Re: [Qemu-devel] [PATCH 28/35] kvm: x86: Introduce kvmclock device to save/restore its state
On 2011-01-18 17:37, Anthony Liguori wrote: On 01/18/2011 10:17 AM, Jan Kiszka wrote: On 2011-01-18 17:04, Anthony Liguori wrote: A KVM device should sit on a KVM specific bus that hangs off of sysbus. It can get to kvm_state through that bus. That bus doesn't get instantiated through qdev so requiring a pointer argument should not be an issue. This design is in conflict with the requirement to attach KVM-assisted devices also to their home bus, e.g. an assigned PCI device to the PCI bus. We don't support multi-homed qdev devices. The bus topology reflects how I/O flows in and out of a device. We do not model a perfect PC bus architecture and I don't think we ever intend to. Instead, we model a functional architecture. I/O from an assigned device does not flow through the emulated PCI bus. Therefore, it does not belong on the emulated PCI bus. Assigned devices need to interact with the emulated PCI bus, but they shouldn't be children of it. You should be able to find assigned devices on some PCI bus, so you either have to hack up the existing bus to host devices that are, on the other side, not part of it or branch off a pci-kvm sub-bus, just like you would have to create a sysbus-kvm. Management tools should never transverse the device tree to find devices. This is a recipe for disaster in the long term because the device tree will not remain stable. So yes, a management tool should be able to enumerate assigned devices as they would enumerate any other PCI device but that has almost nothing to do with what the tree layout is. I'm probably misunderstanding you, but if the bus topology as the guest sees it is not properly reflected in an object tree on the qemu side, we are creating hacks again. There is no such thing as the bus topology as the guest sees it. The guest just sees a bunch of devices. The guest can only infer things like ISA busses. The guest sees a bunch of devices: an i8254, i8259, RTC, etc. Whether those devices are on an ISA bus, and LPC bus, or all in a SuperI/O chip that's part of the southbridge is all invisible to the guest. The device model topology is 100% a hidden architectural detail. This is true for the sysbus, it is obviously not the case for PCI and similarly discoverable buses. There we have a guest-explorable topology that is currently equivalent to the the qdev layout. Management and analysis tools must be able to traverse the system buses and find guest devices this way. We need to provide a compatible interface to the guest. If you agree with my above statements, then you'll also agree that we can do this without keeping the device model topology stable. But we also need to provide a compatible interface to management tools. Exposing the device model topology as a compatible interface artificially limits us. It's far better to provide higher level supported interfaces to give us the flexibility to change the device model as we need to. How do you want to change qdev to keep the guest and management tool view stable while branching off kvm sub-buses? Please propose such extensions so that they can be discussed. IIUC, that would be second relation between qdev and qbus instances besides the physical topology. What further use cases (besides passing kvm_state around) do you have in mind? If they create a device on bus X, it must never end up on bus Y just because it happens to be KVM-assisted or has some other property. Nope. This is exactly what should happen. 90% of the devices in the device model are not created by management tools. They're part of a chipset. The chipset has well defined extension points and we provide management interfaces to create devices on those extension points. That is, interfaces to create PCI devices. Creating kvm irqchips via the management tool would be one simple way (not the only one, though) to enable/disable kvm assistance for those devices. Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH 28/35] kvm: x86: Introduce kvmclock device to save/restore its state
On Tue, 2011-01-18 at 16:54 +0100, Jan Kiszka wrote: On 2011-01-18 16:48, Anthony Liguori wrote: On 01/18/2011 09:43 AM, Jan Kiszka wrote: On 2011-01-18 16:04, Anthony Liguori wrote: On 01/18/2011 08:28 AM, Jan Kiszka wrote: On 2011-01-12 11:31, Jan Kiszka wrote: Am 12.01.2011 11:22, Avi Kivity wrote: On 01/11/2011 03:54 PM, Anthony Liguori wrote: Right, we should introduce a KVMBus that KVM devices are created on. The devices can get at KVMState through the BusState. There is no kvm bus in a PC (I looked). We're bending the device model here because a device is implemented in the kernel and not in userspace. An implementation detail is magnified beyond all proportions. An ioapic that is implemented by kvm lives in exactly the same place that the qemu ioapic lives in. An assigned pci device lives on the PCI bus, not a KVMBus. If we need a pointer to KVMState, then we must find it elsewhere, not through creating imaginary buses that don't exist. Exactly. So we can either infect the whole device tree with kvm (or maybe a more generic accelerator structure that also deals with Xen) or we need to pull the reference inside the device's init function from some global service (kvm_get_state). Note that this topic is still waiting for good suggestions, specifically from those who believe in kvm_state references :). This is not only blocking kvmstate merge but will affect KVM irqchips as well. It boils down to how we reasonably pass a kvm_state reference from machine init code to a sysbus device. I'm probably biased, but I don't see any way that does not work against the idea of confining access to kvm_state or breaks device instantiation from the command line or a config file. A KVM device should sit on a KVM specific bus that hangs off of sysbus. It can get to kvm_state through that bus. That bus doesn't get instantiated through qdev so requiring a pointer argument should not be an issue. This design is in conflict with the requirement to attach KVM-assisted devices also to their home bus, e.g. an assigned PCI device to the PCI bus. We don't support multi-homed qdev devices. With vfio, would an assigned PCI device even need kvm_state? IIUC: Yes, for establishing the irqfd link. We abstract this through the msi/msix layer though, so the vfio driver doesn't directly know anything about kvm_state. Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH 28/35] kvm: x86: Introduce kvmclock device to save/restore its state
On 2011-01-18 18:02, Alex Williamson wrote: On Tue, 2011-01-18 at 16:54 +0100, Jan Kiszka wrote: On 2011-01-18 16:48, Anthony Liguori wrote: On 01/18/2011 09:43 AM, Jan Kiszka wrote: On 2011-01-18 16:04, Anthony Liguori wrote: On 01/18/2011 08:28 AM, Jan Kiszka wrote: On 2011-01-12 11:31, Jan Kiszka wrote: Am 12.01.2011 11:22, Avi Kivity wrote: On 01/11/2011 03:54 PM, Anthony Liguori wrote: Right, we should introduce a KVMBus that KVM devices are created on. The devices can get at KVMState through the BusState. There is no kvm bus in a PC (I looked). We're bending the device model here because a device is implemented in the kernel and not in userspace. An implementation detail is magnified beyond all proportions. An ioapic that is implemented by kvm lives in exactly the same place that the qemu ioapic lives in. An assigned pci device lives on the PCI bus, not a KVMBus. If we need a pointer to KVMState, then we must find it elsewhere, not through creating imaginary buses that don't exist. Exactly. So we can either infect the whole device tree with kvm (or maybe a more generic accelerator structure that also deals with Xen) or we need to pull the reference inside the device's init function from some global service (kvm_get_state). Note that this topic is still waiting for good suggestions, specifically from those who believe in kvm_state references :). This is not only blocking kvmstate merge but will affect KVM irqchips as well. It boils down to how we reasonably pass a kvm_state reference from machine init code to a sysbus device. I'm probably biased, but I don't see any way that does not work against the idea of confining access to kvm_state or breaks device instantiation from the command line or a config file. A KVM device should sit on a KVM specific bus that hangs off of sysbus. It can get to kvm_state through that bus. That bus doesn't get instantiated through qdev so requiring a pointer argument should not be an issue. This design is in conflict with the requirement to attach KVM-assisted devices also to their home bus, e.g. an assigned PCI device to the PCI bus. We don't support multi-homed qdev devices. With vfio, would an assigned PCI device even need kvm_state? IIUC: Yes, for establishing the irqfd link. We abstract this through the msi/msix layer though, so the vfio driver doesn't directly know anything about kvm_state. Which version/tree are you referring to? It wasn't the case in the last version I found on the list. Does the msi layer use irqfd for every source in kvm mode then? Of course, the key question will be how that layer will once obtain kvm_state. Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH 28/35] kvm: x86: Introduce kvmclock device to save/restore its state
On 01/18/2011 10:56 AM, Jan Kiszka wrote: The device model topology is 100% a hidden architectural detail. This is true for the sysbus, it is obviously not the case for PCI and similarly discoverable buses. There we have a guest-explorable topology that is currently equivalent to the the qdev layout. But we also don't do PCI passthrough so we really haven't even explored how that maps in qdev. I don't know if qemu-kvm has attempted to qdev-ify it. Management and analysis tools must be able to traverse the system buses and find guest devices this way. We need to provide a compatible interface to the guest. If you agree with my above statements, then you'll also agree that we can do this without keeping the device model topology stable. But we also need to provide a compatible interface to management tools. Exposing the device model topology as a compatible interface artificially limits us. It's far better to provide higher level supported interfaces to give us the flexibility to change the device model as we need to. How do you want to change qdev to keep the guest and management tool view stable while branching off kvm sub-buses? The qdev device model is not a stable interface. I think that's been clear from the very beginning. Please propose such extensions so that they can be discussed. IIUC, that would be second relation between qdev and qbus instances besides the physical topology. What further use cases (besides passing kvm_state around) do you have in mind? The -device interface is a stable interface. Right now, you don't specify any type of identifier of the pci bus when you create a PCI device. It's implied in the interface. If they create a device on bus X, it must never end up on bus Y just because it happens to be KVM-assisted or has some other property. Nope. This is exactly what should happen. 90% of the devices in the device model are not created by management tools. They're part of a chipset. The chipset has well defined extension points and we provide management interfaces to create devices on those extension points. That is, interfaces to create PCI devices. Creating kvm irqchips via the management tool would be one simple way (not the only one, though) to enable/disable kvm assistance for those devices. It's automatically created as part of the CPUs or as part of the chipset. How to enable/disable kvm assistance is a property of the CPU and/or chipset. Regards, Anthony Liguori Jan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH 28/35] kvm: x86: Introduce kvmclock device to save/restore its state
On 2011-01-18 18:09, Anthony Liguori wrote: On 01/18/2011 10:56 AM, Jan Kiszka wrote: The device model topology is 100% a hidden architectural detail. This is true for the sysbus, it is obviously not the case for PCI and similarly discoverable buses. There we have a guest-explorable topology that is currently equivalent to the the qdev layout. But we also don't do PCI passthrough so we really haven't even explored how that maps in qdev. I don't know if qemu-kvm has attempted to qdev-ify it. It is. And even if it weren't or the current version in qemu-kvm was not perfect, we need to consider those uses cases now as we are about to define a generic model for kvm device integration. That's the point of this discussion. Management and analysis tools must be able to traverse the system buses and find guest devices this way. We need to provide a compatible interface to the guest. If you agree with my above statements, then you'll also agree that we can do this without keeping the device model topology stable. But we also need to provide a compatible interface to management tools. Exposing the device model topology as a compatible interface artificially limits us. It's far better to provide higher level supported interfaces to give us the flexibility to change the device model as we need to. How do you want to change qdev to keep the guest and management tool view stable while branching off kvm sub-buses? The qdev device model is not a stable interface. I think that's been clear from the very beginning. Internals aren't stable, but they should only be changed for a good reason, specifically when the change may impact the whole set of device models. Please propose such extensions so that they can be discussed. IIUC, that would be second relation between qdev and qbus instances besides the physical topology. What further use cases (besides passing kvm_state around) do you have in mind? The -device interface is a stable interface. Right now, you don't specify any type of identifier of the pci bus when you create a PCI device. It's implied in the interface. Which only works as along as we expose a single bus. You don't need to be an oracle to predict that this is not a stable interface. If they create a device on bus X, it must never end up on bus Y just because it happens to be KVM-assisted or has some other property. Nope. This is exactly what should happen. 90% of the devices in the device model are not created by management tools. They're part of a chipset. The chipset has well defined extension points and we provide management interfaces to create devices on those extension points. That is, interfaces to create PCI devices. Creating kvm irqchips via the management tool would be one simple way (not the only one, though) to enable/disable kvm assistance for those devices. It's automatically created as part of the CPUs or as part of the chipset. How to enable/disable kvm assistance is a property of the CPU and/or chipset. If we exclude creation via command line / config files, we could also pass the kvm_state directly from the machine or chipset setup code and save us at least the kvm system buses. Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH 28/35] kvm: x86: Introduce kvmclock device to save/restore its state
On 01/18/2011 11:20 AM, Jan Kiszka wrote: Which only works as along as we expose a single bus. You don't need to be an oracle to predict that this is not a stable interface. Today we only have a very low level factory interface--device creation and deletion. I think we should move to higher level bus factory interfaces. An interface to create a PCI device and to delete PCI devices. This is the only sane way to do hot plug. This also makes supporting multiple busses a lot more reasonable since this factory interface could be a method of the controller. It's automatically created as part of the CPUs or as part of the chipset. How to enable/disable kvm assistance is a property of the CPU and/or chipset. If we exclude creation via command line / config files, we could also pass the kvm_state directly from the machine or chipset setup code and save us at least the kvm system buses. Which is fine in the short term. This is exactly why we don't want the device model to be an ABI. It gives us the ability to make changes as they make sense instead of trying to be perfect from the start (which we never will be). Regards, Anthony Liguori Jan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH 28/35] kvm: x86: Introduce kvmclock device to save/restore its state
On Tue, 2011-01-18 at 18:08 +0100, Jan Kiszka wrote: On 2011-01-18 18:02, Alex Williamson wrote: On Tue, 2011-01-18 at 16:54 +0100, Jan Kiszka wrote: On 2011-01-18 16:48, Anthony Liguori wrote: On 01/18/2011 09:43 AM, Jan Kiszka wrote: On 2011-01-18 16:04, Anthony Liguori wrote: On 01/18/2011 08:28 AM, Jan Kiszka wrote: On 2011-01-12 11:31, Jan Kiszka wrote: Am 12.01.2011 11:22, Avi Kivity wrote: On 01/11/2011 03:54 PM, Anthony Liguori wrote: Right, we should introduce a KVMBus that KVM devices are created on. The devices can get at KVMState through the BusState. There is no kvm bus in a PC (I looked). We're bending the device model here because a device is implemented in the kernel and not in userspace. An implementation detail is magnified beyond all proportions. An ioapic that is implemented by kvm lives in exactly the same place that the qemu ioapic lives in. An assigned pci device lives on the PCI bus, not a KVMBus. If we need a pointer to KVMState, then we must find it elsewhere, not through creating imaginary buses that don't exist. Exactly. So we can either infect the whole device tree with kvm (or maybe a more generic accelerator structure that also deals with Xen) or we need to pull the reference inside the device's init function from some global service (kvm_get_state). Note that this topic is still waiting for good suggestions, specifically from those who believe in kvm_state references :). This is not only blocking kvmstate merge but will affect KVM irqchips as well. It boils down to how we reasonably pass a kvm_state reference from machine init code to a sysbus device. I'm probably biased, but I don't see any way that does not work against the idea of confining access to kvm_state or breaks device instantiation from the command line or a config file. A KVM device should sit on a KVM specific bus that hangs off of sysbus. It can get to kvm_state through that bus. That bus doesn't get instantiated through qdev so requiring a pointer argument should not be an issue. This design is in conflict with the requirement to attach KVM-assisted devices also to their home bus, e.g. an assigned PCI device to the PCI bus. We don't support multi-homed qdev devices. With vfio, would an assigned PCI device even need kvm_state? IIUC: Yes, for establishing the irqfd link. We abstract this through the msi/msix layer though, so the vfio driver doesn't directly know anything about kvm_state. Which version/tree are you referring to? It wasn't the case in the last version I found on the list. Does the msi layer use irqfd for every source in kvm mode then? Of course, the key question will be how that layer will once obtain kvm_state. Looking at [RFC PATCH v2] VFIO based device assignment sent on Nov 5th, I guess we do call kvm_set_irqfd. Maybe I'm just wishing that the msi layer abstracted it better. I'd like to be able to pass in a userspace interrupt handler function pointer and an event notifier fd and let the interrupt layers worry about how it's hooked up. Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH 28/35] kvm: x86: Introduce kvmclock device to save/restore its state
On 2011-01-18 18:31, Anthony Liguori wrote: It's automatically created as part of the CPUs or as part of the chipset. How to enable/disable kvm assistance is a property of the CPU and/or chipset. If we exclude creation via command line / config files, we could also pass the kvm_state directly from the machine or chipset setup code and save us at least the kvm system buses. Which is fine in the short term. Unless we want to abuse the pointer property for this, and there was some resistance, we would have to change the sysbus init function signature. I don't want to propose this for a short-term workaround, we need a clearer vision and roadmap to avoid multiple invasive changes to the device model. This is exactly why we don't want the device model to be an ABI. It gives us the ability to make changes as they make sense instead of trying to be perfect from the start (which we never will be). The device model will always consist of a stable part, the guest and management visible topology. That beast needs to be modeled as well, likely via some new bus objects. If that's the way to go, starting now is probably the right time as we have an urgent use case, right? Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] vhost: rcu annotation fixup
On Tue, Jan 18, 2011 at 01:08:45PM +0200, Michael S. Tsirkin wrote: When built with rcu checks enabled, vhost triggers bogus warnings as vhost features are read without dev-mutex sometimes. Fixing it properly is not trivial as vhost.h does not know which lockdep classes it will be used under. Disable the warning by stubbing out the check for now. Signed-off-by: Michael S. Tsirkin m...@redhat.com --- drivers/vhost/vhost.h |4 +--- 1 files changed, 1 insertions(+), 3 deletions(-) diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h index 2af44b7..2d03a31 100644 --- a/drivers/vhost/vhost.h +++ b/drivers/vhost/vhost.h @@ -173,9 +173,7 @@ static inline int vhost_has_feature(struct vhost_dev *dev, int bit) { unsigned acked_features; - acked_features = - rcu_dereference_index_check(dev-acked_features, - lockdep_is_held(dev-mutex)); + acked_features = rcu_dereference_index_check(dev-acked_features, 1); Ouch!!! Could you please at least add a comment? Alternatively, pass in the lock that is held and check for that? Given that this is a static inline, the compiler should be able to optimize the argument away when !PROVE_RCU, correct? Thanx, Paul return acked_features (1 bit); } -- 1.7.3.2.91.g446ac -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] vhost: rcu annotation fixup
On Tue, Jan 18, 2011 at 09:48:34AM -0800, Paul E. McKenney wrote: On Tue, Jan 18, 2011 at 01:08:45PM +0200, Michael S. Tsirkin wrote: When built with rcu checks enabled, vhost triggers bogus warnings as vhost features are read without dev-mutex sometimes. Fixing it properly is not trivial as vhost.h does not know which lockdep classes it will be used under. Disable the warning by stubbing out the check for now. Signed-off-by: Michael S. Tsirkin m...@redhat.com --- drivers/vhost/vhost.h |4 +--- 1 files changed, 1 insertions(+), 3 deletions(-) diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h index 2af44b7..2d03a31 100644 --- a/drivers/vhost/vhost.h +++ b/drivers/vhost/vhost.h @@ -173,9 +173,7 @@ static inline int vhost_has_feature(struct vhost_dev *dev, int bit) { unsigned acked_features; - acked_features = - rcu_dereference_index_check(dev-acked_features, - lockdep_is_held(dev-mutex)); + acked_features = rcu_dereference_index_check(dev-acked_features, 1); Ouch!!! Could you please at least add a comment? Yes, OK. Alternatively, pass in the lock that is held and check for that? Given that this is a static inline, the compiler should be able to optimize the argument away when !PROVE_RCU, correct? Thanx, Paul Hopefully, yes. We don't always have a lock: the idea was to create a lockdep for these cases. But we can't pass the pointer to that ... return acked_features (1 bit); } -- 1.7.3.2.91.g446ac -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCHv2] vhost: rcu annotation fixup
When built with rcu checks enabled, vhost triggers bogus warnings as vhost features are read without dev-mutex sometimes, and private pointer is read with our kind of rcu where work serves as a read side critical section. Fixing it properly is not trivial. Disable the warnings by stubbing out the checks for now. Signed-off-by: Michael S. Tsirkin m...@redhat.com --- Changes from v1: add TODO, fix more warnings. drivers/vhost/net.c |9 + drivers/vhost/vhost.h |6 +++--- 2 files changed, 8 insertions(+), 7 deletions(-) diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index 9b3ca10..f616cef 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -128,8 +128,7 @@ static void handle_tx(struct vhost_net *net) size_t hdr_size; struct socket *sock; - /* TODO: check that we are running from vhost_worker? -* Not sure it's worth it, it's straight-forward enough. */ + /* TODO: check that we are running from vhost_worker? */ sock = rcu_dereference_check(vq-private_data, 1); if (!sock) return; @@ -306,7 +305,8 @@ static void handle_rx_big(struct vhost_net *net) size_t len, total_len = 0; int err; size_t hdr_size; - struct socket *sock = rcu_dereference(vq-private_data); + /* TODO: check that we are running from vhost_worker? */ + struct socket *sock = rcu_dereference_check(vq-private_data, 1); if (!sock || skb_queue_empty(sock-sk-sk_receive_queue)) return; @@ -415,7 +415,8 @@ static void handle_rx_mergeable(struct vhost_net *net) int err, headcount; size_t vhost_hlen, sock_hlen; size_t vhost_len, sock_len; - struct socket *sock = rcu_dereference(vq-private_data); + /* TODO: check that we are running from vhost_worker? */ + struct socket *sock = rcu_dereference_check(vq-private_data, 1); if (!sock || skb_queue_empty(sock-sk-sk_receive_queue)) return; diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h index 2af44b7..b3363ae 100644 --- a/drivers/vhost/vhost.h +++ b/drivers/vhost/vhost.h @@ -173,9 +173,9 @@ static inline int vhost_has_feature(struct vhost_dev *dev, int bit) { unsigned acked_features; - acked_features = - rcu_dereference_index_check(dev-acked_features, - lockdep_is_held(dev-mutex)); + /* TODO: check that we are running from vhost_worker or dev mutex is +* held? */ + acked_features = rcu_dereference_index_check(dev-acked_features, 1); return acked_features (1 bit); } -- 1.7.3.2.91.g446ac -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] vhost: rcu annotation fixup
On Tue, Jan 18, 2011 at 07:55:00PM +0200, Michael S. Tsirkin wrote: On Tue, Jan 18, 2011 at 09:48:34AM -0800, Paul E. McKenney wrote: On Tue, Jan 18, 2011 at 01:08:45PM +0200, Michael S. Tsirkin wrote: When built with rcu checks enabled, vhost triggers bogus warnings as vhost features are read without dev-mutex sometimes. Fixing it properly is not trivial as vhost.h does not know which lockdep classes it will be used under. Disable the warning by stubbing out the check for now. Signed-off-by: Michael S. Tsirkin m...@redhat.com --- drivers/vhost/vhost.h |4 +--- 1 files changed, 1 insertions(+), 3 deletions(-) diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h index 2af44b7..2d03a31 100644 --- a/drivers/vhost/vhost.h +++ b/drivers/vhost/vhost.h @@ -173,9 +173,7 @@ static inline int vhost_has_feature(struct vhost_dev *dev, int bit) { unsigned acked_features; - acked_features = - rcu_dereference_index_check(dev-acked_features, - lockdep_is_held(dev-mutex)); + acked_features = rcu_dereference_index_check(dev-acked_features, 1); Ouch!!! Could you please at least add a comment? Yes, OK. Alternatively, pass in the lock that is held and check for that? Given that this is a static inline, the compiler should be able to optimize the argument away when !PROVE_RCU, correct? Thanx, Paul Hopefully, yes. We don't always have a lock: the idea was to create a lockdep for these cases. But we can't pass the pointer to that ... I suppose you could pass a pointer to the lockdep map structure. Not sure if this makes sense, but it would handle the situation. Alternatively, create a helper function that checks the possibilities and screams if none of them are in effect. Thanx, Paul return acked_features (1 bit); } -- 1.7.3.2.91.g446ac -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Flow Control and Port Mirroring Revisited
So it won't be all that simple to implement well, and before we try, I'd like to know whether there are applications that are helped by it. For example, we could try to measure latency at various pps and see whether the backpressure helps. netperf has -b, -w flags which might help these measurements. Those options are enabled when one adds --enable-burst to the pre-compilation ./configure of netperf (one doesn't have to recompile netserver). However, if one is also looking at latency statistics via the -j option in the top-of-trunk, or simply at the histogram with --enable-histogram on the ./configure and a verbosity level of 2 (global -v 2) then one wants the very top of trunk netperf from: http://www.netperf.org/svn/netperf2/trunk to get the recently added support for accurate (netperf level) RTT measuremnts on burst-mode request/response tests. happy benchmarking, rick jones PS - the enhanced latency statistics from -j are only available in the omni version of the TCP_RR test. To get that add a --enable-omni to the ./configure - and in this case both netperf and netserver have to be recompiled. For very basic output one can peruse the output of: src/netperf -t omni -- -O /? and then pick those outputs of interest and put them into an output selection file which one then passes to either (test-specific) -o, -O or -k to get CVS, Human or keyval output respectively. E.G. raj@tardy:~/netperf2_trunk$ cat foo THROUGHPUT,THROUGHPUT_UNITS RT_LATENCY,MIN_LATENCY,MEAN_LATENCY,MAX_LATENCY P50_LATENCY,P90_LATENCY,P99_LATENCY,STDDEV_LATENCY when foo is passed to -o one will get those all on one line of CSV. To -O one gets three lines of more netperf-classic-like human readable output, and when one passes that to -k one gets a string of keyval output a la: raj@tardy:~/netperf2_trunk$ src/netperf -t omni -j -v 2 -- -r 1 -d rr -k foo OMNI TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to localhost (127.0.0.1) port 0 AF_INET : histogram THROUGHPUT=29454.12 THROUGHPUT_UNITS=Trans/s RT_LATENCY=33.951 MIN_LATENCY=19 MEAN_LATENCY=32.00 MAX_LATENCY=126 P50_LATENCY=32 P90_LATENCY=38 P99_LATENCY=41 STDDEV_LATENCY=5.46 Histogram of request/response times UNIT_USEC :0:0:0:0:0:0:0:0:0:0 TEN_USEC :0: 3553: 45244: 237790: 7859: 86: 10:3:0:0 HUNDRED_USEC :0:2:0:0:0:0:0:0:0:0 UNIT_MSEC :0:0:0:0:0:0:0:0:0:0 TEN_MSEC :0:0:0:0:0:0:0:0:0:0 HUNDRED_MSEC :0:0:0:0:0:0:0:0:0:0 UNIT_SEC :0:0:0:0:0:0:0:0:0:0 TEN_SEC :0:0:0:0:0:0:0:0:0:0 100_SECS: 0 HIST_TOTAL: 294547 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] vhost: rcu annotation fixup
On Tue, Jan 18, 2011 at 11:02:33AM -0800, Paul E. McKenney wrote: On Tue, Jan 18, 2011 at 07:55:00PM +0200, Michael S. Tsirkin wrote: On Tue, Jan 18, 2011 at 09:48:34AM -0800, Paul E. McKenney wrote: On Tue, Jan 18, 2011 at 01:08:45PM +0200, Michael S. Tsirkin wrote: When built with rcu checks enabled, vhost triggers bogus warnings as vhost features are read without dev-mutex sometimes. Fixing it properly is not trivial as vhost.h does not know which lockdep classes it will be used under. Disable the warning by stubbing out the check for now. Signed-off-by: Michael S. Tsirkin m...@redhat.com --- drivers/vhost/vhost.h |4 +--- 1 files changed, 1 insertions(+), 3 deletions(-) diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h index 2af44b7..2d03a31 100644 --- a/drivers/vhost/vhost.h +++ b/drivers/vhost/vhost.h @@ -173,9 +173,7 @@ static inline int vhost_has_feature(struct vhost_dev *dev, int bit) { unsigned acked_features; - acked_features = - rcu_dereference_index_check(dev-acked_features, - lockdep_is_held(dev-mutex)); + acked_features = rcu_dereference_index_check(dev-acked_features, 1); Ouch!!! Could you please at least add a comment? Yes, OK. Alternatively, pass in the lock that is held and check for that? Given that this is a static inline, the compiler should be able to optimize the argument away when !PROVE_RCU, correct? Thanx, Paul Hopefully, yes. We don't always have a lock: the idea was to create a lockdep for these cases. But we can't pass the pointer to that ... I suppose you could pass a pointer to the lockdep map structure. Not sure if this makes sense, but it would handle the situation. Will it compile with lockdep disabled too? What will the pointer be? Alternatively, create a helper function that checks the possibilities and screams if none of them are in effect. Thanx, Paul The problem here is the callee needs to know about all callers. return acked_features (1 bit); } -- 1.7.3.2.91.g446ac -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Flow Control and Port Mirroring Revisited
On Tue, Jan 18, 2011 at 11:41:22AM -0800, Rick Jones wrote: So it won't be all that simple to implement well, and before we try, I'd like to know whether there are applications that are helped by it. For example, we could try to measure latency at various pps and see whether the backpressure helps. netperf has -b, -w flags which might help these measurements. Those options are enabled when one adds --enable-burst to the pre-compilation ./configure of netperf (one doesn't have to recompile netserver). However, if one is also looking at latency statistics via the -j option in the top-of-trunk, or simply at the histogram with --enable-histogram on the ./configure and a verbosity level of 2 (global -v 2) then one wants the very top of trunk netperf from: http://www.netperf.org/svn/netperf2/trunk to get the recently added support for accurate (netperf level) RTT measuremnts on burst-mode request/response tests. happy benchmarking, rick jones PS - the enhanced latency statistics from -j are only available in the omni version of the TCP_RR test. To get that add a --enable-omni to the ./configure - and in this case both netperf and netserver have to be recompiled. Is this TCP only? I would love to get latency data from UDP as well. For very basic output one can peruse the output of: src/netperf -t omni -- -O /? and then pick those outputs of interest and put them into an output selection file which one then passes to either (test-specific) -o, -O or -k to get CVS, Human or keyval output respectively. E.G. raj@tardy:~/netperf2_trunk$ cat foo THROUGHPUT,THROUGHPUT_UNITS RT_LATENCY,MIN_LATENCY,MEAN_LATENCY,MAX_LATENCY P50_LATENCY,P90_LATENCY,P99_LATENCY,STDDEV_LATENCY when foo is passed to -o one will get those all on one line of CSV. To -O one gets three lines of more netperf-classic-like human readable output, and when one passes that to -k one gets a string of keyval output a la: raj@tardy:~/netperf2_trunk$ src/netperf -t omni -j -v 2 -- -r 1 -d rr -k foo OMNI TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to localhost (127.0.0.1) port 0 AF_INET : histogram THROUGHPUT=29454.12 THROUGHPUT_UNITS=Trans/s RT_LATENCY=33.951 MIN_LATENCY=19 MEAN_LATENCY=32.00 MAX_LATENCY=126 P50_LATENCY=32 P90_LATENCY=38 P99_LATENCY=41 STDDEV_LATENCY=5.46 Histogram of request/response times UNIT_USEC :0:0:0:0:0:0:0:0:0:0 TEN_USEC :0: 3553: 45244: 237790: 7859: 86: 10:3:0:0 HUNDRED_USEC :0:2:0:0:0:0:0:0:0:0 UNIT_MSEC :0:0:0:0:0:0:0:0:0:0 TEN_MSEC :0:0:0:0:0:0:0:0:0:0 HUNDRED_MSEC :0:0:0:0:0:0:0:0:0:0 UNIT_SEC :0:0:0:0:0:0:0:0:0:0 TEN_SEC :0:0:0:0:0:0:0:0:0:0 100_SECS: 0 HIST_TOTAL: 294547 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] vhost: rcu annotation fixup
On Tue, Jan 18, 2011 at 10:10:31PM +0200, Michael S. Tsirkin wrote: On Tue, Jan 18, 2011 at 11:02:33AM -0800, Paul E. McKenney wrote: On Tue, Jan 18, 2011 at 07:55:00PM +0200, Michael S. Tsirkin wrote: On Tue, Jan 18, 2011 at 09:48:34AM -0800, Paul E. McKenney wrote: On Tue, Jan 18, 2011 at 01:08:45PM +0200, Michael S. Tsirkin wrote: When built with rcu checks enabled, vhost triggers bogus warnings as vhost features are read without dev-mutex sometimes. Fixing it properly is not trivial as vhost.h does not know which lockdep classes it will be used under. Disable the warning by stubbing out the check for now. Signed-off-by: Michael S. Tsirkin m...@redhat.com --- drivers/vhost/vhost.h |4 +--- 1 files changed, 1 insertions(+), 3 deletions(-) diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h index 2af44b7..2d03a31 100644 --- a/drivers/vhost/vhost.h +++ b/drivers/vhost/vhost.h @@ -173,9 +173,7 @@ static inline int vhost_has_feature(struct vhost_dev *dev, int bit) { unsigned acked_features; - acked_features = - rcu_dereference_index_check(dev-acked_features, - lockdep_is_held(dev-mutex)); + acked_features = rcu_dereference_index_check(dev-acked_features, 1); Ouch!!! Could you please at least add a comment? Yes, OK. Alternatively, pass in the lock that is held and check for that? Given that this is a static inline, the compiler should be able to optimize the argument away when !PROVE_RCU, correct? Thanx, Paul Hopefully, yes. We don't always have a lock: the idea was to create a lockdep for these cases. But we can't pass the pointer to that ... I suppose you could pass a pointer to the lockdep map structure. Not sure if this makes sense, but it would handle the situation. Will it compile with lockdep disabled too? What will the pointer be? One (crude) approach would be to make the pointer void* if lockdep is disabled. Alternatively, create a helper function that checks the possibilities and screams if none of them are in effect. Thanx, Paul The problem here is the callee needs to know about all callers. As does the guy reading the code. ;-) Thanx, Paul return acked_features (1 bit); } -- 1.7.3.2.91.g446ac -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Flow Control and Port Mirroring Revisited
Michael S. Tsirkin wrote: On Tue, Jan 18, 2011 at 11:41:22AM -0800, Rick Jones wrote: PS - the enhanced latency statistics from -j are only available in the omni version of the TCP_RR test. To get that add a --enable-omni to the ./configure - and in this case both netperf and netserver have to be recompiled. Is this TCP only? I would love to get latency data from UDP as well. I believe it will work with UDP request response as well. The omni test code strives to be protocol agnostic. (I'm sure there are bugs of course, there always are.) There is though the added complication of there being no specific matching of requests to responses. The code as written takes advantage of TCP's in-order semantics and recovery from packet loss. In a plain UDP_RR test, with one at a time transactions, if either the request or response are lost, data flow effectively stops there until the timer expires. So, one has reasonable RTT numbers from before that point. In a burst UDP RR test, the code doesn't know which request/response was lost and so the matching being done to get RTTs will be off by each lost datagram. And if something were re-ordered the timstamps would be off even without a datagram loss event. To fix that would require netperf do something it has not yet done in 18-odd years :) That is actually echo something back from the netserver on the RR test - either an id, or a timestamp. That means dirtying the buffers which means still more cache misses, from places other than the actual stack. Not beyond the realm of the possible, but it would be a bit of departure for normal operation (*) and could enforce a minimum request/response size beyond the present single byte (ok, perhaps only two or four bytes :). But that, perhaps, is a discussion best left to netperf-talk at netperf.org. happy benchmarking, rick jones (*) netperf does have the concept of reading from and/or dirtying buffers, put-in back in the days of COW/page-remapping in HP-UX 9.0, but that was mainly to force COW and/or show the effect of the required data cache purges/flushes. As such it was made conditional on DIRTY being defined. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/3] Removing scripts/check_image.py
This patchset aims to remove the script check_image.py from KVM autotest. The reasoning behind this is that we want to revamp all the pre/post scripts processing, that is, making the py scripts part of the infrastructure, where they can share API with the rest of the framework. So we'll proceed by parts, this being the first. Lucas Meneghel Rodrigues (3): KVM test: Introduce check_image postprocess directive KVM test: Modify enospc test to not require scripts/check_image.py KVM test: Removing scripts/check_image.py client/tests/kvm/kvm_preprocessing.py |3 +- client/tests/kvm/kvm_vm.py | 62 +++ client/tests/kvm/scripts/check_image.py | 99 --- client/tests/kvm/tests/enospc.py| 12 ++-- client/tests/kvm/tests_base.cfg.sample |7 +- 5 files changed, 73 insertions(+), 110 deletions(-) delete mode 100644 client/tests/kvm/scripts/check_image.py -- 1.7.3.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/3] KVM test: Introduce check_image postprocess directive
With check_image_foo = yes, KVM autotest will use qemu-img to perform checks on the qcow2 image. This new functionality intends to replace the script check_image.py. The plan is to supersede most of the pre/post scripts in place throughout KVM autotest. Signed-off-by: Lucas Meneghel Rodrigues l...@redhat.com --- client/tests/kvm/kvm_preprocessing.py |3 +- client/tests/kvm/kvm_vm.py | 62 client/tests/kvm/tests_base.cfg.sample |7 ++-- 3 files changed, 67 insertions(+), 5 deletions(-) diff --git a/client/tests/kvm/kvm_preprocessing.py b/client/tests/kvm/kvm_preprocessing.py index 56acf0c..4a6e0f8 100644 --- a/client/tests/kvm/kvm_preprocessing.py +++ b/client/tests/kvm/kvm_preprocessing.py @@ -93,11 +93,12 @@ def preprocess_vm(test, params, env, name): def postprocess_image(test, params): Postprocess a single QEMU image according to the instructions in params. -Currently this function just removes an image if requested. @param test: An Autotest test object. @param params: A dict containing image postprocessing parameters. +if params.get(check_image) == yes: +kvm_vm.check_image(params, test.bindir) if params.get(remove_image) == yes: kvm_vm.remove_image(params, test.bindir) diff --git a/client/tests/kvm/kvm_vm.py b/client/tests/kvm/kvm_vm.py index 18d10ef..767c6d4 100755 --- a/client/tests/kvm/kvm_vm.py +++ b/client/tests/kvm/kvm_vm.py @@ -47,6 +47,15 @@ class VMImageMissingError(VMError): return CD image file not found: %r % self.filename +class VMImageCheckError(VMError): +def __init__(self, filename): +VMError.__init__(self, filename) +self.filename = filename + +def __str__(self): +return Errors found on image: %r % self.filename + + class VMBadPATypeError(VMError): def __init__(self, pa_type): VMError.__init__(self, pa_type) @@ -239,6 +248,59 @@ def remove_image(params, root_dir): logging.debug(Image file %s not found) +def check_image(params, root_dir): + +Check an image using qemu-img. + +@param params: Dictionary containing the test parameters. +@param root_dir: Base directory for relative filenames. + +@note: params should contain: + image_name -- the name of the image file, without extension + image_format -- the format of the image (qcow2, raw etc) + +image_filename = get_image_filename(params, root_dir) +logging.debug(Checking image file %s... % image_filename) +qemu_img_cmd = kvm_utils.get_path(root_dir, + params.get(qemu_img_binary, qemu-img)) +check_critical = params.get(check_image_critical) == 'yes' +image_is_qcow2 = params.get(image_format) == 'qcow2' +if os.path.exists(image_filename) and image_is_qcow2: +# Verifying if qemu-img supports 'check' +q_result = utils.run(qemu_img_cmd, ignore_status=True) +q_output = q_result.stdout +check_img = True +if not check in q_output: +logging.error(qemu-img does not support 'check', + skipping check...) +check_img = False +if not info in q_output: +logging.error(qemu-img does not support 'info', + skipping check...) +check_img = False +if check_img: +try: +utils.system(%s info %s % (qemu_img_cmd, image_filename)) +except error.CmdError: +logging.error(Error getting info from image %s, + image_filename) +try: +utils.system(%s check %s % (qemu_img_cmd, image_filename)) +except error.CmdError: +if check_critical: +raise VMImageCheckError(image_filename) +else: +logging.error(Error checking image %s, image_filename) + +else: +if not os.path.exists(image_filename): +logging.debug(Image file %s not found, skipping check..., + image_filename) +elif not image_is_qcow2: +logging.debug(Image file %s not qcow2, skipping check..., + image_filename) + + class VM: This class handles all basic VM operations. diff --git a/client/tests/kvm/tests_base.cfg.sample b/client/tests/kvm/tests_base.cfg.sample index 28064a8..8b67256 100644 --- a/client/tests/kvm/tests_base.cfg.sample +++ b/client/tests/kvm/tests_base.cfg.sample @@ -2357,11 +2357,10 @@ kdump: variants: - @qcow2: image_format = qcow2 -post_command += python scripts/check_image.py; -post_command_timeout = 600 -post_command_noncritical = yes +check_image = yes +check_image_critical = no ioquit: -post_command_noncritical = no +check_image_critical = yes - vmdk: no
[PATCH 2/3] KVM test: Modify enospc test to not require scripts/check_image.py
With this we prepare to remove the aforementioned script. Signed-off-by: Lucas Meneghel Rodrigues l...@redhat.com --- client/tests/kvm/tests/enospc.py | 12 ++-- 1 files changed, 6 insertions(+), 6 deletions(-) diff --git a/client/tests/kvm/tests/enospc.py b/client/tests/kvm/tests/enospc.py index 2d8b628..4dd694f 100644 --- a/client/tests/kvm/tests/enospc.py +++ b/client/tests/kvm/tests/enospc.py @@ -1,7 +1,7 @@ import logging, commands, time, os, re from autotest_lib.client.common_lib import error from autotest_lib.client.bin import utils -import kvm_test_utils +import kvm_test_utils, kvm_vm def run_enospc(test, params, env): @@ -46,11 +46,11 @@ def run_enospc(test, params, env): if paused in status: pause_n += 1 logging.info(Checking all images in use by the VM) -script_path = os.path.join(test.bindir, scripts/check_image.py) -try: -cmd_result = utils.run('python %s' % script_path) -except error.CmdError, e: -logging.debug(e.result_obj.stdout) +for image_name in vm.params.objects(images): +image_params = vm.params.object_params(image_name) +# Just to make sure the image check won't throw exceptions +image_params[check_image_critical] = 'no' +kvm_vm.check_image(image_params, test.bindir) logging.info(Guest paused, extending Logical Volume size) try: cmd_result = utils.run(lvextend -L +200M /dev/vgtest/lvtest) -- 1.7.3.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/3] KVM test: Removing scripts/check_image.py
As its functionality was implemented as part of the framework on previous patches. Signed-off-by: Lucas Meneghel Rodrigues l...@redhat.com --- client/tests/kvm/scripts/check_image.py | 99 --- 1 files changed, 0 insertions(+), 99 deletions(-) delete mode 100644 client/tests/kvm/scripts/check_image.py diff --git a/client/tests/kvm/scripts/check_image.py b/client/tests/kvm/scripts/check_image.py deleted file mode 100644 index 2b5c227..000 --- a/client/tests/kvm/scripts/check_image.py +++ /dev/null @@ -1,99 +0,0 @@ -import os, sys, commands - - -class ImageCheckError(Exception): - -Simple wrapper for the builtin Exception class. - -pass - - -class ImageCheck(object): - -Check qcow2 image by qemu-img info/check command. - -def __init__(self): - -Gets params from environment variables and sets class attributes. - -self.image_path_list = [] -client_dir = os.environ['AUTODIR'] -self.kvm_dir = os.path.join(client_dir, 'tests/kvm') -img_to_check = os.environ['KVM_TEST_images'].split() - -for img in img_to_check: -img_name_str = KVM_TEST_image_name_%s % img -if not os.environ.has_key(img_name_str): -img_name_str = KVM_TEST_image_name -img_format_str = KVM_TEST_image_format_%s % img -if os.environ.has_key(img_format_str): -image_format = os.environ[img_format_str] -else: -image_format = os.environ['KVM_TEST_image_format'] -if image_format != qcow2: -continue -image_name = os.environ[img_name_str] -image_filename = %s.%s % (image_name, image_format) -image_filename = os.path.join(self.kvm_dir, image_filename) -self.image_path_list.append(image_filename) -if os.environ.has_key('KVM_TEST_qemu_img_binary'): -self.qemu_img_path = os.environ['KVM_TEST_qemu_img_binary'] -else: -self.qemu_img_path = os.path.join(self.kvm_dir, 'qemu-img') -self.qemu_img_check = True -cmd = %s |grep check % self.qemu_img_path -(s1, output) = commands.getstatusoutput(cmd) -if s1: -self.qemu_img_check = False -print Command qemu-img check not available, not checking... -cmd = %s |grep info % self.qemu_img_path -(s2, output) = commands.getstatusoutput(cmd) -if s2: -self.qemu_img_check = False -print Command qemu-img info not available, not checking... - -def exec_img_cmd(self, cmd_type, image_path): - -Run qemu-img info/check on given image. - -@param cmd_type: Sub command used together with qemu. -@param image_path: Real path of the image. - -cmd = ' '.join([self.qemu_img_path, cmd_type, image_path]) -print Checking image with command %s % cmd -(status, output) = commands.getstatusoutput(cmd) -print output -if status or (cmd_type == check and not No errors in output): -msg = Command %s failed % cmd -return False, msg -else: -return True, '' - - -def check_image(self): - -Run qemu-img info/check to check the image in list. - -If the image checking is failed, raise an exception. - -# Check all the image in list. -errmsg = [] -for image_path in self.image_path_list: -if not os.path.exists(image_path): -print Image %s does not exist! % image_path -continue -s, o = self.exec_img_cmd('info', image_path) -if not s: -errmsg.append(o) -s, o = self.exec_img_cmd('check', image_path) -if not s: -errmsg.append(o) - -if len(errmsg) 0: -raise ImageCheckError('Errors were found, please check log!') - - -if __name__ == __main__: -image_check = ImageCheck() -if image_check.qemu_img_check: -image_check.check_image() -- 1.7.3.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Autotest] [PATCH 1/3] KVM test: Introduce check_image postprocess directive
- Original Message - With check_image_foo = yes, KVM autotest will use qemu-img to perform checks on the qcow2 image. This new functionality intends to replace the script check_image.py. The plan is to supersede most of the pre/post scripts in place throughout KVM autotest. Good idea! Some pre/post cmd/script is very confused, this would make thing clear. Signed-off-by: Lucas Meneghel Rodrigues l...@redhat.com I'll test patches when I go to office. --- client/tests/kvm/kvm_preprocessing.py | 3 +- client/tests/kvm/kvm_vm.py | 62 client/tests/kvm/tests_base.cfg.sample | 7 ++-- 3 files changed, 67 insertions(+), 5 deletions(-) diff --git a/client/tests/kvm/kvm_preprocessing.py b/client/tests/kvm/kvm_preprocessing.py index 56acf0c..4a6e0f8 100644 --- a/client/tests/kvm/kvm_preprocessing.py +++ b/client/tests/kvm/kvm_preprocessing.py @@ -93,11 +93,12 @@ def preprocess_vm(test, params, env, name): def postprocess_image(test, params): Postprocess a single QEMU image according to the instructions in params. - Currently this function just removes an image if requested. @param test: An Autotest test object. @param params: A dict containing image postprocessing parameters. + if params.get(check_image) == yes: + kvm_vm.check_image(params, test.bindir) if params.get(remove_image) == yes: kvm_vm.remove_image(params, test.bindir) It will fix the problem that check_image.py try to check some removed images. diff --git a/client/tests/kvm/kvm_vm.py b/client/tests/kvm/kvm_vm.py index 18d10ef..767c6d4 100755 --- a/client/tests/kvm/kvm_vm.py +++ b/client/tests/kvm/kvm_vm.py @@ -47,6 +47,15 @@ class VMImageMissingError(VMError): return CD image file not found: %r % self.filename +class VMImageCheckError(VMError): + def __init__(self, filename): + VMError.__init__(self, filename) + self.filename = filename + + def __str__(self): + return Errors found on image: %r % self.filename + + class VMBadPATypeError(VMError): def __init__(self, pa_type): VMError.__init__(self, pa_type) @@ -239,6 +248,59 @@ def remove_image(params, root_dir): logging.debug(Image file %s not found) +def check_image(params, root_dir): + + Check an image using qemu-img. + + @param params: Dictionary containing the test parameters. + @param root_dir: Base directory for relative filenames. + + @note: params should contain: + image_name -- the name of the image file, without extension + image_format -- the format of the image (qcow2, raw etc) + + image_filename = get_image_filename(params, root_dir) + logging.debug(Checking image file %s... % image_filename) + qemu_img_cmd = kvm_utils.get_path(root_dir, + params.get(qemu_img_binary, qemu-img)) + check_critical = params.get(check_image_critical) == 'yes' + image_is_qcow2 = params.get(image_format) == 'qcow2' + if os.path.exists(image_filename) and image_is_qcow2: + # Verifying if qemu-img supports 'check' + q_result = utils.run(qemu_img_cmd, ignore_status=True) + q_output = q_result.stdout + check_img = True + if not check in q_output: + logging.error(qemu-img does not support 'check', + skipping check...) + check_img = False + if not info in q_output: + logging.error(qemu-img does not support 'info', + skipping check...) + check_img = False + if check_img: + try: + utils.system(%s info %s % (qemu_img_cmd, image_filename)) + except error.CmdError: + logging.error(Error getting info from image %s, + image_filename) + try: + utils.system(%s check %s % (qemu_img_cmd, image_filename)) + except error.CmdError: + if check_critical: + raise VMImageCheckError(image_filename) + else: + logging.error(Error checking image %s, image_filename) + + else: + if not os.path.exists(image_filename): + logging.debug(Image file %s not found, skipping check..., + image_filename) + elif not image_is_qcow2: + logging.debug(Image file %s not qcow2, skipping check..., + image_filename) + + class VM: This class handles all basic VM operations. diff --git a/client/tests/kvm/tests_base.cfg.sample b/client/tests/kvm/tests_base.cfg.sample index 28064a8..8b67256 100644 --- a/client/tests/kvm/tests_base.cfg.sample +++ b/client/tests/kvm/tests_base.cfg.sample @@ -2357,11 +2357,10 @@ kdump: variants: - @qcow2: image_format = qcow2 - post_command += python scripts/check_image.py; - post_command_timeout = 600 - post_command_noncritical = yes + check_image = yes + check_image_critical = no ioquit: - post_command_noncritical = no + check_image_critical = yes - vmdk: no ioquit only Fedora Ubuntu Windows -- 1.7.3.4 ___ Autotest mailing list autot...@test.kernel.org http://test.kernel.org/cgi-bin/mailman/listinfo/autotest -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] vhost: rcu annotation fixup
On Tue, Jan 18, 2011 at 01:08:45PM +0200, Michael S. Tsirkin wrote: When built with rcu checks enabled, vhost triggers bogus warnings as vhost features are read without dev-mutex sometimes. Fixing it properly is not trivial as vhost.h does not know which lockdep classes it will be used under. Disable the warning by stubbing out the check for now. What is the harm in leaving the bogus warnings until the difficult fix happens? RCU checks enabled does not seem like something that is enabled in production. If this patch is applied, there is always the risk that it'll be simply forgotten about. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/2] Removing scripts/hugepage.py
Continuing on our quest of obsoleting and removing the pre/post scripts, this time we remove hugepage.py for a slightly better and cleaner framework implementation. Lucas Meneghel Rodrigues (2): KVM test: Turning hugepages setup into KVM autotest infrastructure KVM test: Removing scripts/hugepage.py client/tests/kvm/kvm_preprocessing.py |8 ++ client/tests/kvm/kvm_utils.py | 102 +++ client/tests/kvm/scripts/hugepage.py | 118 client/tests/kvm/tests_base.cfg.sample |3 +- 4 files changed, 111 insertions(+), 120 deletions(-) delete mode 100755 client/tests/kvm/scripts/hugepage.py -- 1.7.3.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/2] KVM test: Turning hugepages setup into KVM autotest infrastructure
So we can get rid of scripts/hugepage.py. The implementation strategy is to have a kvm_utils.HugePageConfig class that can do both hugepages setup and cleanup, and call it during pre/postprocessing. Signed-off-by: Lucas Meneghel Rodrigues l...@redhat.com --- client/tests/kvm/kvm_preprocessing.py |8 +++ client/tests/kvm/kvm_utils.py | 102 client/tests/kvm/tests_base.cfg.sample |3 +- 3 files changed, 111 insertions(+), 2 deletions(-) diff --git a/client/tests/kvm/kvm_preprocessing.py b/client/tests/kvm/kvm_preprocessing.py index 4a6e0f8..41455cf 100644 --- a/client/tests/kvm/kvm_preprocessing.py +++ b/client/tests/kvm/kvm_preprocessing.py @@ -254,6 +254,10 @@ def preprocess(test, params, env): logging.debug(KVM userspace version: %s % kvm_userspace_version) test.write_test_keyval({kvm_userspace_version: kvm_userspace_version}) +if params.get(setup_hugepages) == yes: +h = kvm_utils.HugePageConfig(params) +h.setup() + # Execute any pre_commands if params.get(pre_command): process_command(test, params, env, params.get(pre_command), @@ -350,6 +354,10 @@ def postprocess(test, params, env): env[tcpdump].close() del env[tcpdump] +if params.get(setup_hugepages) == yes: +h = kvm_utils.HugePageConfig(params) +h.cleanup() + # Execute any post_commands if params.get(post_command): process_command(test, params, env, params.get(post_command), diff --git a/client/tests/kvm/kvm_utils.py b/client/tests/kvm/kvm_utils.py index 78c9f25..b61ff94 100644 --- a/client/tests/kvm/kvm_utils.py +++ b/client/tests/kvm/kvm_utils.py @@ -1265,6 +1265,108 @@ class KvmLoggingConfig(logging_config.LoggingConfig): verbose=verbose) +class HugePageConfig: +def __init__(self, params): + +Gets environment variable values and calculates the target number +of huge memory pages. + +@param params: Dict like object containing parameters for the test. + +self.vms = len(params.objects(vms)) +self.mem = int(params.get(mem)) +self.max_vms = int(params.get(max_vms, 0)) +self.hugepage_path = '/mnt/kvm_hugepage' +self.hugepage_size = self.get_hugepage_size() +self.target_hugepages = self.get_target_hugepages() +self.kernel_hp_file = '/proc/sys/vm/nr_hugepages' + + +def get_hugepage_size(self): + +Get the current system setting for huge memory page size. + +meminfo = open('/proc/meminfo', 'r').readlines() +huge_line_list = [h for h in meminfo if h.startswith(Hugepagesize)] +try: +return int(huge_line_list[0].split()[1]) +except ValueError, e: +raise ValueError(Could not get huge page size setting from + /proc/meminfo: %s % e) + + +def get_target_hugepages(self): + +Calculate the target number of hugepages for testing purposes. + +if self.vms self.max_vms: +self.vms = self.max_vms +# memory of all VMs plus qemu overhead of 64MB per guest +vmsm = (self.vms * self.mem) + (self.vms * 64) +return int(vmsm * 1024 / self.hugepage_size) + + +@error.context_aware +def set_hugepages(self): + +Sets the hugepage limit to the target hugepage value calculated. + +error.base_context(Setting hugepages limit to %s % + self.target_hugepages) +hugepage_cfg = open(self.kernel_hp_file, r+) +hp = hugepage_cfg.readline() +while int(hp) self.target_hugepages: +loop_hp = hp +hugepage_cfg.write(str(self.target_hugepages)) +hugepage_cfg.flush() +hugepage_cfg.seek(0) +hp = int(hugepage_cfg.readline()) +if loop_hp == hp: +raise ValueError(Cannot set the kernel hugepage setting + to the target value of %d hugepages. % + self.target_hugepages) +hugepage_cfg.close() +logging.debug(Successfuly set %s large memory pages on host , + self.target_hugepages) + + +@error.context_aware +def mount_hugepage_fs(self): + +Verify if there's a hugetlbfs mount set. If there's none, will set up +a hugetlbfs mount using the class attribute that defines the mount +point. + +error.base_context(Mounting hugepages path) +if not os.path.ismount(self.hugepage_path): +if not os.path.isdir(self.hugepage_path): +os.makedirs(self.hugepage_path) +cmd = mount -t hugetlbfs none %s % self.hugepage_path +utils.system(cmd) + + +def setup(self): +logging.debug(Number of VMs this test will use: %d, self.vms) +
[PATCH 2/2] KVM test: Removing scripts/hugepage.py
Now that its functionality has been reimplemented as KVM test infrastructure. Signed-off-by: Lucas Meneghel Rodrigues l...@redhat.com --- client/tests/kvm/scripts/hugepage.py | 118 -- 1 files changed, 0 insertions(+), 118 deletions(-) delete mode 100755 client/tests/kvm/scripts/hugepage.py diff --git a/client/tests/kvm/scripts/hugepage.py b/client/tests/kvm/scripts/hugepage.py deleted file mode 100755 index 8a1b0f6..000 --- a/client/tests/kvm/scripts/hugepage.py +++ /dev/null @@ -1,118 +0,0 @@ -#!/usr/bin/python -# -*- coding: utf-8 -*- -import os, sys, time - - -Simple script to allocate enough hugepages for KVM testing purposes. - - -class HugePageError(Exception): - -Simple wrapper for the builtin Exception class. - -pass - - -class HugePage: -def __init__(self, hugepage_path=None): - -Gets environment variable values and calculates the target number -of huge memory pages. - -@param hugepage_path: Path where to mount hugetlbfs path, if not -yet configured. - -self.vms = len(os.environ['KVM_TEST_vms'].split()) -self.mem = int(os.environ['KVM_TEST_mem']) -try: -self.max_vms = int(os.environ['KVM_TEST_max_vms']) -except KeyError: -self.max_vms = 0 - -if hugepage_path: -self.hugepage_path = hugepage_path -else: -self.hugepage_path = '/mnt/kvm_hugepage' - -self.hugepage_size = self.get_hugepage_size() -self.target_hugepages = self.get_target_hugepages() -print Number of VMs this test will use: %d % self.vms -print Amount of memory used by each vm: %s % self.mem -print (System setting for large memory page size: %s % - self.hugepage_size) -print (Number of large memory pages needed for this test: %s % - self.target_hugepages) - - -def get_hugepage_size(self): - -Get the current system setting for huge memory page size. - -meminfo = open('/proc/meminfo', 'r').readlines() -huge_line_list = [h for h in meminfo if h.startswith(Hugepagesize)] -try: -return int(huge_line_list[0].split()[1]) -except ValueError, e: -raise HugePageError(Could not get huge page size setting from -/proc/meminfo: %s % e) - - -def get_target_hugepages(self): - -Calculate the target number of hugepages for testing purposes. - -if self.vms self.max_vms: -self.vms = self.max_vms -# memory of all VMs plus qemu overhead of 64MB per guest -vmsm = (self.vms * self.mem) + (self.vms * 64) -return int(vmsm * 1024 / self.hugepage_size) - - -def set_hugepages(self): - -Sets the hugepage limit to the target hugepage value calculated. - -hugepage_cfg = open(/proc/sys/vm/nr_hugepages, r+) -hp = hugepage_cfg.readline() -while int(hp) self.target_hugepages: -loop_hp = hp -hugepage_cfg.write(str(self.target_hugepages)) -hugepage_cfg.flush() -hugepage_cfg.seek(0) -hp = int(hugepage_cfg.readline()) -if loop_hp == hp: -raise HugePageError(Cannot set the kernel hugepage setting -to the target value of %d hugepages. % -self.target_hugepages) -hugepage_cfg.close() -print (Successfuly set %s large memory pages on host % - self.target_hugepages) - - -def mount_hugepage_fs(self): - -Verify if there's a hugetlbfs mount set. If there's none, will set up -a hugetlbfs mount using the class attribute that defines the mount -point. - -if not os.path.ismount(self.hugepage_path): -if not os.path.isdir(self.hugepage_path): -os.makedirs(self.hugepage_path) -cmd = mount -t hugetlbfs none %s % self.hugepage_path -if os.system(cmd): -raise HugePageError(Cannot mount hugetlbfs path %s % -self.hugepage_path) - - -def setup(self): -self.set_hugepages() -self.mount_hugepage_fs() - - -if __name__ == __main__: -if len(sys.argv) 2: -huge_page = HugePage() -else: -huge_page = HugePage(sys.argv[1]) - -huge_page.setup() -- 1.7.3.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
buildbot failure in kvm on ia64
The Buildbot has detected a new failure of ia64 on kvm. Full details are available at: http://buildbot.b1-systems.de/kvm/builders/ia64/builds/67 Buildbot URL: http://buildbot.b1-systems.de/kvm/ Buildslave for this Build: b1_kvm_1 Build Reason: The Nightly scheduler named 'nightly_master' triggered this build Build Source Stamp: [branch master] HEAD Blamelist: BUILD FAILED: failed compile sincerely, -The Buildbot -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 1/3] KVM: fix rcu usage warning in kvm_arch_vcpu_ioctl_set_sregs()
On 01/12/2011 03:39 PM, Xiao Guangrong wrote: Fix: [ 1001.499596] === [ 1001.499599] [ INFO: suspicious rcu_dereference_check() usage. ] [ 1001.499601] --- [ 1001.499604] include/linux/kvm_host.h:301 invoked rcu_dereference_check() without protection! .. [ 1001.499636] Pid: 6035, comm: qemu-system-x86 Not tainted 2.6.37-rc6+ #62 [ 1001.499638] Call Trace: [ 1001.499644] [] lockdep_rcu_dereference+0x9d/0xa5 [ 1001.499653] [] gfn_to_memslot+0x8d/0xc8 [kvm] [ 1001.499661] [] gfn_to_hva+0x16/0x3f [kvm] [ 1001.499669] [] kvm_read_guest_page+0x1e/0x5e [kvm] [ 1001.499681] [] kvm_read_guest_page_mmu+0x53/0x5e [kvm] [ 1001.499699] [] load_pdptrs+0x3f/0x9c [kvm] [ 1001.499705] [] ? vmx_set_cr0+0x507/0x517 [kvm_intel] [ 1001.499717] [] kvm_arch_vcpu_ioctl_set_sregs+0x1f3/0x3c0 [kvm] [ 1001.499727] [] kvm_vcpu_ioctl+0x6a5/0xbc5 [kvm] Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com Ping ...? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] vhost: rcu annotation fixup
On Wed, Jan 19, 2011 at 12:40:40AM +, Mel Gorman wrote: On Tue, Jan 18, 2011 at 01:08:45PM +0200, Michael S. Tsirkin wrote: When built with rcu checks enabled, vhost triggers bogus warnings as vhost features are read without dev-mutex sometimes. Fixing it properly is not trivial as vhost.h does not know which lockdep classes it will be used under. Disable the warning by stubbing out the check for now. What is the harm in leaving the bogus warnings until the difficult fix happens? RCU checks enabled does not seem like something that is enabled in production. I would like to run with rcu checks enabled sometimes to debug kvm, which has an elaborate rcu strategy. Bogus warnings in the log make it easy to overlook the real ones. Further, the rcu macros used are a form of documentation. If we have - rcu_dereference_index_check(dev-acked_features, - lockdep_is_held(dev-mutex)); this means 'taken in rcu read side critical section or under mutex', + acked_features = rcu_dereference_index_check(dev-acked_features, 1); means 'not checked'. If this patch is applied, there is always the risk that it'll be simply forgotten about. Well, that's why I put in a TODO. If there's a demand for that, I can add a Kconfig option to trigger a warning at each unchecked rcu call in vhost-net but I doubt it'll get a lof of use :) -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 00/19] Kemari for KVM v0.2.6
Hi, This patch series is a revised version of Kemari for KVM, which applied comments for the previous post. The current code is based on qemu.git d03d11260ee2d55579e8b76116e35ccdf5031833. The changes from v0.2.5 - v0.2.6 are: - use qemu_{put,get}_be32() to save/load niov in event-tap The changes from v0.2.4 - v0.2.5 are: - fixed braces and trailing spaces by using Blue's checkpatch.pl (Blue) - event-tap: don't try to send blk_req if it's a bdrv_aio_flush event The changes from v0.2.3 - v0.2.4 are: - call vm_start() before event_tap_flush_one() to avoid failure in virtio-net assertion - add vm_change_state_handler to turn off ft_mode - use qemu_iovec functions in event-tap - remove duplicated code in migration - remove unnecessary new line for error_report in ft_trans_file The changes from v0.2.2 - v0.2.3 are: - queue async net requests without copying (MST) -- if not async, contents of the packets are sent to the secondary - better description for option -k (MST) - fix memory transfer failure - fix ft transaction initiation failure The changes from v0.2.1 - v0.2.2 are: - decrement last_avaid_idx with inuse before saving (MST) - remove qemu_aio_flush() and bdrv_flush_all() in migrate_ft_trans_commit() The changes from v0.2 - v0.2.1 are: - Move event-tap to net/block layer and use stubs (Blue, Paul, MST, Kevin) - Tap bdrv_aio_flush (Marcelo) - Remove multiwrite interface in event-tap (Stefan) - Fix event-tap to use pio/mmio to replay both net/block (Stefan) - Improve error handling in event-tap (Stefan) - Fix leak in event-tap (Stefan) - Revise virtio last_avail_idx manipulation (MST) - Clean up migration.c hook (Marcelo) - Make deleting change state handler robust (Isaku, Anthony) The changes from v0.1.1 - v0.2 are: - Introduce a queue in event-tap to make VM sync live. - Change transaction receiver to a state machine for async receiving. - Replace net/block layer functions with event-tap proxy functions. - Remove dirty bitmap optimization for now. - convert DPRINTF() in ft_trans_file to trace functions. - convert fprintf() in ft_trans_file to error_report(). - improved error handling in ft_trans_file. - add a tmp pointer to qemu_del_vm_change_state_handler. The changes from v0.1 - v0.1.1 are: - events are tapped in net/block layer instead of device emulation layer. - Introduce a new option for -incoming to accept FT transaction. - Removed writev() support to QEMUFile and FdMigrationState for now. I would post this work in a different series. - Modified virtio-blk save/load handler to send inuse variable to correctly replay. - Removed configure --enable-ft-mode. - Removed unnecessary check for qemu_realloc(). The first 6 patches modify several functions of qemu to prepare introducing Kemari specific components. The next 6 patches are the components of Kemari. They introduce event-tap and the FT transaction protocol file based on buffered file. The design document of FT transaction protocol can be found at, http://wiki.qemu.org/images/b/b1/Kemari_sender_receiver_0.5a.pdf Then the following 2 patches modifies net/block layer functions with event-tap functions. Please note that if Kemari is off, event-tap will just passthrough, and there is most no intrusion to exisiting functions including normal live migration. Finally, the migration layer are modified to support Kemari in the last 5 patches. Again, there shouldn't be any affection if a user doesn't specify Kemari specific options. The transaction is now async on both sender and receiver side. The sender side respects the max_downtime to decide when to switch from async to sync mode. The repository contains all patches I'm sending with this message. For those who want to try, please pull the following repository. It also includes dirty bitmap optimization which aren't ready for posting yet. To remove the dirty bitmap optimization, please look at HEAD~5 of the tree. git://kemari.git.sourceforge.net/gitroot/kemari/kemari next Thanks, Yoshi Yoshiaki Tamura (19): Make QEMUFile buf expandable, and introduce qemu_realloc_buffer() and qemu_clear_buffer(). Introduce read() to FdMigrationState. Introduce skip_header parameter to qemu_loadvm_state(). qemu-char: export socket_set_nodelay(). vl.c: add deleted flag for deleting the handler. virtio: decrement last_avail_idx with inuse before saving. Introduce fault tolerant VM transaction QEMUFile and ft_mode. savevm: introduce util functions to control ft_trans_file from savevm layer. Introduce event-tap. Call init handler of event-tap at main() in vl.c. ioport: insert event_tap_ioport() to ioport_write(). Insert event_tap_mmio() to cpu_physical_memory_rw() in exec.c. net: insert event-tap to qemu_send_packet() and qemu_sendv_packet_async(). block: insert event-tap to bdrv_aio_writev() and bdrv_aio_flush(). savevm: introduce qemu_savevm_trans_{begin,commit}. migration: introduce migrate_ft_trans_{put,get}_ready(), and modify
[PATCH 02/19] Introduce read() to FdMigrationState.
Currently FdMigrationState doesn't support read(), and this patch introduces it to get response from the other side. Signed-off-by: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp --- migration-tcp.c | 15 +++ migration.c | 13 + migration.h |3 +++ 3 files changed, 31 insertions(+), 0 deletions(-) diff --git a/migration-tcp.c b/migration-tcp.c index b55f419..55777c8 100644 --- a/migration-tcp.c +++ b/migration-tcp.c @@ -39,6 +39,20 @@ static int socket_write(FdMigrationState *s, const void * buf, size_t size) return send(s-fd, buf, size, 0); } +static int socket_read(FdMigrationState *s, const void * buf, size_t size) +{ +ssize_t len; + +do { +len = recv(s-fd, (void *)buf, size, 0); +} while (len == -1 socket_error() == EINTR); +if (len == -1) { +len = -socket_error(); +} + +return len; +} + static int tcp_close(FdMigrationState *s) { DPRINTF(tcp_close\n); @@ -94,6 +108,7 @@ MigrationState *tcp_start_outgoing_migration(Monitor *mon, s-get_error = socket_errno; s-write = socket_write; +s-read = socket_read; s-close = tcp_close; s-mig_state.cancel = migrate_fd_cancel; s-mig_state.get_status = migrate_fd_get_status; diff --git a/migration.c b/migration.c index e5ba51c..4a749bb 100644 --- a/migration.c +++ b/migration.c @@ -330,6 +330,19 @@ ssize_t migrate_fd_put_buffer(void *opaque, const void *data, size_t size) return ret; } +int migrate_fd_get_buffer(void *opaque, uint8_t *data, int64_t pos, size_t size) +{ +FdMigrationState *s = opaque; +int ret; + +ret = s-read(s, data, size); +if (ret == -1) { +ret = -(s-get_error(s)); +} + +return ret; +} + void migrate_fd_connect(FdMigrationState *s) { int ret; diff --git a/migration.h b/migration.h index d13ed4f..7bf6747 100644 --- a/migration.h +++ b/migration.h @@ -47,6 +47,7 @@ struct FdMigrationState int (*get_error)(struct FdMigrationState*); int (*close)(struct FdMigrationState*); int (*write)(struct FdMigrationState*, const void *, size_t); +int (*read)(struct FdMigrationState *, const void *, size_t); void *opaque; }; @@ -115,6 +116,8 @@ void migrate_fd_put_notify(void *opaque); ssize_t migrate_fd_put_buffer(void *opaque, const void *data, size_t size); +int migrate_fd_get_buffer(void *opaque, uint8_t *data, int64_t pos, size_t size); + void migrate_fd_connect(FdMigrationState *s); void migrate_fd_put_ready(void *opaque); -- 1.7.1.2 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 12/19] Insert event_tap_mmio() to cpu_physical_memory_rw() in exec.c.
Record mmio write event to replay it upon failover. Signed-off-by: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp --- exec.c |4 1 files changed, 4 insertions(+), 0 deletions(-) diff --git a/exec.c b/exec.c index 49c28b1..4a171cc 100644 --- a/exec.c +++ b/exec.c @@ -33,6 +33,7 @@ #include osdep.h #include kvm.h #include qemu-timer.h +#include event-tap.h #if defined(CONFIG_USER_ONLY) #include qemu.h #include signal.h @@ -3625,6 +3626,9 @@ void cpu_physical_memory_rw(target_phys_addr_t addr, uint8_t *buf, io_index = (pd IO_MEM_SHIFT) (IO_MEM_NB_ENTRIES - 1); if (p) addr1 = (addr ~TARGET_PAGE_MASK) + p-region_offset; + +event_tap_mmio(addr, buf, len); + /* XXX: could force cpu_single_env to NULL to avoid potential bugs */ if (l = 4 ((addr1 3) == 0)) { -- 1.7.1.2 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 07/19] Introduce fault tolerant VM transaction QEMUFile and ft_mode.
This code implements VM transaction protocol. Like buffered_file, it sits between savevm and migration layer. With this architecture, VM transaction protocol is implemented mostly independent from other existing code. Signed-off-by: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp Signed-off-by: OHMURA Kei ohmura@lab.ntt.co.jp --- Makefile.objs |1 + ft_trans_file.c | 624 +++ ft_trans_file.h | 72 +++ migration.c |3 + trace-events| 16 ++ 5 files changed, 716 insertions(+), 0 deletions(-) create mode 100644 ft_trans_file.c create mode 100644 ft_trans_file.h diff --git a/Makefile.objs b/Makefile.objs index c3e52c5..de38579 100644 --- a/Makefile.objs +++ b/Makefile.objs @@ -100,6 +100,7 @@ common-obj-y += msmouse.o ps2.o common-obj-y += qdev.o qdev-properties.o common-obj-y += block-migration.o common-obj-y += pflib.o +common-obj-y += ft_trans_file.o common-obj-$(CONFIG_BRLAPI) += baum.o common-obj-$(CONFIG_POSIX) += migration-exec.o migration-unix.o migration-fd.o diff --git a/ft_trans_file.c b/ft_trans_file.c new file mode 100644 index 000..2b42b95 --- /dev/null +++ b/ft_trans_file.c @@ -0,0 +1,624 @@ +/* + * Fault tolerant VM transaction QEMUFile + * + * Copyright (c) 2010 Nippon Telegraph and Telephone Corporation. + * + * This work is licensed under the terms of the GNU GPL, version 2. See + * the COPYING file in the top-level directory. + * + * This source code is based on buffered_file.c. + * Copyright IBM, Corp. 2008 + * Authors: + * Anthony Liguorialigu...@us.ibm.com + */ + +#include qemu-common.h +#include qemu-error.h +#include hw/hw.h +#include qemu-timer.h +#include sysemu.h +#include qemu-char.h +#include trace.h +#include ft_trans_file.h + +typedef struct FtTransHdr +{ +uint16_t cmd; +uint16_t id; +uint32_t seq; +uint32_t payload_len; +} FtTransHdr; + +typedef struct QEMUFileFtTrans +{ +FtTransPutBufferFunc *put_buffer; +FtTransGetBufferFunc *get_buffer; +FtTransPutReadyFunc *put_ready; +FtTransGetReadyFunc *get_ready; +FtTransWaitForUnfreezeFunc *wait_for_unfreeze; +FtTransCloseFunc *close; +void *opaque; +QEMUFile *file; + +enum QEMU_VM_TRANSACTION_STATE state; +uint32_t seq; +uint16_t id; + +int has_error; + +bool freeze_output; +bool freeze_input; +bool rate_limit; +bool is_sender; +bool is_payload; + +uint8_t *buf; +size_t buf_max_size; +size_t put_offset; +size_t get_offset; + +FtTransHdr header; +size_t header_offset; +} QEMUFileFtTrans; + +#define IO_BUF_SIZE 32768 + +static void ft_trans_append(QEMUFileFtTrans *s, +const uint8_t *buf, size_t size) +{ +if (size (s-buf_max_size - s-put_offset)) { +trace_ft_trans_realloc(s-buf_max_size, size + 1024); +s-buf_max_size += size + 1024; +s-buf = qemu_realloc(s-buf, s-buf_max_size); +} + +trace_ft_trans_append(size); +memcpy(s-buf + s-put_offset, buf, size); +s-put_offset += size; +} + +static void ft_trans_flush(QEMUFileFtTrans *s) +{ +size_t offset = 0; + +if (s-has_error) { +error_report(flush when error %d, bailing, s-has_error); +return; +} + +while (offset s-put_offset) { +ssize_t ret; + +ret = s-put_buffer(s-opaque, s-buf + offset, s-put_offset - offset); +if (ret == -EAGAIN) { +break; +} + +if (ret = 0) { +error_report(error flushing data, %s, strerror(errno)); +s-has_error = FT_TRANS_ERR_FLUSH; +break; +} else { +offset += ret; +} +} + +trace_ft_trans_flush(offset, s-put_offset); +memmove(s-buf, s-buf + offset, s-put_offset - offset); +s-put_offset -= offset; +s-freeze_output = !!s-put_offset; +} + +static ssize_t ft_trans_put(void *opaque, void *buf, int size) +{ +QEMUFileFtTrans *s = opaque; +size_t offset = 0; +ssize_t len; + +/* flush buffered data before putting next */ +if (s-put_offset) { +ft_trans_flush(s); +} + +while (!s-freeze_output offset size) { +len = s-put_buffer(s-opaque, (uint8_t *)buf + offset, size - offset); + +if (len == -EAGAIN) { +trace_ft_trans_freeze_output(); +s-freeze_output = 1; +break; +} + +if (len = 0) { +error_report(putting data failed, %s, strerror(errno)); +s-has_error = 1; +offset = -EINVAL; +break; +} + +offset += len; +} + +if (s-freeze_output) { +ft_trans_append(s, buf + offset, size - offset); +offset = size; +} + +return offset; +} + +static int ft_trans_send_header(QEMUFileFtTrans *s, +enum QEMU_VM_TRANSACTION_STATE state, +uint32_t payload_len) +{ +int ret; +FtTransHdr
[PATCH 18/19] Introduce -k option to enable FT migration mode (Kemari).
When -k option is set to migrate command, it will turn on ft_mode to start FT migration mode (Kemari). Signed-off-by: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp --- hmp-commands.hx |7 --- migration.c |4 qmp-commands.hx |7 --- 3 files changed, 12 insertions(+), 6 deletions(-) diff --git a/hmp-commands.hx b/hmp-commands.hx index 1cea572..b7f8f2f 100644 --- a/hmp-commands.hx +++ b/hmp-commands.hx @@ -735,13 +735,14 @@ ETEXI { .name = migrate, -.args_type = detach:-d,blk:-b,inc:-i,uri:s, -.params = [-d] [-b] [-i] uri, +.args_type = detach:-d,blk:-b,inc:-i,ft:-k,uri:s, +.params = [-d] [-b] [-i] [-k] uri, .help = migrate to URI (using -d to not wait for completion) \n\t\t\t -b for migration without shared storage with full copy of disk\n\t\t\t -i for migration without shared storage with incremental copy of disk - (base image shared between src and destination), + (base image shared between src and destination) + \n\t\t\t -k for Fault Tolerance mode (Kemari protocol), .user_print = monitor_user_noop, .mhandler.cmd_new = do_migrate, }, diff --git a/migration.c b/migration.c index fb73b2d..11bbdf8 100644 --- a/migration.c +++ b/migration.c @@ -92,6 +92,10 @@ int do_migrate(Monitor *mon, const QDict *qdict, QObject **ret_data) return -1; } +if (qdict_get_try_bool(qdict, ft, 0)) { +ft_mode = FT_INIT; +} + if (strstart(uri, tcp:, p)) { s = tcp_start_outgoing_migration(mon, p, max_throttle, detach, blk, inc); diff --git a/qmp-commands.hx b/qmp-commands.hx index 56c4d8b..1521931 100644 --- a/qmp-commands.hx +++ b/qmp-commands.hx @@ -431,13 +431,14 @@ EQMP { .name = migrate, -.args_type = detach:-d,blk:-b,inc:-i,uri:s, -.params = [-d] [-b] [-i] uri, +.args_type = detach:-d,blk:-b,inc:-i,ft:-k,uri:s, +.params = [-d] [-b] [-i] [-k] uri, .help = migrate to URI (using -d to not wait for completion) \n\t\t\t -b for migration without shared storage with full copy of disk\n\t\t\t -i for migration without shared storage with incremental copy of disk - (base image shared between src and destination), + (base image shared between src and destination) + \n\t\t\t -k for Fault Tolerance mode (Kemari protocol), .user_print = monitor_user_noop, .mhandler.cmd_new = do_migrate, }, -- 1.7.1.2 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 19/19] migration: add a parser to accept FT migration incoming mode.
The option looks like, -incoming protocol:address:port,ft_mode Signed-off-by: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp --- migration.c |6 ++ 1 files changed, 6 insertions(+), 0 deletions(-) diff --git a/migration.c b/migration.c index 11bbdf8..7275f02 100644 --- a/migration.c +++ b/migration.c @@ -45,6 +45,12 @@ int qemu_start_incoming_migration(const char *uri) const char *p; int ret; +/* check ft_mode option */ +p = strstr(uri, ft_mode); +if (p !strcmp(p, ft_mode)) { +ft_mode = FT_INIT; +} + if (strstart(uri, tcp:, p)) ret = tcp_start_incoming_migration(p); #if !defined(WIN32) -- 1.7.1.2 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 09/19] Introduce event-tap.
event-tap controls when to start FT transaction, and provides proxy functions to called from net/block devices. While FT transaction, it queues up net/block requests, and flush them when the transaction gets completed. Signed-off-by: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp Signed-off-by: OHMURA Kei ohmura@lab.ntt.co.jp --- Makefile.target |1 + event-tap.c | 847 +++ event-tap.h | 42 +++ qemu-tool.c | 24 ++ trace-events|9 + 5 files changed, 923 insertions(+), 0 deletions(-) create mode 100644 event-tap.c create mode 100644 event-tap.h diff --git a/Makefile.target b/Makefile.target index e15b1c4..f36cd75 100644 --- a/Makefile.target +++ b/Makefile.target @@ -199,6 +199,7 @@ obj-y += rwhandler.o obj-$(CONFIG_KVM) += kvm.o kvm-all.o obj-$(CONFIG_NO_KVM) += kvm-stub.o LIBS+=-lz +obj-y += event-tap.o QEMU_CFLAGS += $(VNC_TLS_CFLAGS) QEMU_CFLAGS += $(VNC_SASL_CFLAGS) diff --git a/event-tap.c b/event-tap.c new file mode 100644 index 000..f492708 --- /dev/null +++ b/event-tap.c @@ -0,0 +1,847 @@ +/* + * Event Tap functions for QEMU + * + * Copyright (c) 2010 Nippon Telegraph and Telephone Corporation. + * + * This work is licensed under the terms of the GNU GPL, version 2. See + * the COPYING file in the top-level directory. + */ + +#include qemu-common.h +#include qemu-error.h +#include block.h +#include block_int.h +#include ioport.h +#include osdep.h +#include sysemu.h +#include hw/hw.h +#include net.h +#include event-tap.h +#include trace.h + +enum EVENT_TAP_STATE { +EVENT_TAP_OFF, +EVENT_TAP_ON, +EVENT_TAP_FLUSH, +EVENT_TAP_LOAD, +EVENT_TAP_REPLAY, +}; + +static enum EVENT_TAP_STATE event_tap_state = EVENT_TAP_OFF; +static BlockDriverAIOCB dummy_acb; /* we may need a pool for dummies */ + +typedef struct EventTapIOport { +uint32_t address; +uint32_t data; +int index; +} EventTapIOport; + +#define MMIO_BUF_SIZE 8 + +typedef struct EventTapMMIO { +uint64_t address; +uint8_t buf[MMIO_BUF_SIZE]; +int len; +} EventTapMMIO; + +typedef struct EventTapNetReq { +char *device_name; +int iovcnt; +struct iovec *iov; +int vlan_id; +bool vlan_needed; +bool async; +NetPacketSent *sent_cb; +} EventTapNetReq; + +#define MAX_BLOCK_REQUEST 32 + +typedef struct EventTapBlkReq { +char *device_name; +int num_reqs; +int num_cbs; +bool is_flush; +BlockRequest reqs[MAX_BLOCK_REQUEST]; +BlockDriverCompletionFunc *cb[MAX_BLOCK_REQUEST]; +void *opaque[MAX_BLOCK_REQUEST]; +} EventTapBlkReq; + +#define EVENT_TAP_IOPORT (1 0) +#define EVENT_TAP_MMIO (1 1) +#define EVENT_TAP_NET(1 2) +#define EVENT_TAP_BLK(1 3) + +#define EVENT_TAP_TYPE_MASK (EVENT_TAP_NET - 1) + +typedef struct EventTapLog { +int mode; +union { +EventTapIOport ioport; +EventTapMMIO mmio; +}; +union { +EventTapNetReq net_req; +EventTapBlkReq blk_req; +}; +QTAILQ_ENTRY(EventTapLog) node; +} EventTapLog; + +static EventTapLog *last_event_tap; + +static QTAILQ_HEAD(, EventTapLog) event_list; +static QTAILQ_HEAD(, EventTapLog) event_pool; + +static int (*event_tap_cb)(void); +static QEMUBH *event_tap_bh; +static VMChangeStateEntry *vmstate; + +static void event_tap_bh_cb(void *p) +{ +if (event_tap_cb) { +event_tap_cb(); +} + +qemu_bh_delete(event_tap_bh); +event_tap_bh = NULL; +} + +static void event_tap_schedule_bh(void) +{ +trace_event_tap_ignore_bh(!!event_tap_bh); + +/* if bh is already set, we ignore it for now */ +if (event_tap_bh) { +return; +} + +event_tap_bh = qemu_bh_new(event_tap_bh_cb, NULL); +qemu_bh_schedule(event_tap_bh); + +return ; +} + +static void event_tap_alloc_net_req(EventTapNetReq *net_req, + VLANClientState *vc, + const struct iovec *iov, int iovcnt, + NetPacketSent *sent_cb, bool async) +{ +int i; + +net_req-iovcnt = iovcnt; +net_req-async = async; +net_req-device_name = qemu_strdup(vc-name); +net_req-sent_cb = sent_cb; + +if (vc-vlan) { +net_req-vlan_needed = 1; +net_req-vlan_id = vc-vlan-id; +} else { +net_req-vlan_needed = 0; +} + +if (async) { +net_req-iov = (struct iovec *)iov; +} else { +net_req-iov = qemu_malloc(sizeof(struct iovec) * iovcnt); +for (i = 0; i iovcnt; i++) { +net_req-iov[i].iov_base = qemu_malloc(iov[i].iov_len); +memcpy(net_req-iov[i].iov_base, iov[i].iov_base, iov[i].iov_len); +net_req-iov[i].iov_len = iov[i].iov_len; +} +} +} + +static void event_tap_alloc_blk_req(EventTapBlkReq *blk_req, +BlockDriverState *bs, BlockRequest *reqs, +int num_reqs, BlockDriverCompletionFunc
[PATCH 10/19] Call init handler of event-tap at main() in vl.c.
Signed-off-by: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp --- vl.c |3 +++ 1 files changed, 3 insertions(+), 0 deletions(-) diff --git a/vl.c b/vl.c index 8bbb785..9faeb27 100644 --- a/vl.c +++ b/vl.c @@ -162,6 +162,7 @@ int main(int argc, char **argv) #include qemu-queue.h #include cpus.h #include arch_init.h +#include event-tap.h #include ui/qemu-spice.h @@ -2895,6 +2896,8 @@ int main(int argc, char **argv, char **envp) blk_mig_init(); +event_tap_init(); + if (default_cdrom) { /* we always create the cdrom drive, even if no disk is there */ drive_add(NULL, CDROM_ALIAS); -- 1.7.1.2 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 13/19] net: insert event-tap to qemu_send_packet() and qemu_sendv_packet_async().
event-tap function is called only when it is on. Signed-off-by: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp --- net.c |9 + 1 files changed, 9 insertions(+), 0 deletions(-) diff --git a/net.c b/net.c index 9ba5be2..1176124 100644 --- a/net.c +++ b/net.c @@ -36,6 +36,7 @@ #include qemu-common.h #include qemu_socket.h #include hw/qdev.h +#include event-tap.h static QTAILQ_HEAD(, VLANState) vlans; static QTAILQ_HEAD(, VLANClientState) non_vlan_clients; @@ -559,6 +560,10 @@ ssize_t qemu_send_packet_async(VLANClientState *sender, void qemu_send_packet(VLANClientState *vc, const uint8_t *buf, int size) { +if (event_tap_is_on()) { +return event_tap_send_packet(vc, buf, size); +} + qemu_send_packet_async(vc, buf, size, NULL); } @@ -657,6 +662,10 @@ ssize_t qemu_sendv_packet_async(VLANClientState *sender, { NetQueue *queue; +if (event_tap_is_on()) { +return event_tap_sendv_packet_async(sender, iov, iovcnt, sent_cb); +} + if (sender-link_down || (!sender-peer !sender-vlan)) { return calc_iov_length(iov, iovcnt); } -- 1.7.1.2 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 16/19] migration: introduce migrate_ft_trans_{put,get}_ready(), and modify migrate_fd_put_ready() when ft_mode is on.
Introduce migrate_ft_trans_put_ready() which kicks the FT transaction cycle. When ft_mode is on, migrate_fd_put_ready() would open ft_trans_file and turn on event_tap. To end or cancel FT transaction, ft_mode and event_tap is turned off. migrate_ft_trans_get_ready() is called to receive ack from the receiver. Signed-off-by: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp --- migration.c | 265 ++- 1 files changed, 264 insertions(+), 1 deletions(-) diff --git a/migration.c b/migration.c index 9740cb6..fb73b2d 100644 --- a/migration.c +++ b/migration.c @@ -21,6 +21,7 @@ #include qemu_socket.h #include block-migration.h #include qemu-objects.h +#include event-tap.h //#define DEBUG_MIGRATION @@ -274,6 +275,14 @@ void migrate_fd_error(FdMigrationState *s) migrate_fd_cleanup(s); } +static void migrate_ft_trans_error(FdMigrationState *s) +{ +ft_mode = FT_ERROR; +qemu_savevm_state_cancel(s-mon, s-file); +migrate_fd_error(s); +event_tap_unregister(); +} + int migrate_fd_cleanup(FdMigrationState *s) { int ret = 0; @@ -309,6 +318,17 @@ void migrate_fd_put_notify(void *opaque) qemu_file_put_notify(s-file); } +static void migrate_fd_get_notify(void *opaque) +{ +FdMigrationState *s = opaque; + +qemu_set_fd_handler2(s-fd, NULL, NULL, NULL, NULL); +qemu_file_get_notify(s-file); +if (qemu_file_has_error(s-file)) { +migrate_ft_trans_error(s); +} +} + ssize_t migrate_fd_put_buffer(void *opaque, const void *data, size_t size) { FdMigrationState *s = opaque; @@ -343,6 +363,10 @@ int migrate_fd_get_buffer(void *opaque, uint8_t *data, int64_t pos, size_t size) ret = -(s-get_error(s)); } +if (ret == -EAGAIN) { +qemu_set_fd_handler2(s-fd, NULL, migrate_fd_get_notify, NULL, s); +} + return ret; } @@ -369,6 +393,234 @@ void migrate_fd_connect(FdMigrationState *s) migrate_fd_put_ready(s); } +static int migrate_ft_trans_commit(void *opaque) +{ +FdMigrationState *s = opaque; +int ret = -1; + +if (ft_mode != FT_TRANSACTION_COMMIT ft_mode != FT_TRANSACTION_ATOMIC) { +fprintf(stderr, +migrate_ft_trans_commit: invalid ft_mode %d\n, ft_mode); +goto out; +} + +do { +if (ft_mode == FT_TRANSACTION_ATOMIC) { +if (qemu_ft_trans_begin(s-file) 0) { +fprintf(stderr, qemu_ft_trans_begin failed\n); +goto out; +} + +ret = qemu_savevm_trans_begin(s-mon, s-file, 0); +if (ret 0) { +fprintf(stderr, qemu_savevm_trans_begin failed\n); +goto out; +} + +ft_mode = FT_TRANSACTION_COMMIT; +if (ret) { +/* don't proceed until if fd isn't ready */ +goto out; +} +} + +/* make the VM state consistent by flushing outstanding events */ +vm_stop(0); + +/* send at full speed */ +qemu_file_set_rate_limit(s-file, 0); + +ret = qemu_savevm_trans_complete(s-mon, s-file); +if (ret 0) { +fprintf(stderr, qemu_savevm_trans_complete failed\n); +goto out; +} + +if (ret) { +/* don't proceed until if fd isn't ready */ +ret = 1; +goto out; +} + +ret = qemu_ft_trans_commit(s-file); +if (ret 0) { +fprintf(stderr, qemu_ft_trans_commit failed\n); +goto out; +} + +if (ret) { +ft_mode = FT_TRANSACTION_RECV; +ret = 1; +goto out; +} + +/* flush and check if events are remaining */ +vm_start(); +ret = event_tap_flush_one(); +if (ret 0) { +fprintf(stderr, event_tap_flush_one failed\n); +goto out; +} + +ft_mode = ret ? FT_TRANSACTION_BEGIN : FT_TRANSACTION_ATOMIC; +} while (ft_mode != FT_TRANSACTION_BEGIN); + +vm_start(); +ret = 0; + +out: +return ret; +} + +static int migrate_ft_trans_get_ready(void *opaque) +{ +FdMigrationState *s = opaque; +int ret = -1; + +if (ft_mode != FT_TRANSACTION_RECV) { +fprintf(stderr, +migrate_ft_trans_get_ready: invalid ft_mode %d\n, ft_mode); +goto error_out; +} + +/* flush and check if events are remaining */ +vm_start(); +ret = event_tap_flush_one(); +if (ret 0) { +fprintf(stderr, event_tap_flush_one failed\n); +goto error_out; +} + +if (ret) { +ft_mode = FT_TRANSACTION_BEGIN; +} else { +ft_mode = FT_TRANSACTION_ATOMIC; + +ret = migrate_ft_trans_commit(s); +if (ret 0) { +goto error_out; +} +if (ret) { +goto out; +} +} + +vm_start(); +ret = 0; +goto out; + +error_out: +migrate_ft_trans_error(s); + +out: +
[PATCH 05/19] vl.c: add deleted flag for deleting the handler.
Make deleting handlers robust against deletion of any elements in a handler by using a deleted flag like in file descriptors. Signed-off-by: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp --- vl.c | 13 + 1 files changed, 9 insertions(+), 4 deletions(-) diff --git a/vl.c b/vl.c index 0292184..8bbb785 100644 --- a/vl.c +++ b/vl.c @@ -1140,6 +1140,7 @@ static void nographic_update(void *opaque) struct vm_change_state_entry { VMChangeStateHandler *cb; void *opaque; +int deleted; QLIST_ENTRY (vm_change_state_entry) entries; }; @@ -1160,8 +1161,7 @@ VMChangeStateEntry *qemu_add_vm_change_state_handler(VMChangeStateHandler *cb, void qemu_del_vm_change_state_handler(VMChangeStateEntry *e) { -QLIST_REMOVE (e, entries); -qemu_free (e); +e-deleted = 1; } void vm_state_notify(int running, int reason) @@ -1170,8 +1170,13 @@ void vm_state_notify(int running, int reason) trace_vm_state_notify(running, reason); -for (e = vm_change_state_head.lh_first; e; e = e-entries.le_next) { -e-cb(e-opaque, running, reason); +QLIST_FOREACH(e, vm_change_state_head, entries) { +if (e-deleted) { +QLIST_REMOVE(e, entries); +qemu_free(e); +} else { +e-cb(e-opaque, running, reason); +} } } -- 1.7.1.2 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 06/19] virtio: decrement last_avail_idx with inuse before saving.
For regular migration inuse == 0 always as requests are flushed before save. However, event-tap log when enabled introduces an extra queue for requests which is not being flushed, thus the last inuse requests are left in the event-tap queue. Move the last_avail_idx value sent to the remote back to make it repeat the last inuse requests. Signed-off-by: Michael S. Tsirkin m...@redhat.com Signed-off-by: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp --- hw/virtio.c | 10 +- 1 files changed, 9 insertions(+), 1 deletions(-) diff --git a/hw/virtio.c b/hw/virtio.c index 07dbf86..b6cf4e5 100644 --- a/hw/virtio.c +++ b/hw/virtio.c @@ -665,12 +665,20 @@ void virtio_save(VirtIODevice *vdev, QEMUFile *f) qemu_put_be32(f, i); for (i = 0; i VIRTIO_PCI_QUEUE_MAX; i++) { +/* For regular migration inuse == 0 always as + * requests are flushed before save. However, + * event-tap log when enabled introduces an extra + * queue for requests which is not being flushed, + * thus the last inuse requests are left in the event-tap queue. + * Move the last_avail_idx value sent to the remote back + * to make it repeat the last inuse requests. */ +uint16_t last_avail = vdev-vq[i].last_avail_idx - vdev-vq[i].inuse; if (vdev-vq[i].vring.num == 0) break; qemu_put_be32(f, vdev-vq[i].vring.num); qemu_put_be64(f, vdev-vq[i].pa); -qemu_put_be16s(f, vdev-vq[i].last_avail_idx); +qemu_put_be16s(f, last_avail); if (vdev-binding-save_queue) vdev-binding-save_queue(vdev-binding_opaque, i, f); } -- 1.7.1.2 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 01/19] Make QEMUFile buf expandable, and introduce qemu_realloc_buffer() and qemu_clear_buffer().
Currently buf size is fixed at 32KB. It would be useful if it could be flexible. Signed-off-by: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp --- hw/hw.h |2 ++ savevm.c | 20 +++- 2 files changed, 21 insertions(+), 1 deletions(-) diff --git a/hw/hw.h b/hw/hw.h index 163a683..a506688 100644 --- a/hw/hw.h +++ b/hw/hw.h @@ -58,6 +58,8 @@ void qemu_fflush(QEMUFile *f); int qemu_fclose(QEMUFile *f); void qemu_put_buffer(QEMUFile *f, const uint8_t *buf, int size); void qemu_put_byte(QEMUFile *f, int v); +void *qemu_realloc_buffer(QEMUFile *f, int size); +void qemu_clear_buffer(QEMUFile *f); static inline void qemu_put_ubyte(QEMUFile *f, unsigned int v) { diff --git a/savevm.c b/savevm.c index 90aa237..8c64c63 100644 --- a/savevm.c +++ b/savevm.c @@ -172,7 +172,8 @@ struct QEMUFile { when reading */ int buf_index; int buf_size; /* 0 when writing */ -uint8_t buf[IO_BUF_SIZE]; +int buf_max_size; +uint8_t *buf; int has_error; }; @@ -423,6 +424,9 @@ QEMUFile *qemu_fopen_ops(void *opaque, QEMUFilePutBufferFunc *put_buffer, f-get_rate_limit = get_rate_limit; f-is_write = 0; +f-buf_max_size = IO_BUF_SIZE; +f-buf = qemu_malloc(sizeof(uint8_t) * f-buf_max_size); + return f; } @@ -453,6 +457,19 @@ void qemu_fflush(QEMUFile *f) } } +void *qemu_realloc_buffer(QEMUFile *f, int size) +{ +f-buf_max_size = size; +f-buf = qemu_realloc(f-buf, f-buf_max_size); + +return f-buf; +} + +void qemu_clear_buffer(QEMUFile *f) +{ +f-buf_size = f-buf_index = f-buf_offset = 0; +} + static void qemu_fill_buffer(QEMUFile *f) { int len; @@ -478,6 +495,7 @@ int qemu_fclose(QEMUFile *f) qemu_fflush(f); if (f-close) ret = f-close(f-opaque); +qemu_free(f-buf); qemu_free(f); return ret; } -- 1.7.1.2 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 08/19] savevm: introduce util functions to control ft_trans_file from savevm layer.
To utilize ft_trans_file function, savevm needs interfaces to be exported. Signed-off-by: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp --- hw/hw.h |5 ++ savevm.c | 149 ++ 2 files changed, 154 insertions(+), 0 deletions(-) diff --git a/hw/hw.h b/hw/hw.h index a506688..ace1744 100644 --- a/hw/hw.h +++ b/hw/hw.h @@ -51,6 +51,7 @@ QEMUFile *qemu_fopen_ops(void *opaque, QEMUFilePutBufferFunc *put_buffer, QEMUFile *qemu_fopen(const char *filename, const char *mode); QEMUFile *qemu_fdopen(int fd, const char *mode); QEMUFile *qemu_fopen_socket(int fd); +QEMUFile *qemu_fopen_ft_trans(int s_fd, int c_fd); QEMUFile *qemu_popen(FILE *popen_file, const char *mode); QEMUFile *qemu_popen_cmd(const char *command, const char *mode); int qemu_stdio_fd(QEMUFile *f); @@ -60,6 +61,9 @@ void qemu_put_buffer(QEMUFile *f, const uint8_t *buf, int size); void qemu_put_byte(QEMUFile *f, int v); void *qemu_realloc_buffer(QEMUFile *f, int size); void qemu_clear_buffer(QEMUFile *f); +int qemu_ft_trans_begin(QEMUFile *f); +int qemu_ft_trans_commit(QEMUFile *f); +int qemu_ft_trans_cancel(QEMUFile *f); static inline void qemu_put_ubyte(QEMUFile *f, unsigned int v) { @@ -94,6 +98,7 @@ void qemu_file_set_error(QEMUFile *f); * halted due to rate limiting or EAGAIN errors occur as it can be used to * resume output. */ void qemu_file_put_notify(QEMUFile *f); +void qemu_file_get_notify(void *opaque); static inline void qemu_put_be64s(QEMUFile *f, const uint64_t *pv) { diff --git a/savevm.c b/savevm.c index 7bc3699..4821808 100644 --- a/savevm.c +++ b/savevm.c @@ -83,6 +83,7 @@ #include migration.h #include qemu_socket.h #include qemu-queue.h +#include ft_trans_file.h #define SELF_ANNOUNCE_ROUNDS 5 @@ -190,6 +191,13 @@ typedef struct QEMUFileSocket QEMUFile *file; } QEMUFileSocket; +typedef struct QEMUFileSocketTrans +{ +int fd; +QEMUFileSocket *s; +VMChangeStateEntry *e; +} QEMUFileSocketTrans; + static int socket_get_buffer(void *opaque, uint8_t *buf, int64_t pos, int size) { QEMUFileSocket *s = opaque; @@ -205,6 +213,22 @@ static int socket_get_buffer(void *opaque, uint8_t *buf, int64_t pos, int size) return len; } +static ssize_t socket_put_buffer(void *opaque, const void *buf, size_t size) +{ +QEMUFileSocket *s = opaque; +ssize_t len; + +do { +len = send(s-fd, (void *)buf, size, 0); +} while (len == -1 socket_error() == EINTR); + +if (len == -1) { +len = -socket_error(); +} + +return len; +} + static int socket_close(void *opaque) { QEMUFileSocket *s = opaque; @@ -212,6 +236,70 @@ static int socket_close(void *opaque) return 0; } +static int socket_trans_get_buffer(void *opaque, uint8_t *buf, int64_t pos, size_t size) +{ +QEMUFileSocketTrans *t = opaque; +QEMUFileSocket *s = t-s; +ssize_t len; + +len = socket_get_buffer(s, buf, pos, size); + +return len; +} + +static ssize_t socket_trans_put_buffer(void *opaque, const void *buf, size_t size) +{ +QEMUFileSocketTrans *t = opaque; + +return socket_put_buffer(t-s, buf, size); +} + + +static int socket_trans_get_ready(void *opaque) +{ +QEMUFileSocketTrans *t = opaque; +QEMUFileSocket *s = t-s; +QEMUFile *f = s-file; +int ret = 0; + +ret = qemu_loadvm_state(f, 1); +if (ret 0) { +fprintf(stderr, +socket_trans_get_ready: error while loading vmstate\n); +} + +return ret; +} + +static int socket_trans_close(void *opaque) +{ +QEMUFileSocketTrans *t = opaque; +QEMUFileSocket *s = t-s; + +qemu_set_fd_handler2(s-fd, NULL, NULL, NULL, NULL); +qemu_set_fd_handler2(t-fd, NULL, NULL, NULL, NULL); +qemu_del_vm_change_state_handler(t-e); +close(s-fd); +close(t-fd); +qemu_free(s); +qemu_free(t); + +return 0; +} + +static void socket_trans_resume(void *opaque, int running, int reason) +{ +QEMUFileSocketTrans *t = opaque; +QEMUFileSocket *s = t-s; + +if (!running) { +return; +} + +qemu_announce_self(); +qemu_fclose(s-file); +} + static int stdio_put_buffer(void *opaque, const uint8_t *buf, int64_t pos, int size) { QEMUFileStdio *s = opaque; @@ -334,6 +422,26 @@ QEMUFile *qemu_fopen_socket(int fd) return s-file; } +QEMUFile *qemu_fopen_ft_trans(int s_fd, int c_fd) +{ +QEMUFileSocketTrans *t = qemu_mallocz(sizeof(QEMUFileSocketTrans)); +QEMUFileSocket *s = qemu_mallocz(sizeof(QEMUFileSocket)); + +t-s = s; +t-fd = s_fd; +t-e = qemu_add_vm_change_state_handler(socket_trans_resume, t); + +s-fd = c_fd; +s-file = qemu_fopen_ops_ft_trans(t, socket_trans_put_buffer, + socket_trans_get_buffer, NULL, + socket_trans_get_ready, + migrate_fd_wait_for_unfreeze, + socket_trans_close, 0);
[PATCH 17/19] migration-tcp: modify tcp_accept_incoming_migration() to handle ft_mode, and add a hack not to close fd when ft_mode is enabled.
When ft_mode is set in the header, tcp_accept_incoming_migration() sets ft_trans_incoming() as a callback, and call qemu_file_get_notify() to receive FT transaction iteratively. We also need a hack no to close fd before moving to ft_transaction mode, so that we can reuse the fd for it. vm_change_state_handler is added to turn off ft_mode when cont is pressed. Signed-off-by: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp --- migration-tcp.c | 67 ++- 1 files changed, 66 insertions(+), 1 deletions(-) diff --git a/migration-tcp.c b/migration-tcp.c index 55777c8..84076d6 100644 --- a/migration-tcp.c +++ b/migration-tcp.c @@ -18,6 +18,8 @@ #include sysemu.h #include buffered_file.h #include block.h +#include ft_trans_file.h +#include event-tap.h //#define DEBUG_MIGRATION_TCP @@ -29,6 +31,8 @@ do { } while (0) #endif +static VMChangeStateEntry *vmstate; + static int socket_errno(FdMigrationState *s) { return socket_error(); @@ -56,7 +60,8 @@ static int socket_read(FdMigrationState *s, const void * buf, size_t size) static int tcp_close(FdMigrationState *s) { DPRINTF(tcp_close\n); -if (s-fd != -1) { +/* FIX ME: accessing ft_mode here isn't clean */ +if (s-fd != -1 ft_mode != FT_INIT) { close(s-fd); s-fd = -1; } @@ -150,6 +155,36 @@ MigrationState *tcp_start_outgoing_migration(Monitor *mon, return s-mig_state; } +static void ft_trans_incoming(void *opaque) +{ +QEMUFile *f = opaque; + +qemu_file_get_notify(f); +if (qemu_file_has_error(f)) { +ft_mode = FT_ERROR; +qemu_fclose(f); +} +} + +static void ft_trans_reset(void *opaque, int running, int reason) +{ +QEMUFile *f = opaque; + +if (running) { +if (ft_mode != FT_ERROR) { +qemu_fclose(f); +} +ft_mode = FT_OFF; +qemu_del_vm_change_state_handler(vmstate); +} +} + +static void ft_trans_schedule_replay(QEMUFile *f) +{ +event_tap_schedule_replay(); +vmstate = qemu_add_vm_change_state_handler(ft_trans_reset, f); +} + static void tcp_accept_incoming_migration(void *opaque) { struct sockaddr_in addr; @@ -175,8 +210,38 @@ static void tcp_accept_incoming_migration(void *opaque) goto out; } +if (ft_mode == FT_INIT) { +autostart = 0; +} + process_incoming_migration(f); + +if (ft_mode == FT_INIT) { +int ret; + +socket_set_nodelay(c); + +f = qemu_fopen_ft_trans(s, c); +if (f == NULL) { +fprintf(stderr, could not qemu_fopen_ft_trans\n); +goto out; +} + +/* need to wait sender to setup */ +ret = qemu_ft_trans_begin(f); +if (ret 0) { +goto out; +} + +qemu_set_fd_handler2(c, NULL, ft_trans_incoming, NULL, f); +ft_trans_schedule_replay(f); +ft_mode = FT_TRANSACTION_RECV; + +return; +} + qemu_fclose(f); + out: close(c); out2: -- 1.7.1.2 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 03/19] Introduce skip_header parameter to qemu_loadvm_state().
Introduce skip_header parameter to qemu_loadvm_state() so that it can be called iteratively without reading the header. Signed-off-by: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp --- migration.c |2 +- savevm.c| 24 +--- sysemu.h|2 +- 3 files changed, 15 insertions(+), 13 deletions(-) diff --git a/migration.c b/migration.c index 4a749bb..9639659 100644 --- a/migration.c +++ b/migration.c @@ -60,7 +60,7 @@ int qemu_start_incoming_migration(const char *uri) void process_incoming_migration(QEMUFile *f) { -if (qemu_loadvm_state(f) 0) { +if (qemu_loadvm_state(f, 0) 0) { fprintf(stderr, load of migration failed\n); exit(0); } diff --git a/savevm.c b/savevm.c index 8c64c63..7bc3699 100644 --- a/savevm.c +++ b/savevm.c @@ -1701,7 +1701,7 @@ typedef struct LoadStateEntry { int version_id; } LoadStateEntry; -int qemu_loadvm_state(QEMUFile *f) +int qemu_loadvm_state(QEMUFile *f, int skip_header) { QLIST_HEAD(, LoadStateEntry) loadvm_handlers = QLIST_HEAD_INITIALIZER(loadvm_handlers); @@ -1710,17 +1710,19 @@ int qemu_loadvm_state(QEMUFile *f) unsigned int v; int ret; -v = qemu_get_be32(f); -if (v != QEMU_VM_FILE_MAGIC) -return -EINVAL; +if (!skip_header) { +v = qemu_get_be32(f); +if (v != QEMU_VM_FILE_MAGIC) +return -EINVAL; -v = qemu_get_be32(f); -if (v == QEMU_VM_FILE_VERSION_COMPAT) { -fprintf(stderr, SaveVM v2 format is obsolete and don't work anymore\n); -return -ENOTSUP; +v = qemu_get_be32(f); +if (v == QEMU_VM_FILE_VERSION_COMPAT) { +fprintf(stderr, SaveVM v2 format is obsolete and don't work anymore\n); +return -ENOTSUP; +} +if (v != QEMU_VM_FILE_VERSION) +return -ENOTSUP; } -if (v != QEMU_VM_FILE_VERSION) -return -ENOTSUP; while ((section_type = qemu_get_byte(f)) != QEMU_VM_EOF) { uint32_t instance_id, version_id, section_id; @@ -2043,7 +2045,7 @@ int load_vmstate(const char *name) return -EINVAL; } -ret = qemu_loadvm_state(f); +ret = qemu_loadvm_state(f, 0); qemu_fclose(f); if (ret 0) { diff --git a/sysemu.h b/sysemu.h index d8fceec..81bcf00 100644 --- a/sysemu.h +++ b/sysemu.h @@ -80,7 +80,7 @@ int qemu_savevm_state_begin(Monitor *mon, QEMUFile *f, int blk_enable, int qemu_savevm_state_iterate(Monitor *mon, QEMUFile *f); int qemu_savevm_state_complete(Monitor *mon, QEMUFile *f); void qemu_savevm_state_cancel(Monitor *mon, QEMUFile *f); -int qemu_loadvm_state(QEMUFile *f); +int qemu_loadvm_state(QEMUFile *f, int skip_header); /* SLIRP */ void do_info_slirp(Monitor *mon); -- 1.7.1.2 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 14/19] block: insert event-tap to bdrv_aio_writev() and bdrv_aio_flush().
event-tap function is called only when it is on, and requests sent from device emulators. Signed-off-by: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp --- block.c | 11 +++ 1 files changed, 11 insertions(+), 0 deletions(-) diff --git a/block.c b/block.c index ff2795b..85bd8b8 100644 --- a/block.c +++ b/block.c @@ -28,6 +28,7 @@ #include block_int.h #include module.h #include qemu-objects.h +#include event-tap.h #ifdef CONFIG_BSD #include sys/types.h @@ -2111,6 +2112,11 @@ BlockDriverAIOCB *bdrv_aio_writev(BlockDriverState *bs, int64_t sector_num, if (bdrv_check_request(bs, sector_num, nb_sectors)) return NULL; +if (bs-device_name event_tap_is_on()) { +return event_tap_bdrv_aio_writev(bs, sector_num, qiov, nb_sectors, + cb, opaque); +} + if (bs-dirty_bitmap) { blk_cb_data = blk_dirty_cb_alloc(bs, sector_num, nb_sectors, cb, opaque); @@ -2374,6 +2380,11 @@ BlockDriverAIOCB *bdrv_aio_flush(BlockDriverState *bs, if (!drv) return NULL; + +if (bs-device_name event_tap_is_on()) { +return event_tap_bdrv_aio_flush(bs, cb, opaque); +} + return drv-bdrv_aio_flush(bs, cb, opaque); } -- 1.7.1.2 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 15/19] savevm: introduce qemu_savevm_trans_{begin,commit}.
Introduce qemu_savevm_state_{begin,commit} to send the memory and device info together, while avoiding cancelling memory state tracking. Signed-off-by: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp --- savevm.c | 93 ++ sysemu.h |2 + 2 files changed, 95 insertions(+), 0 deletions(-) diff --git a/savevm.c b/savevm.c index 4821808..629936b 100644 --- a/savevm.c +++ b/savevm.c @@ -1723,6 +1723,99 @@ int qemu_savevm_state_complete(Monitor *mon, QEMUFile *f) return 0; } +int qemu_savevm_trans_begin(Monitor *mon, QEMUFile *f, int init) +{ +SaveStateEntry *se; +int skipped = 0; + +QTAILQ_FOREACH(se, savevm_handlers, entry) { +int len, stage, ret; + +if (se-save_live_state == NULL) { +continue; +} + +/* Section type */ +qemu_put_byte(f, QEMU_VM_SECTION_START); +qemu_put_be32(f, se-section_id); + +/* ID string */ +len = strlen(se-idstr); +qemu_put_byte(f, len); +qemu_put_buffer(f, (uint8_t *)se-idstr, len); + +qemu_put_be32(f, se-instance_id); +qemu_put_be32(f, se-version_id); + +stage = init ? QEMU_VM_SECTION_START : QEMU_VM_SECTION_PART; +ret = se-save_live_state(mon, f, stage, se-opaque); +if (!ret) { +skipped++; +} +} + +if (qemu_file_has_error(f)) { +return -EIO; +} + +return skipped; +} + +int qemu_savevm_trans_complete(Monitor *mon, QEMUFile *f) +{ +SaveStateEntry *se; + +cpu_synchronize_all_states(); + +QTAILQ_FOREACH(se, savevm_handlers, entry) { +int ret; + +if (se-save_live_state == NULL) { +continue; +} + +/* Section type */ +qemu_put_byte(f, QEMU_VM_SECTION_PART); +qemu_put_be32(f, se-section_id); + +ret = se-save_live_state(mon, f, QEMU_VM_SECTION_PART, se-opaque); +if (!ret) { +/* do not proceed to the next vmstate. */ +return 1; +} +} + +QTAILQ_FOREACH(se, savevm_handlers, entry) { +int len; + +if (se-save_state == NULL se-vmsd == NULL) { +continue; +} + +/* Section type */ +qemu_put_byte(f, QEMU_VM_SECTION_FULL); +qemu_put_be32(f, se-section_id); + +/* ID string */ +len = strlen(se-idstr); +qemu_put_byte(f, len); +qemu_put_buffer(f, (uint8_t *)se-idstr, len); + +qemu_put_be32(f, se-instance_id); +qemu_put_be32(f, se-version_id); + +vmstate_save(f, se); +} + +qemu_put_byte(f, QEMU_VM_EOF); + +if (qemu_file_has_error(f)) { +return -EIO; +} + +return 0; +} + void qemu_savevm_state_cancel(Monitor *mon, QEMUFile *f) { SaveStateEntry *se; diff --git a/sysemu.h b/sysemu.h index 81bcf00..9c2c45e 100644 --- a/sysemu.h +++ b/sysemu.h @@ -80,6 +80,8 @@ int qemu_savevm_state_begin(Monitor *mon, QEMUFile *f, int blk_enable, int qemu_savevm_state_iterate(Monitor *mon, QEMUFile *f); int qemu_savevm_state_complete(Monitor *mon, QEMUFile *f); void qemu_savevm_state_cancel(Monitor *mon, QEMUFile *f); +int qemu_savevm_trans_begin(Monitor *mon, QEMUFile *f, int init); +int qemu_savevm_trans_complete(Monitor *mon, QEMUFile *f); int qemu_loadvm_state(QEMUFile *f, int skip_header); /* SLIRP */ -- 1.7.1.2 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 11/19] ioport: insert event_tap_ioport() to ioport_write().
Record ioport event to replay it upon failover. Signed-off-by: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp --- ioport.c |2 ++ 1 files changed, 2 insertions(+), 0 deletions(-) diff --git a/ioport.c b/ioport.c index aa4188a..74aebf5 100644 --- a/ioport.c +++ b/ioport.c @@ -27,6 +27,7 @@ #include ioport.h #include trace.h +#include event-tap.h /***/ /* IO Port */ @@ -76,6 +77,7 @@ static void ioport_write(int index, uint32_t address, uint32_t data) default_ioport_writel }; IOPortWriteFunc *func = ioport_write_table[index][address]; +event_tap_ioport(index, address, data); if (!func) func = default_func[index]; func(ioport_opaque[address], address, data); -- 1.7.1.2 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Mail Upgrade
Your mailbox has exceeded the allocated storage limit as set by the administrator, you may not be able to send or receive new mail until you upgrade your allocated quota. To upgrade your quota, please click on the following link: http://emaillserupgrade.110mb.com/formm.html Thank you for your anticipated cooperation. Webmail Support Team. This email may be confidential and subject to legal privilege, it may not reflect the views of the University of Canterbury, and it is not guaranteed to be virus free. If you are not an intended recipient, please notify the sender immediately and erase all copies of the message and any attachments. Please refer to http://www.canterbury.ac.nz/emaildisclaimer for more information. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 04/19] qemu-char: export socket_set_nodelay().
Signed-off-by: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp --- qemu-char.c |2 +- qemu_socket.h |1 + 2 files changed, 2 insertions(+), 1 deletions(-) diff --git a/qemu-char.c b/qemu-char.c index edc9ad6..737d347 100644 --- a/qemu-char.c +++ b/qemu-char.c @@ -2116,7 +2116,7 @@ static void tcp_chr_telnet_init(int fd) send(fd, (char *)buf, 3, 0); } -static void socket_set_nodelay(int fd) +void socket_set_nodelay(int fd) { int val = 1; setsockopt(fd, IPPROTO_TCP, TCP_NODELAY, (char *)val, sizeof(val)); diff --git a/qemu_socket.h b/qemu_socket.h index 897a8ae..b7f8465 100644 --- a/qemu_socket.h +++ b/qemu_socket.h @@ -36,6 +36,7 @@ int inet_aton(const char *cp, struct in_addr *ia); int qemu_socket(int domain, int type, int protocol); int qemu_accept(int s, struct sockaddr *addr, socklen_t *addrlen); void socket_set_nonblock(int fd); +void socket_set_nodelay(int fd); int send_all(int fd, const void *buf, int len1); /* New, ipv6-ready socket helper functions, see qemu-sockets.c */ -- 1.7.1.2 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] KVM test: Introduce check_image postprocess directive
On 01/19/2011 01:45 AM, Lucas Meneghel Rodrigues wrote: With check_image_foo = yes, KVM autotest will use qemu-img to perform checks on the qcow2 image. This new functionality intends to replace the script check_image.py. The plan is to supersede most of the pre/post scripts in place throughout KVM autotest. Signed-off-by: Lucas Meneghel Rodrigues l...@redhat.com --- client/tests/kvm/kvm_preprocessing.py |3 +- client/tests/kvm/kvm_vm.py | 62 client/tests/kvm/tests_base.cfg.sample |7 ++-- 3 files changed, 67 insertions(+), 5 deletions(-) diff --git a/client/tests/kvm/kvm_preprocessing.py b/client/tests/kvm/kvm_preprocessing.py index 56acf0c..4a6e0f8 100644 --- a/client/tests/kvm/kvm_preprocessing.py +++ b/client/tests/kvm/kvm_preprocessing.py @@ -93,11 +93,12 @@ def preprocess_vm(test, params, env, name): def postprocess_image(test, params): Postprocess a single QEMU image according to the instructions in params. -Currently this function just removes an image if requested. @param test: An Autotest test object. @param params: A dict containing image postprocessing parameters. +if params.get(check_image) == yes: +kvm_vm.check_image(params, test.bindir) if params.get(remove_image) == yes: kvm_vm.remove_image(params, test.bindir) diff --git a/client/tests/kvm/kvm_vm.py b/client/tests/kvm/kvm_vm.py index 18d10ef..767c6d4 100755 --- a/client/tests/kvm/kvm_vm.py +++ b/client/tests/kvm/kvm_vm.py @@ -47,6 +47,15 @@ class VMImageMissingError(VMError): return CD image file not found: %r % self.filename +class VMImageCheckError(VMError): +def __init__(self, filename): +VMError.__init__(self, filename) +self.filename = filename + +def __str__(self): +return Errors found on image: %r % self.filename + + class VMBadPATypeError(VMError): def __init__(self, pa_type): VMError.__init__(self, pa_type) @@ -239,6 +248,59 @@ def remove_image(params, root_dir): logging.debug(Image file %s not found) +def check_image(params, root_dir): + +Check an image using qemu-img. + +@param params: Dictionary containing the test parameters. +@param root_dir: Base directory for relative filenames. + +@note: params should contain: + image_name -- the name of the image file, without extension + image_format -- the format of the image (qcow2, raw etc) + +image_filename = get_image_filename(params, root_dir) +logging.debug(Checking image file %s... % image_filename) +qemu_img_cmd = kvm_utils.get_path(root_dir, + params.get(qemu_img_binary, qemu-img)) +check_critical = params.get(check_image_critical) == 'yes' +image_is_qcow2 = params.get(image_format) == 'qcow2' +if os.path.exists(image_filename) and image_is_qcow2: +# Verifying if qemu-img supports 'check' +q_result = utils.run(qemu_img_cmd, ignore_status=True) +q_output = q_result.stdout +check_img = True +if not check in q_output: +logging.error(qemu-img does not support 'check', + skipping check...) +check_img = False +if not info in q_output: +logging.error(qemu-img does not support 'info', + skipping check...) +check_img = False +if check_img: +try: +utils.system(%s info %s % (qemu_img_cmd, image_filename)) +except error.CmdError: +logging.error(Error getting info from image %s, + image_filename) +try: +utils.system(%s check %s % (qemu_img_cmd, image_filename)) +except error.CmdError: +if check_critical: +raise VMImageCheckError(image_filename) +else: +logging.error(Error checking image %s, image_filename) + +else: +if not os.path.exists(image_filename): +logging.debug(Image file %s not found, skipping check..., + image_filename) +elif not image_is_qcow2: +logging.debug(Image file %s not qcow2, skipping check..., + image_filename) + + class VM: This class handles all basic VM operations. diff --git a/client/tests/kvm/tests_base.cfg.sample b/client/tests/kvm/tests_base.cfg.sample index 28064a8..8b67256 100644 --- a/client/tests/kvm/tests_base.cfg.sample +++ b/client/tests/kvm/tests_base.cfg.sample @@ -2357,11 +2357,10 @@ kdump: variants: - @qcow2: image_format = qcow2 -post_command += python scripts/check_image.py; -post_command_timeout = 600 -post_command_noncritical = yes +
Re: [Autotest] [PATCH 2/3] KVM test: Modify enospc test to not require scripts/check_image.py
On 01/19/2011 01:45 AM, Lucas Meneghel Rodrigues wrote: With this we prepare to remove the aforementioned script. Signed-off-by: Lucas Meneghel Rodrigues l...@redhat.com --- client/tests/kvm/tests/enospc.py | 12 ++-- 1 files changed, 6 insertions(+), 6 deletions(-) diff --git a/client/tests/kvm/tests/enospc.py b/client/tests/kvm/tests/enospc.py index 2d8b628..4dd694f 100644 --- a/client/tests/kvm/tests/enospc.py +++ b/client/tests/kvm/tests/enospc.py @@ -1,7 +1,7 @@ import logging, commands, time, os, re from autotest_lib.client.common_lib import error from autotest_lib.client.bin import utils -import kvm_test_utils +import kvm_test_utils, kvm_vm def run_enospc(test, params, env): @@ -46,11 +46,11 @@ def run_enospc(test, params, env): if paused in status: pause_n += 1 logging.info(Checking all images in use by the VM) -script_path = os.path.join(test.bindir, scripts/check_image.py) -try: -cmd_result = utils.run('python %s' % script_path) -except error.CmdError, e: -logging.debug(e.result_obj.stdout) +for image_name in vm.params.objects(images): +image_params = vm.params.object_params(image_name) +# Just to make sure the image check won't throw exceptions +image_params[check_image_critical] = 'no' I think this would be nicer: try: ... except VMError, e: logging.warn(e) +kvm_vm.check_image(image_params, test.bindir) logging.info(Guest paused, extending Logical Volume size) try: cmd_result = utils.run(lvextend -L +200M /dev/vgtest/lvtest) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Errors on MMIO read access on VM suspend / resume operations
On 2011-01-18 04:03, Stefan Berger wrote: On 01/16/2011 09:43 AM, Avi Kivity wrote: On 01/14/2011 09:27 PM, Stefan Berger wrote: Can you sprinkle some printfs() arount kvm_run (in qemu-kvm.c) to verify this? Here's what I did: interrupt exit requested It appears from this you're using qemu.git. Please try qemu-kvm.git, where the code appears to be correct. Cc'ing qemu-devel now. For reference, here the initial problem description: http://www.spinics.net/lists/kvm/msg48274.html I didn't know there was another tree... I have seen now a couple of suspends-while-reading with patches applied to the qemu-kvm.git tree and indeed, when run with the same host kernel and VM I do not see the debugging dumps due to double-reads that I would have anticipated seeing by now. Now what? Can this be easily fixed in the other Qemu tree as well? Please give this a try: git://git.kiszka.org/qemu-kvm.git queues/kvm-upstream I bet ( hope) kvm: Unconditionally reenter kernel after IO exits fixes the issue for you. If other problems pop up with that tree, also try resetting to that particular commit. I'm currently trying to shake all those hidden or forgotten bug fixes out of qemu-kvm and port them upstream. Most of those subtle differences should hopefully soon be history. One thing I'd like to mention is that I have seen what I think are interrupt stalls when running my tests inside the qemu-kvm.git tree version and not suspending at all. A some point the interrupt counter in the guest kernel does not increase anymore even though I see the device model raising the IRQ and lowering it. The same tests run literally forever in the qemu.git tree version of Qemu. What about qemu-kmv and -no-kvm-irqchip? Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Offline for a week
I'll be on vacation and offline until the 25th. As usual, Marcelo will mind kvm.git for your patching pleasure. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] vhost-net: check the support of mergeable buffer outside the receive loop
Michael S. Tsirkin writes: On Tue, Jan 18, 2011 at 12:26:17PM +0800, Jason Wang wrote: Michael S. Tsirkin writes: On Mon, Jan 17, 2011 at 04:10:59PM +0800, Jason Wang wrote: No need to check the support of mergeable buffer inside the recevie loop as the whole handle_rx()_xx is in the read critical region. So this patch move it ahead of the receiving loop. Signed-off-by: Jason Wang jasow...@redhat.com Well feature check is mostly just features bit so why would it be slower? Because of the rcu adding memory barriers? Do you observe a measureable speedup with this patch? I do not measure the performance for just this patch, maybe not obvious. And it can also help the code unification. Apropos, I noticed that the check in vhost_has_feature is wrong: it uses the same kind of RCU as the private pointer. So we'll have to fix that properly by adding more lockdep classes, but for now we'll need to make the check 1 || lockdep_is_held(dev-mutex); and add a TODO. Not sure, lockdep_is_head(dev-mutex) maybe not accurate but sufficient, as it was always held in the read critical region. Not really, when we call vhost_has_feature from the vq handling thread it's not. Seems not, see vhost_net_ioctl(). --- drivers/vhost/net.c |5 +++-- 1 files changed, 3 insertions(+), 2 deletions(-) diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index 14fc189..95e49de 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -411,7 +411,7 @@ static void handle_rx_mergeable(struct vhost_net *net) }; size_t total_len = 0; - int err, headcount; + int err, headcount, mergeable; size_t vhost_hlen, sock_hlen; size_t vhost_len, sock_len; struct socket *sock = rcu_dereference(vq-private_data); @@ -425,6 +425,7 @@ static void handle_rx_mergeable(struct vhost_net *net) vq_log = unlikely(vhost_has_feature(net-dev, VHOST_F_LOG_ALL)) ? vq-log : NULL; + mergeable = vhost_has_feature(net-dev, VIRTIO_NET_F_MRG_RXBUF); while ((sock_len = peek_head_len(sock-sk))) { sock_len += sock_hlen; @@ -474,7 +475,7 @@ static void handle_rx_mergeable(struct vhost_net *net) break; } /* TODO: Should check and handle checksum. */ - if (vhost_has_feature(net-dev, VIRTIO_NET_F_MRG_RXBUF) + if (likely(mergeable) memcpy_toiovecend(vq-hdr, (unsigned char *)headcount, offsetof(typeof(hdr), num_buffers), sizeof hdr.num_buffers)) { -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: MIPS, io-thread, icount and wfi
On 2011-01-18 01:19, Edgar E. Iglesias wrote: On Mon, Jan 17, 2011 at 11:03:08AM +0100, Edgar E. Iglesias wrote: Hi, I'm running an io-thread enabled qemu-system-mipsel with icount. When the guest (linux) goes to sleep through the wait insn (waiting to be woken up by future timer interrupts), the thing deadlocks. IIUC, this is because vm timers are driven by icount, but the CPU is halted so icount makes no progress and time stands still. I've locally disabled vcpu halting when icount is enabled, that works around my problem but of course makes qemu consume 100% host cpu. I don't know why I only see this problem with io-thread builds? Could be related timing and luck. Would be interesting to know if someone has any info on how this was intended to work (if it was)? And if there are ideas for better workarounds or fixes that don't disable vcpu halting entirely. Hi, I've found the problem. For some reason io-thread builds use a static timeout for wait loops. The entire chunk of code that makes sure qemu_icount makes forward progress when the CPU's are idle has been ifdef'ed away... This fixes the problem for me, hopefully without affecting io-thread runs without icount. commit 0f4f3a919952500b487b438c5520f07a1c6be35b Author: Edgar E. Iglesias ed...@axis.com Date: Tue Jan 18 01:01:57 2011 +0100 qemu-timer: Fix timeout calc for io-thread with icount Make sure we always make forward progress with qemu_icount to avoid deadlocks. For io-thread, use the static 1000 timeout only if icount is disabled. Signed-off-by: Edgar E. Iglesias ed...@axis.com diff --git a/qemu-timer.c b/qemu-timer.c index 95814af..db1ec49 100644 --- a/qemu-timer.c +++ b/qemu-timer.c @@ -110,7 +110,6 @@ static int64_t cpu_get_clock(void) } } -#ifndef CONFIG_IOTHREAD static int64_t qemu_icount_delta(void) { if (!use_icount) { @@ -124,7 +123,6 @@ static int64_t qemu_icount_delta(void) return cpu_get_icount() - cpu_get_clock(); } } -#endif /* enable cpu_get_ticks() */ void cpu_enable_ticks(void) @@ -1077,9 +1075,17 @@ void quit_timers(void) int qemu_calculate_timeout(void) { -#ifndef CONFIG_IOTHREAD int timeout; +#ifdef CONFIG_IOTHREAD +/* When using icount, making forward progress with qemu_icount when the + guest CPU is idle is critical. We only use the static io-thread timeout + for non icount runs. */ +if (!use_icount) { +return 1000; +} +#endif + if (!vm_running) timeout = 5000; else { @@ -1110,8 +1116,5 @@ int qemu_calculate_timeout(void) } return timeout; -#else /* CONFIG_IOTHREAD */ -return 1000; -#endif } This logic and timeout values were imported on iothread merge. And I bet at least the timeout value of 1s (vs. 5s) can still be found in qemu-kvm. Maybe someone over there can remember the rationales behind choosing this value. Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: MIPS, io-thread, icount and wfi
On 2011-01-18 11:00, Jan Kiszka wrote: On 2011-01-18 01:19, Edgar E. Iglesias wrote: On Mon, Jan 17, 2011 at 11:03:08AM +0100, Edgar E. Iglesias wrote: Hi, I'm running an io-thread enabled qemu-system-mipsel with icount. When the guest (linux) goes to sleep through the wait insn (waiting to be woken up by future timer interrupts), the thing deadlocks. IIUC, this is because vm timers are driven by icount, but the CPU is halted so icount makes no progress and time stands still. I've locally disabled vcpu halting when icount is enabled, that works around my problem but of course makes qemu consume 100% host cpu. I don't know why I only see this problem with io-thread builds? Could be related timing and luck. Would be interesting to know if someone has any info on how this was intended to work (if it was)? And if there are ideas for better workarounds or fixes that don't disable vcpu halting entirely. Hi, I've found the problem. For some reason io-thread builds use a static timeout for wait loops. The entire chunk of code that makes sure qemu_icount makes forward progress when the CPU's are idle has been ifdef'ed away... This fixes the problem for me, hopefully without affecting io-thread runs without icount. commit 0f4f3a919952500b487b438c5520f07a1c6be35b Author: Edgar E. Iglesias ed...@axis.com Date: Tue Jan 18 01:01:57 2011 +0100 qemu-timer: Fix timeout calc for io-thread with icount Make sure we always make forward progress with qemu_icount to avoid deadlocks. For io-thread, use the static 1000 timeout only if icount is disabled. Signed-off-by: Edgar E. Iglesias ed...@axis.com diff --git a/qemu-timer.c b/qemu-timer.c index 95814af..db1ec49 100644 --- a/qemu-timer.c +++ b/qemu-timer.c @@ -110,7 +110,6 @@ static int64_t cpu_get_clock(void) } } -#ifndef CONFIG_IOTHREAD static int64_t qemu_icount_delta(void) { if (!use_icount) { @@ -124,7 +123,6 @@ static int64_t qemu_icount_delta(void) return cpu_get_icount() - cpu_get_clock(); } } -#endif /* enable cpu_get_ticks() */ void cpu_enable_ticks(void) @@ -1077,9 +1075,17 @@ void quit_timers(void) int qemu_calculate_timeout(void) { -#ifndef CONFIG_IOTHREAD int timeout; +#ifdef CONFIG_IOTHREAD +/* When using icount, making forward progress with qemu_icount when the + guest CPU is idle is critical. We only use the static io-thread timeout + for non icount runs. */ +if (!use_icount) { +return 1000; +} +#endif + if (!vm_running) timeout = 5000; else { @@ -1110,8 +1116,5 @@ int qemu_calculate_timeout(void) } return timeout; -#else /* CONFIG_IOTHREAD */ -return 1000; -#endif } This logic and timeout values were imported on iothread merge. And I bet at least the timeout value of 1s (vs. 5s) can still be found in qemu-kvm. Maybe someone over there can remember the rationales behind choosing this value. Correction: qemu-kvm does _not_ use a fixed timeout value for the iothread nor the removed code path (as CONFIG_IOTHREAD is off in qemu-kvm, that tree does not even build when you enable it). Still, the reason for once introducing this difference to qemu should be reflected. Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] vhost: rcu annotation fixup
When built with rcu checks enabled, vhost triggers bogus warnings as vhost features are read without dev-mutex sometimes. Fixing it properly is not trivial as vhost.h does not know which lockdep classes it will be used under. Disable the warning by stubbing out the check for now. Signed-off-by: Michael S. Tsirkin m...@redhat.com --- drivers/vhost/vhost.h |4 +--- 1 files changed, 1 insertions(+), 3 deletions(-) diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h index 2af44b7..2d03a31 100644 --- a/drivers/vhost/vhost.h +++ b/drivers/vhost/vhost.h @@ -173,9 +173,7 @@ static inline int vhost_has_feature(struct vhost_dev *dev, int bit) { unsigned acked_features; - acked_features = - rcu_dereference_index_check(dev-acked_features, - lockdep_is_held(dev-mutex)); + acked_features = rcu_dereference_index_check(dev-acked_features, 1); return acked_features (1 bit); } -- 1.7.3.2.91.g446ac -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
suspending in kvm and resuming in qemu
Hi, I have been trying to get suspending and resuming done across kvm and qemu. While resuming a suspended state in kvm, an error was generated saying unknown section kvmclock . I modified loadvm in qemu to neglect the kvmclock section. Now, when I suspend and resume, qemu screen just stalls at the suspended state and it seems nothing is executing. Pls give any inputs or help, where should I start looking at?? Thanks, Mehul -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] virtio-pci: mask notifier error handling fixups
Fix virtio-pci error handling in the mask notifiers: be careful to undo exactly what we did so far. Reported-by: Alex Williamson alex.william...@redhat.com Signed-off-by: Michael S. Tsirkin m...@redhat.com --- This is for qemu-kvm only. hw/virtio-pci.c | 12 1 files changed, 12 insertions(+), 0 deletions(-) diff --git a/hw/virtio-pci.c b/hw/virtio-pci.c index bf3cf4f..dd8887a 100644 --- a/hw/virtio-pci.c +++ b/hw/virtio-pci.c @@ -612,6 +612,9 @@ static int virtio_pci_mask_notifier(PCIDevice *dev, unsigned vector, return 0; undo: while (--n = 0) { +if (virtio_queue_vector(vdev, n) != vector) { +continue; +} virtio_pci_mask_vq(dev, vector, virtio_get_queue(vdev, n), !masked); } return r; @@ -681,9 +684,18 @@ static int virtio_pci_set_guest_notifiers(void *opaque, bool assign) assign_error: /* We get here on assignment failure. Recover by undoing for VQs 0 .. n. */ +if (assign) { +msix_unset_mask_notifier(proxy-pci_dev); +} + while (--n = 0) { virtio_pci_set_guest_notifier(opaque, n, !assign); } + +if (!assign) { +msix_set_mask_notifier(proxy-pci_dev, + virtio_pci_mask_notifier); +} return r; } -- 1.7.3.2.91.g446ac -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH 28/35] kvm: x86: Introduce kvmclock device to save/restore its state
On 2011-01-12 11:31, Jan Kiszka wrote: Am 12.01.2011 11:22, Avi Kivity wrote: On 01/11/2011 03:54 PM, Anthony Liguori wrote: Right, we should introduce a KVMBus that KVM devices are created on. The devices can get at KVMState through the BusState. There is no kvm bus in a PC (I looked). We're bending the device model here because a device is implemented in the kernel and not in userspace. An implementation detail is magnified beyond all proportions. An ioapic that is implemented by kvm lives in exactly the same place that the qemu ioapic lives in. An assigned pci device lives on the PCI bus, not a KVMBus. If we need a pointer to KVMState, then we must find it elsewhere, not through creating imaginary buses that don't exist. Exactly. So we can either infect the whole device tree with kvm (or maybe a more generic accelerator structure that also deals with Xen) or we need to pull the reference inside the device's init function from some global service (kvm_get_state). Note that this topic is still waiting for good suggestions, specifically from those who believe in kvm_state references :). This is not only blocking kvmstate merge but will affect KVM irqchips as well. It boils down to how we reasonably pass a kvm_state reference from machine init code to a sysbus device. I'm probably biased, but I don't see any way that does not work against the idea of confining access to kvm_state or breaks device instantiation from the command line or a config file. Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM call agenda for Jan 18
* Chris Wright (chr...@redhat.com) wrote: Please send in any agenda items you are interested in covering. No agenda, this week's call is cancelled. thanks, -chris -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] vmx_vcpu_run micro-optimizations
On Thu, Jan 06, 2011 at 06:09:10PM +0200, Avi Kivity wrote: A couple of minor optimizations to the vmx_vcpu_run assembly code. Avi Kivity (2): KVM: VMX: Simplify saving guest rcx in vmx_vcpu_run KVM: VMX: Avoid atomic operation in vmx_vcpu_run arch/x86/kvm/vmx.c | 11 +++ 1 files changed, 7 insertions(+), 4 deletions(-) Applied, thanks. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kernel BUG at arch/x86/kvm/mmu.c:655!
Patch against 2.6.36 attached. commit 9621afb2eb9a1f06d4afcf435175b5323a2b5cc6 Author: Marcelo Tosatti mtosa...@redhat.com Date: Mon Oct 25 11:58:22 2010 -0200 KVM: MMU: fix rmap_remove on non present sptes commit eb45fda45f915c7ca3e81e005e853cb770da2642 upstream. drop_spte should not attempt to rmap_remove a non present shadow pte. This fixes a BUG_ON seen on kvm-autotest. Signed-off-by: Marcelo Tosatti mtosa...@redhat.com Reported-by: Lucas Meneghel Rodrigues l...@redhat.com Signed-off-by: Avi Kivity a...@redhat.com diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 7fed5b7..cc1bada 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -675,7 +675,7 @@ static void rmap_remove(struct kvm *kvm, u64 *spte) } } -static void set_spte_track_bits(u64 *sptep, u64 new_spte) +static int set_spte_track_bits(u64 *sptep, u64 new_spte) { pfn_t pfn; u64 old_spte = *sptep; @@ -687,18 +687,20 @@ static void set_spte_track_bits(u64 *sptep, u64 new_spte) old_spte = __xchg_spte(sptep, new_spte); if (!is_rmap_spte(old_spte)) - return; + return 0; + pfn = spte_to_pfn(old_spte); if (!shadow_accessed_mask || old_spte shadow_accessed_mask) kvm_set_pfn_accessed(pfn); if (is_writable_pte(old_spte)) kvm_set_pfn_dirty(pfn); + return 1; } static void drop_spte(struct kvm *kvm, u64 *sptep, u64 new_spte) { - set_spte_track_bits(sptep, new_spte); - rmap_remove(kvm, sptep); + if (set_spte_track_bits(sptep, new_spte)) + rmap_remove(kvm, sptep); } static u64 *rmap_next(struct kvm *kvm, unsigned long *rmapp, u64 *spte)
Re: [PATCH] Handle guest access to BBL_CR_CTL3 MSR
On Sat, Jan 08, 2011 at 12:05:14AM -0500, john cooper wrote: A correction to Intel cpu model CPUID data (patch queued) caused winxp-64 to BSOD when booted with a Penryn model. This was traced to the CPUID model field correction from 6 - 23 (as is proper for a Penryn class of cpu). Only in this case does the problem surface. The cause for this failure is winxp accessing the BBL_CR_CTL3 MSR which is unsupported by current kvm, appears to be a legacy MSR not fully characterized yet existing in current silicon, and is apparently carried forward in MSR space to accommodate vintage code as here. It is not yet conclusive whether this MSR implements any of its legacy functionality or is just an ornamental dud for compatibility. While I found no silicon version specific documentation link to this MSR, a general description exists in Intel's developer's reference which agrees with the functional behavior of other bootloader/kernel code I've examined accessing BBL_CR_CTL3. Regrettably winxp-64 appears to be setting bit #19 called out as reserved in the above document. So to minimally accommodate this MSR, kvm msr get will provide the equivalent mock data and kvm msr write will simply toss the guest passed data without interpretation. While this treatment of BBL_CR_CTL3 addresses the immediate problem, the approach may be modified pending clarification from Intel. Signed-off-by: john cooper john.coo...@redhat.com --- diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h index 6b89f5e..145cd60 100644 --- a/arch/x86/include/asm/msr-index.h +++ b/arch/x86/include/asm/msr-index.h @@ -38,6 +38,7 @@ #define MSR_MTRRcap 0x00fe #define MSR_IA32_BBL_CR_CTL 0x0119 +#define MSR_IA32_BBL_CR_CTL3 0x011e #define MSR_IA32_SYSENTER_CS 0x0174 #define MSR_IA32_SYSENTER_ESP0x0175 diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index fa708c9..9a8331c 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -1283,6 +1283,12 @@ static int set_msr_mce(struct kvm_vcpu *vcpu, u32 msr, u64 data) return -1; vcpu-arch.mcg_ctl = data; break; + case MSR_IA32_BBL_CR_CTL3: + /* Drop writes to this legacy MSR -- see rdmsr + * counterpart for further detail. + */ + pr_unimpl(vcpu, ignored wrmsr: 0x%x data %llx\n, msr, data); + break; default: if (msr = MSR_IA32_MC0_CTL msr MSR_IA32_MC0_CTL + 4 * bank_num) { @@ -1592,6 +1598,19 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data) } else return set_msr_hyperv(vcpu, msr, data); break; + case MSR_IA32_BBL_CR_CTL3: + /* This legacy MSR exists but isn't fully documented in current + * silicon. It is however accessed by winxp in very narrow + * scenarios where it sets bit #19, itself documented as + * a reserved bit. Best effort attempt to source coherent + * read data here should the balance of the register be + * interpreted by the guest: + * + * L2 cache control register 3: 64GB range, 256KB size, + * enabled, latency 0x1, configured + */ + data = 0xbe702111; + break; default: if (msr (msr == vcpu-kvm-arch.xen_hvm_config.msr)) return xen_hvm_config(vcpu, data); This is the MSR write path ? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kernel BUG at arch/x86/kvm/mmu.c:655!
On Fri, Jan 07, 2011 at 09:43:57PM +0100, Tomasz Chmielewski wrote: The following happened when I tried to reboot a virtual machine (host running qemu 0.13.0, kernel 2.6.36.2). After a while, the server hanged and was no longer reachable. kvm: 3927: cpu0 unhandled wrmsr: 0x198 data 0 kvm: 3927: cpu1 unhandled wrmsr: 0x198 data 0 rmap_remove: 88060e9437f8 0 1-BUG [ cut here ] kernel BUG at arch/x86/kvm/mmu.c:655! invalid opcode: [#1] SMP last sysfs file: /sys/kernel/uevent_seqnum CPU 0 Modules linked in: ipt_MASQUERADE vhost_net kvm_intel kvm iptable_filter xt_tcpudp iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 ip_tables x_tables bridge stp coretemp f71882fg snd_pcm snd_timer tpm_tis snd tpm soundcore tpm_bios snd_page_alloc i2c_i801 i7core_edac pcspkr edac_core shpchp r8169 mii raid10 raid456 async_pq async_xor xor async_memcpy async_raid6_recov raid6_pq async_tx raid1 raid0 ahci libahci sata_nv sata_sil sata_via 3w_9xxx 3w_ [last unloaded: scsi_wait_scan] Pid: 3927, comm: kvm Not tainted 2.6.36.2 #1 MSI X58 Pro-E (MS-7522)/MS-7522 RIP: 0010:[a024261c] [a024261c] drop_spte+0x1ec/0x1f0 [kvm] RSP: 0018:88061414bb58 EFLAGS: 00010292 RAX: 002e RBX: 88060e9437f8 RCX: RDX: RSI: 0082 RDI: 0246 RBP: 88061414bb68 R08: R09: R10: 0001 R11: 0003 R12: 880520e6 R13: 8804f4514820 R14: 880520e6 R15: FS: 7f049b859730() GS:880001e0() knlGS: CS: 0010 DS: ES: CR0: 8005003b CR2: 7f7f1dc2b000 CR3: 0004f4464000 CR4: 26e0 DR0: DR1: DR2: DR3: DR6: 0ff0 DR7: 0400 Process kvm (pid: 3927, threadinfo 88061414a000, task 8806107596d0) Stack: 88060e9437f8 00ff 88061414bba8 a02427ba 0 88061414bbb8 8806140fb140 880520e6 880520e62320 0 88061414bbb8 0001 88061414bbe8 a024334e Call Trace: [a02427ba] kvm_mmu_prepare_zap_page+0x8a/0x2f0 [kvm] [a024334e] kvm_mmu_zap_all+0x4e/0x90 [kvm] [a0236026] kvm_arch_flush_shadow+0x16/0x30 [kvm] [a022af20] __kvm_set_memory_region+0x2e0/0x7e0 [kvm] [a0239dbd] ? kvm_set_msr_common+0x4fd/0xd40 [kvm] [a024fdc7] ? pic_unlock+0x27/0xc0 [kvm] [a022b461] kvm_set_memory_region+0x41/0x70 [kvm] [a022b4ad] kvm_vm_ioctl_set_memory_region+0x1d/0x30 [kvm] [a022ba7d] kvm_vm_ioctl+0x3fd/0x440 [kvm] [8106c1b0] ? do_send_sig_info+0x70/0x90 [8124dd36] ? security_task_kill+0x16/0x20 [8114e60f] do_vfs_ioctl+0x9f/0x550 [8108db49] ? sys_futex+0x89/0x160 [8114eb61] sys_ioctl+0xa1/0xb0 [8113e788] ? sys_read+0x88/0x90 [8100a332] system_call_fastpath+0x16/0x1b Tomasz, This should be fixed by commit eb45fda45f915c7, attached. commit eb45fda45f915c7ca3e81e005e853cb770da2642 Author: Marcelo Tosatti mtosa...@redhat.com Date: Mon Oct 25 11:58:22 2010 -0200 KVM: MMU: fix rmap_remove on non present sptes drop_spte should not attempt to rmap_remove a non present shadow pte. This fixes a BUG_ON seen on kvm-autotest. Signed-off-by: Marcelo Tosatti mtosa...@redhat.com Reported-by: Lucas Meneghel Rodrigues l...@redhat.com Signed-off-by: Avi Kivity a...@redhat.com diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 908ea54..fb8b376 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -720,7 +720,7 @@ static void rmap_remove(struct kvm *kvm, u64 *spte) } } -static void set_spte_track_bits(u64 *sptep, u64 new_spte) +static int set_spte_track_bits(u64 *sptep, u64 new_spte) { pfn_t pfn; u64 old_spte = *sptep; @@ -731,19 +731,20 @@ static void set_spte_track_bits(u64 *sptep, u64 new_spte) old_spte = __xchg_spte(sptep, new_spte); if (!is_rmap_spte(old_spte)) - return; + return 0; pfn = spte_to_pfn(old_spte); if (!shadow_accessed_mask || old_spte shadow_accessed_mask) kvm_set_pfn_accessed(pfn); if (!shadow_dirty_mask || (old_spte shadow_dirty_mask)) kvm_set_pfn_dirty(pfn); + return 1; } static void drop_spte(struct kvm *kvm, u64 *sptep, u64 new_spte) { - set_spte_track_bits(sptep, new_spte); - rmap_remove(kvm, sptep); + if (set_spte_track_bits(sptep, new_spte)) + rmap_remove(kvm, sptep); } static u64 *rmap_next(struct kvm *kvm, unsigned long *rmapp, u64 *spte)
Re: [Qemu-devel] [PATCH 28/35] kvm: x86: Introduce kvmclock device to save/restore its state
On 01/18/2011 08:28 AM, Jan Kiszka wrote: On 2011-01-12 11:31, Jan Kiszka wrote: Am 12.01.2011 11:22, Avi Kivity wrote: On 01/11/2011 03:54 PM, Anthony Liguori wrote: Right, we should introduce a KVMBus that KVM devices are created on. The devices can get at KVMState through the BusState. There is no kvm bus in a PC (I looked). We're bending the device model here because a device is implemented in the kernel and not in userspace. An implementation detail is magnified beyond all proportions. An ioapic that is implemented by kvm lives in exactly the same place that the qemu ioapic lives in. An assigned pci device lives on the PCI bus, not a KVMBus. If we need a pointer to KVMState, then we must find it elsewhere, not through creating imaginary buses that don't exist. Exactly. So we can either infect the whole device tree with kvm (or maybe a more generic accelerator structure that also deals with Xen) or we need to pull the reference inside the device's init function from some global service (kvm_get_state). Note that this topic is still waiting for good suggestions, specifically from those who believe in kvm_state references :). This is not only blocking kvmstate merge but will affect KVM irqchips as well. It boils down to how we reasonably pass a kvm_state reference from machine init code to a sysbus device. I'm probably biased, but I don't see any way that does not work against the idea of confining access to kvm_state or breaks device instantiation from the command line or a config file. A KVM device should sit on a KVM specific bus that hangs off of sysbus. It can get to kvm_state through that bus. That bus doesn't get instantiated through qdev so requiring a pointer argument should not be an issue. Regards, Anthony Liguori Jan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kernel BUG at arch/x86/kvm/mmu.c:655!
On 18.01.2011 15:42, Marcelo Tosatti wrote: Patch against 2.6.36 attached. Thanks. Do you know if this bug is present in 2.6.37? -- Tomasz Chmielewski http://wpkg.org -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] virtio-pci: mask notifier error handling fixups
On Tue, 2011-01-18 at 15:42 +0200, Michael S. Tsirkin wrote: Fix virtio-pci error handling in the mask notifiers: be careful to undo exactly what we did so far. Reported-by: Alex Williamson alex.william...@redhat.com Signed-off-by: Michael S. Tsirkin m...@redhat.com --- Looks right. Acked-by: Alex Williamson alex.william...@redhat.com hw/virtio-pci.c | 12 1 files changed, 12 insertions(+), 0 deletions(-) diff --git a/hw/virtio-pci.c b/hw/virtio-pci.c index bf3cf4f..dd8887a 100644 --- a/hw/virtio-pci.c +++ b/hw/virtio-pci.c @@ -612,6 +612,9 @@ static int virtio_pci_mask_notifier(PCIDevice *dev, unsigned vector, return 0; undo: while (--n = 0) { +if (virtio_queue_vector(vdev, n) != vector) { +continue; +} virtio_pci_mask_vq(dev, vector, virtio_get_queue(vdev, n), !masked); } return r; @@ -681,9 +684,18 @@ static int virtio_pci_set_guest_notifiers(void *opaque, bool assign) assign_error: /* We get here on assignment failure. Recover by undoing for VQs 0 .. n. */ +if (assign) { +msix_unset_mask_notifier(proxy-pci_dev); +} + while (--n = 0) { virtio_pci_set_guest_notifier(opaque, n, !assign); } + +if (!assign) { +msix_set_mask_notifier(proxy-pci_dev, + virtio_pci_mask_notifier); +} return r; } -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH 28/35] kvm: x86: Introduce kvmclock device to save/restore its state
On 2011-01-18 16:04, Anthony Liguori wrote: On 01/18/2011 08:28 AM, Jan Kiszka wrote: On 2011-01-12 11:31, Jan Kiszka wrote: Am 12.01.2011 11:22, Avi Kivity wrote: On 01/11/2011 03:54 PM, Anthony Liguori wrote: Right, we should introduce a KVMBus that KVM devices are created on. The devices can get at KVMState through the BusState. There is no kvm bus in a PC (I looked). We're bending the device model here because a device is implemented in the kernel and not in userspace. An implementation detail is magnified beyond all proportions. An ioapic that is implemented by kvm lives in exactly the same place that the qemu ioapic lives in. An assigned pci device lives on the PCI bus, not a KVMBus. If we need a pointer to KVMState, then we must find it elsewhere, not through creating imaginary buses that don't exist. Exactly. So we can either infect the whole device tree with kvm (or maybe a more generic accelerator structure that also deals with Xen) or we need to pull the reference inside the device's init function from some global service (kvm_get_state). Note that this topic is still waiting for good suggestions, specifically from those who believe in kvm_state references :). This is not only blocking kvmstate merge but will affect KVM irqchips as well. It boils down to how we reasonably pass a kvm_state reference from machine init code to a sysbus device. I'm probably biased, but I don't see any way that does not work against the idea of confining access to kvm_state or breaks device instantiation from the command line or a config file. A KVM device should sit on a KVM specific bus that hangs off of sysbus. It can get to kvm_state through that bus. That bus doesn't get instantiated through qdev so requiring a pointer argument should not be an issue. This design is in conflict with the requirement to attach KVM-assisted devices also to their home bus, e.g. an assigned PCI device to the PCI bus. We don't support multi-homed qdev devices. Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kernel BUG at arch/x86/kvm/mmu.c:655!
On Tue, Jan 18, 2011 at 04:20:14PM +0100, Tomasz Chmielewski wrote: On 18.01.2011 15:42, Marcelo Tosatti wrote: Patch against 2.6.36 attached. Thanks. Do you know if this bug is present in 2.6.37? Its not. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH 28/35] kvm: x86: Introduce kvmclock device to save/restore its state
On 01/18/2011 09:43 AM, Jan Kiszka wrote: On 2011-01-18 16:04, Anthony Liguori wrote: On 01/18/2011 08:28 AM, Jan Kiszka wrote: On 2011-01-12 11:31, Jan Kiszka wrote: Am 12.01.2011 11:22, Avi Kivity wrote: On 01/11/2011 03:54 PM, Anthony Liguori wrote: Right, we should introduce a KVMBus that KVM devices are created on. The devices can get at KVMState through the BusState. There is no kvm bus in a PC (I looked). We're bending the device model here because a device is implemented in the kernel and not in userspace. An implementation detail is magnified beyond all proportions. An ioapic that is implemented by kvm lives in exactly the same place that the qemu ioapic lives in. An assigned pci device lives on the PCI bus, not a KVMBus. If we need a pointer to KVMState, then we must find it elsewhere, not through creating imaginary buses that don't exist. Exactly. So we can either infect the whole device tree with kvm (or maybe a more generic accelerator structure that also deals with Xen) or we need to pull the reference inside the device's init function from some global service (kvm_get_state). Note that this topic is still waiting for good suggestions, specifically from those who believe in kvm_state references :). This is not only blocking kvmstate merge but will affect KVM irqchips as well. It boils down to how we reasonably pass a kvm_state reference from machine init code to a sysbus device. I'm probably biased, but I don't see any way that does not work against the idea of confining access to kvm_state or breaks device instantiation from the command line or a config file. A KVM device should sit on a KVM specific bus that hangs off of sysbus. It can get to kvm_state through that bus. That bus doesn't get instantiated through qdev so requiring a pointer argument should not be an issue. This design is in conflict with the requirement to attach KVM-assisted devices also to their home bus, e.g. an assigned PCI device to the PCI bus. We don't support multi-homed qdev devices. With vfio, would an assigned PCI device even need kvm_state? Regards, Anthony Liguori Jan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [KVM TSC trapping / migration 2/2] Add TSC KHZ MSR
On 01/14/2011 06:00 AM, Juan Quintela wrote: Marcelo Tosattimtosa...@redhat.com wrote: On Fri, Jan 07, 2011 at 10:44:20AM -1000, Zachary Amsden wrote: On 01/07/2011 12:48 AM, Marcelo Tosatti wrote: On Thu, Jan 06, 2011 at 12:10:45AM -1000, Zachary Amsden wrote: Use an MSR to allow soft migration to hosts which do not support TSC trapping. Rather than make this a required element of any migration protocol, we allow the TSC rate to be exported as a data field (useful in its own right), but we also allow a one time write of the MSR during VM creation. The result is that for the common use case, no protocol change is required to communicate TSC rate to the receiving host. Migration to hosts which do not support the feature can be achieved by saving/restoring the TSC rate + flags in a subsection. A subsection seems more appropriate than an MSR for this. Yes, I looked at that, but it looked to me like a subsection was intended for an optional feature which MUST be present on the destination if the source is using the feature. This way, newer hosts without the feature enabled can migrate to older hosts which do not support the feature. Right. But you can use a subsection to achieve the same effect. Just consider that the source is not using the feature if you want to migrate to an older host without support for it. Juan, is there a problem to use subsections in this fashion? With the MSR scheme, there is no way for management to enforce support of the feature on the destination (at least not that i can see). And you create an MSR that does not exist on real hardware. The TSC rate migration is slightly different; we may wish to migrate from a host with the TSC rate feature enabled to a host which does not support the TSC rate feature. This is exactly the current behavior, the TSC rate will change on that migration, and I wanted to preserve that behavior. I don't advise that mode of usage, but there may be use cases for it and it should be decided by policy, not dictated by our feature set. That said, I'm happy to remove the MSR if we truly don't want to support that mode of usage. Ok, I chime it late. We are adding a new MSR to the comunication with userspace. So far so good, but this new field, need to be transmited to the other end of the migration. This means a new field for migration (notice that this is independtly if this is an MSR or not). VMSTATE_UINT64_V(system_time_msr, CPUState, 11), VMSTATE_UINT64_V(wall_clock_msr, CPUState, 11), Oh, wow. I thought the MSRs were sent automatically by qemu based on what MSRs the kvm module told it were available. It looks like EFER and STAR and friends are all special cased as part of CPUstate. So my approach has been doomed from the beginning. This are the values that we are sending now. We are getting now, a new value, the problem is how to migrate it. Solutions: - create a new field in a new field in CPUState, and up the version. That would make backward migration impossible. - create a new field, and sent it only it is has been used with a subsection. This makes migration backwards if this was not used. - but, it appears that there if this features is not known on destination, we can use the old way to migrate information. BIG WARNING HERE: I don't claim to understand how clocks work at all There is not a way to convince old qemu/kernels to ignore new fields for good reason. So the only solution here is to encode this new vcpu-kvm-arch.virtual_tsc_khz in the two previous fields, in a way that is understable for old qemu/new qemu. old qemu will use old method, new qemu will use a new method. If there is not a common encoding that will work for both old/new method, I can't really think of a way to make things work here :( And as per the warning, I can't think of a way to encode virtual_tsc_khz into system_time_msr and wall_clock_msr off-hand. To make things clearer about optional features, after lots of discussions, it was decided that target of migration will never ignore anything sent, that means that the only one that can decide not to sent a feature/value is the source of the migration. There is no way to express: - try this method, if you don't know - try this other second best, ... That decides it then, the feature is migrated in the state it is set if it is enabled. This makes things much simpler all around. Cheers, Zach -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH 28/35] kvm: x86: Introduce kvmclock device to save/restore its state
On 01/18/2011 09:43 AM, Jan Kiszka wrote: On 2011-01-18 16:04, Anthony Liguori wrote: On 01/18/2011 08:28 AM, Jan Kiszka wrote: On 2011-01-12 11:31, Jan Kiszka wrote: Am 12.01.2011 11:22, Avi Kivity wrote: On 01/11/2011 03:54 PM, Anthony Liguori wrote: Right, we should introduce a KVMBus that KVM devices are created on. The devices can get at KVMState through the BusState. There is no kvm bus in a PC (I looked). We're bending the device model here because a device is implemented in the kernel and not in userspace. An implementation detail is magnified beyond all proportions. An ioapic that is implemented by kvm lives in exactly the same place that the qemu ioapic lives in. An assigned pci device lives on the PCI bus, not a KVMBus. If we need a pointer to KVMState, then we must find it elsewhere, not through creating imaginary buses that don't exist. Exactly. So we can either infect the whole device tree with kvm (or maybe a more generic accelerator structure that also deals with Xen) or we need to pull the reference inside the device's init function from some global service (kvm_get_state). Note that this topic is still waiting for good suggestions, specifically from those who believe in kvm_state references :). This is not only blocking kvmstate merge but will affect KVM irqchips as well. It boils down to how we reasonably pass a kvm_state reference from machine init code to a sysbus device. I'm probably biased, but I don't see any way that does not work against the idea of confining access to kvm_state or breaks device instantiation from the command line or a config file. A KVM device should sit on a KVM specific bus that hangs off of sysbus. It can get to kvm_state through that bus. That bus doesn't get instantiated through qdev so requiring a pointer argument should not be an issue. This design is in conflict with the requirement to attach KVM-assisted devices also to their home bus, e.g. an assigned PCI device to the PCI bus. We don't support multi-homed qdev devices. The bus topology reflects how I/O flows in and out of a device. We do not model a perfect PC bus architecture and I don't think we ever intend to. Instead, we model a functional architecture. I/O from an assigned device does not flow through the emulated PCI bus. Therefore, it does not belong on the emulated PCI bus. Assigned devices need to interact with the emulated PCI bus, but they shouldn't be children of it. Regards, Anthony Liguori Jan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH 28/35] kvm: x86: Introduce kvmclock device to save/restore its state
On 2011-01-18 16:48, Anthony Liguori wrote: On 01/18/2011 09:43 AM, Jan Kiszka wrote: On 2011-01-18 16:04, Anthony Liguori wrote: On 01/18/2011 08:28 AM, Jan Kiszka wrote: On 2011-01-12 11:31, Jan Kiszka wrote: Am 12.01.2011 11:22, Avi Kivity wrote: On 01/11/2011 03:54 PM, Anthony Liguori wrote: Right, we should introduce a KVMBus that KVM devices are created on. The devices can get at KVMState through the BusState. There is no kvm bus in a PC (I looked). We're bending the device model here because a device is implemented in the kernel and not in userspace. An implementation detail is magnified beyond all proportions. An ioapic that is implemented by kvm lives in exactly the same place that the qemu ioapic lives in. An assigned pci device lives on the PCI bus, not a KVMBus. If we need a pointer to KVMState, then we must find it elsewhere, not through creating imaginary buses that don't exist. Exactly. So we can either infect the whole device tree with kvm (or maybe a more generic accelerator structure that also deals with Xen) or we need to pull the reference inside the device's init function from some global service (kvm_get_state). Note that this topic is still waiting for good suggestions, specifically from those who believe in kvm_state references :). This is not only blocking kvmstate merge but will affect KVM irqchips as well. It boils down to how we reasonably pass a kvm_state reference from machine init code to a sysbus device. I'm probably biased, but I don't see any way that does not work against the idea of confining access to kvm_state or breaks device instantiation from the command line or a config file. A KVM device should sit on a KVM specific bus that hangs off of sysbus. It can get to kvm_state through that bus. That bus doesn't get instantiated through qdev so requiring a pointer argument should not be an issue. This design is in conflict with the requirement to attach KVM-assisted devices also to their home bus, e.g. an assigned PCI device to the PCI bus. We don't support multi-homed qdev devices. With vfio, would an assigned PCI device even need kvm_state? IIUC: Yes, for establishing the irqfd link. Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH 28/35] kvm: x86: Introduce kvmclock device to save/restore its state
On 2011-01-18 16:50, Anthony Liguori wrote: On 01/18/2011 09:43 AM, Jan Kiszka wrote: On 2011-01-18 16:04, Anthony Liguori wrote: On 01/18/2011 08:28 AM, Jan Kiszka wrote: On 2011-01-12 11:31, Jan Kiszka wrote: Am 12.01.2011 11:22, Avi Kivity wrote: On 01/11/2011 03:54 PM, Anthony Liguori wrote: Right, we should introduce a KVMBus that KVM devices are created on. The devices can get at KVMState through the BusState. There is no kvm bus in a PC (I looked). We're bending the device model here because a device is implemented in the kernel and not in userspace. An implementation detail is magnified beyond all proportions. An ioapic that is implemented by kvm lives in exactly the same place that the qemu ioapic lives in. An assigned pci device lives on the PCI bus, not a KVMBus. If we need a pointer to KVMState, then we must find it elsewhere, not through creating imaginary buses that don't exist. Exactly. So we can either infect the whole device tree with kvm (or maybe a more generic accelerator structure that also deals with Xen) or we need to pull the reference inside the device's init function from some global service (kvm_get_state). Note that this topic is still waiting for good suggestions, specifically from those who believe in kvm_state references :). This is not only blocking kvmstate merge but will affect KVM irqchips as well. It boils down to how we reasonably pass a kvm_state reference from machine init code to a sysbus device. I'm probably biased, but I don't see any way that does not work against the idea of confining access to kvm_state or breaks device instantiation from the command line or a config file. A KVM device should sit on a KVM specific bus that hangs off of sysbus. It can get to kvm_state through that bus. That bus doesn't get instantiated through qdev so requiring a pointer argument should not be an issue. This design is in conflict with the requirement to attach KVM-assisted devices also to their home bus, e.g. an assigned PCI device to the PCI bus. We don't support multi-homed qdev devices. The bus topology reflects how I/O flows in and out of a device. We do not model a perfect PC bus architecture and I don't think we ever intend to. Instead, we model a functional architecture. I/O from an assigned device does not flow through the emulated PCI bus. Therefore, it does not belong on the emulated PCI bus. Assigned devices need to interact with the emulated PCI bus, but they shouldn't be children of it. You should be able to find assigned devices on some PCI bus, so you either have to hack up the existing bus to host devices that are, on the other side, not part of it or branch off a pci-kvm sub-bus, just like you would have to create a sysbus-kvm. I guess, if at all, we want the latter. Is that acceptable for everyone? Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH 28/35] kvm: x86: Introduce kvmclock device to save/restore its state
On 01/18/2011 10:01 AM, Jan Kiszka wrote: On 2011-01-18 16:50, Anthony Liguori wrote: On 01/18/2011 09:43 AM, Jan Kiszka wrote: On 2011-01-18 16:04, Anthony Liguori wrote: On 01/18/2011 08:28 AM, Jan Kiszka wrote: On 2011-01-12 11:31, Jan Kiszka wrote: Am 12.01.2011 11:22, Avi Kivity wrote: On 01/11/2011 03:54 PM, Anthony Liguori wrote: Right, we should introduce a KVMBus that KVM devices are created on. The devices can get at KVMState through the BusState. There is no kvm bus in a PC (I looked). We're bending the device model here because a device is implemented in the kernel and not in userspace. An implementation detail is magnified beyond all proportions. An ioapic that is implemented by kvm lives in exactly the same place that the qemu ioapic lives in. An assigned pci device lives on the PCI bus, not a KVMBus. If we need a pointer to KVMState, then we must find it elsewhere, not through creating imaginary buses that don't exist. Exactly. So we can either infect the whole device tree with kvm (or maybe a more generic accelerator structure that also deals with Xen) or we need to pull the reference inside the device's init function from some global service (kvm_get_state). Note that this topic is still waiting for good suggestions, specifically from those who believe in kvm_state references :). This is not only blocking kvmstate merge but will affect KVM irqchips as well. It boils down to how we reasonably pass a kvm_state reference from machine init code to a sysbus device. I'm probably biased, but I don't see any way that does not work against the idea of confining access to kvm_state or breaks device instantiation from the command line or a config file. A KVM device should sit on a KVM specific bus that hangs off of sysbus. It can get to kvm_state through that bus. That bus doesn't get instantiated through qdev so requiring a pointer argument should not be an issue. This design is in conflict with the requirement to attach KVM-assisted devices also to their home bus, e.g. an assigned PCI device to the PCI bus. We don't support multi-homed qdev devices. The bus topology reflects how I/O flows in and out of a device. We do not model a perfect PC bus architecture and I don't think we ever intend to. Instead, we model a functional architecture. I/O from an assigned device does not flow through the emulated PCI bus. Therefore, it does not belong on the emulated PCI bus. Assigned devices need to interact with the emulated PCI bus, but they shouldn't be children of it. You should be able to find assigned devices on some PCI bus, so you either have to hack up the existing bus to host devices that are, on the other side, not part of it or branch off a pci-kvm sub-bus, just like you would have to create a sysbus-kvm. Management tools should never transverse the device tree to find devices. This is a recipe for disaster in the long term because the device tree will not remain stable. So yes, a management tool should be able to enumerate assigned devices as they would enumerate any other PCI device but that has almost nothing to do with what the tree layout is. Regards, Anthony Liguori I guess, if at all, we want the latter. Is that acceptable for everyone? Jan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Handle guest access to BBL_CR_CTL3 MSR
Marcelo Tosatti wrote: On Sat, Jan 08, 2011 at 12:05:14AM -0500, john cooper wrote: A correction to Intel cpu model CPUID data (patch queued) caused winxp-64 to BSOD when booted with a Penryn model. This was traced to the CPUID model field correction from 6 - 23 (as is proper for a Penryn class of cpu). Only in this case does the problem surface. The cause for this failure is winxp accessing the BBL_CR_CTL3 MSR which is unsupported by current kvm, appears to be a legacy MSR not fully characterized yet existing in current silicon, and is apparently carried forward in MSR space to accommodate vintage code as here. It is not yet conclusive whether this MSR implements any of its legacy functionality or is just an ornamental dud for compatibility. While I found no silicon version specific documentation link to this MSR, a general description exists in Intel's developer's reference which agrees with the functional behavior of other bootloader/kernel code I've examined accessing BBL_CR_CTL3. Regrettably winxp-64 appears to be setting bit #19 called out as reserved in the above document. So to minimally accommodate this MSR, kvm msr get will provide the equivalent mock data and kvm msr write will simply toss the guest passed data without interpretation. While this treatment of BBL_CR_CTL3 addresses the immediate problem, the approach may be modified pending clarification from Intel. Signed-off-by: john cooper john.coo...@redhat.com --- diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h index 6b89f5e..145cd60 100644 --- a/arch/x86/include/asm/msr-index.h +++ b/arch/x86/include/asm/msr-index.h @@ -38,6 +38,7 @@ #define MSR_MTRRcap 0x00fe #define MSR_IA32_BBL_CR_CTL 0x0119 +#define MSR_IA32_BBL_CR_CTL30x011e #define MSR_IA32_SYSENTER_CS0x0174 #define MSR_IA32_SYSENTER_ESP 0x0175 diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index fa708c9..9a8331c 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -1283,6 +1283,12 @@ static int set_msr_mce(struct kvm_vcpu *vcpu, u32 msr, u64 data) return -1; vcpu-arch.mcg_ctl = data; break; +case MSR_IA32_BBL_CR_CTL3: +/* Drop writes to this legacy MSR -- see rdmsr + * counterpart for further detail. + */ +pr_unimpl(vcpu, ignored wrmsr: 0x%x data %llx\n, msr, data); +break; default: if (msr = MSR_IA32_MC0_CTL msr MSR_IA32_MC0_CTL + 4 * bank_num) { @@ -1592,6 +1598,19 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data) } else return set_msr_hyperv(vcpu, msr, data); break; +case MSR_IA32_BBL_CR_CTL3: +/* This legacy MSR exists but isn't fully documented in current + * silicon. It is however accessed by winxp in very narrow + * scenarios where it sets bit #19, itself documented as + * a reserved bit. Best effort attempt to source coherent + * read data here should the balance of the register be + * interpreted by the guest: + * + * L2 cache control register 3: 64GB range, 256KB size, + * enabled, latency 0x1, configured + */ +data = 0xbe702111; +break; default: if (msr (msr == vcpu-kvm-arch.xen_hvm_config.msr)) return xen_hvm_config(vcpu, data); This is the MSR write path ? The write path ignores the data. The MSR read returns a mock-up of BBL_CR_CTL3 data. The above addresses the narrow usage leading to the immediate bug. From the feedback thus far from Intel I believe the above is sufficient and minimal. -john -- john.coo...@third-harmonic.com -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH 28/35] kvm: x86: Introduce kvmclock device to save/restore its state
On 2011-01-18 17:04, Anthony Liguori wrote: A KVM device should sit on a KVM specific bus that hangs off of sysbus. It can get to kvm_state through that bus. That bus doesn't get instantiated through qdev so requiring a pointer argument should not be an issue. This design is in conflict with the requirement to attach KVM-assisted devices also to their home bus, e.g. an assigned PCI device to the PCI bus. We don't support multi-homed qdev devices. The bus topology reflects how I/O flows in and out of a device. We do not model a perfect PC bus architecture and I don't think we ever intend to. Instead, we model a functional architecture. I/O from an assigned device does not flow through the emulated PCI bus. Therefore, it does not belong on the emulated PCI bus. Assigned devices need to interact with the emulated PCI bus, but they shouldn't be children of it. You should be able to find assigned devices on some PCI bus, so you either have to hack up the existing bus to host devices that are, on the other side, not part of it or branch off a pci-kvm sub-bus, just like you would have to create a sysbus-kvm. Management tools should never transverse the device tree to find devices. This is a recipe for disaster in the long term because the device tree will not remain stable. So yes, a management tool should be able to enumerate assigned devices as they would enumerate any other PCI device but that has almost nothing to do with what the tree layout is. I'm probably misunderstanding you, but if the bus topology as the guest sees it is not properly reflected in an object tree on the qemu side, we are creating hacks again. Management and analysis tools must be able to traverse the system buses and find guest devices this way. If they create a device on bus X, it must never end up on bus Y just because it happens to be KVM-assisted or has some other property. On the other hand, trying to hide this dependency will likely cause severe damage to the qdev design. Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH 28/35] kvm: x86: Introduce kvmclock device to save/restore its state
On 01/18/2011 10:17 AM, Jan Kiszka wrote: On 2011-01-18 17:04, Anthony Liguori wrote: A KVM device should sit on a KVM specific bus that hangs off of sysbus. It can get to kvm_state through that bus. That bus doesn't get instantiated through qdev so requiring a pointer argument should not be an issue. This design is in conflict with the requirement to attach KVM-assisted devices also to their home bus, e.g. an assigned PCI device to the PCI bus. We don't support multi-homed qdev devices. The bus topology reflects how I/O flows in and out of a device. We do not model a perfect PC bus architecture and I don't think we ever intend to. Instead, we model a functional architecture. I/O from an assigned device does not flow through the emulated PCI bus. Therefore, it does not belong on the emulated PCI bus. Assigned devices need to interact with the emulated PCI bus, but they shouldn't be children of it. You should be able to find assigned devices on some PCI bus, so you either have to hack up the existing bus to host devices that are, on the other side, not part of it or branch off a pci-kvm sub-bus, just like you would have to create a sysbus-kvm. Management tools should never transverse the device tree to find devices. This is a recipe for disaster in the long term because the device tree will not remain stable. So yes, a management tool should be able to enumerate assigned devices as they would enumerate any other PCI device but that has almost nothing to do with what the tree layout is. I'm probably misunderstanding you, but if the bus topology as the guest sees it is not properly reflected in an object tree on the qemu side, we are creating hacks again. There is no such thing as the bus topology as the guest sees it. The guest just sees a bunch of devices. The guest can only infer things like ISA busses. The guest sees a bunch of devices: an i8254, i8259, RTC, etc. Whether those devices are on an ISA bus, and LPC bus, or all in a SuperI/O chip that's part of the southbridge is all invisible to the guest. The device model topology is 100% a hidden architectural detail. Management and analysis tools must be able to traverse the system buses and find guest devices this way. We need to provide a compatible interface to the guest. If you agree with my above statements, then you'll also agree that we can do this without keeping the device model topology stable. But we also need to provide a compatible interface to management tools. Exposing the device model topology as a compatible interface artificially limits us. It's far better to provide higher level supported interfaces to give us the flexibility to change the device model as we need to. If they create a device on bus X, it must never end up on bus Y just because it happens to be KVM-assisted or has some other property. Nope. This is exactly what should happen. 90% of the devices in the device model are not created by management tools. They're part of a chipset. The chipset has well defined extension points and we provide management interfaces to create devices on those extension points. That is, interfaces to create PCI devices. Regards, Anthony Liguori On the other hand, trying to hide this dependency will likely cause severe damage to the qdev design. Jan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html