Re: [Qemu-devel] Re: [PATCH 26/35] kvm: Eliminate KVMState arguments
On 01/10/2011 10:23 PM, Anthony Liguori wrote: I don't see how ioapic, pit, or pic have a system scope. They are not bound to any CPU like the APIC which you may have in mind. And none of the above interact with KVM. They're implemented by kvm. What deeper interaction do you have in mind? They may be replaced by KVM but if you look at the PIT, this is done by having two distinct devices. The KVM specific device can (and should) be instantiated with kvm_state. The way the IOAPIC/APIC/PIC is handled in qemu-kvm is nasty. The kernel devices are separate devices and that should be reflected in the device tree. I don't see why. Those are just two different implementations for the same guest visible device. It's like saying IDE should be seen differently if it's backed by qcow2 or qed. The device tree is about the guest view of devices. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: [PATCH 26/35] kvm: Eliminate KVMState arguments
On 01/10/2011 10:11 PM, Anthony Liguori wrote: On 01/08/2011 02:47 AM, Jan Kiszka wrote: OK, but I don't want to argue about the ioeventfd API. So let's put this case aside. :) I often reply too quickly without explaining myself. Let me use ioeventfd as an example to highlight why KVMState is a good thing. In real life, PIO and MMIO are never directly communicated to the device from the processor. Instead, they go through a series of other devices. In the case of something like an ISA device, a PIO first goes to the chipset into the PCI complex, it will then go through a PCI-to-ISA bridge via subtractive decoding, and then forward over the ISA device where it will be interpreted by some device. The path to the chipset may be shared among different processors but it may also be unique. The APIC is the best example as there are historic APICs that hung directly off of the CPUs such that the same MMIO access across different CPUs did not go to the same device. This is why the APIC emulation in QEMU is so weird because we don't model this behavior correctly. This means that a PIO operation needs to flow from a CPUState to a DeviceState. It can then flow through to another DeviceState until it's finally handled. The first problem with ioeventfd is that it's a per-VM operation. It should be per VCPU. Just consider ioeventfd as something that hooks the system bus, not the processor-chipset link, and the problem goes away. In practice, any per-cpu io port (for SMM or power management) would need synchronous handling; an eventfd isn't a suitable way to communicate it. But even if this were the case, the path that a PIO operation takes should not be impacted by ioeventfd. IOW, a device shouldn't be allocating an eventfd() and handing it to a magical KVM call. Instead, a device should register a callback for a particular port in the same way it always does. *As an optimization*, we should have another interface that says that these values are only valid for this IO port. That would let us create eventfds and register things behind the scenes. The semantics are different. The normal callbacks are synchronous; the vcpu is halted until the callback is serviced. For most callbacks, this is critical, not just per-cpu things like vmport (example: cirrus back switching). I agree it shouldn't be done by a kvm-specific kvm call, but instead by an API, but that API needs to be explicitly asynchronous. When we further thread qemu, we'd also need to specify which thread is to execute the callback (and the implementation would add the eventfd to the thread's fd poll list). That means we can handle TCG, older KVM kernels, and newer KVM kernels without any special support in the device model. It also means that the device models never have to worry about KVMState because there's an entirely different piece of code that's consulting the set of special ports and then deciding how to handle it. The result is better, more portable code that doesn't have KVM-isms. Yes. If passing state around seems to be ugly, it's probably because we're not abstracting things correctly. Removing the state and making it implicit is the wrong solution. I agree with the general sentiment that utilizing the fact that a variable is global to make it implicit is bad from a software engineering point of view. By restricting access to variables and functions, you can enforce modularity on the code. Much like the private: specifier in C++ and other languages. Fixing the abstraction is the right solution (or living with the ugliness until someone else is motivated to fix it properly). And with that too, especially the parenthesized statement. We have qemu-kvm that is overly pragmatic and trie[sd] not to avoid modifying qemu as much as possible. We have the qemu.git kvm implementation that tries a perfectionist approach that failed because most of the users need the featured and tested pragmatic approach. The two mix like oil and water. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: [PATCH 26/35] kvm: Eliminate KVMState arguments
On 01/11/2011 03:01 AM, Avi Kivity wrote: On 01/10/2011 10:23 PM, Anthony Liguori wrote: I don't see how ioapic, pit, or pic have a system scope. They are not bound to any CPU like the APIC which you may have in mind. And none of the above interact with KVM. They're implemented by kvm. What deeper interaction do you have in mind? The emulated ioapic/pit/pic do not interact with KVM at all. The KVM versions should be completely separate devices. They may be replaced by KVM but if you look at the PIT, this is done by having two distinct devices. The KVM specific device can (and should) be instantiated with kvm_state. The way the IOAPIC/APIC/PIC is handled in qemu-kvm is nasty. The kernel devices are separate devices and that should be reflected in the device tree. I don't see why. Those are just two different implementations for the same guest visible device. Right, they should appear the same to the guest but the fact that they're two different implementations should be reflected in the device tree. It's like saying IDE should be seen differently if it's backed by qcow2 or qed. No, it's not at all. Advantages of separating KVM devices: 1) it becomes very clear what functionality is handled in the kernel verses in userspace (you can actually look at the code and tell) 2) a user can explicitly create either the emulated version of the device or the in-kernel version of the device (no need for -no-kvm-irqchip) 3) a user can pass parameters directly to the in-kernel version of the device that are different from the userspace version (like selecting different interrupt catch-up methods) Regards, Anthony Liguori The device tree is about the guest view of devices. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: [PATCH 26/35] kvm: Eliminate KVMState arguments
On 11.01.2011, at 15:00, Anthony Liguori wrote: On 01/11/2011 03:01 AM, Avi Kivity wrote: On 01/10/2011 10:23 PM, Anthony Liguori wrote: I don't see how ioapic, pit, or pic have a system scope. They are not bound to any CPU like the APIC which you may have in mind. And none of the above interact with KVM. They're implemented by kvm. What deeper interaction do you have in mind? The emulated ioapic/pit/pic do not interact with KVM at all. The KVM versions should be completely separate devices. They may be replaced by KVM but if you look at the PIT, this is done by having two distinct devices. The KVM specific device can (and should) be instantiated with kvm_state. The way the IOAPIC/APIC/PIC is handled in qemu-kvm is nasty. The kernel devices are separate devices and that should be reflected in the device tree. I don't see why. Those are just two different implementations for the same guest visible device. Right, they should appear the same to the guest but the fact that they're two different implementations should be reflected in the device tree. It's like saying IDE should be seen differently if it's backed by qcow2 or qed. No, it's not at all. Advantages of separating KVM devices: 1) it becomes very clear what functionality is handled in the kernel verses in userspace (you can actually look at the code and tell) 2) a user can explicitly create either the emulated version of the device or the in-kernel version of the device (no need for -no-kvm-irqchip) 3) a user can pass parameters directly to the in-kernel version of the device that are different from the userspace version (like selecting different interrupt catch-up methods) Disadvantages: 1) you lose migration / savevm between KVM and non-KVM VMs I'm not saying this is unsolvable, but it's certainly something that bothers me :). Some sort of meta-device for KVM implemented devices and emulated devices would be nice. That device would then be the one state gets saved/restored from. Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: [PATCH 26/35] kvm: Eliminate KVMState arguments
On 01/11/2011 08:06 AM, Alexander Graf wrote: On 11.01.2011, at 15:00, Anthony Liguori wrote: On 01/11/2011 03:01 AM, Avi Kivity wrote: On 01/10/2011 10:23 PM, Anthony Liguori wrote: I don't see how ioapic, pit, or pic have a system scope. They are not bound to any CPU like the APIC which you may have in mind. And none of the above interact with KVM. They're implemented by kvm. What deeper interaction do you have in mind? The emulated ioapic/pit/pic do not interact with KVM at all. The KVM versions should be completely separate devices. They may be replaced by KVM but if you look at the PIT, this is done by having two distinct devices. The KVM specific device can (and should) be instantiated with kvm_state. The way the IOAPIC/APIC/PIC is handled in qemu-kvm is nasty. The kernel devices are separate devices and that should be reflected in the device tree. I don't see why. Those are just two different implementations for the same guest visible device. Right, they should appear the same to the guest but the fact that they're two different implementations should be reflected in the device tree. It's like saying IDE should be seen differently if it's backed by qcow2 or qed. No, it's not at all. Advantages of separating KVM devices: 1) it becomes very clear what functionality is handled in the kernel verses in userspace (you can actually look at the code and tell) 2) a user can explicitly create either the emulated version of the device or the in-kernel version of the device (no need for -no-kvm-irqchip) 3) a user can pass parameters directly to the in-kernel version of the device that are different from the userspace version (like selecting different interrupt catch-up methods) Disadvantages: 1) you lose migration / savevm between KVM and non-KVM VMs This doesn't work today and it's never worked. KVM exposes things that TCG cannot emulate (like pvclock). Even as two devices, nothing prevents it from working. Both devices just have to support each other's savevm format. If they use the same code, it makes it very easy. Take a look at how the KVM PIT is implemented for an example of this. Regards, Anthony Liguori I'm not saying this is unsolvable, but it's certainly something that bothers me :). Some sort of meta-device for KVM implemented devices and emulated devices would be nice. That device would then be the one state gets saved/restored from. Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: [PATCH 26/35] kvm: Eliminate KVMState arguments
On 01/11/2011 04:00 PM, Anthony Liguori wrote: On 01/11/2011 03:01 AM, Avi Kivity wrote: On 01/10/2011 10:23 PM, Anthony Liguori wrote: I don't see how ioapic, pit, or pic have a system scope. They are not bound to any CPU like the APIC which you may have in mind. And none of the above interact with KVM. They're implemented by kvm. What deeper interaction do you have in mind? The emulated ioapic/pit/pic do not interact with KVM at all. How can they not interact with kvm if they're implemented by kvm? I really don't follow here. The KVM versions should be completely separate devices. Why? I don't see why. Those are just two different implementations for the same guest visible device. Right, they should appear the same to the guest but the fact that they're two different implementations should be reflected in the device tree. Why? To move beyond single-word questions, what is the purpose of the device tree? In my mind, it reflects the virtual hardware. What's important is that we have a PIC, virtio network adapter, and IDE disk. Not that they're backed by kvm, vhost-net, and qcow2. It's like saying IDE should be seen differently if it's backed by qcow2 or qed. No, it's not at all. Advantages of separating KVM devices: 1) it becomes very clear what functionality is handled in the kernel verses in userspace (you can actually look at the code and tell) How something is implemented is not important, certainly not important enough to expose to the user as an monitor or live migration ABI. 2) a user can explicitly create either the emulated version of the device or the in-kernel version of the device (no need for -no-kvm-irqchip) -device ioapic,model=kernel vs. -device kvm-ioapic? Is it really important to do that? 110% of the time we want the kernel irqchips. The remaining -10% are only used for testing. 3) a user can pass parameters directly to the in-kernel version of the device that are different from the userspace version (like selecting different interrupt catch-up methods) -device pit,model=qemu,catchup=slew error: catchup=slew not supported in this model I'm not overly concerned about the implementation part. Though I think it's better to have a single implementation with kvm acting as an accelerator, having it the other way is no big deal. What I am worried about is exposing it as a monitor and migration ABI. IMO the only important thing is the spec that the device implements, not what piece of code implements it. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: [PATCH 26/35] kvm: Eliminate KVMState arguments
On 01/11/2011 04:09 PM, Anthony Liguori wrote: Disadvantages: 1) you lose migration / savevm between KVM and non-KVM VMs This doesn't work today and it's never worked. KVM exposes things that TCG cannot emulate (like pvclock). If you run kvm without pvclock, or implement pvclock in qemu, it works fine. It should work fine for the PIT, PIC, and IOAPIC (never tried it myself). If we decide to have a kernel hpet implementation, for example, it would be good to be able to live migrate from a version without kernel hpet, to a version with kernel hpet, and have the kernel hpet enabled. Even as two devices, nothing prevents it from working. Both devices just have to support each other's savevm format. If they use the same code, it makes it very easy. Take a look at how the KVM PIT is implemented for an example of this. They need to use the same device id then. And if they share code, that indicates that they need to be the same device even more. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: [PATCH 26/35] kvm: Eliminate KVMState arguments
On 11.01.2011, at 15:09, Anthony Liguori wrote: On 01/11/2011 08:06 AM, Alexander Graf wrote: On 11.01.2011, at 15:00, Anthony Liguori wrote: On 01/11/2011 03:01 AM, Avi Kivity wrote: On 01/10/2011 10:23 PM, Anthony Liguori wrote: I don't see how ioapic, pit, or pic have a system scope. They are not bound to any CPU like the APIC which you may have in mind. And none of the above interact with KVM. They're implemented by kvm. What deeper interaction do you have in mind? The emulated ioapic/pit/pic do not interact with KVM at all. The KVM versions should be completely separate devices. They may be replaced by KVM but if you look at the PIT, this is done by having two distinct devices. The KVM specific device can (and should) be instantiated with kvm_state. The way the IOAPIC/APIC/PIC is handled in qemu-kvm is nasty. The kernel devices are separate devices and that should be reflected in the device tree. I don't see why. Those are just two different implementations for the same guest visible device. Right, they should appear the same to the guest but the fact that they're two different implementations should be reflected in the device tree. It's like saying IDE should be seen differently if it's backed by qcow2 or qed. No, it's not at all. Advantages of separating KVM devices: 1) it becomes very clear what functionality is handled in the kernel verses in userspace (you can actually look at the code and tell) 2) a user can explicitly create either the emulated version of the device or the in-kernel version of the device (no need for -no-kvm-irqchip) 3) a user can pass parameters directly to the in-kernel version of the device that are different from the userspace version (like selecting different interrupt catch-up methods) Disadvantages: 1) you lose migration / savevm between KVM and non-KVM VMs This doesn't work today and it's never worked. KVM exposes things that TCG cannot emulate (like pvclock). Those cases simply shouldn't exist and hurt us (or at least me). I had exactly the pvclock issue with xenner. Xenner can't do proper timekeeping in emulation mode. So implementing an emulated pvclock implementation is (pretty low) on my todo list. Even as two devices, nothing prevents it from working. Both devices just have to support each other's savevm format. If they use the same code, it makes it very easy. Take a look at how the KVM PIT is implemented for an example of this. If that's all it takes, fine. It makes it pretty hard to enforce, but I guess we can get away with that :). Making devices separate basically hurts abstraction. I don't see any use case where we should have a KVM device without emulation equivalent. For the CPU we also think of KVM as an accelerator instead of a separate device, no? :) Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: [PATCH 26/35] kvm: Eliminate KVMState arguments
On 01/11/2011 08:18 AM, Avi Kivity wrote: On 01/11/2011 04:00 PM, Anthony Liguori wrote: On 01/11/2011 03:01 AM, Avi Kivity wrote: On 01/10/2011 10:23 PM, Anthony Liguori wrote: I don't see how ioapic, pit, or pic have a system scope. They are not bound to any CPU like the APIC which you may have in mind. And none of the above interact with KVM. They're implemented by kvm. What deeper interaction do you have in mind? The emulated ioapic/pit/pic do not interact with KVM at all. How can they not interact with kvm if they're implemented by kvm? I really don't follow here. emulated ioapic/pit/pic == versions implemented in QEMU. That's what I'm trying to say. When not using the KVM versions of the devices, there are no interactions with KVM. The KVM versions should be completely separate devices. Why? Because the KVM versions are replacements. I don't see why. Those are just two different implementations for the same guest visible device. Right, they should appear the same to the guest but the fact that they're two different implementations should be reflected in the device tree. Why? To move beyond single-word questions, what is the purpose of the device tree? In my mind, it reflects the virtual hardware. What's important is that we have a PIC, virtio network adapter, and IDE disk. Not that they're backed by kvm, vhost-net, and qcow2. Let me give a very concrete example to illustrate my point. One thing I have on my TODO is to implement catch-up support for the emulated devices. I want to implement three modes of catch-up support: drop, fast, and gradual. Gradual is the best policy IMHO but fast is necessary on older kernels without highres timers. Drop is necessary to maintain compatibility with what we have today. The kernel PIT only implements one mode and even if the other two were added, even the newest version of QEMU needs to deal with the fact that there's old kernels out there with PIT's that only do fast. So how does this get exposed to management tools? Do you check for drift-mode=fast and transparently enable the KVM pit? Do you fail if anything but drift-mode=fast is specified? We need to have the following mechanisms: 1) the ability to select an in-kernel PIT vs. a userspace PIT 2) an independent mechanism to configure the userspace PIT 3) an independent mechanism to configure the in-kernel PIT. The best way to do this is to make the in-kernel PIT a separate device. Then we get all of this for free. 2) a user can explicitly create either the emulated version of the device or the in-kernel version of the device (no need for -no-kvm-irqchip) -device ioapic,model=kernel vs. -device kvm-ioapic? Is it really important to do that? 110% of the time we want the kernel irqchips. The remaining -10% are only used for testing. If model=kernel makes the support options different, then you end up introduce another layer of option validation. By using the later form, you get to leverage the option validation of qdev plus it makes it much clearer to users what options are supported in what model because now the documentation is explicit about it. 3) a user can pass parameters directly to the in-kernel version of the device that are different from the userspace version (like selecting different interrupt catch-up methods) -device pit,model=qemu,catchup=slew error: catchup=slew not supported in this model I'm not overly concerned about the implementation part. Though I think it's better to have a single implementation with kvm acting as an accelerator, having it the other way is no big deal. What I am worried about is exposing it as a monitor and migration ABI. IMO the only important thing is the spec that the device implements, not what piece of code implements it. Just as we do in the PIT, there's nothing wrong with making the device's migration compatible. I'm not entirely sure what your concerns about the monitor are but there's simply no way to hide the fact that a device is implemented in KVM at the monitor level. But really, is this something that management tools want? I doubt it. I think they want to have ultimate control over what gets created with us providing a recommended set of defaults. Regards, Anthony Liguori -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: [PATCH 26/35] kvm: Eliminate KVMState arguments
On 01/11/2011 08:22 AM, Avi Kivity wrote: On 01/11/2011 04:09 PM, Anthony Liguori wrote: Disadvantages: 1) you lose migration / savevm between KVM and non-KVM VMs This doesn't work today and it's never worked. KVM exposes things that TCG cannot emulate (like pvclock). If you run kvm without pvclock, or implement pvclock in qemu, it works fine. It should work fine for the PIT, PIC, and IOAPIC (never tried it myself). If we decide to have a kernel hpet implementation, for example, it would be good to be able to live migrate from a version without kernel hpet, to a version with kernel hpet, and have the kernel hpet enabled. Even as two devices, nothing prevents it from working. Both devices just have to support each other's savevm format. If they use the same code, it makes it very easy. Take a look at how the KVM PIT is implemented for an example of this. They need to use the same device id then. And if they share code, that indicates that they need to be the same device even more. No, it really doesn't :-) Cirrus VGA and std VGA share a lot of code. But that doesn't mean that we treat them as one device. And BTW, there are guest visible differences between the KVM IOAPIC/PIC/PIT than the QEMU versions. The only reason PIT live migration works today is because usually delivers all interrupts quickly. But it actually does maintain state in the work queue that isn't saved. If PIT tried to implement gradual catchup, there would be no way not to expose that state to userspace. Regards, Anthony Liguori -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: [PATCH 26/35] kvm: Eliminate KVMState arguments
On 01/11/2011 04:28 PM, Anthony Liguori wrote: On 01/11/2011 08:18 AM, Avi Kivity wrote: On 01/11/2011 04:00 PM, Anthony Liguori wrote: On 01/11/2011 03:01 AM, Avi Kivity wrote: On 01/10/2011 10:23 PM, Anthony Liguori wrote: I don't see how ioapic, pit, or pic have a system scope. They are not bound to any CPU like the APIC which you may have in mind. And none of the above interact with KVM. They're implemented by kvm. What deeper interaction do you have in mind? The emulated ioapic/pit/pic do not interact with KVM at all. How can they not interact with kvm if they're implemented by kvm? I really don't follow here. emulated ioapic/pit/pic == versions implemented in QEMU. That's what I'm trying to say. When not using the KVM versions of the devices, there are no interactions with KVM. Okay. Isn't that the same for the cpu? Yet we use the same CPUState and are live-migration compatible (as long as cpuids match). The KVM versions should be completely separate devices. Why? Because the KVM versions are replacements. Only the implementation. The guest doesn't see the replacement. They have exactly the same state. I don't see why. Those are just two different implementations for the same guest visible device. Right, they should appear the same to the guest but the fact that they're two different implementations should be reflected in the device tree. Why? To move beyond single-word questions, what is the purpose of the device tree? In my mind, it reflects the virtual hardware. What's important is that we have a PIC, virtio network adapter, and IDE disk. Not that they're backed by kvm, vhost-net, and qcow2. Let me give a very concrete example to illustrate my point. One thing I have on my TODO is to implement catch-up support for the emulated devices. I want to implement three modes of catch-up support: drop, fast, and gradual. Gradual is the best policy IMHO but fast is necessary on older kernels without highres timers. Drop is necessary to maintain compatibility with what we have today. The kernel PIT only implements one mode and even if the other two were added, even the newest version of QEMU needs to deal with the fact that there's old kernels out there with PIT's that only do fast. So how does this get exposed to management tools? Do you check for drift-mode=fast and transparently enable the KVM pit? Do you fail if anything but drift-mode=fast is specified? We need to have the following mechanisms: 1) the ability to select an in-kernel PIT vs. a userspace PIT 2) an independent mechanism to configure the userspace PIT 3) an independent mechanism to configure the in-kernel PIT. The best way to do this is to make the in-kernel PIT a separate device. Then we get all of this for free. And it buys us live migration and ABI issues for the same price. Really, can't we do class i8254 { ... virtual void set_catchup_policy(std::string policy) = 0; ... } to deal with the differences? 2) a user can explicitly create either the emulated version of the device or the in-kernel version of the device (no need for -no-kvm-irqchip) -device ioapic,model=kernel vs. -device kvm-ioapic? Is it really important to do that? 110% of the time we want the kernel irqchips. The remaining -10% are only used for testing. If model=kernel makes the support options different, then you end up introduce another layer of option validation. By using the later form, you get to leverage the option validation of qdev plus it makes it much clearer to users what options are supported in what model because now the documentation is explicit about it. Option validation = internals. ABI = ABI. We can deal with the former in any number of ways, but exposing it to the ABI is forever. 3) a user can pass parameters directly to the in-kernel version of the device that are different from the userspace version (like selecting different interrupt catch-up methods) -device pit,model=qemu,catchup=slew error: catchup=slew not supported in this model I'm not overly concerned about the implementation part. Though I think it's better to have a single implementation with kvm acting as an accelerator, having it the other way is no big deal. What I am worried about is exposing it as a monitor and migration ABI. IMO the only important thing is the spec that the device implements, not what piece of code implements it. Just as we do in the PIT, there's nothing wrong with making the device's migration compatible. Then the two devices have the same migration section id? That's my biggest worry. Not really worried about PIT and PIC (no one uses the user PIT now), but more about future devices moving into the kernel, if we have to do that. I'm not entirely sure what your concerns about the monitor are but there's simply no way to hide the fact that a device is implemented in KVM at the monitor
Re: [Qemu-devel] Re: [PATCH 26/35] kvm: Eliminate KVMState arguments
On 01/11/2011 04:36 PM, Anthony Liguori wrote: They need to use the same device id then. And if they share code, that indicates that they need to be the same device even more. No, it really doesn't :-) Cirrus VGA and std VGA share a lot of code. But that doesn't mean that we treat them as one device. Cirrus and VGA really are separate devices. They share code because on evolved from the other, and is backwards compatible with the other. i8254 and i8254-kvm did not evolve from each other, both are implementations of the i8254 spec, and both are 100% compatible with each other (modulu bugs). And BTW, there are guest visible differences between the KVM IOAPIC/PIC/PIT than the QEMU versions. The only reason PIT live migration works today is because usually delivers all interrupts quickly. But it actually does maintain state in the work queue that isn't saved. If PIT tried to implement gradual catchup, there would be no way not to expose that state to userspace. Why not? Whatever state the kernel keeps, we expose to userspace and allow sending it over the wire. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: [PATCH 26/35] kvm: Eliminate KVMState arguments
On 01/11/2011 08:56 AM, Avi Kivity wrote: On 01/11/2011 04:36 PM, Anthony Liguori wrote: They need to use the same device id then. And if they share code, that indicates that they need to be the same device even more. No, it really doesn't :-) Cirrus VGA and std VGA share a lot of code. But that doesn't mean that we treat them as one device. Cirrus and VGA really are separate devices. They share code because on evolved from the other, and is backwards compatible with the other. i8254 and i8254-kvm did not evolve from each other, Actually, they did, but that's besides the point. both are implementations of the i8254 spec, and both are 100% compatible with each other (modulu bugs). And BTW, there are guest visible differences between the KVM IOAPIC/PIC/PIT than the QEMU versions. The only reason PIT live migration works today is because usually delivers all interrupts quickly. But it actually does maintain state in the work queue that isn't saved. If PIT tried to implement gradual catchup, there would be no way not to expose that state to userspace. Why not? Whatever state the kernel keeps, we expose to userspace and allow sending it over the wire. What exactly is the scenario you're concerned about? Migration between userspace HPET and in-kernel HPET? One thing I've been considering is essentially migration filters. It would be a set of rules that essentially were hpet-kvm.* = hpet.* which would allow migration from hpet to hpet-kvm given a translation of state. I think this sort of higher level ruleset would make it easier to support migration between versions of the device model. Of course, that only gives you a forward path. It doesn't give you a backwards path. Regards, Anthony Liguori -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: [PATCH 26/35] kvm: Eliminate KVMState arguments
On 11.01.2011, at 16:12, Anthony Liguori wrote: On 01/11/2011 08:56 AM, Avi Kivity wrote: On 01/11/2011 04:36 PM, Anthony Liguori wrote: They need to use the same device id then. And if they share code, that indicates that they need to be the same device even more. No, it really doesn't :-) Cirrus VGA and std VGA share a lot of code. But that doesn't mean that we treat them as one device. Cirrus and VGA really are separate devices. They share code because on evolved from the other, and is backwards compatible with the other. i8254 and i8254-kvm did not evolve from each other, Actually, they did, but that's besides the point. both are implementations of the i8254 spec, and both are 100% compatible with each other (modulu bugs). And BTW, there are guest visible differences between the KVM IOAPIC/PIC/PIT than the QEMU versions. The only reason PIT live migration works today is because usually delivers all interrupts quickly. But it actually does maintain state in the work queue that isn't saved. If PIT tried to implement gradual catchup, there would be no way not to expose that state to userspace. Why not? Whatever state the kernel keeps, we expose to userspace and allow sending it over the wire. What exactly is the scenario you're concerned about? Migration between userspace HPET and in-kernel HPET? One thing I've been considering is essentially migration filters. It would be a set of rules that essentially were hpet-kvm.* = hpet.* which would allow migration from hpet to hpet-kvm given a translation of state. I think this sort of higher level ruleset would make it easier to support migration between versions of the device model. Of course, that only gives you a forward path. It doesn't give you a backwards path. Why not? Just include the version in the rule set and define a backwards rule if it's easy to do. If not, migration isn't possible. Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: [PATCH 26/35] kvm: Eliminate KVMState arguments
On 01/11/2011 05:12 PM, Anthony Liguori wrote: No, it really doesn't :-) Cirrus VGA and std VGA share a lot of code. But that doesn't mean that we treat them as one device. Cirrus and VGA really are separate devices. They share code because on evolved from the other, and is backwards compatible with the other. i8254 and i8254-kvm did not evolve from each other, Actually, they did, but that's besides the point. The code did, the devices did not. Why not? Whatever state the kernel keeps, we expose to userspace and allow sending it over the wire. What exactly is the scenario you're concerned about? Migration between userspace HPET and in-kernel HPET? Yes. To a lesser extent, a client doing 'info hpet' or similar and failing for kernel hpet. One thing I've been considering is essentially migration filters. It would be a set of rules that essentially were hpet-kvm.* = hpet.* which would allow migration from hpet to hpet-kvm given a translation of state. I think this sort of higher level ruleset would make it easier to support migration between versions of the device model. Of course, that only gives you a forward path. It doesn't give you a backwards path. It would be easier to have them use the same device id in the first place. If it looks like an i8254, quacks like an i8254, and live migrates like an i8254, it's probably an i8254. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: [PATCH 26/35] kvm: Eliminate KVMState arguments
On 01/11/2011 09:37 AM, Avi Kivity wrote: Why not? Whatever state the kernel keeps, we expose to userspace and allow sending it over the wire. What exactly is the scenario you're concerned about? Migration between userspace HPET and in-kernel HPET? Yes. To a lesser extent, a client doing 'info hpet' or similar and failing for kernel hpet. That's pretty easy to address. One thing I've been considering is essentially migration filters. It would be a set of rules that essentially were hpet-kvm.* = hpet.* which would allow migration from hpet to hpet-kvm given a translation of state. I think this sort of higher level ruleset would make it easier to support migration between versions of the device model. Of course, that only gives you a forward path. It doesn't give you a backwards path. It would be easier to have them use the same device id in the first place. If it looks like an i8254, quacks like an i8254, and live migrates like an i8254, it's probably an i8254. And that's fine. I'm not suggesting you call it i8253. But it's two separate implementations. We should make that visible, not try to hide it. It's an important detail. Imagine getting a sosreport that includes a dump of the device tree. You really want to see something in there that tells you it's an in-kernel PIT and not the userspace one. Regards, Anthony Liguori -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: [PATCH 26/35] kvm: Eliminate KVMState arguments
On 01/11/2011 05:55 PM, Anthony Liguori wrote: One thing I've been considering is essentially migration filters. It would be a set of rules that essentially were hpet-kvm.* = hpet.* which would allow migration from hpet to hpet-kvm given a translation of state. I think this sort of higher level ruleset would make it easier to support migration between versions of the device model. Of course, that only gives you a forward path. It doesn't give you a backwards path. It would be easier to have them use the same device id in the first place. If it looks like an i8254, quacks like an i8254, and live migrates like an i8254, it's probably an i8254. And that's fine. I'm not suggesting you call it i8253. But it's two separate implementations. We should make that visible, not try to hide it. It's an important detail. Visible, yes, but not in live migration, or in 'info i8254', or similar. We can live migrate between qcow2 and qed (using block migration), we should be able to do the same for the two i8254 implementations. I'm not happy about separate implementations, but that's a minor details. We can change it 2n+1 times without anybody noticing. Not so about ABI stuff. Imagine getting a sosreport that includes a dump of the device tree. You really want to see something in there that tells you it's an in-kernel PIT and not the userspace one. Sure. Not the device tree though. The command line would give all the information? Or 'info i8254' can say something about the implementation. I don't want to have the user say 'info i8254-kvm'. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: [PATCH 26/35] kvm: Eliminate KVMState arguments
Visible, yes, but not in live migration, or in 'info i8254', or similar. We can live migrate between qcow2 and qed (using block migration), we should be able to do the same for the two i8254 implementations. I'm not happy about separate implementations, but that's a minor details. We can change it 2n+1 times without anybody noticing. Not so about ABI stuff. Imagine getting a sosreport that includes a dump of the device tree. You really want to see something in there that tells you it's an in-kernel PIT and not the userspace one. Sure. Not the device tree though. The command line would give all the information? Then it's a one off option. We really want as much info as possible stored in the device tree. Or 'info i8254' can say something about the implementation. I don't want to have the user say 'info i8254-kvm'. info doesn't take a qdev device so yes, it can show whatever we want it to show. Regards, Anthony Liguori -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: [PATCH 26/35] kvm: Eliminate KVMState arguments
On 01/11/2011 06:26 PM, Anthony Liguori wrote: Visible, yes, but not in live migration, or in 'info i8254', or similar. We can live migrate between qcow2 and qed (using block migration), we should be able to do the same for the two i8254 implementations. I'm not happy about separate implementations, but that's a minor details. We can change it 2n+1 times without anybody noticing. Not so about ABI stuff. Imagine getting a sosreport that includes a dump of the device tree. You really want to see something in there that tells you it's an in-kernel PIT and not the userspace one. Sure. Not the device tree though. The command line would give all the information? Then it's a one off option. We really want as much info as possible stored in the device tree. Or 'info i8254' can say something about the implementation. I don't want to have the user say 'info i8254-kvm'. info doesn't take a qdev device so yes, it can show whatever we want it to show. It may be a qdev read-only attribute (and thus not migrated?) if we have to have it in qdev for some reason. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: [PATCH 26/35] kvm: Eliminate KVMState arguments
On 01/08/2011 02:47 AM, Jan Kiszka wrote: OK, but I don't want to argue about the ioeventfd API. So let's put this case aside. :) I often reply too quickly without explaining myself. Let me use ioeventfd as an example to highlight why KVMState is a good thing. In real life, PIO and MMIO are never directly communicated to the device from the processor. Instead, they go through a series of other devices. In the case of something like an ISA device, a PIO first goes to the chipset into the PCI complex, it will then go through a PCI-to-ISA bridge via subtractive decoding, and then forward over the ISA device where it will be interpreted by some device. The path to the chipset may be shared among different processors but it may also be unique. The APIC is the best example as there are historic APICs that hung directly off of the CPUs such that the same MMIO access across different CPUs did not go to the same device. This is why the APIC emulation in QEMU is so weird because we don't model this behavior correctly. This means that a PIO operation needs to flow from a CPUState to a DeviceState. It can then flow through to another DeviceState until it's finally handled. The first problem with ioeventfd is that it's a per-VM operation. It should be per VCPU. But even if this were the case, the path that a PIO operation takes should not be impacted by ioeventfd. IOW, a device shouldn't be allocating an eventfd() and handing it to a magical KVM call. Instead, a device should register a callback for a particular port in the same way it always does. *As an optimization*, we should have another interface that says that these values are only valid for this IO port. That would let us create eventfds and register things behind the scenes. That means we can handle TCG, older KVM kernels, and newer KVM kernels without any special support in the device model. It also means that the device models never have to worry about KVMState because there's an entirely different piece of code that's consulting the set of special ports and then deciding how to handle it. The result is better, more portable code that doesn't have KVM-isms. If passing state around seems to be ugly, it's probably because we're not abstracting things correctly. Removing the state and making it implicit is the wrong solution. Fixing the abstraction is the right solution (or living with the ugliness until someone else is motivated to fix it properly). Regards, Anthony Liguori -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 26/35] kvm: Eliminate KVMState arguments
Am 10.01.2011 20:59, Anthony Liguori wrote: On 01/08/2011 02:47 AM, Jan Kiszka wrote: Am 08.01.2011 00:27, Anthony Liguori wrote: On 01/07/2011 03:03 AM, Jan Kiszka wrote: Am 06.01.2011 20:24, Anthony Liguori wrote: On 01/06/2011 11:56 AM, Marcelo Tosatti wrote: From: Jan Kiszkajan.kis...@siemens.com QEMU supports only one VM, so there is only one kvm_state per process, and we gain nothing passing a reference to it around. Eliminate any need to refer to it outside of kvm-all.c. Signed-off-by: Jan Kiszkajan.kis...@siemens.com CC: Alexander Grafag...@suse.de Signed-off-by: Marcelo Tosattimtosa...@redhat.com I think this is a big mistake. Obviously, I don't share your concerns. :) Having to manage kvm_state keeps the abstraction lines well defined. How does it help? Otherwise, it's far too easy for portions of code to call into KVM functions that really shouldn't. I can't imagine we gain anything from requiring kvm_check_extension callers to hold a kvm_state capability. Yes, it's now much easier to call kvm_[vm_]ioctl, but that's the key point of this change: So far we primarily complicated the internal interface between generic and arch-dependent kvm parts by requiring kvm_state joggling. But external users already find interfaces without this restriction (kvm_log_*, kvm_ioeventfd_*, ...). That's because it's at least complicated to _cleanly_ pass kvm_state references to all users that need it - e.g. sysbus devices like kvmclock or upcoming in-kernel irqchips. I think you're basically making my point for me. ioeventfd is a broken interface. It shouldn't be a VM ioctl but rather a VCPU ioctl because PIO events are dispatched on a per-VCPU basis. OK, but I don't want to argue about the ioeventfd API. So let's put this case aside. :) kvm_state is available as part of CPU state so it's quite easy to get at if these interfaces just took a CPUState argument (and they should). My point is definitely NOT about cpu-bound devices. That case is clear and is not touched at all by this patch. My point is about devices that have clear system scope like kvmclock, ioapic, pit, pic, I don't see how ioapic, pit, or pic have a system scope. They are not bound to any CPU like the APIC which you may have in mind. I don't know enough about kvmclock. It's just the same. whatever-the-future-will-bring. And about KVM services that have global scope like capability checks and other feature explorations or VM configurations done by the KVM arch code. You still didn't explain what we gain in these concrete scenarios by handing the technically redundant abstraction kvm_state around, especially _inside_ the KVM core. If you have to pass around a KVMState pointer, you establish an explicit relationship and communication between subsystems. Any place where the global KVMState is used is a red flag that something is wrong. It is and will be _only_ used inside kvm-all.c. Again: What is the benefit of restricting access to kvm_check_extension this way? I don't see what the advantage to making all of the KVMState global and implicit. It seems like a big step backwards to me. Can you give a very concrete example of where you think it results in easier to understand code as I don't see how making relationships implicit ever makes code easier to understand? The best example does not yet exist (fortunately): Just look at patch 28 and then try to pass some kvm_state reference to the kvmclock device. Is this handle worth changing the sysbus API? Jan signature.asc Description: OpenPGP digital signature
Re: [Qemu-devel] Re: [PATCH 26/35] kvm: Eliminate KVMState arguments
Am 10.01.2011 21:11, Anthony Liguori wrote: On 01/08/2011 02:47 AM, Jan Kiszka wrote: OK, but I don't want to argue about the ioeventfd API. So let's put this case aside. :) I often reply too quickly without explaining myself. Let me use ioeventfd as an example to highlight why KVMState is a good thing. In real life, PIO and MMIO are never directly communicated to the device from the processor. Instead, they go through a series of other devices. In the case of something like an ISA device, a PIO first goes to the chipset into the PCI complex, it will then go through a PCI-to-ISA bridge via subtractive decoding, and then forward over the ISA device where it will be interpreted by some device. The path to the chipset may be shared among different processors but it may also be unique. The APIC is the best example as there are historic APICs that hung directly off of the CPUs such that the same MMIO access across different CPUs did not go to the same device. This is why the APIC emulation in QEMU is so weird because we don't model this behavior correctly. This means that a PIO operation needs to flow from a CPUState to a DeviceState. It can then flow through to another DeviceState until it's finally handled. The first problem with ioeventfd is that it's a per-VM operation. It should be per VCPU. But even if this were the case, the path that a PIO operation takes should not be impacted by ioeventfd. IOW, a device shouldn't be allocating an eventfd() and handing it to a magical KVM call. Instead, a device should register a callback for a particular port in the same way it always does. *As an optimization*, we should have another interface that says that these values are only valid for this IO port. That would let us create eventfds and register things behind the scenes. That means we can handle TCG, older KVM kernels, and newer KVM kernels without any special support in the device model. It also means that the device models never have to worry about KVMState because there's an entirely different piece of code that's consulting the set of special ports and then deciding how to handle it. The result is better, more portable code that doesn't have KVM-isms. If passing state around seems to be ugly, it's probably because we're not abstracting things correctly. Removing the state and making it implicit is the wrong solution. Fixing the abstraction is the right solution (or living with the ugliness until someone else is motivated to fix it properly). Look at my other reply, it should answer this. ioeventfd is the wrong example IMHO as one may argue about its relation to VCPUS. Jan signature.asc Description: OpenPGP digital signature
Re: [Qemu-devel] Re: [PATCH 26/35] kvm: Eliminate KVMState arguments
On 01/10/2011 02:12 PM, Jan Kiszka wrote: Am 10.01.2011 20:59, Anthony Liguori wrote: On 01/08/2011 02:47 AM, Jan Kiszka wrote: Am 08.01.2011 00:27, Anthony Liguori wrote: On 01/07/2011 03:03 AM, Jan Kiszka wrote: Am 06.01.2011 20:24, Anthony Liguori wrote: On 01/06/2011 11:56 AM, Marcelo Tosatti wrote: From: Jan Kiszkajan.kis...@siemens.com QEMU supports only one VM, so there is only one kvm_state per process, and we gain nothing passing a reference to it around. Eliminate any need to refer to it outside of kvm-all.c. Signed-off-by: Jan Kiszkajan.kis...@siemens.com CC: Alexander Grafag...@suse.de Signed-off-by: Marcelo Tosattimtosa...@redhat.com I think this is a big mistake. Obviously, I don't share your concerns. :) Having to manage kvm_state keeps the abstraction lines well defined. How does it help? Otherwise, it's far too easy for portions of code to call into KVM functions that really shouldn't. I can't imagine we gain anything from requiring kvm_check_extension callers to hold a kvm_state capability. Yes, it's now much easier to call kvm_[vm_]ioctl, but that's the key point of this change: So far we primarily complicated the internal interface between generic and arch-dependent kvm parts by requiring kvm_state joggling. But external users already find interfaces without this restriction (kvm_log_*, kvm_ioeventfd_*, ...). That's because it's at least complicated to _cleanly_ pass kvm_state references to all users that need it - e.g. sysbus devices like kvmclock or upcoming in-kernel irqchips. I think you're basically making my point for me. ioeventfd is a broken interface. It shouldn't be a VM ioctl but rather a VCPU ioctl because PIO events are dispatched on a per-VCPU basis. OK, but I don't want to argue about the ioeventfd API. So let's put this case aside. :) kvm_state is available as part of CPU state so it's quite easy to get at if these interfaces just took a CPUState argument (and they should). My point is definitely NOT about cpu-bound devices. That case is clear and is not touched at all by this patch. My point is about devices that have clear system scope like kvmclock, ioapic, pit, pic, I don't see how ioapic, pit, or pic have a system scope. They are not bound to any CPU like the APIC which you may have in mind. And none of the above interact with KVM. They may be replaced by KVM but if you look at the PIT, this is done by having two distinct devices. The KVM specific device can (and should) be instantiated with kvm_state. The way the IOAPIC/APIC/PIC is handled in qemu-kvm is nasty. The kernel devices are separate devices and that should be reflected in the device tree. I don't know enough about kvmclock. It's just the same. whatever-the-future-will-bring. And about KVM services that have global scope like capability checks and other feature explorations or VM configurations done by the KVM arch code. You still didn't explain what we gain in these concrete scenarios by handing the technically redundant abstraction kvm_state around, especially _inside_ the KVM core. If you have to pass around a KVMState pointer, you establish an explicit relationship and communication between subsystems. Any place where the global KVMState is used is a red flag that something is wrong. It is and will be _only_ used inside kvm-all.c. Again: What is the benefit of restricting access to kvm_check_extension this way? The more places that need to deal with KVM compatibility code, the worse we will be because it's more opportunities to get it wrong. I don't see what the advantage to making all of the KVMState global and implicit. It seems like a big step backwards to me. Can you give a very concrete example of where you think it results in easier to understand code as I don't see how making relationships implicit ever makes code easier to understand? The best example does not yet exist (fortunately): Just look at patch 28 and then try to pass some kvm_state reference to the kvmclock device. Is this handle worth changing the sysbus API? Let me look at that patch and reply there. Regards, Anthony Liguori Jan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: [PATCH 26/35] kvm: Eliminate KVMState arguments
Am 10.01.2011 21:23, Anthony Liguori wrote: On 01/10/2011 02:12 PM, Jan Kiszka wrote: Am 10.01.2011 20:59, Anthony Liguori wrote: On 01/08/2011 02:47 AM, Jan Kiszka wrote: Am 08.01.2011 00:27, Anthony Liguori wrote: On 01/07/2011 03:03 AM, Jan Kiszka wrote: Am 06.01.2011 20:24, Anthony Liguori wrote: On 01/06/2011 11:56 AM, Marcelo Tosatti wrote: From: Jan Kiszkajan.kis...@siemens.com QEMU supports only one VM, so there is only one kvm_state per process, and we gain nothing passing a reference to it around. Eliminate any need to refer to it outside of kvm-all.c. Signed-off-by: Jan Kiszkajan.kis...@siemens.com CC: Alexander Grafag...@suse.de Signed-off-by: Marcelo Tosattimtosa...@redhat.com I think this is a big mistake. Obviously, I don't share your concerns. :) Having to manage kvm_state keeps the abstraction lines well defined. How does it help? Otherwise, it's far too easy for portions of code to call into KVM functions that really shouldn't. I can't imagine we gain anything from requiring kvm_check_extension callers to hold a kvm_state capability. Yes, it's now much easier to call kvm_[vm_]ioctl, but that's the key point of this change: So far we primarily complicated the internal interface between generic and arch-dependent kvm parts by requiring kvm_state joggling. But external users already find interfaces without this restriction (kvm_log_*, kvm_ioeventfd_*, ...). That's because it's at least complicated to _cleanly_ pass kvm_state references to all users that need it - e.g. sysbus devices like kvmclock or upcoming in-kernel irqchips. I think you're basically making my point for me. ioeventfd is a broken interface. It shouldn't be a VM ioctl but rather a VCPU ioctl because PIO events are dispatched on a per-VCPU basis. OK, but I don't want to argue about the ioeventfd API. So let's put this case aside. :) kvm_state is available as part of CPU state so it's quite easy to get at if these interfaces just took a CPUState argument (and they should). My point is definitely NOT about cpu-bound devices. That case is clear and is not touched at all by this patch. My point is about devices that have clear system scope like kvmclock, ioapic, pit, pic, I don't see how ioapic, pit, or pic have a system scope. They are not bound to any CPU like the APIC which you may have in mind. And none of the above interact with KVM. They may be replaced by KVM but if you look at the PIT, this is done by having two distinct devices. The KVM specific device can (and should) be instantiated with kvm_state. The way the IOAPIC/APIC/PIC is handled in qemu-kvm is nasty. The kernel devices are separate devices and that should be reflected in the device tree. If separate device or hack to existing one - both need to sync their user space state with the kernel when QEMU asks them to. That's how they have to interact with KVM all the time. Same for kvmclock if you want to look at a really trivial example. I don't know enough about kvmclock. It's just the same. whatever-the-future-will-bring. And about KVM services that have global scope like capability checks and other feature explorations or VM configurations done by the KVM arch code. You still didn't explain what we gain in these concrete scenarios by handing the technically redundant abstraction kvm_state around, especially _inside_ the KVM core. If you have to pass around a KVMState pointer, you establish an explicit relationship and communication between subsystems. Any place where the global KVMState is used is a red flag that something is wrong. It is and will be _only_ used inside kvm-all.c. Again: What is the benefit of restricting access to kvm_check_extension this way? The more places that need to deal with KVM compatibility code, the worse we will be because it's more opportunities to get it wrong. That code belongs where the related logic is. IMHO, it would be a needless abstraction to push in-kernel access services and workaround definitions in the KVM core instead of the KVM device model code - provided there is only one user. But this discussion is a bit abstract right now as we do not yet have anything more complex than kvmclock on the table for QEMU. I don't see what the advantage to making all of the KVMState global and implicit. It seems like a big step backwards to me. Can you give a very concrete example of where you think it results in easier to understand code as I don't see how making relationships implicit ever makes code easier to understand? The best example does not yet exist (fortunately): Just look at patch 28 and then try to pass some kvm_state reference to the kvmclock
Re: [PATCH 26/35] kvm: Eliminate KVMState arguments
Am 08.01.2011 00:27, Anthony Liguori wrote: On 01/07/2011 03:03 AM, Jan Kiszka wrote: Am 06.01.2011 20:24, Anthony Liguori wrote: On 01/06/2011 11:56 AM, Marcelo Tosatti wrote: From: Jan Kiszkajan.kis...@siemens.com QEMU supports only one VM, so there is only one kvm_state per process, and we gain nothing passing a reference to it around. Eliminate any need to refer to it outside of kvm-all.c. Signed-off-by: Jan Kiszkajan.kis...@siemens.com CC: Alexander Grafag...@suse.de Signed-off-by: Marcelo Tosattimtosa...@redhat.com I think this is a big mistake. Obviously, I don't share your concerns. :) Having to manage kvm_state keeps the abstraction lines well defined. How does it help? Otherwise, it's far too easy for portions of code to call into KVM functions that really shouldn't. I can't imagine we gain anything from requiring kvm_check_extension callers to hold a kvm_state capability. Yes, it's now much easier to call kvm_[vm_]ioctl, but that's the key point of this change: So far we primarily complicated the internal interface between generic and arch-dependent kvm parts by requiring kvm_state joggling. But external users already find interfaces without this restriction (kvm_log_*, kvm_ioeventfd_*, ...). That's because it's at least complicated to _cleanly_ pass kvm_state references to all users that need it - e.g. sysbus devices like kvmclock or upcoming in-kernel irqchips. I think you're basically making my point for me. ioeventfd is a broken interface. It shouldn't be a VM ioctl but rather a VCPU ioctl because PIO events are dispatched on a per-VCPU basis. OK, but I don't want to argue about the ioeventfd API. So let's put this case aside. :) kvm_state is available as part of CPU state so it's quite easy to get at if these interfaces just took a CPUState argument (and they should). My point is definitely NOT about cpu-bound devices. That case is clear and is not touched at all by this patch. My point is about devices that have clear system scope like kvmclock, ioapic, pit, pic, whatever-the-future-will-bring. And about KVM services that have global scope like capability checks and other feature explorations or VM configurations done by the KVM arch code. You still didn't explain what we gain in these concrete scenarios by handing the technically redundant abstraction kvm_state around, especially _inside_ the KVM core. Jan signature.asc Description: OpenPGP digital signature
Re: [PATCH 26/35] kvm: Eliminate KVMState arguments
Am 06.01.2011 20:24, Anthony Liguori wrote: On 01/06/2011 11:56 AM, Marcelo Tosatti wrote: From: Jan Kiszkajan.kis...@siemens.com QEMU supports only one VM, so there is only one kvm_state per process, and we gain nothing passing a reference to it around. Eliminate any need to refer to it outside of kvm-all.c. Signed-off-by: Jan Kiszkajan.kis...@siemens.com CC: Alexander Grafag...@suse.de Signed-off-by: Marcelo Tosattimtosa...@redhat.com I think this is a big mistake. Obviously, I don't share your concerns. :) Having to manage kvm_state keeps the abstraction lines well defined. How does it help? Otherwise, it's far too easy for portions of code to call into KVM functions that really shouldn't. I can't imagine we gain anything from requiring kvm_check_extension callers to hold a kvm_state capability. Yes, it's now much easier to call kvm_[vm_]ioctl, but that's the key point of this change: So far we primarily complicated the internal interface between generic and arch-dependent kvm parts by requiring kvm_state joggling. But external users already find interfaces without this restriction (kvm_log_*, kvm_ioeventfd_*, ...). That's because it's at least complicated to _cleanly_ pass kvm_state references to all users that need it - e.g. sysbus devices like kvmclock or upcoming in-kernel irqchips. Let's just stop this artificial abstraction that has no practical use and focus on detecting layering violations via code review. That's more reliable IMHO. Jan signature.asc Description: OpenPGP digital signature
Re: [PATCH 26/35] kvm: Eliminate KVMState arguments
On 01/07/2011 03:03 AM, Jan Kiszka wrote: Am 06.01.2011 20:24, Anthony Liguori wrote: On 01/06/2011 11:56 AM, Marcelo Tosatti wrote: From: Jan Kiszkajan.kis...@siemens.com QEMU supports only one VM, so there is only one kvm_state per process, and we gain nothing passing a reference to it around. Eliminate any need to refer to it outside of kvm-all.c. Signed-off-by: Jan Kiszkajan.kis...@siemens.com CC: Alexander Grafag...@suse.de Signed-off-by: Marcelo Tosattimtosa...@redhat.com I think this is a big mistake. Obviously, I don't share your concerns. :) Having to manage kvm_state keeps the abstraction lines well defined. How does it help? Otherwise, it's far too easy for portions of code to call into KVM functions that really shouldn't. I can't imagine we gain anything from requiring kvm_check_extension callers to hold a kvm_state capability. Yes, it's now much easier to call kvm_[vm_]ioctl, but that's the key point of this change: So far we primarily complicated the internal interface between generic and arch-dependent kvm parts by requiring kvm_state joggling. But external users already find interfaces without this restriction (kvm_log_*, kvm_ioeventfd_*, ...). That's because it's at least complicated to _cleanly_ pass kvm_state references to all users that need it - e.g. sysbus devices like kvmclock or upcoming in-kernel irqchips. I think you're basically making my point for me. ioeventfd is a broken interface. It shouldn't be a VM ioctl but rather a VCPU ioctl because PIO events are dispatched on a per-VCPU basis. kvm_state is available as part of CPU state so it's quite easy to get at if these interfaces just took a CPUState argument (and they should). Let's just stop this artificial abstraction that has no practical use and focus on detecting layering violations via code review. That's more reliable IMHO. Documenting relationships between devices and the CPU is a very important task. Being able to grep for cpu_single_env to see what devices models are interacting with the CPU is a very good thing. When you've got these interactions hidden in a spaghetti of function calls, things become impossible to understand. Regards, Anthony Liguori Jan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 26/35] kvm: Eliminate KVMState arguments
From: Jan Kiszka jan.kis...@siemens.com QEMU supports only one VM, so there is only one kvm_state per process, and we gain nothing passing a reference to it around. Eliminate any need to refer to it outside of kvm-all.c. Signed-off-by: Jan Kiszka jan.kis...@siemens.com CC: Alexander Graf ag...@suse.de Signed-off-by: Marcelo Tosatti mtosa...@redhat.com --- cpu-defs.h|2 - kvm-all.c | 232 + kvm-stub.c|2 +- kvm.h | 15 +-- target-i386/cpuid.c |9 +- target-i386/kvm.c | 77 target-i386/kvm_x86.h |3 + target-ppc/kvm.c | 12 ++-- target-s390x/kvm.c|8 +-- 9 files changed, 160 insertions(+), 200 deletions(-) diff --git a/cpu-defs.h b/cpu-defs.h index 8d4bf86..0e04239 100644 --- a/cpu-defs.h +++ b/cpu-defs.h @@ -131,7 +131,6 @@ typedef struct icount_decr_u16 { #endif struct kvm_run; -struct KVMState; struct qemu_work_item; typedef struct CPUBreakpoint { @@ -207,7 +206,6 @@ typedef struct CPUWatchpoint { struct QemuCond *halt_cond; \ struct qemu_work_item *queued_work_first, *queued_work_last;\ const char *cpu_model_str; \ -struct KVMState *kvm_state; \ struct kvm_run *kvm_run;\ int kvm_fd; \ int kvm_vcpu_dirty; diff --git a/kvm-all.c b/kvm-all.c index ef2ca3b..d8820c7 100644 --- a/kvm-all.c +++ b/kvm-all.c @@ -52,8 +52,7 @@ typedef struct KVMSlot typedef struct kvm_dirty_log KVMDirtyLog; -struct KVMState -{ +static struct KVMState { KVMSlot slots[32]; int fd; int vmfd; @@ -72,21 +71,19 @@ struct KVMState int irqchip_in_kernel; int pit_in_kernel; int xsave, xcrs; -}; - -static KVMState *kvm_state; +} kvm_state; -static KVMSlot *kvm_alloc_slot(KVMState *s) +static KVMSlot *kvm_alloc_slot(void) { int i; -for (i = 0; i ARRAY_SIZE(s-slots); i++) { +for (i = 0; i ARRAY_SIZE(kvm_state.slots); i++) { /* KVM private memory slots */ if (i = 8 i 12) { continue; } -if (s-slots[i].memory_size == 0) { -return s-slots[i]; +if (kvm_state.slots[i].memory_size == 0) { +return kvm_state.slots[i]; } } @@ -94,14 +91,13 @@ static KVMSlot *kvm_alloc_slot(KVMState *s) abort(); } -static KVMSlot *kvm_lookup_matching_slot(KVMState *s, - target_phys_addr_t start_addr, +static KVMSlot *kvm_lookup_matching_slot(target_phys_addr_t start_addr, target_phys_addr_t end_addr) { int i; -for (i = 0; i ARRAY_SIZE(s-slots); i++) { -KVMSlot *mem = s-slots[i]; +for (i = 0; i ARRAY_SIZE(kvm_state.slots); i++) { +KVMSlot *mem = kvm_state.slots[i]; if (start_addr == mem-start_addr end_addr == mem-start_addr + mem-memory_size) { @@ -115,15 +111,14 @@ static KVMSlot *kvm_lookup_matching_slot(KVMState *s, /* * Find overlapping slot with lowest start address */ -static KVMSlot *kvm_lookup_overlapping_slot(KVMState *s, -target_phys_addr_t start_addr, +static KVMSlot *kvm_lookup_overlapping_slot(target_phys_addr_t start_addr, target_phys_addr_t end_addr) { KVMSlot *found = NULL; int i; -for (i = 0; i ARRAY_SIZE(s-slots); i++) { -KVMSlot *mem = s-slots[i]; +for (i = 0; i ARRAY_SIZE(kvm_state.slots); i++) { +KVMSlot *mem = kvm_state.slots[i]; if (mem-memory_size == 0 || (found found-start_addr mem-start_addr)) { @@ -139,13 +134,13 @@ static KVMSlot *kvm_lookup_overlapping_slot(KVMState *s, return found; } -int kvm_physical_memory_addr_from_ram(KVMState *s, ram_addr_t ram_addr, +int kvm_physical_memory_addr_from_ram(ram_addr_t ram_addr, target_phys_addr_t *phys_addr) { int i; -for (i = 0; i ARRAY_SIZE(s-slots); i++) { -KVMSlot *mem = s-slots[i]; +for (i = 0; i ARRAY_SIZE(kvm_state.slots); i++) { +KVMSlot *mem = kvm_state.slots[i]; if (ram_addr = mem-phys_offset ram_addr mem-phys_offset + mem-memory_size) { @@ -157,7 +152,7 @@ int kvm_physical_memory_addr_from_ram(KVMState *s, ram_addr_t ram_addr, return 0; } -static int kvm_set_user_memory_region(KVMState *s, KVMSlot *slot) +static int kvm_set_user_memory_region(KVMSlot *slot) { struct kvm_userspace_memory_region mem; @@ -166,10 +161,10 @@ static int kvm_set_user_memory_region(KVMState *s, KVMSlot *slot) mem.memory_size = slot-memory_size; mem.userspace_addr = (unsigned
Re: [PATCH 26/35] kvm: Eliminate KVMState arguments
On 01/06/2011 11:56 AM, Marcelo Tosatti wrote: From: Jan Kiszkajan.kis...@siemens.com QEMU supports only one VM, so there is only one kvm_state per process, and we gain nothing passing a reference to it around. Eliminate any need to refer to it outside of kvm-all.c. Signed-off-by: Jan Kiszkajan.kis...@siemens.com CC: Alexander Grafag...@suse.de Signed-off-by: Marcelo Tosattimtosa...@redhat.com I think this is a big mistake. Having to manage kvm_state keeps the abstraction lines well defined. Otherwise, it's far too easy for portions of code to call into KVM functions that really shouldn't. Regards, Anthony Liguori --- cpu-defs.h|2 - kvm-all.c | 232 + kvm-stub.c|2 +- kvm.h | 15 +-- target-i386/cpuid.c |9 +- target-i386/kvm.c | 77 target-i386/kvm_x86.h |3 + target-ppc/kvm.c | 12 ++-- target-s390x/kvm.c|8 +-- 9 files changed, 160 insertions(+), 200 deletions(-) diff --git a/cpu-defs.h b/cpu-defs.h index 8d4bf86..0e04239 100644 --- a/cpu-defs.h +++ b/cpu-defs.h @@ -131,7 +131,6 @@ typedef struct icount_decr_u16 { #endif struct kvm_run; -struct KVMState; struct qemu_work_item; typedef struct CPUBreakpoint { @@ -207,7 +206,6 @@ typedef struct CPUWatchpoint { struct QemuCond *halt_cond; \ struct qemu_work_item *queued_work_first, *queued_work_last;\ const char *cpu_model_str; \ -struct KVMState *kvm_state; \ struct kvm_run *kvm_run;\ int kvm_fd; \ int kvm_vcpu_dirty; diff --git a/kvm-all.c b/kvm-all.c index ef2ca3b..d8820c7 100644 --- a/kvm-all.c +++ b/kvm-all.c @@ -52,8 +52,7 @@ typedef struct KVMSlot typedef struct kvm_dirty_log KVMDirtyLog; -struct KVMState -{ +static struct KVMState { KVMSlot slots[32]; int fd; int vmfd; @@ -72,21 +71,19 @@ struct KVMState int irqchip_in_kernel; int pit_in_kernel; int xsave, xcrs; -}; - -static KVMState *kvm_state; +} kvm_state; -static KVMSlot *kvm_alloc_slot(KVMState *s) +static KVMSlot *kvm_alloc_slot(void) { int i; -for (i = 0; i ARRAY_SIZE(s-slots); i++) { +for (i = 0; i ARRAY_SIZE(kvm_state.slots); i++) { /* KVM private memory slots */ if (i= 8 i 12) { continue; } -if (s-slots[i].memory_size == 0) { -returns-slots[i]; +if (kvm_state.slots[i].memory_size == 0) { +returnkvm_state.slots[i]; } } @@ -94,14 +91,13 @@ static KVMSlot *kvm_alloc_slot(KVMState *s) abort(); } -static KVMSlot *kvm_lookup_matching_slot(KVMState *s, - target_phys_addr_t start_addr, +static KVMSlot *kvm_lookup_matching_slot(target_phys_addr_t start_addr, target_phys_addr_t end_addr) { int i; -for (i = 0; i ARRAY_SIZE(s-slots); i++) { -KVMSlot *mem =s-slots[i]; +for (i = 0; i ARRAY_SIZE(kvm_state.slots); i++) { +KVMSlot *mem =kvm_state.slots[i]; if (start_addr == mem-start_addr end_addr == mem-start_addr + mem-memory_size) { @@ -115,15 +111,14 @@ static KVMSlot *kvm_lookup_matching_slot(KVMState *s, /* * Find overlapping slot with lowest start address */ -static KVMSlot *kvm_lookup_overlapping_slot(KVMState *s, -target_phys_addr_t start_addr, +static KVMSlot *kvm_lookup_overlapping_slot(target_phys_addr_t start_addr, target_phys_addr_t end_addr) { KVMSlot *found = NULL; int i; -for (i = 0; i ARRAY_SIZE(s-slots); i++) { -KVMSlot *mem =s-slots[i]; +for (i = 0; i ARRAY_SIZE(kvm_state.slots); i++) { +KVMSlot *mem =kvm_state.slots[i]; if (mem-memory_size == 0 || (found found-start_addr mem-start_addr)) { @@ -139,13 +134,13 @@ static KVMSlot *kvm_lookup_overlapping_slot(KVMState *s, return found; } -int kvm_physical_memory_addr_from_ram(KVMState *s, ram_addr_t ram_addr, +int kvm_physical_memory_addr_from_ram(ram_addr_t ram_addr, target_phys_addr_t *phys_addr) { int i; -for (i = 0; i ARRAY_SIZE(s-slots); i++) { -KVMSlot *mem =s-slots[i]; +for (i = 0; i ARRAY_SIZE(kvm_state.slots); i++) { +KVMSlot *mem =kvm_state.slots[i]; if (ram_addr= mem-phys_offset ram_addr mem-phys_offset + mem-memory_size) { @@ -157,7 +152,7 @@ int kvm_physical_memory_addr_from_ram(KVMState *s, ram_addr_t ram_addr, return 0; } -static int kvm_set_user_memory_region(KVMState *s,