Re: [Qemu-devel] DOS VM problem with QEMU-KVM and newer kernels
On Mon, Apr 16, 2012 at 12:25:37PM +0200, Jan Kiszka wrote: On 2012-04-15 11:44, Avi Kivity wrote: Jan, Joerg, was an AMD erratum published for the bug? It wasn't an erratum but a documented feature limitation in the AMD architecture that was simply ignored by the old code. Right. But there is an erratum on K8 (only) which Kevin ran into. It is documented as Erratum 701 and the bug is that no EXITINTINFO is stored on a task-switch intercept on K8. Joerg -- AMD Operating System Research Center Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach General Managers: Alberto Bozzo Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632
Re: [Qemu-devel] kvm PCI assignment VFIO ramblings
On Fri, Aug 26, 2011 at 12:24:23AM -0400, David Gibson wrote: On Thu, Aug 25, 2011 at 08:25:45AM -0500, Alexander Graf wrote: On 25.08.2011, at 07:31, Roedel, Joerg wrote: For mmio we could stop the guest and replace the mmio region with a region that is filled with 0xff, no? Sure, but that happens in user space. The question is how does kernel space enforce an MMIO region to not be mapped after the hotplug event occured? Keep in mind that user space is pretty much untrusted here - it doesn't have to be QEMU. It could just as well be a generic user space driver. And that can just ignore hotplug events. We're saying you hard yank the mapping from the userspace process. That is, you invalidate all its PTEs mapping the MMIO space, and don't let it fault them back in. As I see it there are two options: (a) make subsequent accesses from userspace or the guest result in either a SIGBUS that userspace must either deal with or die, or (b) replace the mapping with a dummy RO mapping containing 0xff, with any trapped writes emulated as nops. The biggest problem with this approach is that it has to happen in the context of the given process. Linux can't really modify an mm which which belong to another context in a safe way. The more I think about this, I come to the conclusion that it would be the best to just kill the process accessing the device if it is manually de-assigned from vfio. It should be a non-standard path anyway so it doesn't make a lot of sense to implement complicated handling semantics for it, no? Joerg -- AMD Operating System Research Center Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach General Managers: Alberto Bozzo, Andrew Bowd Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632
Re: [Qemu-devel] kvm PCI assignment VFIO ramblings
On Fri, Aug 26, 2011 at 12:20:00AM -0400, David Gibson wrote: On Wed, Aug 24, 2011 at 01:03:32PM +0200, Roedel, Joerg wrote: On Wed, Aug 24, 2011 at 05:33:00AM -0400, David Gibson wrote: On Wed, Aug 24, 2011 at 11:14:26AM +0200, Roedel, Joerg wrote: I don't see a reason to make this meta-grouping static. It would harm flexibility on x86. I think it makes things easier on power but there are options on that platform to get the dynamic solution too. I think several people are misreading what Ben means by static. I would prefer to say 'persistent', in that the meta-groups lifetime is not tied to an fd, but they can be freely created, altered and removed during runtime. Even if it can be altered at runtime, from a usability perspective it is certainly the best to handle these groups directly in qemu. Or are there strong reasons to do it somewhere else? Funny, Ben and I think usability demands it be the other way around. The reason is that you mean the usability for the programmer and I mean it for the actual user of qemu :) If the meta-groups are transient - that is lifetime tied to an fd - then any program that wants to use meta-groups *must* know the interfaces for creating one, whatever they are. But if they're persistent, the admin can use other tools to create the meta-group then just hand it to a program to use, since the interfaces for _using_ a meta-group are identical to those for an atomic group. This doesn't preclude a program from being meta-group aware, and creating its own if it wants to, of course. My guess is that qemu would not want to build its own meta-groups, but libvirt probably would. Doing it in libvirt makes it really hard for a plain user of qemu to assign more than one device to a guest. What I want it that a user just types qemu -device vfio,host=00:01.0 -device vfio,host=00:02.0 ... and it just works. Qemu creates the meta-groups and they are automatically destroyed when qemu exits. That the programs are not aware of meta-groups is not a big problem because all software using vfio needs still to be written :) Btw, with this concept the programmer can still decide to not use meta-groups and just multiplex the mappings to all open device-fds it uses. Joerg -- AMD Operating System Research Center Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach General Managers: Alberto Bozzo, Andrew Bowd Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632
Re: [Qemu-devel] kvm PCI assignment VFIO ramblings
On Wed, Aug 24, 2011 at 10:56:13AM -0400, Alex Williamson wrote: On Wed, 2011-08-24 at 10:43 +0200, Joerg Roedel wrote: A side-note: Might it be better to expose assigned devices in a guest on a seperate bus? This will make it easier to emulate an IOMMU for the guest inside qemu. I think we want that option, sure. A lot of guests aren't going to support hotplugging buses though, so I think our default, map the entire guest model should still be using bus 0. The ACPI gets a lot more complicated for that model too; dynamic SSDTs? Thanks, Ok, if only AMD-Vi should be emulated then it is not strictly necessary. For this IOMMU we can specify that devices on the same bus belong to different IOMMUs. So we can implement an IOMMU that handles internal qemu-devices and one that handles pass-through devices. Not sure if this is possible with VT-d too. Okay VT-d emulation would also require that the devices emulation of a PCIe bridge, no? Joerg -- AMD Operating System Research Center Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach General Managers: Alberto Bozzo, Andrew Bowd Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632
Re: [Qemu-devel] kvm PCI assignment VFIO ramblings
Hi Alex, On Wed, Aug 24, 2011 at 05:13:49PM -0400, Alex Williamson wrote: Is this roughly what you're thinking of for the iommu_group component? Adding a dev_to_group iommu ops callback let's us consolidate the sysfs support in the iommu base. Would AMD-Vi do something similar (or exactly the same) for group #s? Thanks, The concept looks good, I have some comments, though. On AMD-Vi the implementation would look a bit different because there is a data-structure were the information can be gathered from, so no need for PCI bus scanning there. diff --git a/drivers/base/iommu.c b/drivers/base/iommu.c index 6e6b6a1..6b54c1a 100644 --- a/drivers/base/iommu.c +++ b/drivers/base/iommu.c @@ -17,20 +17,56 @@ */ #include linux/bug.h +#include linux/device.h #include linux/types.h #include linux/module.h #include linux/slab.h #include linux/errno.h #include linux/iommu.h +#include linux/pci.h static struct iommu_ops *iommu_ops; +static ssize_t show_iommu_group(struct device *dev, + struct device_attribute *attr, char *buf) +{ + return sprintf(buf, %lx, iommu_dev_to_group(dev)); Probably add a 0x prefix so userspace knows the format? +} +static DEVICE_ATTR(iommu_group, S_IRUGO, show_iommu_group, NULL); + +static int add_iommu_group(struct device *dev, void *unused) +{ + if (iommu_dev_to_group(dev) = 0) + return device_create_file(dev, dev_attr_iommu_group); + + return 0; +} + +static int device_notifier(struct notifier_block *nb, +unsigned long action, void *data) +{ + struct device *dev = data; + + if (action == BUS_NOTIFY_ADD_DEVICE) + return add_iommu_group(dev, NULL); + + return 0; +} + +static struct notifier_block device_nb = { + .notifier_call = device_notifier, +}; + void register_iommu(struct iommu_ops *ops) { if (iommu_ops) BUG(); iommu_ops = ops; + + /* FIXME - non-PCI, really want for_each_bus() */ + bus_register_notifier(pci_bus_type, device_nb); + bus_for_each_dev(pci_bus_type, NULL, NULL, add_iommu_group); } We need to solve this differently. ARM is starting to use the iommu-api too and this definitly does not work there. One possible solution might be to make the iommu-ops per-bus. bool iommu_found(void) @@ -94,6 +130,14 @@ int iommu_domain_has_cap(struct iommu_domain *domain, } EXPORT_SYMBOL_GPL(iommu_domain_has_cap); +long iommu_dev_to_group(struct device *dev) +{ + if (iommu_ops-dev_to_group) + return iommu_ops-dev_to_group(dev); + return -ENODEV; +} +EXPORT_SYMBOL_GPL(iommu_dev_to_group); Please rename this to iommu_device_group(). The dev_to_group name suggests a conversion but it is actually just a property of the device. Also the return type should not be long but something that fits into 32bit on all platforms. Since you use -ENODEV, probably s32 is a good choice. + int iommu_map(struct iommu_domain *domain, unsigned long iova, phys_addr_t paddr, int gfp_order, int prot) { diff --git a/drivers/pci/intel-iommu.c b/drivers/pci/intel-iommu.c index f02c34d..477259c 100644 --- a/drivers/pci/intel-iommu.c +++ b/drivers/pci/intel-iommu.c @@ -404,6 +404,7 @@ static int dmar_map_gfx = 1; static int dmar_forcedac; static int intel_iommu_strict; static int intel_iommu_superpage = 1; +static int intel_iommu_no_mf_groups; #define DUMMY_DEVICE_DOMAIN_INFO ((struct device_domain_info *)(-1)) static DEFINE_SPINLOCK(device_domain_lock); @@ -438,6 +439,10 @@ static int __init intel_iommu_setup(char *str) printk(KERN_INFO Intel-IOMMU: disable supported super page\n); intel_iommu_superpage = 0; + } else if (!strncmp(str, no_mf_groups, 12)) { + printk(KERN_INFO + Intel-IOMMU: disable separate groups for multifunction devices\n); + intel_iommu_no_mf_groups = 1; This should really be a global iommu option and not be VT-d specific. str += strcspn(str, ,); @@ -3902,6 +3907,52 @@ static int intel_iommu_domain_has_cap(struct iommu_domain *domain, return 0; } +/* Group numbers are arbitrary. Device with the same group number + * indicate the iommu cannot differentiate between them. To avoid + * tracking used groups we just use the seg|bus|devfn of the lowest + * level we're able to differentiate devices */ +static long intel_iommu_dev_to_group(struct device *dev) +{ + struct pci_dev *pdev = to_pci_dev(dev); + struct pci_dev *bridge; + union { + struct { + u8 devfn; + u8 bus; + u16 segment; + } pci; + u32 group; + } id; + + if (iommu_no_mapping(dev)) + return -ENODEV; + +
Re: [Qemu-devel] kvm PCI assignment VFIO ramblings
On Wed, Aug 24, 2011 at 11:07:46AM -0400, Alex Williamson wrote: On Wed, 2011-08-24 at 10:52 +0200, Roedel, Joerg wrote: On Tue, Aug 23, 2011 at 01:08:29PM -0400, Alex Williamson wrote: On Tue, 2011-08-23 at 15:14 +0200, Roedel, Joerg wrote: Handling it through fds is a good idea. This makes sure that everything belongs to one process. I am not really sure yet if we go the way to just bind plain groups together or if we create meta-groups. The meta-groups thing seems somewhat cleaner, though. I'm leaning towards binding because we need to make it dynamic, but I don't really have a good picture of the lifecycle of a meta-group. In my view the life-cycle of the meta-group is a subrange of the qemu-instance's life-cycle. I guess I mean the lifecycle of a super-group that's actually exposed as a new group in sysfs. Who creates it? How? How are groups dynamically added and removed from the super-group? The group merging makes sense to me because it's largely just an optimization that qemu will try to merge groups. If it works, great. If not, it manages them separately. When all the devices from a group are unplugged, unmerge the group if necessary. Right. The super-group thing is an optimization. We need to try the polite method of attempting to hot unplug the device from qemu first, which the current vfio code already implements. We can then escalate if it doesn't respond. The current code calls abort in qemu if the guest doesn't respond, but I agree we should also be enforcing this at the kernel interface. I think the problem with the hard-unplug is that we don't have a good revoke mechanism for the mmio mmaps. For mmio we could stop the guest and replace the mmio region with a region that is filled with 0xff, no? Joerg -- AMD Operating System Research Center Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach General Managers: Alberto Bozzo, Andrew Bowd Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632
Re: [Qemu-devel] kvm PCI assignment VFIO ramblings
On Thu, Aug 25, 2011 at 11:38:09AM -0400, Don Dutile wrote: On 08/25/2011 06:54 AM, Roedel, Joerg wrote: We need to solve this differently. ARM is starting to use the iommu-api too and this definitly does not work there. One possible solution might be to make the iommu-ops per-bus. When you think of a system where there isn't just one bus-type with iommu support, it makes more sense. Additionally, it also allows the long-term architecture to use different types of IOMMUs on each bus segment -- think per-PCIe-switch/bridge IOMMUs -- esp. 'tuned' IOMMUs -- ones better geared for networks, ones better geared for direct-attach disk hba's. Not sure how likely it is to have different types of IOMMUs within a given bus-type. But if they become reality we can multiplex in the iommu-api without much hassle :) For now, something like bus_set_iommu() or bus_register_iommu() would provide a nice way to do bus-specific setups for a given iommu implementation. Joerg -- AMD Operating System Research Center Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach General Managers: Alberto Bozzo, Andrew Bowd Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632
Re: [Qemu-devel] kvm PCI assignment VFIO ramblings
On Tue, Aug 23, 2011 at 01:08:29PM -0400, Alex Williamson wrote: On Tue, 2011-08-23 at 15:14 +0200, Roedel, Joerg wrote: Handling it through fds is a good idea. This makes sure that everything belongs to one process. I am not really sure yet if we go the way to just bind plain groups together or if we create meta-groups. The meta-groups thing seems somewhat cleaner, though. I'm leaning towards binding because we need to make it dynamic, but I don't really have a good picture of the lifecycle of a meta-group. In my view the life-cycle of the meta-group is a subrange of the qemu-instance's life-cycle. Putting the process to sleep (which would be uninterruptible) seems bad. The process would sleep until the guest releases the device-group, which can take days or months. The best thing (and the most intrusive :-) ) is to change PCI core to allow unbindings to fail, I think. But this probably further complicates the way to upstream VFIO... Yes, it's not ideal but I think it's sufficient for now and if we later get support for returning an error from release, we can set a timeout after notifying the user to make use of that. Thanks, Ben had the idea of just forcing to hard-unplug this device from the guest. Thats probably the best way to deal with that, I think. VFIO sends a notification to qemu that the device is gone and qemu informs the guest in some way about it. Joerg -- AMD Operating System Research Center Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach General Managers: Alberto Bozzo, Andrew Bowd Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632
Re: [Qemu-devel] kvm PCI assignment VFIO ramblings
On Tue, Aug 23, 2011 at 07:35:37PM -0400, Benjamin Herrenschmidt wrote: On Tue, 2011-08-23 at 15:18 +0200, Roedel, Joerg wrote: Hmm, good idea. But as far as I know the hotplug-event needs to be in the guest _before_ the device is actually unplugged (so that the guest can unbind its driver first). That somehow brings back the sleep-idea and the timeout in the .release function. That's for normal assisted hotplug, but don't we support hard hotplug ? I mean, things like cardbus, thunderbolt (if we ever support that) etc... will need it and some platforms do support hard hotplug of PCIe devices. (That's why drivers should never spin on MMIO waiting for a 1 bit to clear without a timeout :-) Right, thats probably the best semantic for this issue then. The worst thing that happens is that the admin crashed the guest. Joerg -- AMD Operating System Research Center Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach General Managers: Alberto Bozzo, Andrew Bowd Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632
Re: [Qemu-devel] kvm PCI assignment VFIO ramblings
On Tue, Aug 23, 2011 at 12:54:27PM -0400, aafabbri wrote: On 8/23/11 4:04 AM, Joerg Roedel joerg.roe...@amd.com wrote: That is makes uiommu basically the same as the meta-groups, right? Yes, functionality seems the same, thus my suggestion to keep uiommu explicit. Is there some need for group-groups besides defining sets of groups which share IOMMU resources? I do all this stuff (bringing up sets of devices which may share IOMMU domain) dynamically from C applications. I don't really want some static (boot-time or sysfs fiddling) supergroup config unless there is a good reason KVM/power needs it. As you say in your next email, doing it all from ioctls is very easy, programmatically. I don't see a reason to make this meta-grouping static. It would harm flexibility on x86. I think it makes things easier on power but there are options on that platform to get the dynamic solution too. Joerg -- AMD Operating System Research Center Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach General Managers: Alberto Bozzo, Andrew Bowd Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632
Re: [Qemu-devel] kvm PCI assignment VFIO ramblings
On Wed, Aug 24, 2011 at 05:33:00AM -0400, David Gibson wrote: On Wed, Aug 24, 2011 at 11:14:26AM +0200, Roedel, Joerg wrote: I don't see a reason to make this meta-grouping static. It would harm flexibility on x86. I think it makes things easier on power but there are options on that platform to get the dynamic solution too. I think several people are misreading what Ben means by static. I would prefer to say 'persistent', in that the meta-groups lifetime is not tied to an fd, but they can be freely created, altered and removed during runtime. Even if it can be altered at runtime, from a usability perspective it is certainly the best to handle these groups directly in qemu. Or are there strong reasons to do it somewhere else? Joerg -- AMD Operating System Research Center Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach General Managers: Alberto Bozzo, Andrew Bowd Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632
Re: [Qemu-devel] kvm PCI assignment VFIO ramblings
On Mon, Aug 22, 2011 at 05:03:53PM -0400, Benjamin Herrenschmidt wrote: I am in favour of /dev/vfio/$GROUP. If multiple devices should be assigned to a guest, there can also be an ioctl to bind a group to an address-space of another group (certainly needs some care to not allow that both groups belong to different processes). Btw, a problem we havn't talked about yet entirely is driver-deassignment. User space can decide to de-assign the device from vfio while a fd is open on it. With PCI there is no way to let this fail (the .release function returns void last time i checked). Is this a problem, and yes, how we handle that? We can treat it as a hard unplug (like a cardbus gone away). IE. Dispose of the direct mappings (switch to MMIO emulation) and return all ff's from reads ( ignore writes). Then send an unplug event via whatever mechanism the platform provides (ACPI hotplug controller on x86 for example, we haven't quite sorted out what to do on power for hotplug yet). Hmm, good idea. But as far as I know the hotplug-event needs to be in the guest _before_ the device is actually unplugged (so that the guest can unbind its driver first). That somehow brings back the sleep-idea and the timeout in the .release function. Joerg -- AMD Operating System Research Center Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach General Managers: Alberto Bozzo, Andrew Bowd Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632
Re: [Qemu-devel] kvm PCI assignment VFIO ramblings
On Mon, Aug 22, 2011 at 03:17:00PM -0400, Alex Williamson wrote: On Mon, 2011-08-22 at 19:25 +0200, Joerg Roedel wrote: I am in favour of /dev/vfio/$GROUP. If multiple devices should be assigned to a guest, there can also be an ioctl to bind a group to an address-space of another group (certainly needs some care to not allow that both groups belong to different processes). That's an interesting idea. Maybe an interface similar to the current uiommu interface, where you open() the 2nd group fd and pass the fd via ioctl to the primary group. IOMMUs that don't support this would fail the attach device callback, which would fail the ioctl to bind them. It will need to be designed so any group can be removed from the super-set and the remaining group(s) still works. This feels like something that can be added after we get an initial implementation. Handling it through fds is a good idea. This makes sure that everything belongs to one process. I am not really sure yet if we go the way to just bind plain groups together or if we create meta-groups. The meta-groups thing seems somewhat cleaner, though. Btw, a problem we havn't talked about yet entirely is driver-deassignment. User space can decide to de-assign the device from vfio while a fd is open on it. With PCI there is no way to let this fail (the .release function returns void last time i checked). Is this a problem, and yes, how we handle that? The current vfio has the same problem, we can't unbind a device from vfio while it's attached to a guest. I think we'd use the same solution too; send out a netlink packet for a device removal and have the .remove call sleep on a wait_event(, refcnt == 0). We could also set a timeout and SIGBUS the PIDs holding the device if they don't return it willingly. Thanks, Putting the process to sleep (which would be uninterruptible) seems bad. The process would sleep until the guest releases the device-group, which can take days or months. The best thing (and the most intrusive :-) ) is to change PCI core to allow unbindings to fail, I think. But this probably further complicates the way to upstream VFIO... Joerg -- AMD Operating System Research Center Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach General Managers: Alberto Bozzo, Andrew Bowd Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632
Re: [Qemu-devel] kvm PCI assignment VFIO ramblings
On Mon, Aug 22, 2011 at 06:51:35AM -0400, Avi Kivity wrote: On 08/22/2011 01:46 PM, Joerg Roedel wrote: That does not work. The bridge in question may not even be visible as a PCI device, so you can't link to it. This is the case on a few PCIe cards which only have a PCIx chip and a PCIe-2-PCIx bridge to implement the PCIe interface (yes, I have seen those cards). How does the kernel detect that devices behind the invisible bridge must be assigned as a unit? On the AMD IOMMU side this information is stored in the IVRS ACPI table. Not sure about the VT-d side, though. Joerg -- AMD Operating System Research Center Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach General Managers: Alberto Bozzo, Andrew Bowd Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632
Re: [Qemu-devel] kvm PCI assignment VFIO ramblings
On Mon, Aug 22, 2011 at 08:42:35AM -0400, Avi Kivity wrote: On 08/22/2011 03:36 PM, Roedel, Joerg wrote: On the AMD IOMMU side this information is stored in the IVRS ACPI table. Not sure about the VT-d side, though. I see. There is no sysfs node representing it? No. It also doesn't exist as a 'struct pci_dev'. This caused problems in the AMD IOMMU driver in the past and I needed to fix that. There I know that from :) I'd rather not add another meaningless identifier. Well, I don't think its really meaningless, but we need some way to communicate the information about device groups to userspace. Joerg -- AMD Operating System Research Center Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach General Managers: Alberto Bozzo, Andrew Bowd Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632
Re: [Qemu-devel] kvm PCI assignment VFIO ramblings
On Mon, Aug 22, 2011 at 09:06:07AM -0400, Avi Kivity wrote: On 08/22/2011 03:55 PM, Roedel, Joerg wrote: Well, I don't think its really meaningless, but we need some way to communicate the information about device groups to userspace. I mean the contents of the group descriptor. There are enough 42s in the kernel, it's better if we can replace a synthetic number with something meaningful. If we only look at PCI than a Segment:Bus:Dev.Fn Number would be sufficient, of course. But the idea was to make it generic enough so that it works with !PCI too. Joerg -- AMD Operating System Research Center Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach General Managers: Alberto Bozzo, Andrew Bowd Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632
Re: [Qemu-devel] kvm PCI assignment VFIO ramblings
On Mon, Aug 22, 2011 at 09:17:41AM -0400, Avi Kivity wrote: On 08/22/2011 04:15 PM, Roedel, Joerg wrote: On Mon, Aug 22, 2011 at 09:06:07AM -0400, Avi Kivity wrote: On 08/22/2011 03:55 PM, Roedel, Joerg wrote: Well, I don't think its really meaningless, but we need some way to communicate the information about device groups to userspace. I mean the contents of the group descriptor. There are enough 42s in the kernel, it's better if we can replace a synthetic number with something meaningful. If we only look at PCI than a Segment:Bus:Dev.Fn Number would be sufficient, of course. But the idea was to make it generic enough so that it works with !PCI too. We could make it an arch defined string instead of a symlink. So it doesn't return 42, rather something that can be used by the admin to figure out what the problem was. Well, ok, it would certainly differ from the in-kernel representation then and introduce new architecture dependencies into libvirt. But if the 'group-string' is more meaningful to users then its certainly good. Suggestions? Joerg -- AMD Operating System Research Center Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach General Managers: Alberto Bozzo, Andrew Bowd Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632
Re: [Qemu-devel] semantics of -cpu host and check/enforce
On Fri, Jun 10, 2011 at 05:36:37PM -0400, Eduardo Habkost wrote: We have 3 sets of cpu features that may or may not be included in '-cpu host': A) Features that are supported by the host and that KVM can already emulate, or don't need KVM support to be used; B) Features that may be not supported by the host but can be emulated by KVM (e.g. the SVM features, or x2apic); C) Features that are supported by the host but KVM can't emulate. Divided in: C1) features we can't emulate and we know about it (e.g. dtes64)[1] C2) features we possibly can't emulate but we don't even know about it (e.g. features added to recent CPUs). It seems obvious that all the features in group A must always be included in '-cpu host', but what about features in the B or C groups? About group B: it looks like we are not being consistent. For example, svm_features has every bit enabled when using '-cpu host' even if the host doesn't support them; in other cases (e.g. x2apic), it is not enabled by '-cpu host' unless the host already supports it. SVM is a special feature. We can't just forward the host-cpuid to the guest because every SVM feature we provide to the guest needs emulation in the kernel module. The kernel-module will tell qemu-kvm about its feature set via the GET_SUPPORTED_CPUID ioctl. So the idea behint -cpu host and SVM features is that qemu-kvm enables all of them and masks out everything that is not supported by the kernel module. Note that the kernel might even emulate features that are not supported on the host, like the vmcb-clean-bits, so we really need to ask the kernel what is supported for the guest. Regards, Joerg -- AMD Operating System Research Center Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach General Managers: Alberto Bozzo, Andrew Bowd Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632
Re: [Qemu-devel] drop -enable-nesting
On Tue, May 31, 2011 at 04:58:16AM -0400, Avi Kivity wrote: On 05/31/2011 11:44 AM, Daniel P. Berrange wrote: I think it's safe to drop -enable-nesting immediately. Dan, does libvirt make use of it? Yes, but it should be safe to drop it. Currently, if the user specifies a CPU with the 'svm' flag present in libvirt guest XML, then we will pass args '-cpu +svm -enable-nesting'. So if we drop --enable-nesting, then libvirt will simply omit it and everything should still work because we have still got +svm set. But qemu will complain about an option it can't parse. The best choice is probably to keep the option and make it a nop for the lifetime of qemu-kvm. Optionally qemu-kvm can print a warning about the deprecated option. Joerg -- AMD Operating System Research Center Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach General Managers: Alberto Bozzo, Andrew Bowd Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632
Re: [Qemu-devel] [PATCH 3/7] cpu model bug fixes and definition corrections: Add kvm emulated x2apic flag to config defined cpu models
On Sat, May 28, 2011 at 04:39:13AM -0400, Jan Kiszka wrote: Jörg, how to deal with -enable-nesting in qemu-kvm to align behavior with upstream? My personal preference is to just remove it. In upstream-qemu it is enabled/disabled by +/-svm. -enable-nesting is just a historic thing which can be wiped out. Joerg -- AMD Operating System Research Center Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach General Managers: Alberto Bozzo, Andrew Bowd Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632
Re: [Qemu-devel] drop -enable-nesting
On Mon, May 30, 2011 at 11:04:02AM -0400, Jan Kiszka wrote: On 2011-05-30 16:38, Nadav Har'El wrote: On Mon, May 30, 2011, Jan Kiszka wrote about drop -enable-nesting (was: [PATCH 3/7] cpu model bug fixes and definition corrections...): On 2011-05-30 10:18, Roedel, Joerg wrote: On Sat, May 28, 2011 at 04:39:13AM -0400, Jan Kiszka wrote: J�rg, how to deal with -enable-nesting in qemu-kvm to align behavior with upstream? My personal preference is to just remove it. In upstream-qemu it is enabled/disabled by +/-svm. -enable-nesting is just a historic thing which can be wiped out. -enable-nesting could remain as a synonym for enabling either VMX or SVM in the guest, depending on what was available in the host (because KVM now supports both nested SVM and nested VMX, but not SVM-on-VMX or vice versa). Why? Once nesting is stable (I think SVM already is), there is no reason for an explicit enable. And you can always mask it out via -cpu. BTW, what are the defaults for SVM right now in qemu-kvm and upstream? Enable if the modeled CPU supports it? qemu-kvm still needs -enable-nesting, otherwise it is disabled. Upstream qemu should enable it unconditionally (can be disabled with -cpu ,-svm). Joerg -- AMD Operating System Research Center Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach General Managers: Alberto Bozzo, Andrew Bowd Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632
[Qemu-devel] Re: [PATCH 08/10] ahci: add ahci emulation
On Fri, Nov 19, 2010 at 04:12:43AM -0500, Gerd Hoffmann wrote: +static void ahci_check_irq(AHCIState *s) MSI support would be nice to have. Alrighty, I added MSI support. But I only have a single vector in use atm as nvec 0 doesn't get written to the PCI config space so the code trips on that later on. For multiple vectors MSI-X is the best choice as it fixes a few design issues MSI has when it comes to multiple vectors. I think linux never ever uses MSI with multiple vectors anyway. Linux doesn't even support MSI with multiple vectors today. Joerg -- AMD Operating System Research Center Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach General Managers: Alberto Bozzo, Andrew Bowd Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632
[Qemu-devel] Re: [PATCH 1/3] Make kvm64 the default cpu model when kvm_enabled()
(Sorry for the late reply) On Thu, Oct 07, 2010 at 08:48:06AM -0400, Anthony Liguori wrote: On 10/07/2010 03:42 AM, Roedel, Joerg wrote: On Wed, Oct 06, 2010 at 03:24:59PM -0400, Anthony Liguori wrote: +qemu_compat_version = machine-compat_version; + if (display_type == DT_NOGRAPHIC) { if (default_parallel) add_device_config(DEV_PARALLEL, null); -- 1.7.0.4 Looks fine to me, given CPUs are not in qdev. Anthony? The idea is fine, but why not just add the default CPU to the machine description? If I remember correctly the reason was that the machine description was not accessible in the cpuid initialization path because it is a function local variable. Not tested at all but I think the attached patch addresses it in a pretty nice way. There's a couple ways you could support your patch on top of this. You could add a kvm_cpu_model to the machine structure that gets defaulted too if kvm_enabled(). You could also introduce a new KVM machine type that gets defaulted to if no explicit machine is specified. I had something similar in mind but then I realized that we need at least a cpu_model and a cpu_model_kvm to distinguish between the TCG and the KVM case. Further the QEMUMachine data structure is used for all architectures in QEMU and the model-names only make sense for x86. So I decided for the comapt-version way (which doesn't mean I object against this one ;-) ) Joerg From d2370c88cef4b07d48ba3c4804e35ae2db8db7c0 Mon Sep 17 00:00:00 2001 From: Anthony Liguori aligu...@us.ibm.com Date: Thu, 7 Oct 2010 07:43:42 -0500 Subject: [PATCH] machine: make default cpu model part of machine structure Signed-off-by: Anthony Liguori aligu...@us.ibm.com diff --git a/hw/boards.h b/hw/boards.h index 6f0f0d7..8c6ef27 100644 --- a/hw/boards.h +++ b/hw/boards.h @@ -16,6 +16,7 @@ typedef struct QEMUMachine { const char *name; const char *alias; const char *desc; +const char *cpu_model; QEMUMachineInitFunc *init; int use_scsi; int max_cpus; diff --git a/hw/pc.c b/hw/pc.c index 69b13bf..0826107 100644 --- a/hw/pc.c +++ b/hw/pc.c @@ -866,14 +866,6 @@ void pc_cpus_init(const char *cpu_model) int i; /* init CPUs */ -if (cpu_model == NULL) { -#ifdef TARGET_X86_64 -cpu_model = qemu64; -#else -cpu_model = qemu32; -#endif -} - for(i = 0; i smp_cpus; i++) { pc_new_cpu(cpu_model); } diff --git a/hw/pc_piix.c b/hw/pc_piix.c index 12359a7..919b4d6 100644 --- a/hw/pc_piix.c +++ b/hw/pc_piix.c @@ -204,17 +204,22 @@ static void pc_init_isa(ram_addr_t ram_size, const char *initrd_filename, const char *cpu_model) { -if (cpu_model == NULL) -cpu_model = 486; pc_init1(ram_size, boot_device, kernel_filename, kernel_cmdline, initrd_filename, cpu_model, 0); } +#ifdef TARGET_X86_64 +#define DEF_CPU_MODEL qemu64 +#else +#define DEF_CPU_MODEL qemu32 +#endif + static QEMUMachine pc_machine = { .name = pc-0.13, .alias = pc, .desc = Standard PC, +.cpu_model = DEF_CPU_MODEL, .init = pc_init_pci, .max_cpus = 255, .is_default = 1, @@ -223,6 +228,7 @@ static QEMUMachine pc_machine = { static QEMUMachine pc_machine_v0_12 = { .name = pc-0.12, .desc = Standard PC, +.cpu_model = DEF_CPU_MODEL, .init = pc_init_pci, .max_cpus = 255, .compat_props = (GlobalProperty[]) { @@ -242,6 +248,7 @@ static QEMUMachine pc_machine_v0_12 = { static QEMUMachine pc_machine_v0_11 = { .name = pc-0.11, .desc = Standard PC, qemu 0.11, +.cpu_model = DEF_CPU_MODEL, .init = pc_init_pci, .max_cpus = 255, .compat_props = (GlobalProperty[]) { @@ -277,6 +284,7 @@ static QEMUMachine pc_machine_v0_11 = { static QEMUMachine pc_machine_v0_10 = { .name = pc-0.10, .desc = Standard PC, qemu 0.10, +.cpu_model = DEF_CPU_MODEL, .init = pc_init_pci, .max_cpus = 255, .compat_props = (GlobalProperty[]) { @@ -324,6 +332,7 @@ static QEMUMachine pc_machine_v0_10 = { static QEMUMachine isapc_machine = { .name = isapc, .desc = ISA-only PC, +.cpu_model = 486, .init = pc_init_isa, .max_cpus = 1, }; diff --git a/vl.c b/vl.c index df414ef..3a55cc8 100644 --- a/vl.c +++ b/vl.c @@ -2904,6 +2904,10 @@ int main(int argc, char **argv, char **envp) } qemu_add_globals(); +if (cpu_model == NULL) { +cpu_model = machine-cpu_model; +} + machine-init(ram_size, boot_devices, kernel_filename, kernel_cmdline, initrd_filename, cpu_model); -- 1.7.0.4 -- AMD Operating System Research Center Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach General Managers: Alberto Bozzo, Andrew
[Qemu-devel] Re: [PATCH 1/3] Make kvm64 the default cpu model when kvm_enabled()
On Wed, Oct 06, 2010 at 03:24:59PM -0400, Anthony Liguori wrote: +qemu_compat_version = machine-compat_version; + if (display_type == DT_NOGRAPHIC) { if (default_parallel) add_device_config(DEV_PARALLEL, null); -- 1.7.0.4 Looks fine to me, given CPUs are not in qdev. Anthony? The idea is fine, but why not just add the default CPU to the machine description? If I remember correctly the reason was that the machine description was not accessible in the cpuid initialization path because it is a function local variable. I could have made it a global variable but considered the compat_version approach simpler. The qemu_compat_version might also be useful at other places. Joerg -- AMD Operating System Research Center Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach General Managers: Alberto Bozzo, Andrew Bowd Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632
[Qemu-devel] Re: [PATCH 3/3] Add svm cpuid features
On Mon, Sep 27, 2010 at 12:22:13PM -0400, Avi Kivity wrote: On 09/27/2010 05:40 PM, Roedel, Joerg wrote: On Mon, Sep 27, 2010 at 11:02:23AM -0400, Avi Kivity wrote: On 09/27/2010 04:58 PM, Avi Kivity wrote: On 09/27/2010 03:16 PM, Joerg Roedel wrote: This patch adds the svm cpuid feature flags to the qemu intialization path. It also adds the svm features available on phenom to its cpu-definition and extends the host cpu type to support all svm features KVM can provide. This should really be all svm features since we'll later mask them with kvm capabilities. This way, if we later add more capabilities, we automatically gain support in userspace. Yes, that is what -cpu host does with these patches applied. The svm_features variable is set to -1 in this case and masked out later with the KVM capabilities. Well, running current uq/master I get: has_svm: can't execute cpuid 0x800a kvm: no hardware support ? Weird, it worked here as I tested it. I had it on qemu/master and with all three patches. But patch 1 should not make the difference. I take a look, have you pushed the failing uq/master? What was your command line? Joerg -- AMD Operating System Research Center Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach General Managers: Alberto Bozzo, Andrew Bowd Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632
[Qemu-devel] Re: [PATCH 3/3] Add svm cpuid features
On Tue, Sep 28, 2010 at 05:37:58AM -0400, Avi Kivity wrote: On 09/28/2010 11:28 AM, Roedel, Joerg wrote: Weird, it worked here as I tested it. I had it on qemu/master and with all three patches. But patch 1 should not make the difference. I take a look, have you pushed the failing uq/master? Yes, 8fe6a21c76. What was your command line? qemu-system-x86_64 -m 2G -cpu kvm64,+svm,+npt -enable-kvm ... Note this is qemu.git, so -enable-kvm is needed. Ok, I apparently forgot to force the CPUID xlevel to be 0x800A when SVM is enabled, probably because I only tested CPUID models where xlevel already defaults to 0x800A. Attached is a fix, thanks for catching this. Joerg From d18d973d037f1af1869e114aea1f076ea3a1ee7a Mon Sep 17 00:00:00 2001 From: Joerg Roedel joerg.roe...@amd.com Date: Tue, 28 Sep 2010 11:58:49 +0200 Subject: [PATCH 1/1] qemu: Force xlevel to at least 0x800A if SVM is enabled This patch forces the extended CPUID level to be at least 0x800A if SVM is enabled in the CPU model. Signed-off-by: Joerg Roedel joerg.roe...@amd.com --- target-i386/cpuid.c |5 + 1 files changed, 5 insertions(+), 0 deletions(-) diff --git a/target-i386/cpuid.c b/target-i386/cpuid.c index 0e0bf60..0630fe1 100644 --- a/target-i386/cpuid.c +++ b/target-i386/cpuid.c @@ -714,6 +714,11 @@ static int cpu_x86_find_by_name(x86_def_t *x86_cpu_def, const char *cpu_model) x86_cpu_def-ext3_features = ~minus_ext3_features; x86_cpu_def-kvm_features = ~minus_kvm_features; x86_cpu_def-svm_features = ~minus_svm_features; +if ((x86_cpu_def-ext3_features CPUID_EXT3_SVM) +(x86_cpu_def-xlevel 0x800A)) { +/* Force xlevel to at least 0x800A if SVM enabled */ +x86_cpu_def-xlevel = 0x800A; +} if (check_cpuid) { if (check_features_against_host(x86_cpu_def) enforce_cpuid) goto error; -- 1.7.0.4 -- AMD Operating System Research Center Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach General Managers: Alberto Bozzo, Andrew Bowd Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632
[Qemu-devel] Re: [PATCH 3/3] Add svm cpuid features
On Mon, Sep 27, 2010 at 11:02:23AM -0400, Avi Kivity wrote: On 09/27/2010 04:58 PM, Avi Kivity wrote: On 09/27/2010 03:16 PM, Joerg Roedel wrote: This patch adds the svm cpuid feature flags to the qemu intialization path. It also adds the svm features available on phenom to its cpu-definition and extends the host cpu type to support all svm features KVM can provide. This should really be all svm features since we'll later mask them with kvm capabilities. This way, if we later add more capabilities, we automatically gain support in userspace. Yes, that is what -cpu host does with these patches applied. The svm_features variable is set to -1 in this case and masked out later with the KVM capabilities. (or rather, all svm features that don't need save/restore support in userspace; I don't think any do, beyond svm itself) I applied 2 and 3 (to branch uq/master). If you want to add more svm features, please send a patch. For patch 1, I'd like a review by someone who understands the compat code. Great, thanks. Joerg -- AMD Operating System Research Center Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach General Managers: Alberto Bozzo, Andrew Bowd Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632
[Qemu-devel] Re: [PATCH 3/3] Add svm cpuid features
On Thu, Sep 16, 2010 at 10:06:48AM -0400, Avi Kivity wrote: On 09/14/2010 05:52 PM, Joerg Roedel wrote: This patch adds the svm cpuid feature flags to the qemu intialization path. It also adds the svm features available on phenom to its cpu-definition and extends the host cpu type to support all svm features KVM can provide. Does phenom really have vmcbclean? I thought it was a really new feature. No, but we could emulate it on almost all hardware that has SVM. Its basically the same as with x2apic. Joerg -- AMD Operating System Research Center Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach General Managers: Alberto Bozzo, Andrew Bowd Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632
[Qemu-devel] Re: [PATCH 1/3] Make kvm64 the default cpu model when kvm_enabled()
On Thu, Sep 16, 2010 at 10:03:19AM -0400, Avi Kivity wrote: On 09/14/2010 05:52 PM, Joerg Roedel wrote: As requested by Alex this patch makes kvm64 the default CPU model when qemu is started with -enable-kvm. This breaks compatiblity; if started with -M 0.13 or prior we should default to qemu64. Ok, I can change that. But its ok to make kvm64 the default for anything 0.13? Joerg -- AMD Operating System Research Center Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach General Managers: Alberto Bozzo, Andrew Bowd Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632
[Qemu-devel] Re: [PATCH 3/3] Add svm cpuid features
On Thu, Sep 16, 2010 at 10:17:09AM -0400, Alexander Graf wrote: Roedel, Joerg wrote: On Thu, Sep 16, 2010 at 10:06:48AM -0400, Avi Kivity wrote: On 09/14/2010 05:52 PM, Joerg Roedel wrote: This patch adds the svm cpuid feature flags to the qemu intialization path. It also adds the svm features available on phenom to its cpu-definition and extends the host cpu type to support all svm features KVM can provide. Does phenom really have vmcbclean? I thought it was a really new feature. No, but we could emulate it on almost all hardware that has SVM. Its basically the same as with x2apic. -cpu non-host is not about what we could emulate, but what the hardware really is. Features not supported by the particular CPU do not belong there. I'll remove it then. It would have been nice to have it there because it will improve performance of svm emulation. But if it is about emulating the real cpu then I'll just drop it. Joerg -- AMD Operating System Research Center Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach General Managers: Alberto Bozzo, Andrew Bowd Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632
[Qemu-devel] Re: [PATCH 3/3] Add svm cpuid features
On Thu, Sep 16, 2010 at 10:28:01AM -0400, Alexander Graf wrote: Roedel, Joerg wrote: I'll remove it then. It would have been nice to have it there because it will improve performance of svm emulation. But if it is about emulating the real cpu then I'll just drop it. speed == -cpu host :). If the performance increase is significant, we can always add a new cpu type for a cpu that does support those flags so people can use it there and still be migration safe. In fact, that would even benefit you guys as you'd sell more new machines for speed ;). Well, since its emulated completly in software it would be available on all hosts that start with -cpu phenom (at least if they use the same kvm version). So this would not break migration. Joerg -- AMD Operating System Research Center Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach General Managers: Alberto Bozzo, Andrew Bowd Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632
[Qemu-devel] Re: [PATCH 1/3] Make kvm64 the default cpu model when kvm_enabled()
On Tue, Sep 14, 2010 at 11:58:03AM -0400, Alexander Graf wrote: +if (kvm_enabled()) +cpu_model = DEFAULT_KVM_CPU_MODEL; +else +cpu_model = DEFAULT_QEMU_CPU_MODEL; Braces :(. Okay, here is the new patch: From f49e78edbd4143d05128228d9cc24bd5abc3abf1 Mon Sep 17 00:00:00 2001 From: Joerg Roedel joerg.roe...@amd.com Date: Tue, 14 Sep 2010 16:52:11 +0200 Subject: [PATCH 1/3] Make kvm64 the default cpu model when kvm_enabled() As requested by Alex this patch makes kvm64 the default CPU model when qemu is started with -enable-kvm. Signed-off-by: Joerg Roedel joerg.roe...@amd.com --- hw/pc.c | 20 +++- 1 files changed, 15 insertions(+), 5 deletions(-) diff --git a/hw/pc.c b/hw/pc.c index 69b13bf..a6355f3 100644 --- a/hw/pc.c +++ b/hw/pc.c @@ -40,6 +40,16 @@ #include sysbus.h #include sysemu.h #include blockdev.h +#include kvm.h + + +#ifdef TARGET_X86_64 +#define DEFAULT_KVM_CPU_MODEL kvm64 +#define DEFAULT_QEMU_CPU_MODEL qemu64 +#else +#define DEFAULT_KVM_CPU_MODEL kvm32 +#define DEFAULT_QEMU_CPU_MODEL qemu32 +#endif /* output Bochs bios info messages */ //#define DEBUG_BIOS @@ -867,11 +877,11 @@ void pc_cpus_init(const char *cpu_model) /* init CPUs */ if (cpu_model == NULL) { -#ifdef TARGET_X86_64 -cpu_model = qemu64; -#else -cpu_model = qemu32; -#endif +if (kvm_enabled()) { +cpu_model = DEFAULT_KVM_CPU_MODEL; +} else { +cpu_model = DEFAULT_QEMU_CPU_MODEL; +} } for(i = 0; i smp_cpus; i++) { -- 1.7.0.4 -- AMD Operating System Research Center Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach General Managers: Alberto Bozzo, Andrew Bowd Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632
Re: [Qemu-devel] [PATCH 1/3] qemu-kvm: Invert svm-flag setting logic
On Tue, Sep 07, 2010 at 08:59:52AM -0400, Avi Kivity wrote: On 09/07/2010 03:55 PM, Alexander Graf wrote: Was it broken? How? When migrating inside l2 context, we're missing information. My idea back then was to force the l1 guest out of l2 context every time we want to migrate, but I'm not sure this has happened. I thought that was implemented, but not sure (though it isn't clean architecturally). Even then I'm not sure we're transferring the GIF. Argh, we aren't. Migration with nested-svm doesn't work reliably today. But that is on my list. Joerg -- AMD Operating System Research Center Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach General Managers: Alberto Bozzo, Andrew Bowd Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632
Re: [Qemu-devel] [PATCH 1/3] qemu-kvm: Invert svm-flag setting logic
On Tue, Sep 07, 2010 at 10:14:46AM -0400, Alexander Graf wrote: On 07.09.2010, at 16:12, Roedel, Joerg wrote: On Tue, Sep 07, 2010 at 09:51:33AM -0400, Avi Kivity wrote: On 09/07/2010 03:23 PM, Alexander Graf wrote: I think we should get rid of kvm_nested and -enable-nesting. Instead, we should enable the SVM bit in the host and qemu64 cpu types, but not in kvm64. This way users are safe to not use nested svm, but can choose to do so if they like. Also, it should be possible to do something like -cpu kvm64,flags=+svm. Without having to mess with -enable-nesting. I agree, -enable-nesting is redundant with -cpu ,+svm. This patchset makes -enable-nesting pratically a synonym for -cpu ,+svm. So we can safely remove -enable-nesting in the future. Why not just not introduce it? :) It's always hard to get rid of legacy. Because its already there? The patches are against qemu-kvm. -- AMD Operating System Research Center Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach General Managers: Alberto Bozzo, Andrew Bowd Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632
Re: [Qemu-devel] [PATCH 1/3] qemu-kvm: Invert svm-flag setting logic
On Tue, Sep 07, 2010 at 10:16:10AM -0400, Alexander Graf wrote: Oh, because this is for qemu-kvm. Uh - how about a nice little patch that makes things work in qemu.git, leave out the -enable-nesting piece and just keep the -enable-nesting backwards compat patch in qemu-kvm.git - or maybe even remove it there. I can do this. Is it better to merge all this stuff into qemu.git instead of qemu-kvm.git? Then I target again for qemu.git. Joerg -- AMD Operating System Research Center Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach General Managers: Alberto Bozzo, Andrew Bowd Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632
Re: [Qemu-devel] [PATCH 1/3] qemu-kvm: Invert svm-flag setting logic
On Tue, Sep 07, 2010 at 09:51:33AM -0400, Avi Kivity wrote: On 09/07/2010 03:23 PM, Alexander Graf wrote: I think we should get rid of kvm_nested and -enable-nesting. Instead, we should enable the SVM bit in the host and qemu64 cpu types, but not in kvm64. This way users are safe to not use nested svm, but can choose to do so if they like. Also, it should be possible to do something like -cpu kvm64,flags=+svm. Without having to mess with -enable-nesting. I agree, -enable-nesting is redundant with -cpu ,+svm. This patchset makes -enable-nesting pratically a synonym for -cpu ,+svm. So we can safely remove -enable-nesting in the future. Joerg -- AMD Operating System Research Center Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach General Managers: Alberto Bozzo, Andrew Bowd Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632
[Qemu-devel] Re: SVM emulation: EVENTINJ marked valid when a pagefault happens while issuing a software interrupt
On Thu, May 27, 2010 at 02:53:18PM -0400, Jan Kiszka wrote: Based on the KVM code (which is known to work perfectly :) ), I think you are right: SVM apparently clears the valid bit in EVENTINJ during VMRUN once it starts processing the injection, not after it as it's the case in current QEMU. But better ask the experts: Jörg, Gleb? SVM always clears the vmcb.eventinj on vmrun because every exception is injected right after vmrun finished and cpu is in guest mode. It can happen (for example if taking the exception causes a page fault) that the vmcb.eventinj field is copied to vmcb.exit_int_info. Also note that at this point there is a difference between hardware svm and the nested-svm implementation in kvm. The hardware always takes the exception first before checking for any other intercept condition. This basically means that exit_int_info is only set when the injected event could not be delivered due to other conditions in the guest (page fault, nested page-fault, ...). In nested-svm you can get a valid exit_int_info when an interrupt or nmi is pending too. In the software implementation these intercepts are taken before the event is delivered and you find the event in vmcb.exit_int_info. This is not forbidden in the svm architecture and I have not found a hypervisor that has a problem with this different behavior. I have a patch here which changes this in nested-svm, but it introduces more problems than it fixes. Joerg
[Qemu-devel] Re: SVM emulation: EVENTINJ marked valid when a pagefault happens while issuing a software interrupt
On Thu, May 27, 2010 at 06:20:00PM -0400, Jan Kiszka wrote: Erik van der Kouwe wrote: Problem is: I'm compiling in Linux and testing in MINIX. Testing on the real hardware would require a reboot everytime. Moreover, it might screw up my system if I make bad mistakes (the MINIX filesystem is easily corrupted). Use Linux+KVM as host OS, it can also run VMMs as guests (aka nested SVM). And you could even debug those guests just like when you would run QEMU in emulation mode. In contrast to SVM emulation, nesting is fairly stable AFAIK. And it is faster. At least it is more stable than any other nested-svm implementation I know of ;-) There are issues with kvmclock when you run kvm-on-kvm and you should not expect windows-based hypervisors to run without problems. Beside that, for running kvm-on-kvm and xen-on-kvm it is indeed fairly stable :-) Linux source tree (2.6.31-ubuntu), arch/x86/kvm/svm.c, end of function nested_svm_vmrun. Here event_inj and event_inj_err are copied from a different VMCB, effectively clearing the value set by the CPU. Maybe this isn't were I should have been looking though? The interesting part is in nested_svm_vmexit. There you have this piece of code: /* * If we emulate a VMRUN/#VMEXIT in the same host #vmexit cycle we have * to make sure that we do not lose injected events. So check event_inj * here and copy it to exit_int_info if it is valid. * Exit_int_info and event_inj can't be both valid because the case * below only happens on a VMRUN instruction intercept which has * no valid exit_int_info set. */ if (vmcb-control.event_inj SVM_EVTINJ_VALID) { struct vmcb_control_area *nc = nested_vmcb-control; nc-exit_int_info = vmcb-control.event_inj; nc-exit_int_info_err = vmcb-control.event_inj_err; } and a few lines later: nested_vmcb-control.event_inj = 0; nested_vmcb-control.event_inj_err = 0; ... which takes care of this situation. The vmcb.eventinf field is _defined_ to be zero on a #vmexit. Joerg
[Qemu-devel] Re: SVM emulation: EVENTINJ marked valid when a pagefault happens while issuing a software interrupt
On Fri, May 28, 2010 at 02:10:59AM -0400, Jan Kiszka wrote: Erik van der Kouwe wrote: In my experience, if I provide the -enable-kvm switch then the guest VMM never detects the presence of virtualization support. Does this only work on AMD hardware? Or do I need to supply some additional parameter to make it work? Yes, forgot to mention: -enable-nesting, and you need qemu-kvm. This feature hasn't been merged upstream yet. And the svm-emulation is only available on AMD hardware. Joerg
[Qemu-devel] Re: SVM emulation: EVENTINJ marked valid when a pagefault happens while issuing a software interrupt
On Fri, May 28, 2010 at 03:45:09AM -0400, Erik van der Kouwe wrote: This is a ok, the problem is the event_inj field rather than the exit_int_info field. From what I've seen the SVM specification neither specifies that the CPU writes to this field nor does it explicitly forbid it. Given the unclarity of the specification it may safest to deal with this in the same way as the hardware does (although I don't know which way this is, it seems inuitively unlikely that the hardware would set event_inj to valid). The AMD64 Architecture Programmer's Manual Volume 2 states in section 15.19: When an event is injected by means of this mechanism, the VMRUN instruction causes the guest to unconditionally take the specified exception or interrupt before executing the first guest instruction. Which implicitly means that. But it could be documented more explicitly, thats right :) Joerg
Re: [Qemu-devel] Re: SVM emulation: EVENTINJ marked valid when a pagefault happens while issuing a software interrupt
On Fri, May 28, 2010 at 09:30:19AM -0400, Erik van der Kouwe wrote: Would be nice to have nested VMX support though, given that Intel CPUs are so much more common than AMDs. I've been searching on the Dell website (in the Netherlands) for a laptop recently and I couldn't find a single AMD model. That doesn't mean they are uncommon ;-) There are many vendors that offer notebooks with AMD processors. Joerg
Re: [Qemu-devel] Re: SVM emulation: EVENTINJ marked valid when a pagefault happens while issuing a software interrupt
On Fri, May 28, 2010 at 09:20:31AM -0400, Jamie Lokier wrote: Roedel, Joerg wrote: On Fri, May 28, 2010 at 02:10:59AM -0400, Jan Kiszka wrote: Erik van der Kouwe wrote: In my experience, if I provide the -enable-kvm switch then the guest VMM never detects the presence of virtualization support. Does this only work on AMD hardware? Or do I need to supply some additional parameter to make it work? Yes, forgot to mention: -enable-nesting, and you need qemu-kvm. This feature hasn't been merged upstream yet. And the svm-emulation is only available on AMD hardware. I assume you mean nested SVM emulation in a KVM guest is only available on real AMD hardware? Right. Is this due to something inherent, or just a limitation of the KVM code not handling all the necessary traps in kvm-intel? The SVM and VMX extensions are architecturally very different in many details. This makes it very hard to emulate VMX on SVM or vice verca. I am not even sure if it is possible at all to emulate one extension in an architectural complete way using the other extension. Joerg