Re: [PATCH v2 4/5] KVM: add KVM_USER_EXIT vcpu ioctl for userspace exit
On 08/18/2015 10:57 PM, Paolo Bonzini wrote: On 18/08/2015 11:30, Avi Kivity wrote: KVM_USER_EXIT in practice should be so rare (at least with in-kernel LAPIC) that I don't think this matters. KVM_USER_EXIT is relatively uninteresting, it only exists to provide an alternative to signals that doesn't require expensive atomics on each and every KVM_RUN. :( Ah, so the idea is to remove the cost of changing the signal mask? Yes, it's explained in the cover letter. Yes, although it looks like a thread-local operation, it takes a process-wide lock. IIRC the lock was only task-wide and uncontended. Problem is, it's on the node that created the thread rather than the node that is running it, and inter-node atomics are really, really slow. Cached inter-node atomics are (relatively) fast, but I think it really is a process-wide lock: sigprocmask calls: void __set_current_blocked(const sigset_t *newset) { struct task_struct *tsk = current; spin_lock_irq(&tsk->sighand->siglock); __set_task_blocked(tsk, newset); spin_unlock_irq(&tsk->sighand->siglock); } struct sighand_struct { atomic_tcount; struct k_sigactionaction[_NSIG]; spinlock_tsiglock; wait_queue_head_tsignalfd_wqh; }; Since sigaction is usually process-wide, I conclude that so will tsk->sighand. For guests spanning >1 host NUMA nodes it's not really practical to ensure that the thread is created on the right node. Even for guests that fit into 1 host node, if you rely on AutoNUMA the VCPUs are created too early for AutoNUMA to have any effect. And newer machines have frighteningly small nodes (two nodes per socket, so it's something like 7 pCPUs if you don't have hyper-threading enabled). True, the NUMA penalty within the same socket is not huge, but it still costs a few thousand clock cycles on vmexit.flat and this feature sweeps it away completely. I expect most user wakeups are via irqfd, so indeed the performance of KVM_USER_EXIT is uninteresting. Yup, either irqfd or KVM_SET_SIGNAL_MSI. Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 4/5] KVM: add KVM_USER_EXIT vcpu ioctl for userspace exit
On 08/17/2015 04:15 PM, Paolo Bonzini wrote: On 16/08/2015 13:27, Avi Kivity wrote: On 08/05/2015 07:33 PM, Radim Krčmář wrote: The guest can use KVM_USER_EXIT instead of a signal-based exiting to userspace. Availability depends on KVM_CAP_USER_EXIT. Only x86 is implemented so far. Signed-off-by: Radim Krčmář --- v2: * use vcpu ioctl instead of vm one [4/5] * shrink kvm_user_exit from 64 to 32 bytes [4/5] Documentation/virtual/kvm/api.txt | 30 ++ arch/x86/kvm/x86.c| 24 include/uapi/linux/kvm.h | 7 +++ virt/kvm/kvm_main.c | 5 +++-- 4 files changed, 64 insertions(+), 2 deletions(-) diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt index 3c714d43a717..c5844f0b8e7c 100644 --- a/Documentation/virtual/kvm/api.txt +++ b/Documentation/virtual/kvm/api.txt @@ -3020,6 +3020,36 @@ Returns: 0 on success, -1 on error Queues an SMI on the thread's vcpu. + +4.97 KVM_USER_EXIT + +Capability: KVM_CAP_USER_EXIT +Architectures: x86 +Type: vcpu ioctl +Parameters: struct kvm_user_exit (in) +Returns: 0 on success, + -EFAULT if the parameter couldn't be read, + -EINVAL if 'reserved' is not zeroed, + +struct kvm_user_exit { +__u8 reserved[32]; +}; + +The ioctl is asynchronous to VCPU execution and can be issued from all threads. +format This breaks an invariant of vcpu ioctls, and also forces a cacheline bounce when we fget() the vcpu fd. KVM_USER_EXIT in practice should be so rare (at least with in-kernel LAPIC) that I don't think this matters. KVM_USER_EXIT is relatively uninteresting, it only exists to provide an alternative to signals that doesn't require expensive atomics on each and every KVM_RUN. :( Ah, so the idea is to remove the cost of changing the signal mask? Yes, although it looks like a thread-local operation, it takes a process-wide lock. I expect most user wakeups are via irqfd, so indeed the performance of KVM_USER_EXIT is uninteresting. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 4/5] KVM: add KVM_USER_EXIT vcpu ioctl for userspace exit
On 08/05/2015 07:33 PM, Radim Krčmář wrote: The guest can use KVM_USER_EXIT instead of a signal-based exiting to userspace. Availability depends on KVM_CAP_USER_EXIT. Only x86 is implemented so far. Signed-off-by: Radim Krčmář --- v2: * use vcpu ioctl instead of vm one [4/5] * shrink kvm_user_exit from 64 to 32 bytes [4/5] Documentation/virtual/kvm/api.txt | 30 ++ arch/x86/kvm/x86.c| 24 include/uapi/linux/kvm.h | 7 +++ virt/kvm/kvm_main.c | 5 +++-- 4 files changed, 64 insertions(+), 2 deletions(-) diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt index 3c714d43a717..c5844f0b8e7c 100644 --- a/Documentation/virtual/kvm/api.txt +++ b/Documentation/virtual/kvm/api.txt @@ -3020,6 +3020,36 @@ Returns: 0 on success, -1 on error Queues an SMI on the thread's vcpu. + +4.97 KVM_USER_EXIT + +Capability: KVM_CAP_USER_EXIT +Architectures: x86 +Type: vcpu ioctl +Parameters: struct kvm_user_exit (in) +Returns: 0 on success, + -EFAULT if the parameter couldn't be read, + -EINVAL if 'reserved' is not zeroed, + +struct kvm_user_exit { + __u8 reserved[32]; +}; + +The ioctl is asynchronous to VCPU execution and can be issued from all threads. +format This breaks an invariant of vcpu ioctls, and also forces a cacheline bounce when we fget() the vcpu fd. We should really try to avoid this. One options is to have a KVM_VCPU_MAKE_EXITFD vcpu ioctl, which returns an eventfd that you then write into. You can make as many exitfds as you like, one for each waking thread, so they never cause cacheline conflicts. Edit: I see the invariant was already broken. But the other comment stands. +Make vcpu_id exit to userspace as soon as possible. If the VCPU is not running +in kernel at the time, it will exit early on the next call to KVM_RUN. +If the VCPU was going to exit because of other reasons when KVM_USER_EXIT was +issued, it will keep the original exit reason and not exit early on next +KVM_RUN. +If VCPU exited because of KVM_USER_EXIT, the exit reason is KVM_EXIT_REQUEST. + +This ioctl has very similar effect (same sans some races on userspace exit) as +sending a signal (that is blocked in userspace and set in KVM_SET_SIGNAL_MASK) +to the VCPU thread. + + + 5. The kvm_run structure diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 3493457ad0a1..27d777eb34e4 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -2466,6 +2466,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) case KVM_CAP_ASSIGN_DEV_IRQ: case KVM_CAP_PCI_2_3: #endif + case KVM_CAP_USER_EXIT: r = 1; break; case KVM_CAP_X86_SMM: @@ -3077,6 +3078,20 @@ static int kvm_set_guest_paused(struct kvm_vcpu *vcpu) return 0; } +int kvm_vcpu_ioctl_user_exit(struct kvm_vcpu *vcpu, struct kvm_user_exit *info) +{ + struct kvm_user_exit valid = {}; + BUILD_BUG_ON(sizeof(valid) != 32); + + if (memcmp(info, &valid, sizeof(valid))) + return -EINVAL; + + kvm_make_request(KVM_REQ_EXIT, vcpu); + kvm_vcpu_kick(vcpu); + + return 0; +} + long kvm_arch_vcpu_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg) { @@ -3341,6 +3356,15 @@ long kvm_arch_vcpu_ioctl(struct file *filp, r = kvm_set_guest_paused(vcpu); goto out; } + case KVM_USER_EXIT: { + struct kvm_user_exit info; + + r = -EFAULT; + if (copy_from_user(&info, argp, sizeof(info))) + goto out; + r = kvm_vcpu_ioctl_user_exit(vcpu, &info); + break; + } default: r = -EINVAL; } diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index d996a7cdb4d2..bc5a1abe9626 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -826,6 +826,7 @@ struct kvm_ppc_smmu_info { #define KVM_CAP_X86_SMM 117 #define KVM_CAP_MULTI_ADDRESS_SPACE 118 #define KVM_CAP_SPLIT_IRQCHIP 119 +#define KVM_CAP_USER_EXIT 120 #ifdef KVM_CAP_IRQ_ROUTING @@ -1008,6 +1009,10 @@ struct kvm_device_attr { __u64 addr; /* userspace address of attr data */ }; +struct kvm_user_exit { + __u8 reserved[32]; +}; + #define KVM_DEV_VFIO_GROUP 1 #define KVM_DEV_VFIO_GROUP_ADD 1 #define KVM_DEV_VFIO_GROUP_DEL 2 @@ -1119,6 +1124,8 @@ struct kvm_s390_ucas_mapping { #define KVM_ARM_SET_DEVICE_ADDR _IOW(KVMIO, 0xab, struct kvm_arm_device_addr) /* Available with KVM_CAP_PPC_RTAS */ #define KVM_PPC_RTAS_DEFINE_TOKEN _IOW(KVMIO, 0xac, struct kvm_rtas_token_args) +/* Available with KVM_CAP_USER_EXIT */ +#define KVM_USER_EXIT _IOW(KVM
Re: [v4 08/16] KVM: kvm-vfio: User API for IRQ forwarding
On 06/12/2015 06:41 PM, Alex Williamson wrote: On Fri, 2015-06-12 at 00:23 +, Wu, Feng wrote: -Original Message- From: Avi Kivity [mailto:avi.kiv...@gmail.com] Sent: Friday, June 12, 2015 3:59 AM To: Wu, Feng; kvm@vger.kernel.org; linux-ker...@vger.kernel.org Cc: pbonz...@redhat.com; mtosa...@redhat.com; alex.william...@redhat.com; eric.au...@linaro.org Subject: Re: [v4 08/16] KVM: kvm-vfio: User API for IRQ forwarding On 06/11/2015 01:51 PM, Feng Wu wrote: From: Eric Auger This patch adds and documents a new KVM_DEV_VFIO_DEVICE group and 2 device attributes: KVM_DEV_VFIO_DEVICE_FORWARD_IRQ, KVM_DEV_VFIO_DEVICE_UNFORWARD_IRQ. The purpose is to be able to set a VFIO device IRQ as forwarded or not forwarded. the command takes as argument a handle to a new struct named kvm_vfio_dev_irq. Is there no way to do this automatically? After all, vfio knows that a device interrupt is forwarded to some eventfd, and kvm knows that some eventfd is forwarded to a guest interrupt. If they compare notes through a central registry, they can figure out that the interrupt needs to be forwarded. Oh, just like Eric mentioned in his reply, this description is out of context of this series, I will remove them in the next version. I suspect Avi's question was more general. While forward/unforward is out of context for this series, it's very similar in nature to enabling/disabling posted interrupts. So I think the question remains whether we really need userspace to participate in creating this shortcut or if kvm and vfio can some how orchestrate figuring it out automatically. Personally I don't know how we could do it automatically. We've always relied on userspace to independently setup vfio and kvm such that neither have any idea that the other is there and update each side independently when anything changes. So it seems consistent to continue that here. It doesn't seem like there's much to gain performance-wise either, updates should be a relatively rare event I'd expect. There's really no metadata associated with an eventfd, so "comparing notes" automatically might imply some central registration entity. That immediately sounds like a much more complex solution, but maybe Avi has some ideas to manage it. Thanks, The idea is to have a central registry maintained by a posted interrupts manager. Both vfio and kvm pass the filp (along with extra information) to the posted interrupts manager, which, when it detects a filp match, tells each of them what to do. The advantages are: - old userspace gains the optimization without change - a userspace API is more expensive to maintain than internal kernel interfaces (CVEs, documentation, maintaining backwards compatibility) - if you can do it without a new interface, this indicates that all the information in the new interface is redundant. That means you have to check it for consistency with the existing information, so it's extra work (likely, it's exactly what the posted interrupt manager would be doing anyway). -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [v4 08/16] KVM: kvm-vfio: User API for IRQ forwarding
On 06/11/2015 01:51 PM, Feng Wu wrote: From: Eric Auger This patch adds and documents a new KVM_DEV_VFIO_DEVICE group and 2 device attributes: KVM_DEV_VFIO_DEVICE_FORWARD_IRQ, KVM_DEV_VFIO_DEVICE_UNFORWARD_IRQ. The purpose is to be able to set a VFIO device IRQ as forwarded or not forwarded. the command takes as argument a handle to a new struct named kvm_vfio_dev_irq. Is there no way to do this automatically? After all, vfio knows that a device interrupt is forwarded to some eventfd, and kvm knows that some eventfd is forwarded to a guest interrupt. If they compare notes through a central registry, they can figure out that the interrupt needs to be forwarded. Signed-off-by: Eric Auger --- Documentation/virtual/kvm/devices/vfio.txt | 34 -- include/uapi/linux/kvm.h | 12 +++ 2 files changed, 40 insertions(+), 6 deletions(-) diff --git a/Documentation/virtual/kvm/devices/vfio.txt b/Documentation/virtual/kvm/devices/vfio.txt index ef51740..6186e6d 100644 --- a/Documentation/virtual/kvm/devices/vfio.txt +++ b/Documentation/virtual/kvm/devices/vfio.txt @@ -4,15 +4,20 @@ VFIO virtual device Device types supported: KVM_DEV_TYPE_VFIO -Only one VFIO instance may be created per VM. The created device -tracks VFIO groups in use by the VM and features of those groups -important to the correctness and acceleration of the VM. As groups -are enabled and disabled for use by the VM, KVM should be updated -about their presence. When registered with KVM, a reference to the -VFIO-group is held by KVM. +Only one VFIO instance may be created per VM. + +The created device tracks VFIO groups in use by the VM and features +of those groups important to the correctness and acceleration of +the VM. As groups are enabled and disabled for use by the VM, KVM +should be updated about their presence. When registered with KVM, +a reference to the VFIO-group is held by KVM. + +The device also enables to control some IRQ settings of VFIO devices: +forwarding/posting. Groups: KVM_DEV_VFIO_GROUP + KVM_DEV_VFIO_DEVICE KVM_DEV_VFIO_GROUP attributes: KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device tracking @@ -20,3 +25,20 @@ KVM_DEV_VFIO_GROUP attributes: For each, kvm_device_attr.addr points to an int32_t file descriptor for the VFIO group. + +KVM_DEV_VFIO_DEVICE attributes: + KVM_DEV_VFIO_DEVICE_FORWARD_IRQ: set a VFIO device IRQ as forwarded + KVM_DEV_VFIO_DEVICE_UNFORWARD_IRQ: set a VFIO device IRQ as not forwarded + +For each, kvm_device_attr.addr points to a kvm_vfio_dev_irq struct. + +When forwarded, a physical IRQ is completed by the guest and not by the +host. This requires HW support in the interrupt controller. + +Forwarding can only be set when the corresponding VFIO IRQ is not masked +(would it be through VFIO_DEVICE_SET_IRQS command or as a consequence of this +IRQ being currently handled) or active at interrupt controller level. +In such a situation, -EAGAIN is returned. It is advised to to set the +forwarding before the VFIO signaling is set up, this avoids trial and errors. + +Unforwarding can happen at any time. diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index 4b60056..798f3e4 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -999,6 +999,9 @@ struct kvm_device_attr { #define KVM_DEV_VFIO_GROUP 1 #define KVM_DEV_VFIO_GROUP_ADD 1 #define KVM_DEV_VFIO_GROUP_DEL 2 +#define KVM_DEV_VFIO_DEVICE 2 +#define KVM_DEV_VFIO_DEVICE_FORWARD_IRQ 1 +#define KVM_DEV_VFIO_DEVICE_UNFORWARD_IRQ2 enum kvm_device_type { KVM_DEV_TYPE_FSL_MPIC_20= 1, @@ -1018,6 +1021,15 @@ enum kvm_device_type { KVM_DEV_TYPE_MAX, }; +struct kvm_vfio_dev_irq { + __u32 argsz; /* structure length */ + __u32 fd; /* file descriptor of the VFIO device */ + __u32 index; /* VFIO device IRQ index */ + __u32 start; /* start of subindex range */ + __u32 count; /* size of subindex range */ + __u32 gsi[]; /* gsi, ie. virtual IRQ number */ +}; + /* * ioctls for VM fds */ -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 00/13] SMM implementation for KVM
On 05/27/2015 08:05 PM, Paolo Bonzini wrote: This brings together the remaining parts of SMM. For now I've left the "weird" interaction between SMM and NMI blocking, and I'm using the same format for the state save area (which is also the one used by QEMU) as the RFC. It builds on the previous cleanup patches, which (with the exception of "KVM: x86: pass kvm_mmu_page to gfn_to_rmap") are now in kvm/queue. The first six patches are more or less the same as the previous version, while the address spaces part hopefully touches all affected functions now. Patches 1-6 implement the SMM API and world switch; patches 7-12 implements the multiple address spaces; patch 13 ties the loose ends and advertises the capability. Tested with SeaBIOS and OVMF, where SMM provides the trusted base for secure boot. Nice work. While I did not do a thorough review, the mmu bits look robust. Thanks, Paolo Paolo Bonzini (13): KVM: x86: introduce num_emulated_msrs KVM: x86: pass host_initiated to functions that read MSRs KVM: x86: pass the whole hflags field to emulator and back KVM: x86: API changes for SMM support KVM: x86: stubs for SMM support KVM: x86: save/load state on SMM switch KVM: add vcpu-specific functions to read/write/translate GFNs KVM: implement multiple address spaces KVM: x86: pass kvm_mmu_page to gfn_to_rmap KVM: x86: use vcpu-specific functions to read/write/translate GFNs KVM: x86: work on all available address spaces KVM: x86: add SMM to the MMU role, support SMRAM address space KVM: x86: advertise KVM_CAP_X86_SMM Documentation/virtual/kvm/api.txt| 52 ++- arch/powerpc/include/asm/kvm_book3s_64.h | 2 +- arch/x86/include/asm/kvm_emulate.h | 9 +- arch/x86/include/asm/kvm_host.h | 44 ++- arch/x86/include/asm/vmx.h | 1 + arch/x86/include/uapi/asm/kvm.h | 11 +- arch/x86/kvm/cpuid.h | 8 + arch/x86/kvm/emulate.c | 262 +- arch/x86/kvm/kvm_cache_regs.h| 5 + arch/x86/kvm/lapic.c | 4 +- arch/x86/kvm/mmu.c | 171 +- arch/x86/kvm/mmu_audit.c | 16 +- arch/x86/kvm/paging_tmpl.h | 18 +- arch/x86/kvm/svm.c | 73 ++-- arch/x86/kvm/trace.h | 22 ++ arch/x86/kvm/vmx.c | 106 +++--- arch/x86/kvm/x86.c | 562 ++- include/linux/kvm_host.h | 49 ++- include/uapi/linux/kvm.h | 6 +- virt/kvm/kvm_main.c | 237 ++--- 20 files changed, 1337 insertions(+), 321 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Announcing qboot, a minimal x86 firmware for QEMU
On 05/27/2015 12:30 PM, Paolo Bonzini wrote: On 26/05/2015 23:25, Christopher Covington wrote: On 05/25/2015 08:53 AM, Paolo Bonzini wrote: On 22/05/2015 13:12, Daniel P. Berrange wrote: In particular I don't see why we need to have a SATA controller and ISA/LPC bridge in every virt machine - root PCI bus only should be possible, as you can provide disks via virtio-blk or virtio-scsi and serial, parallel, mouse, floppy via PCI devices and/or by adding a USB bus in the cases where you really need one. I think removing the ISA/LPC bridge is hard. It includes the real-time clock and fw_cfg, for example. Could VirtIO specified replacements make sense for these peripherals? Not really. virtio is too heavyweight and you'd be reinventing the wheel unnecessarily. For example, ARM's "-M virt" uses a pl011 block for the RTC, and also uses fw_cfg. Another commonly used ISA device is the UART, for which again -M virt uses a pl031. The RTC can be replaced by kvmclock, the keyboard by virtio-console. Maybe we can provide an msr- or pci- based interface to fw_cfg. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 2/4] KVM: x86: Add KVM exit for IOAPIC EOIs
On 05/27/2015 05:06 AM, Steve Rutherford wrote: On Sun, May 24, 2015 at 07:46:03PM +0300, Avi Kivity wrote: On 05/13/2015 04:47 AM, Steve Rutherford wrote: Adds KVM_EXIT_IOAPIC_EOI which passes the interrupt vector up to userspace. Uses a per VCPU exit bitmap to decide whether or not the IOAPIC needs to be informed (which is identical to the EOI_EXIT_BITMAP field used by modern x86 processors, but can also be used to elide kvm IOAPIC EOI exits on older processors). [Note: A prototype using ResampleFDs found that decoupling the EOI >from the VCPU's thread made it possible for the VCPU to not see a recent EOI after reentering the guest. This does not match real hardware.] Compile tested for Intel x86. Signed-off-by: Steve Rutherford --- Documentation/virtual/kvm/api.txt | 10 ++ arch/x86/include/asm/kvm_host.h | 3 +++ arch/x86/kvm/lapic.c | 9 + arch/x86/kvm/x86.c| 11 +++ include/linux/kvm_host.h | 1 + include/uapi/linux/kvm.h | 5 + 6 files changed, 39 insertions(+) diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt index 0744b4e..dd92996 100644 --- a/Documentation/virtual/kvm/api.txt +++ b/Documentation/virtual/kvm/api.txt @@ -3285,6 +3285,16 @@ Valid values for 'type' are: */ __u64 kvm_valid_regs; __u64 kvm_dirty_regs; + + /* KVM_EXIT_IOAPIC_EOI */ +struct { + __u8 vector; +} eoi; + +Indicates that an eoi of a level triggered IOAPIC interrupt on vector has +occurred, which should be handled by the userspace IOAPIC. Triggers when +the Irqchip has been split between userspace and the kernel. + The ioapic is a global resource, so it doesn't make sense for information about it to be returned in a per-vcpu structure EOI exits are a per-vcpu behavior, so this doesn't seem all that strange. (or to block the vcpu while it is being processed). Blocking doesn't feel clean, but doesn't seem all that bad, given that these operations are relatively rare on modern configurations. Agree, maybe the realtime people have an interest here. The way I'd model it is to emulate the APIC bus that connects local APICs and the IOAPIC, using a socket pair. When the user-space ioapic wants to inject an interrupt, it sends a message to the local APICs which then inject it, and when it's ack'ed the EOI is sent back on the same bus. Although I'm not certain about this, it sounds to me like this would require a kernel thread to be waiting (in some way) on this socket, which seems rather heavy handed. It's been a while since I did kernel programming, but I think you can queue a callback to be called when an I/O is ready, and not require a thread. IIRC we do that with irqfd to cause an interrupt to be injected. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 11/12] KVM: x86: add KVM_MEM_X86_SMRAM memory slot flag
On 05/08/2015 02:20 PM, Paolo Bonzini wrote: This adds an arch-specific memslot flag that hides slots unless the VCPU is in system management mode. Some care is needed in order to limit the overhead of x86_gfn_to_memslot when compared with gfn_to_memslot. Thankfully, we have __gfn_to_memslot and search_memslots which are the same, so we can add some extra output to search_memslots. The compiler will optimize it as dead code in __gfn_to_memslot, and will use it to thread jumps in x86_gfn_to_memslot. Signed-off-by: Paolo Bonzini --- Documentation/virtual/kvm/api.txt | 18 -- arch/x86/include/uapi/asm/kvm.h | 3 +++ arch/x86/kvm/smram.c | 25 +++-- include/linux/kvm_host.h | 14 ++ virt/kvm/kvm_main.c | 4 5 files changed, 52 insertions(+), 12 deletions(-) diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt index 51523b70b6b2..2bc99ae040da 100644 --- a/Documentation/virtual/kvm/api.txt +++ b/Documentation/virtual/kvm/api.txt @@ -933,18 +933,24 @@ It is recommended that the lower 21 bits of guest_phys_addr and userspace_addr be identical. This allows large pages in the guest to be backed by large pages in the host. -The flags field supports two flags: KVM_MEM_LOG_DIRTY_PAGES and -KVM_MEM_READONLY. The former can be set to instruct KVM to keep track of -writes to memory within the slot. See KVM_GET_DIRTY_LOG ioctl to know how to -use it. The latter can be set, if KVM_CAP_READONLY_MEM capability allows it, -to make a new slot read-only. In this case, writes to this memory will be -posted to userspace as KVM_EXIT_MMIO exits. +The flags field supports two architecture-independent flags: +KVM_MEM_LOG_DIRTY_PAGES and KVM_MEM_READONLY. The former can be set +to instruct KVM to keep track of writes to memory within the slot. +See KVM_GET_DIRTY_LOG ioctl to know how to use it. The latter can +be set, if KVM_CAP_READONLY_MEM capability allows it, to make a new +slot read-only. In this case, writes to this memory will be posted to +userspace as KVM_EXIT_MMIO exits. When the KVM_CAP_SYNC_MMU capability is available, changes in the backing of the memory region are automatically reflected into the guest. For example, an mmap() that affects the region will be made visible immediately. Another example is madvise(MADV_DROP). +Each architectures can support other specific flags. Right now there is +only one defined. On x86, if KVM_CAP_X86_SMM is available, the +KVM_MEM_X86_SMRAM flag will hide the slot to VCPUs that are not +in system management mode. Is this generic enough? For example, a system could configure itself so that an SMRAM region goes to mmio, hiding real RAM. I see two alternatives: - have three states: SMM, !SMM, both - define two tables: SMM, !SMM, both spanning the entire address space you should probably document how dirty bitmap handling happens in the presence of SMM. + It is recommended to use this API instead of the KVM_SET_MEMORY_REGION ioctl. The KVM_SET_MEMORY_REGION does not allow fine grained control over memory allocation and is deprecated. diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h index 30100a3c1bed..46df15bc844f 100644 --- a/arch/x86/include/uapi/asm/kvm.h +++ b/arch/x86/include/uapi/asm/kvm.h @@ -45,6 +45,9 @@ #define __KVM_HAVE_XCRS #define __KVM_HAVE_READONLY_MEM +#define __KVM_ARCH_VALID_FLAGS KVM_MEM_X86_SMRAM +#define KVM_MEM_X86_SMRAM (1 << 24) + /* Architectural interrupt line count. */ #define KVM_NR_INTERRUPTS 256 diff --git a/arch/x86/kvm/smram.c b/arch/x86/kvm/smram.c index 73616edab631..e7dd933673a4 100644 --- a/arch/x86/kvm/smram.c +++ b/arch/x86/kvm/smram.c @@ -19,10 +19,23 @@ #include #include +#include "kvm_cache_regs.h" struct kvm_memory_slot *x86_gfn_to_memslot(struct kvm_vcpu *vcpu, gfn_t gfn) { - struct kvm_memory_slot *slot = gfn_to_memslot(vcpu->kvm, gfn); + /* By using search_memslots directly the compiler can optimize away +* the "if (found)" check below. + * +* It cannot do the same for gfn_to_memslot because it is not inlined, +* and it also cannot do the same for __gfn_to_memslot because the +* kernel is compiled with -fno-delete-null-pointer-checks. +*/ + bool found; + struct kvm_memslots *memslots = kvm_memslots(vcpu->kvm); + struct kvm_memory_slot *slot = search_memslots(memslots, gfn, &found); + + if (found && unlikely(slot->flags & KVM_MEM_X86_SMRAM) && !is_smm(vcpu)) + return NULL; return slot; } @@ -112,7 +125,15 @@ EXPORT_SYMBOL_GPL(x86_read_guest); int x86_gfn_to_hva_cache_init(struct kvm *kvm, struct gfn_to_hva_cache *ghc, gpa_t gpa, unsigned long len) { - return kvm_gfn_to_hva_cache_init(kvm, ghc, gpa, len); + int r = kvm_gfn_to_hva_ca
Re: [RFC PATCH 2/4] KVM: x86: Add KVM exit for IOAPIC EOIs
On 05/13/2015 04:47 AM, Steve Rutherford wrote: Adds KVM_EXIT_IOAPIC_EOI which passes the interrupt vector up to userspace. Uses a per VCPU exit bitmap to decide whether or not the IOAPIC needs to be informed (which is identical to the EOI_EXIT_BITMAP field used by modern x86 processors, but can also be used to elide kvm IOAPIC EOI exits on older processors). [Note: A prototype using ResampleFDs found that decoupling the EOI from the VCPU's thread made it possible for the VCPU to not see a recent EOI after reentering the guest. This does not match real hardware.] Compile tested for Intel x86. Signed-off-by: Steve Rutherford --- Documentation/virtual/kvm/api.txt | 10 ++ arch/x86/include/asm/kvm_host.h | 3 +++ arch/x86/kvm/lapic.c | 9 + arch/x86/kvm/x86.c| 11 +++ include/linux/kvm_host.h | 1 + include/uapi/linux/kvm.h | 5 + 6 files changed, 39 insertions(+) diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt index 0744b4e..dd92996 100644 --- a/Documentation/virtual/kvm/api.txt +++ b/Documentation/virtual/kvm/api.txt @@ -3285,6 +3285,16 @@ Valid values for 'type' are: */ __u64 kvm_valid_regs; __u64 kvm_dirty_regs; + + /* KVM_EXIT_IOAPIC_EOI */ +struct { + __u8 vector; +} eoi; + +Indicates that an eoi of a level triggered IOAPIC interrupt on vector has +occurred, which should be handled by the userspace IOAPIC. Triggers when +the Irqchip has been split between userspace and the kernel. + The ioapic is a global resource, so it doesn't make sense for information about it to be returned in a per-vcpu structure (or to block the vcpu while it is being processed). The way I'd model it is to emulate the APIC bus that connects local APICs and the IOAPIC, using a socket pair. When the user-space ioapic wants to inject an interrupt, it sends a message to the local APICs which then inject it, and when it's ack'ed the EOI is sent back on the same bus. It's true that the APIC bus no longer exists, but modern processors still pretend it does. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Announcing qboot, a minimal x86 firmware for QEMU
On 05/21/2015 07:21 PM, Paolo Bonzini wrote: On 21/05/2015 17:48, Avi Kivity wrote: Lovely! Note you have memcpy.o instead of memcpy.c. Doh, and it's not used anyway. Check the repository, and let me know if OSv boots with it (it probably needs ACPI; Linux doesn't boot virtio without ACPI). Yes, it requires ACPI. We don't implement the pre-ACPI bootstrap methods. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Announcing qboot, a minimal x86 firmware for QEMU
On 05/21/2015 04:51 PM, Paolo Bonzini wrote: Some of you may have heard about the "Clear Containers" initiative from Intel, which couple KVM with various kernel tricks to create extremely lightweight virtual machines. The experimental Clear Containers setup requires only 18-20 MB to launch a virtual machine, and needs about 60 ms to boot. Now, as all of you probably know, "QEMU is great for running Windows or legacy Linux guests, but that flexibility comes at a hefty price. Not only does all of the emulation consume memory, it also requires some form of low-level firmware in the guest as well. All of this adds quite a bit to virtual-machine startup times (500 to 700 milliseconds is not unusual)". Right? In fact, it's for this reason that Clear Containers uses kvmtool instead of QEMU. No, wrong! In fact, reporting bad performance is pretty much the same as throwing down the gauntlet. Enter qboot, a minimal x86 firmware that runs on QEMU and, together with a slimmed-down QEMU configuration, boots a virtual machine in 40 milliseconds[2] on an Ivy Bridge Core i7 processor. qboot is available at git://github.com/bonzini/qboot.git. In all the glory of its 8KB of code, it brings together various existing open source components: * a minimal (really minimal) 16-bit BIOS runtime based on kvmtool's own BIOS * a couple hardware initialization routines written mostly from scratch but with good help from SeaBIOS source code * a minimal 32-bit libc based on kvm-unit-tests * the Linux loader from QEMU itself The repository has more information on how to achieve fast boot times, and examples of using qboot. Right now there is a limit of 8 MB for vmlinuz+initrd+cmdline, which however should be enough for initrd-less containers. The first commit to qboot is more or less 24 hours old, so there is definitely more work to do, in particular to extract ACPI tables from QEMU and present them to the guest. This is probably another day of work or so, and it will enable multiprocessor guests with little or no impact on the boot times. SMBIOS information is also available from QEMU. On the QEMU side, there is no support yet for persistent memory and the NFIT tables from ACPI 6.0. Once that (and ACPI support) is added, qboot will automatically start using it. Happy hacking! Lovely! Note you have memcpy.o instead of memcpy.c. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SVM: vmload/vmsave-free VM exits?
On 04/13/2015 08:57 PM, Jan Kiszka wrote: On 2015-04-13 19:48, Avi Kivity wrote: I think that Xen does (or did) something along the lines of disabling IST usage (by playing with the descriptors in the IDT) and then re-enabling them when exiting to userspace. So we would reuse that active stack for the current IST users until then. Yes. But I bet there are subtle details that prevent a simple switch at IDT level. Hmm, no low-hanging fruit it seems... For sure. It's not insurmountable, but fairly hard. [17] http://thread.gmane.org/gmane.comp.emulators.kvm.devel/26712/ That thread proposed the complete IST removal. But, given that we still have it 7 years later, Well, it's not as if a crack team of kernel hackers was laboring night and day to remove it, but... I suppose that was not very welcome in general. Simply removing it is impossible, or an NMI happening immediately after SYSCALL will hit user-provided %rsp. Thanks, Jan PS: For the Jailhouse readers: we don't use IST. You don't have userspace, yes? Only guests? -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SVM: vmload/vmsave-free VM exits?
On 04/13/2015 08:41 PM, Avi Kivity wrote: On 04/13/2015 08:35 PM, Jan Kiszka wrote: On 2015-04-13 19:29, Avi Kivity wrote: On 04/13/2015 10:01 AM, Jan Kiszka wrote: On 2015-04-07 07:43, Jan Kiszka wrote: On 2015-04-05 19:12, Valentine Sinitsyn wrote: Hi Jan, On 05.04.2015 13:31, Jan Kiszka wrote: studying the VM exit logic of Jailhouse, I was wondering when AMD's vmload/vmsave can be avoided. Jailhouse as well as KVM currently use these instructions unconditionally. However, I think both only need GS.base, i.e. the per-cpu base address, to be saved and restored if no user space exit or no CPU migration is involved (both is always true for Jailhouse). Xen avoids vmload/vmsave on lightweight exits but it also still uses rsp-based per-cpu variables. So the question boils down to what is generally faster: A) vmload vmrun vmsave B) wrmsrl(MSR_GS_BASE, guest_gs_base) vmrun rdmsrl(MSR_GS_BASE, guest_gs_base) Of course, KVM also has to take into account that heavyweight exits still require vmload/vmsave, thus become more expensive with B) due to the additional MSR accesses. Any thoughts or results of previous experiments? That's a good question, I also thought about it when I was finalizing Jailhouse AMD port. I tried "lightweight exits" with apic-demo but it didn't seem to affect the latency in any noticeable way. That's why I decided not to push the patch (in fact, I was even unable to find it now). Note however that how AMD chips store host state during VM switches are implementation-specific. I did my quick experiments on one CPU only, so your mileage may vary. Regarding your question, I feel B will be faster anyways but again I'm afraid that the gain could be within statistical error of the experiment. It is, at least 160 cycles with hot caches on an AMD A6-5200 APU, more towards 600 if they are colder (added some usleep to each loop in the test). I've tested via vmmcall from guest userspace under Jailhouse. KVM should be adjustable in a similar way. Attached the benchmark, patch will be in the Jailhouse next branch soon. We need to check more CPU types, though. Avi, I found some preparatory patches of yours from 2010 [1]. Do you happen to remember if it was never completed for a technical reason? IIRC, I came to the conclusion that it was impossible. Something about TR.size not receiving a reasonable value. Let me see. To my understanding, TR doesn't play a role until we leave ring 0 again. Or what could make the CPU look for any of the fields in the 64-bit TSS before that? Exceptions that utilize the IST. I found a writeup [17] that describes this, but I think it's even more impossible than that writeup implies. I think that Xen does (or did) something along the lines of disabling IST usage (by playing with the descriptors in the IDT) and then re-enabling them when exiting to userspace. [17] http://thread.gmane.org/gmane.comp.emulators.kvm.devel/26712/ Jan -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SVM: vmload/vmsave-free VM exits?
On 04/13/2015 08:35 PM, Jan Kiszka wrote: On 2015-04-13 19:29, Avi Kivity wrote: On 04/13/2015 10:01 AM, Jan Kiszka wrote: On 2015-04-07 07:43, Jan Kiszka wrote: On 2015-04-05 19:12, Valentine Sinitsyn wrote: Hi Jan, On 05.04.2015 13:31, Jan Kiszka wrote: studying the VM exit logic of Jailhouse, I was wondering when AMD's vmload/vmsave can be avoided. Jailhouse as well as KVM currently use these instructions unconditionally. However, I think both only need GS.base, i.e. the per-cpu base address, to be saved and restored if no user space exit or no CPU migration is involved (both is always true for Jailhouse). Xen avoids vmload/vmsave on lightweight exits but it also still uses rsp-based per-cpu variables. So the question boils down to what is generally faster: A) vmload vmrun vmsave B) wrmsrl(MSR_GS_BASE, guest_gs_base) vmrun rdmsrl(MSR_GS_BASE, guest_gs_base) Of course, KVM also has to take into account that heavyweight exits still require vmload/vmsave, thus become more expensive with B) due to the additional MSR accesses. Any thoughts or results of previous experiments? That's a good question, I also thought about it when I was finalizing Jailhouse AMD port. I tried "lightweight exits" with apic-demo but it didn't seem to affect the latency in any noticeable way. That's why I decided not to push the patch (in fact, I was even unable to find it now). Note however that how AMD chips store host state during VM switches are implementation-specific. I did my quick experiments on one CPU only, so your mileage may vary. Regarding your question, I feel B will be faster anyways but again I'm afraid that the gain could be within statistical error of the experiment. It is, at least 160 cycles with hot caches on an AMD A6-5200 APU, more towards 600 if they are colder (added some usleep to each loop in the test). I've tested via vmmcall from guest userspace under Jailhouse. KVM should be adjustable in a similar way. Attached the benchmark, patch will be in the Jailhouse next branch soon. We need to check more CPU types, though. Avi, I found some preparatory patches of yours from 2010 [1]. Do you happen to remember if it was never completed for a technical reason? IIRC, I came to the conclusion that it was impossible. Something about TR.size not receiving a reasonable value. Let me see. To my understanding, TR doesn't play a role until we leave ring 0 again. Or what could make the CPU look for any of the fields in the 64-bit TSS before that? Exceptions that utilize the IST. I found a writeup [17] that describes this, but I think it's even more impossible than that writeup implies. [17] http://thread.gmane.org/gmane.comp.emulators.kvm.devel/26712/ Jan -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SVM: vmload/vmsave-free VM exits?
On 04/13/2015 10:01 AM, Jan Kiszka wrote: On 2015-04-07 07:43, Jan Kiszka wrote: On 2015-04-05 19:12, Valentine Sinitsyn wrote: Hi Jan, On 05.04.2015 13:31, Jan Kiszka wrote: studying the VM exit logic of Jailhouse, I was wondering when AMD's vmload/vmsave can be avoided. Jailhouse as well as KVM currently use these instructions unconditionally. However, I think both only need GS.base, i.e. the per-cpu base address, to be saved and restored if no user space exit or no CPU migration is involved (both is always true for Jailhouse). Xen avoids vmload/vmsave on lightweight exits but it also still uses rsp-based per-cpu variables. So the question boils down to what is generally faster: A) vmload vmrun vmsave B) wrmsrl(MSR_GS_BASE, guest_gs_base) vmrun rdmsrl(MSR_GS_BASE, guest_gs_base) Of course, KVM also has to take into account that heavyweight exits still require vmload/vmsave, thus become more expensive with B) due to the additional MSR accesses. Any thoughts or results of previous experiments? That's a good question, I also thought about it when I was finalizing Jailhouse AMD port. I tried "lightweight exits" with apic-demo but it didn't seem to affect the latency in any noticeable way. That's why I decided not to push the patch (in fact, I was even unable to find it now). Note however that how AMD chips store host state during VM switches are implementation-specific. I did my quick experiments on one CPU only, so your mileage may vary. Regarding your question, I feel B will be faster anyways but again I'm afraid that the gain could be within statistical error of the experiment. It is, at least 160 cycles with hot caches on an AMD A6-5200 APU, more towards 600 if they are colder (added some usleep to each loop in the test). I've tested via vmmcall from guest userspace under Jailhouse. KVM should be adjustable in a similar way. Attached the benchmark, patch will be in the Jailhouse next branch soon. We need to check more CPU types, though. Avi, I found some preparatory patches of yours from 2010 [1]. Do you happen to remember if it was never completed for a technical reason? IIRC, I came to the conclusion that it was impossible. Something about TR.size not receiving a reasonable value. Let me see. Joel, can you comment on the benefit of variant B) for the various AMD CPUs? Is it always positive? Thanks, Jan [1] http://thread.gmane.org/gmane.comp.emulators.kvm.devel/61455 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: x86: Question regarding the reset value of LINT0
On 04/09/2015 09:21 PM, Nadav Amit wrote: Bandan Das wrote: Nadav Amit writes: Jan Kiszka wrote: On 2015-04-08 19:40, Nadav Amit wrote: Jan Kiszka wrote: On 2015-04-08 18:59, Nadav Amit wrote: Jan Kiszka wrote: On 2015-04-08 18:40, Nadav Amit wrote: Hi, I would appreciate if someone explains the reason for enabling LINT0 during APIC reset. This does not correspond with Intel SDM Figure 10-8: “Local Vector Table” that says all LVT registers are reset to 0x1. In kvm_lapic_reset, I see: apic_set_reg(apic, APIC_LVT0, SET_APIC_DELIVERY_MODE(0, APIC_MODE_EXTINT)); Which is actually pretty similar to QEMU’s apic_reset_common: if (bsp) { /* * LINT0 delivery mode on CPU #0 is set to ExtInt at initialization * time typically by BIOS, so PIC interrupt can be delivered to the * processor when local APIC is enabled. */ s->lvt[APIC_LVT_LINT0] = 0x700; } Yet, in both cases, I miss the point - if it is typically done by the BIOS, why does QEMU or KVM enable it? BTW: KVM seems to run fine without it, and I think setting it causes me problems in certain cases. I suspect it has some historic BIOS backgrounds. Already tried to find more information in the git logs of both code bases? Or something that indicates of SeaBIOS or BochsBIOS once didn't do this initialization? Thanks. I found no indication of such thing. QEMU’s commit message (0e21e12bb311c4c1095d0269dc2ef81196ccb60a) says: Don't route PIC interrupts through the local APIC if the local APIC config says so. By Ari Kivity. Maybe Avi Kivity knows this guy. ths? That should have been Thiemo Seufer (IIRC), but he just committed the code back then (and is no longer with us, sadly). Oh… I am sorry - I didn’t know about that.. (I tried to make an unfunny joke about Avi knowing “Ari”). Ah. No problem. My brain apparently fixed that typo up unnoticed. But if that commit went in without any BIOS changes around it, QEMU simply had to do the job of the latter to keep things working. So should I leave it as is? Can I at least disable in KVM during INIT (and leave it as is for RESET)? No, I don't think there is a need to leave this inaccurate for QEMU if our included BIOS gets it right. I don't know what the backward bug-compatibility of KVM is, though. Maybe you can identify since when our BIOS is fine so that we can discuss time frames. I think that it was addressed in commit 19c1a7692bf65fc40e56f93ad00cc3eefaad22a4 ("Initialize the LINT LVTs on the local APIC of the BSP.”) So it should be included in seabios 0.5.0, which means qemu 0.12 - so we are talking about the end of 2009 or start of 2010. The probability that someone will use a newer version of kernel with something as old as 0.12 is probably minimal. I think it's ok to change it with a comment indicating the reason. To be on the safe side, however, a user changeable switch is something worth considering. I don’t see any existing mechanism for KVM to be aware of its user type and version. I do see another case of KVM hacks that are intended for fixing very old QEMU bugs (see 3a624e29c75 changes in vmx_set_segment, which are from pretty much the same time-frame of the issue I try to fix). Since this is something which would follow around, please advise what would be the format. A new ioctl that would supply the userspace “type” (according to predefined constants) and version? That would be madness. KVM shouldn't even know that qemu exists, let alone track its versions. Simply add a new toggle KVM_USE_STANDARD_LAPIC_LVT_INIT and document that userspace MUST use it. Old userspace won't, and will get the old buggy behavior. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2 CPU Conformance Issue in KVM/x86
On 03/10/2015 12:47 PM, Paolo Bonzini wrote: On 09/03/2015 20:49, Avi Kivity wrote: Yes, and it checked that MAXPHYADDR != 52 before. If you want to set only one bit, making that bit 51 makes sense anyway for simplicity, so it is still 99.9% academic. Once processors appear with MAXPHYADDR = 52, the remaining 0.1% will become more relevant. The current limit is IIRC 46 or 48 (on Haswell Xeons). It will be interesting to have processors with 52 bits of physical address and 48 bits of virtual address. HIGHMEM for x86_64? Or 5-level page tables? I wonder why Intel chose exactly 52... HIGHMEM seems more likely than 5-level page tables. Certainly it wouldn't need hacks like Ingo's 4G-4G. My bitcoins are on 5-level page tables. HIGHMEM is too much pain. 50 bits == 1 PiB. That's quite an amount of RAM. Not that 64 TiB is not "quite an amount of RAM". :) Depends on how many browser tabs you have open, I guess. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2 CPU Conformance Issue in KVM/x86
On 03/09/2015 09:33 PM, Paolo Bonzini wrote: On 09/03/2015 18:08, Avi Kivity wrote: Is the issue emulating a higher MAXPHYADDR on the guest than is available on the host? I don't think there's any need to support that. No, indeed. The only problem is that the failure mode is quite horrible (you get a triple fault, possibly while the guest is running). Can't qemu simply check for it? If you need that, disable EPT. :) -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2 CPU Conformance Issue in KVM/x86
On 03/09/2015 09:38 PM, Paolo Bonzini wrote: On 09/03/2015 20:19, Avi Kivity wrote: I can't think of one with reasonable performance either. Perhaps the maintainers could raise the issue with Intel. It looks academic but it can happen in real life -- KVM for example used to rely on reserved bits faults (it set all bits in the PTE so it wouldn't have been caught by this). Yes, and it checked that MAXPHYADDR != 52 before. If you want to set only one bit, making that bit 51 makes sense anyway for simplicity, so it is still 99.9% academic. Once processors appear with MAXPHYADDR = 52, the remaining 0.1% will become more relevant. The current limit is IIRC 46 or 48 (on Haswell Xeons). It will be interesting to have processors with 52 bits of physical address and 48 bits of virtual address. HIGHMEM for x86_64? Or 5-level page tables? 50 bits == 1 PiB. That's quite an amount of RAM. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2 CPU Conformance Issue in KVM/x86
On 03/09/2015 09:07 PM, Nadav Amit wrote: Avi Kivity wrote: On 03/09/2015 07:51 PM, Nadav Amit wrote: Avi Kivity wrote: On 03/03/2015 11:52 AM, Paolo Bonzini wrote: In this case, the VM might expect exceptions when PTE bits which are higher than the maximum (reported) address width are set, and it would not get such exceptions. This problem can easily be experienced by small change to the existing KVM unit-tests. There are many variants to this problem, and the only solution which I consider complete is to report to the VM the maximum (52) physical address width to the VM, configure the VM to exit on #PF with reserved-bit error-codes, and then emulate these faulting instructions. Not even that would be a definitive solution. If the guest tries to map RAM (e.g. a PCI BAR that is backed by RAM) above the host MAXPHYADDR, you would get EPT misconfiguration vmexits. I think there is no way to emulate physical address width correctly, except by disabling EPT. Is the issue emulating a higher MAXPHYADDR on the guest than is available on the host? I don't think there's any need to support that. Emulating a lower setting on the guest than is available on the host is, I think, desirable. Whether it would work depends on the relative priority of EPT misconfiguration exits vs. page table permission faults. Thanks for the feedback. Guest page-table permissions faults got priority over EPT misconfiguration. KVM can even be set to trap page-table permission faults, at least in VT-x. Anyhow, I don’t think it is enough. Why is it not enough? If you trap a permission fault, you can inject any exception error code you like. Because there is no real permission fault. In the following example, the VM expects one (VM’s MAXPHYADDR=40), but there isn’t (Host’s MAXPHYADDR=46), so the hypervisor cannot trap it. It can only trap all #PF, which is obviously too intrusive. There are three cases: 1) The guest has marked the page as not present. In this case, no reserved bits are set and the guest should receive its #PF. 2) The page is present and the permissions are sufficient. In this case, you will get an EPT misconfiguration and can proceed to inject a #PF with the reserved bit flag set. 3) The page is present but permissions are not sufficient. In this case you can trap the fault via the PFEC_MASK register and inject a #PF to the guest. So you can emulate it and only trap permission faults. It's still too expensive though. Here is an example My machine has MAXPHYADDR of 46. I modified kvm-unit-tests access test to set pte.45 instead of pte.51, which from the VM point-of-view should cause the #PF error-code indicate the reserved bits are set (just as pte.51 does). Here is one error from the log: test pte.p pte.45 pde.p user: FAIL: error code 5 expected d Dump mapping: address: 1234 --L4: 304b007 --L3: 304c007 --L2: 304d001 --L1: 2201 This is with an ept misconfig programmed into that address, yes? A reserved bit in the PTE is set - from the VM point-of-view. If there wasn’t another cause for #PF, it would lead to EPT violation/misconfig. As you can see, the #PF should have had two reasons: reserved bits, and user access to supervisor only page. The error-code however does not indicate the reserved-bits are set. Note that KVM did not trap any exit on that faulting instruction, as otherwise it would try to emulate the instruction and assuming it is supported (and that the #PF was not on an instruction fetch), should be able to emulate the #PF correctly. [ The test actually crashes soon after this error due to these reasons. ] Anyhow, that is the reason for me to assume that having the maximum MAXPHYADDR is better. Well, that doesn't work for the reasons Paolo noted. The guest can have a ivshmem device attached, and map it above a host-supported virtual address, and suddenly it goes slow. I fully understand. That’s the reason I don’t have a reasonable solution. I can't think of one with reasonable performance either. Perhaps the maintainers could raise the issue with Intel. It looks academic but it can happen in real life -- KVM for example used to rely on reserved bits faults (it set all bits in the PTE so it wouldn't have been caught by this). -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2 CPU Conformance Issue in KVM/x86
On 03/09/2015 07:51 PM, Nadav Amit wrote: Avi Kivity wrote: On 03/03/2015 11:52 AM, Paolo Bonzini wrote: In this case, the VM might expect exceptions when PTE bits which are higher than the maximum (reported) address width are set, and it would not get such exceptions. This problem can easily be experienced by small change to the existing KVM unit-tests. There are many variants to this problem, and the only solution which I consider complete is to report to the VM the maximum (52) physical address width to the VM, configure the VM to exit on #PF with reserved-bit error-codes, and then emulate these faulting instructions. Not even that would be a definitive solution. If the guest tries to map RAM (e.g. a PCI BAR that is backed by RAM) above the host MAXPHYADDR, you would get EPT misconfiguration vmexits. I think there is no way to emulate physical address width correctly, except by disabling EPT. Is the issue emulating a higher MAXPHYADDR on the guest than is available on the host? I don't think there's any need to support that. Emulating a lower setting on the guest than is available on the host is, I think, desirable. Whether it would work depends on the relative priority of EPT misconfiguration exits vs. page table permission faults. Thanks for the feedback. Guest page-table permissions faults got priority over EPT misconfiguration. KVM can even be set to trap page-table permission faults, at least in VT-x. Anyhow, I don’t think it is enough. Why is it not enough? If you trap a permission fault, you can inject any exception error code you like. Here is an example My machine has MAXPHYADDR of 46. I modified kvm-unit-tests access test to set pte.45 instead of pte.51, which from the VM point-of-view should cause the #PF error-code indicate the reserved bits are set (just as pte.51 does). Here is one error from the log: test pte.p pte.45 pde.p user: FAIL: error code 5 expected d Dump mapping: address: 1234 --L4: 304b007 --L3: 304c007 --L2: 304d001 --L1: 2201 This is with an ept misconfig programmed into that address, yes? As you can see, the #PF should have had two reasons: reserved bits, and user access to supervisor only page. The error-code however does not indicate the reserved-bits are set. Note that KVM did not trap any exit on that faulting instruction, as otherwise it would try to emulate the instruction and assuming it is supported (and that the #PF was not on an instruction fetch), should be able to emulate the #PF correctly. [ The test actually crashes soon after this error due to these reasons. ] Anyhow, that is the reason for me to assume that having the maximum MAXPHYADDR is better. Well, that doesn't work for the reasons Paolo noted. The guest can have a ivshmem device attached, and map it above a host-supported virtual address, and suddenly it goes slow. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2 CPU Conformance Issue in KVM/x86
On 03/03/2015 11:52 AM, Paolo Bonzini wrote: In this case, the VM might expect exceptions when PTE bits which are higher than the maximum (reported) address width are set, and it would not get such exceptions. This problem can easily be experienced by small change to the existing KVM unit-tests. There are many variants to this problem, and the only solution which I consider complete is to report to the VM the maximum (52) physical address width to the VM, configure the VM to exit on #PF with reserved-bit error-codes, and then emulate these faulting instructions. Not even that would be a definitive solution. If the guest tries to map RAM (e.g. a PCI BAR that is backed by RAM) above the host MAXPHYADDR, you would get EPT misconfiguration vmexits. I think there is no way to emulate physical address width correctly, except by disabling EPT. Is the issue emulating a higher MAXPHYADDR on the guest than is available on the host? I don't think there's any need to support that. Emulating a lower setting on the guest than is available on the host is, I think, desirable. Whether it would work depends on the relative priority of EPT misconfiguration exits vs. page table permission faults. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Seeking a KVM benchmark
On 11/10/2014 02:15 PM, Paolo Bonzini wrote: On 10/11/2014 11:45, Gleb Natapov wrote: I tried making also the other shared MSRs the same between guest and host (STAR, LSTAR, CSTAR, SYSCALL_MASK), so that the user return notifier has nothing to do. That saves about 4-500 cycles on inl_from_qemu. I do want to dig out my old Core 2 and see how the new test fares, but it really looks like your patch will be in 3.19. Please test on wide variety of HW before final decision. Yes, definitely. Also it would be nice to ask Intel what is expected overhead. It is awesome if they mange to add EFER switching with non measurable overhead, but also hard to believe :) So let's see what happens. Sneak preview: the result is definitely worth asking Intel about. I ran these benchmarks with a stock 3.16.6 KVM. Instead I patched kvm-unit-tests to set EFER.SCE in enable_nx. This makes it much simpler for others to reproduce the results. I only ran the inl_from_qemu test. Perf stat reports that the processor goes from 0.65 to 0.46 instructions per cycle, which is consistent with the improvement from 19k to 12k cycles per iteration. Unpatched KVM-unit-tests: 3,385,586,563 cycles#3.189 GHz [83.25%] 2,475,979,685 stalled-cycles-frontend # 73.13% frontend cycles idle [83.37%] 2,083,556,270 stalled-cycles-backend# 61.54% backend cycles idle [66.71%] 1,573,854,041 instructions #0.46 insns per cycle #1.57 stalled cycles per insn [83.20%] 1.108486526 seconds time elapsed Patched KVM-unit-tests: 3,252,297,378 cycles#3.147 GHz [83.32%] 2,010,266,184 stalled-cycles-frontend # 61.81% frontend cycles idle [83.36%] 1,560,371,769 stalled-cycles-backend# 47.98% backend cycles idle [66.51%] 2,133,698,018 instructions #0.66 insns per cycle #0.94 stalled cycles per insn [83.45%] 1.072395697 seconds time elapsed Playing with other events shows that the unpatched benchmark has an awful load of TLB misses Unpatched: 30,311 iTLB-loads 464,641,844 dTLB-loads 10,813,839 dTLB-load-misses #2.33% of all dTLB cache hits 20436,027 iTLB-load-misses # 67421.16% of all iTLB cache hits Patched: 1,440,033 iTLB-loads 640,970,836 dTLB-loads 2,345,112 dTLB-load-misses #0.37% of all dTLB cache hits 270,884 iTLB-load-misses # 18.81% of all iTLB cache hits This is 100% reproducible. The meaning of the numbers is clearer if you look up the raw event numbers in the Intel manuals: - iTLB-loads is 85h/10h aka "perf -e r1085": "Number of cache load STLB [second-level TLB] hits. No page walk." - iTLB-load-misses is 85h/01h aka r185: "Misses in all ITLB levels that cause page walks." So for example event 85h/04h aka r485 ("Cycle PMH is busy with a walk.") and friends show that the unpatched KVM wastes about 0.1 seconds more than the patched KVM on page walks: Unpatched: 22,583,440 r449 (cycles on dTLB store miss page walks) 40,452,018 r408 (cycles on dTLB load miss page walks) 2,115,981 r485 (cycles on iTLB miss page walks) 65,151,439 total Patched: 24,430,676 r449 (cycles on dTLB store miss page walks) 196,017,693 r408 (cycles on dTLB load miss page walks) 213,266,243 r485 (cycles on iTLB miss page walks) - 433,714,612 total These 0.1 seconds probably are all on instructions that would have been fast, since the slow instructions responsible for the low IPC are the microcoded instructions including VMX and other privileged stuff. Similarly, BDh/20h counts STLB flushes, which are 3k in unpatched KVM and 260k in patched KVM. Let's see where they come from: Unpatched: + 98.97% qemu-kvm [kernel.kallsyms] [k] native_write_msr_safe + 0.70% qemu-kvm [kernel.kallsyms] [k] page_fault It's expected that most TLB misses happen just before a page fault (there are also events to count how many TLB misses do result in a page fault, if you care about that), and thus are accounted to the first instruction of the exception handler. We do not know what causes second-level TLB _flushes_ but it's quite expected that you'll have a TLB miss after them and possibly a page fault. And anyway 98.97% of them coming from native_write_msr_safe is totally anomalous. A patched benchmark shows no second-level TLB flush occurs after a WRMSR: + 72.41% qemu-kvm [kernel.kallsyms] [k] page_fault + 9.07% qemu-kvm [kvm_intel][k] vmx_flush_tlb + 6.60% qemu-kvm [kernel.kallsyms] [k] set_pte_vaddr_pud +
Re: [PATCH 2/4] KVM: nSVM: propagate the NPF EXITINFO to the guest
On 09/02/2014 07:46 PM, Paolo Bonzini wrote: */ if (unlikely(real_gfn == UNMAPPED_GVA)) goto error; @@ -1974,10 +1974,28 @@ static void nested_svm_inject_npf_exit(struct kvm_vcpu *vcpu, { struct vcpu_svm *svm = to_svm(vcpu); - svm->vmcb->control.exit_code = SVM_EXIT_NPF; - svm->vmcb->control.exit_code_hi = 0; - svm->vmcb->control.exit_info_1 = fault->error_code; - svm->vmcb->control.exit_info_2 = fault->address; + /* +* We can keep the value that the processor stored in the VMCB, +* but make up something sensible if we hit the WARN. +*/ + if (WARN_ON(svm->vmcb->control.exit_code != SVM_EXIT_NPF)) { + svm->vmcb->control.exit_code = SVM_EXIT_NPF; + svm->vmcb->control.exit_code_hi = 0; + svm->vmcb->control.exit_info_1 = (1ULL << 32); + svm->vmcb->control.exit_info_2 = fault->address; + } Its been a while since I looked into this, but is an injected NPF exit always the result of a real NPF exit? I think so, but that's why I CCed you. :) It could always be the result of emulation into which L0 was tricked. I don't think it's a safe assumption. How about an io-port emulated on L1 but passed through to L2 by the nested hypervisor. On emulation of INS or OUTS, KVM would need to read/write to an L2 address space, It would need to read/write to *L1* (that's where the VMCB's IOIO map lies), which could result into a regular page fault injected into L1. Paolo maybe causing NPF faults to be injected. In this case an IOIO exit would cause an injected NPF exit for L1. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/5] KVM: x86: drop fpu_activate hook
On 08/18/2014 01:51 PM, Paolo Bonzini wrote: Il 18/08/2014 12:26, Avi Kivity ha scritto: On 08/18/2014 01:20 PM, Paolo Bonzini wrote: Il 18/08/2014 11:50, Wanpeng Li ha scritto: fpu_activate hook is introduced by commit 6b52d186 (KVM: Activate fpu on clts), however, there is no user currently, this patch drop it. Reviewed-by: Yang Zhang Signed-off-by: Wanpeng Li --- arch/x86/include/asm/kvm_host.h | 1 - arch/x86/kvm/svm.c | 1 - arch/x86/kvm/vmx.c | 1 - 3 files changed, 3 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 5724601..b68f3e5 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -710,7 +710,6 @@ struct kvm_x86_ops { void (*cache_reg)(struct kvm_vcpu *vcpu, enum kvm_reg reg); unsigned long (*get_rflags)(struct kvm_vcpu *vcpu); void (*set_rflags)(struct kvm_vcpu *vcpu, unsigned long rflags); -void (*fpu_activate)(struct kvm_vcpu *vcpu); void (*fpu_deactivate)(struct kvm_vcpu *vcpu); void (*tlb_flush)(struct kvm_vcpu *vcpu); diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index ddf7427..1f49c86 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -4349,7 +4349,6 @@ static struct kvm_x86_ops svm_x86_ops = { .cache_reg = svm_cache_reg, .get_rflags = svm_get_rflags, .set_rflags = svm_set_rflags, -.fpu_activate = svm_fpu_activate, .fpu_deactivate = svm_fpu_deactivate, .tlb_flush = svm_flush_tlb, diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 71cbee5..2963303 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -8896,7 +8896,6 @@ static struct kvm_x86_ops vmx_x86_ops = { .cache_reg = vmx_cache_reg, .get_rflags = vmx_get_rflags, .set_rflags = vmx_set_rflags, -.fpu_activate = vmx_fpu_activate, .fpu_deactivate = vmx_fpu_deactivate, .tlb_flush = vmx_flush_tlb, Avi/Gleb, do you remember any particular reason for this? IIRC (vaguely) if we expect the fpu to be used in the near future, we activate it eagerly so that we don't fault when it is used. Prevents the sequence: guest user: use fpu #NM host: reflect #NM to guest guest kernel: CLTS guest kernel: switch fpu state #NM host: switch fpu guest kernel: switch fpu state (restarted) guest user: use fpu (restarted) Why was the user removed? Full-time eager fpu? No, I mean any reason to keep the hooks. If there are no callers, I can't think of any. In the meanwhile I found it myself: commit 2d04a05bd7e93c13f13a82ac40de4065a99d069b Author: Avi Kivity Date: Wed Apr 20 15:32:49 2011 +0300 KVM: x86 emulator: emulate CLTS internally Avoid using ctxt->vcpu; we can do everything with ->get_cr() and ->set_cr(). A side effect is that we no longer activate the fpu on emulated CLTS; but that should be very rare. Signed-off-by: Avi Kivity vmx_fpu_activate and svm_fpu_activate are still called on #NM and CLTS, but never from common code after the above patch. Activation on CLTS is currently VMX only; I guess on AMD we could check the decode assists' CR_VALID bit and instruction length to detect CLTS. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/5] KVM: x86: drop fpu_activate hook
On 08/18/2014 01:20 PM, Paolo Bonzini wrote: Il 18/08/2014 11:50, Wanpeng Li ha scritto: fpu_activate hook is introduced by commit 6b52d186 (KVM: Activate fpu on clts), however, there is no user currently, this patch drop it. Reviewed-by: Yang Zhang Signed-off-by: Wanpeng Li --- arch/x86/include/asm/kvm_host.h | 1 - arch/x86/kvm/svm.c | 1 - arch/x86/kvm/vmx.c | 1 - 3 files changed, 3 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 5724601..b68f3e5 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -710,7 +710,6 @@ struct kvm_x86_ops { void (*cache_reg)(struct kvm_vcpu *vcpu, enum kvm_reg reg); unsigned long (*get_rflags)(struct kvm_vcpu *vcpu); void (*set_rflags)(struct kvm_vcpu *vcpu, unsigned long rflags); - void (*fpu_activate)(struct kvm_vcpu *vcpu); void (*fpu_deactivate)(struct kvm_vcpu *vcpu); void (*tlb_flush)(struct kvm_vcpu *vcpu); diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index ddf7427..1f49c86 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -4349,7 +4349,6 @@ static struct kvm_x86_ops svm_x86_ops = { .cache_reg = svm_cache_reg, .get_rflags = svm_get_rflags, .set_rflags = svm_set_rflags, - .fpu_activate = svm_fpu_activate, .fpu_deactivate = svm_fpu_deactivate, .tlb_flush = svm_flush_tlb, diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 71cbee5..2963303 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -8896,7 +8896,6 @@ static struct kvm_x86_ops vmx_x86_ops = { .cache_reg = vmx_cache_reg, .get_rflags = vmx_get_rflags, .set_rflags = vmx_set_rflags, - .fpu_activate = vmx_fpu_activate, .fpu_deactivate = vmx_fpu_deactivate, .tlb_flush = vmx_flush_tlb, Avi/Gleb, do you remember any particular reason for this? IIRC (vaguely) if we expect the fpu to be used in the near future, we activate it eagerly so that we don't fault when it is used. Prevents the sequence: guest user: use fpu #NM host: reflect #NM to guest guest kernel: CLTS guest kernel: switch fpu state #NM host: switch fpu guest kernel: switch fpu state (restarted) guest user: use fpu (restarted) Why was the user removed? Full-time eager fpu? -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: x86 emulator: emulate MOVNTDQ
On 07/11/2014 11:40 PM, Paolo Bonzini wrote: Il 11/07/2014 22:05, Alex Williamson ha scritto: Which will return 'true' for this whether I specify Aligned or not. If the standard convention is to make it explicit, I'm happy to add the extra flag, but I think we already #GP on unaligned as implemented here. Thanks, We should still specify Aligned if the corresponding AVX instruction requires an aligned operand. ISTR that this is not the case for MOVNTDQ, so your patch is correct. I'll check the SDM more carefully next Monday. The explicitly aligned/unaligned instructions have an A or a U to indicate this (e.g. MOVDQU = explicitly unaligned, MOVDQA = explicitly aligned, MOVNTDQ = default). -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4] kvm: Implement PEBS virtualization
On 06/24/2014 07:45 PM, Marcelo Tosatti wrote: On Sun, Jun 22, 2014 at 09:02:25PM +0200, Andi Kleen wrote: First, it's not sufficient to pin the debug store area, you also have to pin the guest page tables that are used to map the debug store. But even if you do that, as soon as the guest fork()s, it will create a new pgd which the host will be free to swap out. The processor can then attempt a PEBS store to an unmapped address which will fail, even though the guest is configured correctly. That's a good point. You're right of course. The only way I can think around it would be to intercept CR3 writes while PEBS is active and always pin all the table pages leading to the PEBS buffer. That's slow, but should be only needed while PEBS is running. -Andi Suppose that can be done separately from the pinned spte patchset. And it requires accounting into mlock limits as well, as noted. One set of pagetables per pinned virtual address leading down to the last translations is sufficient per-vcpu. Or 4, and use the CR3 exit filter to prevent vmexits among the last 4 LRU CR3 values. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4] kvm: Implement PEBS virtualization
On 05/30/2014 04:12 AM, Andi Kleen wrote: From: Andi Kleen PEBS (Precise Event Bases Sampling) profiling is very powerful, allowing improved sampling precision and much additional information, like address or TSX abort profiling. cycles:p and :pp uses PEBS. This patch enables PEBS profiling in KVM guests. PEBS writes profiling records to a virtual address in memory. Since the guest controls the virtual address space the PEBS record is directly delivered to the guest buffer. We set up the PEBS state that is works correctly.The CPU cannot handle any kinds of faults during these guest writes. To avoid any problems with guest pages being swapped by the host we pin the pages when the PEBS buffer is setup, by intercepting that MSR. Typically profilers only set up a single page, so pinning that is not a big problem. The pinning is limited to 17 pages currently (64K+1) In theory the guest can change its own page tables after the PEBS setup. The host has no way to track that with EPT. But if a guest would do that it could only crash itself. It's not expected that normal profilers do that. Talking a bit with Gleb about this, I think this is impossible. First, it's not sufficient to pin the debug store area, you also have to pin the guest page tables that are used to map the debug store. But even if you do that, as soon as the guest fork()s, it will create a new pgd which the host will be free to swap out. The processor can then attempt a PEBS store to an unmapped address which will fail, even though the guest is configured correctly. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 2/5] KVM: MMU: allow pinning spte translations (TDP-only)
On 06/19/2014 09:26 PM, Marcelo Tosatti wrote: On Thu, Jun 19, 2014 at 11:01:06AM +0300, Avi Kivity wrote: On 06/19/2014 02:12 AM, mtosa...@redhat.com wrote: Allow vcpus to pin spte translations by: 1) Creating a per-vcpu list of pinned ranges. 2) On mmu reload request: - Fault ranges. - Mark sptes with a pinned bit. - Mark shadow pages as pinned. 3) Then modify the following actions: - Page age => skip spte flush. - MMU notifiers => force mmu reload request (which kicks cpu out of guest mode). - GET_DIRTY_LOG => force mmu reload request. - SLAB shrinker => skip shadow page deletion. TDP-only. +int kvm_mmu_register_pinned_range(struct kvm_vcpu *vcpu, + gfn_t base_gfn, unsigned long npages) +{ + struct kvm_pinned_page_range *p; + + mutex_lock(&vcpu->arch.pinned_mmu_mutex); + list_for_each_entry(p, &vcpu->arch.pinned_mmu_pages, link) { + if (p->base_gfn == base_gfn && p->npages == npages) { + mutex_unlock(&vcpu->arch.pinned_mmu_mutex); + return -EEXIST; + } + } + mutex_unlock(&vcpu->arch.pinned_mmu_mutex); + + if (vcpu->arch.nr_pinned_ranges >= + KVM_MAX_PER_VCPU_PINNED_RANGE) + return -ENOSPC; + + p = kzalloc(sizeof(struct kvm_pinned_page_range), GFP_KERNEL); + if (!p) + return -ENOMEM; + + vcpu->arch.nr_pinned_ranges++; + + trace_kvm_mmu_register_pinned_range(vcpu->vcpu_id, base_gfn, npages); + + INIT_LIST_HEAD(&p->link); + p->base_gfn = base_gfn; + p->npages = npages; + mutex_lock(&vcpu->arch.pinned_mmu_mutex); + list_add(&p->link, &vcpu->arch.pinned_mmu_pages); + mutex_unlock(&vcpu->arch.pinned_mmu_mutex); + kvm_make_request(KVM_REQ_MMU_RELOAD, vcpu); + + return 0; +} + What happens if ranges overlap (within a vcpu, cross-vcpu)? The page(s) are faulted multiple times if ranges overlap within a vcpu. I see no reason to disallow overlapping ranges. Do you? Not really. Just making sure nothing horrible happens. Or if a range overflows and wraps around 0? Pagefault fails on vm-entry -> KVM_REQ_TRIPLE_FAULT. Will double check for overflows to make sure. Will the loop terminate? Looks like you're limiting the number of ranges, but not the number of pages, so a guest can lock all of its memory. Yes. The page pinning at get_page time can also lock all of guest memory. I'm sure that can't be good. Maybe subject this pinning to the task mlock limit. + +/* + * Pin KVM MMU page translations. This guarantees, for valid + * addresses registered by kvm_mmu_register_pinned_range (valid address + * meaning address which posses sufficient information for fault to + * be resolved), valid translations exist while in guest mode and + * therefore no VM-exits due to faults will occur. + * + * Failure to instantiate pages will abort guest entry. + * + * Page frames should be pinned with get_page in advance. + * + * Pinning is not guaranteed while executing as L2 guest. Does this undermine security? PEBS writes should not be enabled when L2 guest is executing. What prevents L1 for setting up PEBS MSRs for L2? + list_for_each_entry(p, &vcpu->arch.pinned_mmu_pages, link) { + gfn_t gfn_offset; + + for (gfn_offset = 0; gfn_offset < p->npages; gfn_offset++) { + gfn_t gfn = p->base_gfn + gfn_offset; + int r; + bool pinned = false; + + r = vcpu->arch.mmu.page_fault(vcpu, gfn << PAGE_SHIFT, +PFERR_WRITE_MASK, false, +true, &pinned); + /* MMU notifier sequence window: retry */ + if (!r && !pinned) + kvm_make_request(KVM_REQ_MMU_RELOAD, vcpu); + if (r) { + kvm_make_request(KVM_REQ_TRIPLE_FAULT, vcpu); + break; + } + + } + } + mutex_unlock(&vcpu->arch.pinned_mmu_mutex); +} + int kvm_mmu_load(struct kvm_vcpu *vcpu) { int r; @@ -3916,6 +4101,7 @@ goto out; /* set_cr3() should ensure TLB has been flushed */ vcpu->arch.mmu.set_cr3(vcpu, vcpu->arch.mmu.root_hpa); + kvm_mmu_pin_pages(vcpu); out: return r; } I don't see where you unpin pages, so even if you limit the number of pinned pages, a guest can pin all of memory by iterating over all of memory and pinning it a chunk at a time. The caller should be
Re: [patch 2/5] KVM: MMU: allow pinning spte translations (TDP-only)
On 06/19/2014 02:12 AM, mtosa...@redhat.com wrote: Allow vcpus to pin spte translations by: 1) Creating a per-vcpu list of pinned ranges. 2) On mmu reload request: - Fault ranges. - Mark sptes with a pinned bit. - Mark shadow pages as pinned. 3) Then modify the following actions: - Page age => skip spte flush. - MMU notifiers => force mmu reload request (which kicks cpu out of guest mode). - GET_DIRTY_LOG => force mmu reload request. - SLAB shrinker => skip shadow page deletion. TDP-only. +int kvm_mmu_register_pinned_range(struct kvm_vcpu *vcpu, + gfn_t base_gfn, unsigned long npages) +{ + struct kvm_pinned_page_range *p; + + mutex_lock(&vcpu->arch.pinned_mmu_mutex); + list_for_each_entry(p, &vcpu->arch.pinned_mmu_pages, link) { + if (p->base_gfn == base_gfn && p->npages == npages) { + mutex_unlock(&vcpu->arch.pinned_mmu_mutex); + return -EEXIST; + } + } + mutex_unlock(&vcpu->arch.pinned_mmu_mutex); + + if (vcpu->arch.nr_pinned_ranges >= + KVM_MAX_PER_VCPU_PINNED_RANGE) + return -ENOSPC; + + p = kzalloc(sizeof(struct kvm_pinned_page_range), GFP_KERNEL); + if (!p) + return -ENOMEM; + + vcpu->arch.nr_pinned_ranges++; + + trace_kvm_mmu_register_pinned_range(vcpu->vcpu_id, base_gfn, npages); + + INIT_LIST_HEAD(&p->link); + p->base_gfn = base_gfn; + p->npages = npages; + mutex_lock(&vcpu->arch.pinned_mmu_mutex); + list_add(&p->link, &vcpu->arch.pinned_mmu_pages); + mutex_unlock(&vcpu->arch.pinned_mmu_mutex); + kvm_make_request(KVM_REQ_MMU_RELOAD, vcpu); + + return 0; +} + What happens if ranges overlap (within a vcpu, cross-vcpu)? Or if a range overflows and wraps around 0? Or if it does not refer to RAM? Looks like you're limiting the number of ranges, but not the number of pages, so a guest can lock all of its memory. + +/* + * Pin KVM MMU page translations. This guarantees, for valid + * addresses registered by kvm_mmu_register_pinned_range (valid address + * meaning address which posses sufficient information for fault to + * be resolved), valid translations exist while in guest mode and + * therefore no VM-exits due to faults will occur. + * + * Failure to instantiate pages will abort guest entry. + * + * Page frames should be pinned with get_page in advance. + * + * Pinning is not guaranteed while executing as L2 guest. Does this undermine security? + * + */ + +static void kvm_mmu_pin_pages(struct kvm_vcpu *vcpu) +{ + struct kvm_pinned_page_range *p; + + if (is_guest_mode(vcpu)) + return; + + if (!vcpu->arch.mmu.direct_map) + return; + + ASSERT(VALID_PAGE(vcpu->arch.mmu.root_hpa)); + + mutex_lock(&vcpu->arch.pinned_mmu_mutex); Is the mutex actually needed? It seems it's only taken in vcpu context, so the vcpu mutex should be sufficient. + list_for_each_entry(p, &vcpu->arch.pinned_mmu_pages, link) { + gfn_t gfn_offset; + + for (gfn_offset = 0; gfn_offset < p->npages; gfn_offset++) { + gfn_t gfn = p->base_gfn + gfn_offset; + int r; + bool pinned = false; + + r = vcpu->arch.mmu.page_fault(vcpu, gfn << PAGE_SHIFT, +PFERR_WRITE_MASK, false, +true, &pinned); + /* MMU notifier sequence window: retry */ + if (!r && !pinned) + kvm_make_request(KVM_REQ_MMU_RELOAD, vcpu); + if (r) { + kvm_make_request(KVM_REQ_TRIPLE_FAULT, vcpu); + break; + } + + } + } + mutex_unlock(&vcpu->arch.pinned_mmu_mutex); +} + int kvm_mmu_load(struct kvm_vcpu *vcpu) { int r; @@ -3916,6 +4101,7 @@ goto out; /* set_cr3() should ensure TLB has been flushed */ vcpu->arch.mmu.set_cr3(vcpu, vcpu->arch.mmu.root_hpa); + kvm_mmu_pin_pages(vcpu); out: return r; } I don't see where you unpin pages, so even if you limit the number of pinned pages, a guest can pin all of memory by iterating over all of memory and pinning it a chunk at a time. You might try something similar to guest MTRR handling. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [Qemu-ppc] KVM and variable-endianness guest CPUs
On 01/28/2014 01:27 AM, Benjamin Herrenschmidt wrote: On Wed, 2014-01-22 at 17:29 +, Peter Maydell wrote: Basically if it would be on real bus, get byte value that corresponds to phys_addr + 0 address place it into data[0], get byte value that corresponds to phys_addr + 1 address place it into data[1], etc. This just isn't how real buses work. Actually it can be :-) There is no "address + 1, address + 2". There is a single address for the memory transaction and a set of data on data lines and some separate size information. How the device at the far end of the bus chooses to respond to 32 bit accesses to address X versus 8 bit accesses to addresses X through X+3 is entirely its own business and unrelated to the CPU. However the bus has a definition of what byte lane is the lowest in address order. Byte order invariance is an important function of all busses. I think that trying to treat it any differently than an address ordered series of bytes is going to turn into a complete and inextricable mess. I agree. The two options are: (address, byte array, length) and (address, value, word size, endianness) the first is the KVM ABI, the second is how MemoryRegions work. Both are valid, but the first is more general (supports the 3-byte accesses sometimes generated on x86). (It would be perfectly possible to have a device which when you read from address X as 32 bits returned 0x12345678, when you read from address X as 16 bits returned 0x9abc, returned 0x42 for an 8 bit read from X+1, and so on. Having byte reads from X..X+3 return values corresponding to parts of the 32 bit access is purely a convention.) Right, it's possible. It's also stupid and not how most modern devices and busses work. Besides there is no reason why that can't be implemented with Victor proposal anyway. Right. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] KVM and variable-endianness guest CPUs
On 01/22/2014 12:22 PM, Peter Maydell wrote: On 22 January 2014 05:39, Victor Kamensky wrote: Hi Guys, Christoffer and I had a bit heated chat :) on this subject last night. Christoffer, really appreciate your time! We did not really reach agreement during the chat and Christoffer asked me to follow up on this thread. Here it goes. Sorry, it is very long email. I don't believe we can assign any endianity to mmio.data[] byte array. I believe mmio.data[] and mmio.len acts just memcpy and that is all. As memcpy does not imply any endianity of underlying data mmio.data[] should not either. This email is about five times too long to be actually useful, but the major issue here is that the data being transferred is not just a bag of bytes. The data[] array plus the size field are being (mis)used to indicate that the memory transaction is one of: * an 8 bit access * a 16 bit access of some uint16_t value * a 32 bit access of some uint32_t value * a 64 bit access of some uint64_t value exactly as a CPU hardware bus would do. It's because the API is defined in this awkward way with a uint8_t[] array that we need to specify how both sides should go from the actual properties of the memory transaction (value and size) to filling in the array. That is not how x86 hardware works. Back when there was a bus, there were no address lines A0-A2; instead we had 8 byte enables BE0-BE7. A memory transaction placed the qword address on the address lines and asserted the byte enables for the appropriate byte, word, dword, or qword, shifted for the low order bits of the address. If you generated an unaligned access, the transaction was split into two, so an 8-byte write might appear as a 5-byte write followed by a 3-byte write. In fact, the two halves of the transaction might go to different devices, or one might go to a device and another to memory. PCI works the same way. Furthermore, device endianness is entirely irrelevant for deciding the properties of mmio.data[], because the thing we're modelling here is essentially the CPU->bus interface. In real hardware, the properties of individual devices on the bus are irrelevant to how the CPU's interface to the bus behaves, and similarly here the properties of emulated devices don't affect how KVM's interface to QEMU userspace needs to work. MemoryRegion's 'endianness' field, incidentally, is a dreadful mess that we should get rid of. It is attempting to model the property that some buses/bridges have of doing byte-lane-swaps on data that passes through as a property of the device itself. It would be better if we modelled it properly, with container regions having possible byte-swapping and devices just being devices. No, that is not what it is modelling. Suppose a little endian cpu writes a dword 0x12345678 to address 0 of a device, and read back a byte from address 0. What value do you read back? Some (most) devices will return 0x78, others will return 0x12. Other devices don't support mixed sizes at all, but many do. PCI configuration space is an example; it is common to read both Device ID and Vendor ID with a single 32-bit transaction, but you can also read them separately with two 16-bit transaction. Because PCI is little-endian, the Vendor ID at address 0 will be returned as the low word of the 32-bit read of a little-endian processor. If you remove device endianness from memory regions, you have to pass the data as arrays of bytes (like the KVM interface) and let the device assemble words from those bytes itself, taking into consideration its own endianness. What MemoryRegion's endianness does is let the device declare its endianness to the API and let it do all the work. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
On 11/28/2013 01:31 PM, Paolo Bonzini wrote: Il 28/11/2013 12:23, Gleb Natapov ha scritto: Unless what ? :) Unless reader is scheduled out? Yes. Or unless my brain is scheduled out in the middle of a sentence. So we will have to disable preemption in a reader to prevent big latencies for a writer, no? I don't think that's necessary. The writer itself could also be scheduled out, and the reader critical sections are really small. Let's wait for Zhang to try SRCU and report back. I think readers have preemption disabled already in that context (irqfd writes). -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
On 11/28/2013 01:22 PM, Gleb Natapov wrote: On Thu, Nov 28, 2013 at 01:18:54PM +0200, Avi Kivity wrote: On 11/28/2013 01:02 PM, Gleb Natapov wrote: On Thu, Nov 28, 2013 at 12:12:55PM +0200, Avi Kivity wrote: On 11/28/2013 12:11 PM, Gleb Natapov wrote: On Thu, Nov 28, 2013 at 11:49:00AM +0200, Avi Kivity wrote: On 11/28/2013 11:19 AM, Gleb Natapov wrote: On Thu, Nov 28, 2013 at 09:55:42AM +0100, Paolo Bonzini wrote: Il 28/11/2013 07:27, Zhanghaoyu (A) ha scritto: Without synchronize_rcu you could have VCPU writes to routing table e = entry from IRQ routing table kvm_irq_routing_update(kvm, new); VCPU resumes execution kvm_set_msi_irq(e, &irq); kvm_irq_delivery_to_apic_fast(); where the entry is stale but the VCPU has already resumed execution. If we use call_rcu()(Not consider the problem that Gleb pointed out temporarily) instead of synchronize_rcu(), should we still ensure this? The problem is that we should ensure this, so using call_rcu is not possible (even not considering the memory allocation problem). Not changing current behaviour is certainly safer, but I am still not 100% convinced we have to ensure this. Suppose guest does: 1: change msi interrupt by writing to pci register 2: read the pci register to flush the write 3: zero idt I am pretty certain that this code can get interrupt after step 2 on real HW, but I cannot tell if guest can rely on it to be delivered exactly after read instruction or it can be delayed by couple of instructions. Seems to me it would be fragile for an OS to depend on this behaviour. AFAIK Linux does not. Linux is safe, it does interrupt migration from within the interrupt handler. If you do that before the device-specific EOI, you won't get another interrupt until programming the MSI is complete. Is virtio safe? IIRC it can post multiple interrupts without guest acks. Using call_rcu() is a better solution than srcu IMO. Less code changes, consistently faster. Why not fix userspace to use KVM_SIGNAL_MSI instead? Shouldn't it work with old userspace too? Maybe I misunderstood your intent. Zhanghaoyu said that the problem mostly hurts in real-time telecom environment, so I propose how he can fix the problem in his specific environment. It will not fix older userspace obviously, but kernel fix will also require kernel update and usually updating userspace is easier. Isn't the latency due to interrupt migration causing long synchronize_rcu()s? How does KVM_SIGNAL_MSI help? If MSI is delivered using KVM_SIGNAL_MSI as opposite to via an entry in irq routing table changing MSI configuration should not cause update to irq routing table (not saying this is what happens with current QEMU, but theoretically there is not reason to update routing table in this case). I see. That pushes the problem to userspace, which uses traditional locking, so the problem disappears until qemu starts using rcu too to manage this. There is also irqfd, however. We could also do a KVM_UPDATE_IRQFD to change the payload it delivers, but that has exactly the same problems. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
On 11/28/2013 01:02 PM, Gleb Natapov wrote: On Thu, Nov 28, 2013 at 12:12:55PM +0200, Avi Kivity wrote: On 11/28/2013 12:11 PM, Gleb Natapov wrote: On Thu, Nov 28, 2013 at 11:49:00AM +0200, Avi Kivity wrote: On 11/28/2013 11:19 AM, Gleb Natapov wrote: On Thu, Nov 28, 2013 at 09:55:42AM +0100, Paolo Bonzini wrote: Il 28/11/2013 07:27, Zhanghaoyu (A) ha scritto: Without synchronize_rcu you could have VCPU writes to routing table e = entry from IRQ routing table kvm_irq_routing_update(kvm, new); VCPU resumes execution kvm_set_msi_irq(e, &irq); kvm_irq_delivery_to_apic_fast(); where the entry is stale but the VCPU has already resumed execution. If we use call_rcu()(Not consider the problem that Gleb pointed out temporarily) instead of synchronize_rcu(), should we still ensure this? The problem is that we should ensure this, so using call_rcu is not possible (even not considering the memory allocation problem). Not changing current behaviour is certainly safer, but I am still not 100% convinced we have to ensure this. Suppose guest does: 1: change msi interrupt by writing to pci register 2: read the pci register to flush the write 3: zero idt I am pretty certain that this code can get interrupt after step 2 on real HW, but I cannot tell if guest can rely on it to be delivered exactly after read instruction or it can be delayed by couple of instructions. Seems to me it would be fragile for an OS to depend on this behaviour. AFAIK Linux does not. Linux is safe, it does interrupt migration from within the interrupt handler. If you do that before the device-specific EOI, you won't get another interrupt until programming the MSI is complete. Is virtio safe? IIRC it can post multiple interrupts without guest acks. Using call_rcu() is a better solution than srcu IMO. Less code changes, consistently faster. Why not fix userspace to use KVM_SIGNAL_MSI instead? Shouldn't it work with old userspace too? Maybe I misunderstood your intent. Zhanghaoyu said that the problem mostly hurts in real-time telecom environment, so I propose how he can fix the problem in his specific environment. It will not fix older userspace obviously, but kernel fix will also require kernel update and usually updating userspace is easier. Isn't the latency due to interrupt migration causing long synchronize_rcu()s? How does KVM_SIGNAL_MSI help? The problem occurs with assigned devices too AFAICT. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
On 11/28/2013 01:10 PM, Paolo Bonzini wrote: Il 28/11/2013 12:09, Gleb Natapov ha scritto: - if there are no callbacks, but there are readers, synchronize_srcu busy-loops for some time checking if the readers complete. After a while (20 us for synchronize_srcu, 120 us for synchronize_srcu_expedited) it gives up and starts using a workqueue to poll every millisecond. This should never happen unless Unless what ? :) Unless reader is scheduled out? Yes. Or unless my brain is scheduled out in the middle of a sentence. You need a grace period. Or just sleep(). -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
On 11/28/2013 12:40 PM, Paolo Bonzini wrote: Il 28/11/2013 11:16, Avi Kivity ha scritto: The QRCU I linked would work great latency-wise (it has roughly the same latency of an rwsem but readers are lock-free). However, the locked operations in the read path would hurt because of cache misses, so it's not good either. I guess srcu would work. Do you know what's the typical write-side latency there? (in terms of what it waits for, not nanoseconds). If there's no concurrent reader, it's zero if I read the code right. Otherwise it depends: - if there are many callbacks, only 10 of them are processed per millisecond. But unless there are concurrent synchronize_srcu calls there should not be any callback at all. If all VCPUs were to furiously change the MSIs, the latency could go up to #vcpu/10 milliseconds. - if there are no callbacks, but there are readers, synchronize_srcu busy-loops for some time checking if the readers complete. After a while (20 us for synchronize_srcu, 120 us for synchronize_srcu_expedited) it gives up and starts using a workqueue to poll every millisecond. This should never happen unless So, given the very restricted usage this SRCU would have, we probably can expect synchronize_srcu_expedited to finish its job in the busy-looping phase, and 120 us should be the expected maximum latency---more likely to be an order of magnitude smaller, and in very rare cases higher. Looks like it's a good fit then. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
On 11/28/2013 11:53 AM, Paolo Bonzini wrote: Il 28/11/2013 10:49, Avi Kivity ha scritto: Linux is safe, it does interrupt migration from within the interrupt handler. If you do that before the device-specific EOI, you won't get another interrupt until programming the MSI is complete. Is virtio safe? IIRC it can post multiple interrupts without guest acks. Using call_rcu() is a better solution than srcu IMO. Less code changes, consistently faster. call_rcu() has the problem of rate limiting, too. It wasn't such a great idea, I think. The QRCU I linked would work great latency-wise (it has roughly the same latency of an rwsem but readers are lock-free). However, the locked operations in the read path would hurt because of cache misses, so it's not good either. I guess srcu would work. Do you know what's the typical write-side latency there? (in terms of what it waits for, not nanoseconds). -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
On 11/28/2013 12:11 PM, Gleb Natapov wrote: On Thu, Nov 28, 2013 at 11:49:00AM +0200, Avi Kivity wrote: On 11/28/2013 11:19 AM, Gleb Natapov wrote: On Thu, Nov 28, 2013 at 09:55:42AM +0100, Paolo Bonzini wrote: Il 28/11/2013 07:27, Zhanghaoyu (A) ha scritto: Without synchronize_rcu you could have VCPU writes to routing table e = entry from IRQ routing table kvm_irq_routing_update(kvm, new); VCPU resumes execution kvm_set_msi_irq(e, &irq); kvm_irq_delivery_to_apic_fast(); where the entry is stale but the VCPU has already resumed execution. If we use call_rcu()(Not consider the problem that Gleb pointed out temporarily) instead of synchronize_rcu(), should we still ensure this? The problem is that we should ensure this, so using call_rcu is not possible (even not considering the memory allocation problem). Not changing current behaviour is certainly safer, but I am still not 100% convinced we have to ensure this. Suppose guest does: 1: change msi interrupt by writing to pci register 2: read the pci register to flush the write 3: zero idt I am pretty certain that this code can get interrupt after step 2 on real HW, but I cannot tell if guest can rely on it to be delivered exactly after read instruction or it can be delayed by couple of instructions. Seems to me it would be fragile for an OS to depend on this behaviour. AFAIK Linux does not. Linux is safe, it does interrupt migration from within the interrupt handler. If you do that before the device-specific EOI, you won't get another interrupt until programming the MSI is complete. Is virtio safe? IIRC it can post multiple interrupts without guest acks. Using call_rcu() is a better solution than srcu IMO. Less code changes, consistently faster. Why not fix userspace to use KVM_SIGNAL_MSI instead? Shouldn't it work with old userspace too? Maybe I misunderstood your intent. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
On 11/28/2013 11:19 AM, Gleb Natapov wrote: On Thu, Nov 28, 2013 at 09:55:42AM +0100, Paolo Bonzini wrote: Il 28/11/2013 07:27, Zhanghaoyu (A) ha scritto: Without synchronize_rcu you could have VCPU writes to routing table e = entry from IRQ routing table kvm_irq_routing_update(kvm, new); VCPU resumes execution kvm_set_msi_irq(e, &irq); kvm_irq_delivery_to_apic_fast(); where the entry is stale but the VCPU has already resumed execution. If we use call_rcu()(Not consider the problem that Gleb pointed out temporarily) instead of synchronize_rcu(), should we still ensure this? The problem is that we should ensure this, so using call_rcu is not possible (even not considering the memory allocation problem). Not changing current behaviour is certainly safer, but I am still not 100% convinced we have to ensure this. Suppose guest does: 1: change msi interrupt by writing to pci register 2: read the pci register to flush the write 3: zero idt I am pretty certain that this code can get interrupt after step 2 on real HW, but I cannot tell if guest can rely on it to be delivered exactly after read instruction or it can be delayed by couple of instructions. Seems to me it would be fragile for an OS to depend on this behaviour. AFAIK Linux does not. Linux is safe, it does interrupt migration from within the interrupt handler. If you do that before the device-specific EOI, you won't get another interrupt until programming the MSI is complete. Is virtio safe? IIRC it can post multiple interrupts without guest acks. Using call_rcu() is a better solution than srcu IMO. Less code changes, consistently faster. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
On 11/26/2013 06:28 PM, Paolo Bonzini wrote: Il 26/11/2013 17:24, Gleb Natapov ha scritto: VCPU writes to routing table e = entry from IRQ routing table kvm_irq_routing_update(kvm, new); VCPU resumes execution kvm_set_msi_irq(e, &irq); kvm_irq_delivery_to_apic_fast(); where the entry is stale but the VCPU has already resumed execution. So how is it different from what we have now: disable_irq() VCPU writes to routing table e = entry from IRQ routing table kvm_set_msi_irq(e, &irq); kvm_irq_delivery_to_apic_fast(); kvm_irq_routing_update(kvm, new); synchronize_rcu() VCPU resumes execution enable_irq() receive stale irq Adding a "disable/enable IRQs" looks like a relatively big change. But perhaps it's not for some reason I'm missing. Those are guest operations, which may not be there at all. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
On 11/26/2013 06:24 PM, Gleb Natapov wrote: On Tue, Nov 26, 2013 at 04:20:27PM +0100, Paolo Bonzini wrote: Il 26/11/2013 16:03, Gleb Natapov ha scritto: I understood the proposal was also to eliminate the synchronize_rcu(), so while new interrupts would see the new routing table, interrupts already in flight could pick up the old one. Isn't that always the case with RCU? (See my answer above: "the vcpus already see the new routing table after the rcu_assign_pointer that is in kvm_irq_routing_update"). With synchronize_rcu(), you have the additional guarantee that any parallel accesses to the old routing table have completed. Since we also trigger the irq from rcu context, you know that after synchronize_rcu() you won't get any interrupts to the old destination (see kvm_set_irq_inatomic()). We do not have this guaranty for other vcpus that do not call synchronize_rcu(). They may still use outdated routing table while a vcpu or iothread that performed table update sits in synchronize_rcu(). Avi's point is that, after the VCPU resumes execution, you know that no interrupt will be sent to the old destination because kvm_set_msi_inatomic (and ultimately kvm_irq_delivery_to_apic_fast) is also called within the RCU read-side critical section. Without synchronize_rcu you could have VCPU writes to routing table e = entry from IRQ routing table kvm_irq_routing_update(kvm, new); VCPU resumes execution kvm_set_msi_irq(e, &irq); kvm_irq_delivery_to_apic_fast(); where the entry is stale but the VCPU has already resumed execution. So how is it different from what we have now: disable_irq() VCPU writes to routing table e = entry from IRQ routing table kvm_set_msi_irq(e, &irq); kvm_irq_delivery_to_apic_fast(); kvm_irq_routing_update(kvm, new); synchronize_rcu() VCPU resumes execution enable_irq() receive stale irq Suppose the guest did not disable_irq() and enable_irq(), but instead had a pci read where you have the enable_irq(). After the read you cannot have a stale irq (assuming the read flushes the irq all the way to the APIC). -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
On 11/26/2013 06:11 PM, Michael S. Tsirkin wrote: On Tue, Nov 26, 2013 at 06:06:26PM +0200, Avi Kivity wrote: On 11/26/2013 05:58 PM, Paolo Bonzini wrote: Il 26/11/2013 16:35, Avi Kivity ha scritto: If we want to ensure, we need to use a different mechanism for synchronization than the global RCU. QRCU would work; readers are not wait-free but only if there is a concurrent synchronize_qrcu, which should be rare. An alternative path is to convince ourselves that the hardware does not provide the guarantees that the current code provides, and so we can relax them. No, I think it's a reasonable guarantee to provide. Why? Because IIUC the semantics may depend not just on the interrupt controller, but also on the specific PCI device. It seems safer to assume that at least one device/driver pair wants this to work. It's indeed safe, but I think there's a nice win to be had if we drop the assumption. I'm not arguing with that, but a minor commoent below: (BTW, PCI memory writes are posted, but configuration writes are not). MSIs are configured via PCI memory writes. By itself, that doesn't buy us anything, since the guest could flush the write via a read. But I think the fact that the interrupt messages themselves are posted proves that it is safe. FYI, PCI read flushes the interrupt itself in, too. I guess that kills the optimization then. Maybe you can do qrcu, whatever that is. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
On 11/26/2013 05:58 PM, Paolo Bonzini wrote: Il 26/11/2013 16:35, Avi Kivity ha scritto: If we want to ensure, we need to use a different mechanism for synchronization than the global RCU. QRCU would work; readers are not wait-free but only if there is a concurrent synchronize_qrcu, which should be rare. An alternative path is to convince ourselves that the hardware does not provide the guarantees that the current code provides, and so we can relax them. No, I think it's a reasonable guarantee to provide. Why? Because IIUC the semantics may depend not just on the interrupt controller, but also on the specific PCI device. It seems safer to assume that at least one device/driver pair wants this to work. It's indeed safe, but I think there's a nice win to be had if we drop the assumption. (BTW, PCI memory writes are posted, but configuration writes are not). MSIs are configured via PCI memory writes. By itself, that doesn't buy us anything, since the guest could flush the write via a read. But I think the fact that the interrupt messages themselves are posted proves that it is safe. The fact that Linux does interrupt migration from within the interrupt handler also shows that someone else believes that it is the only safe place to do it. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
On 11/26/2013 05:28 PM, Paolo Bonzini wrote: Il 26/11/2013 16:25, Avi Kivity ha scritto: If we want to ensure, we need to use a different mechanism for synchronization than the global RCU. QRCU would work; readers are not wait-free but only if there is a concurrent synchronize_qrcu, which should be rare. An alternative path is to convince ourselves that the hardware does not provide the guarantees that the current code provides, and so we can relax them. No, I think it's a reasonable guarantee to provide. Why? -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
On 11/26/2013 05:20 PM, Paolo Bonzini wrote: Il 26/11/2013 16:03, Gleb Natapov ha scritto: I understood the proposal was also to eliminate the synchronize_rcu(), so while new interrupts would see the new routing table, interrupts already in flight could pick up the old one. Isn't that always the case with RCU? (See my answer above: "the vcpus already see the new routing table after the rcu_assign_pointer that is in kvm_irq_routing_update"). With synchronize_rcu(), you have the additional guarantee that any parallel accesses to the old routing table have completed. Since we also trigger the irq from rcu context, you know that after synchronize_rcu() you won't get any interrupts to the old destination (see kvm_set_irq_inatomic()). We do not have this guaranty for other vcpus that do not call synchronize_rcu(). They may still use outdated routing table while a vcpu or iothread that performed table update sits in synchronize_rcu(). Avi's point is that, after the VCPU resumes execution, you know that no interrupt will be sent to the old destination because kvm_set_msi_inatomic (and ultimately kvm_irq_delivery_to_apic_fast) is also called within the RCU read-side critical section. Without synchronize_rcu you could have VCPU writes to routing table e = entry from IRQ routing table kvm_irq_routing_update(kvm, new); VCPU resumes execution kvm_set_msi_irq(e, &irq); kvm_irq_delivery_to_apic_fast(); where the entry is stale but the VCPU has already resumed execution. If we want to ensure, we need to use a different mechanism for synchronization than the global RCU. QRCU would work; readers are not wait-free but only if there is a concurrent synchronize_qrcu, which should be rare. An alternative path is to convince ourselves that the hardware does not provide the guarantees that the current code provides, and so we can relax them. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
On 11/26/2013 05:03 PM, Gleb Natapov wrote: On Tue, Nov 26, 2013 at 04:54:44PM +0200, Avi Kivity wrote: On 11/26/2013 04:46 PM, Paolo Bonzini wrote: Il 26/11/2013 15:36, Avi Kivity ha scritto: No, this would be exactly the same code that is running now: mutex_lock(&kvm->irq_lock); old = kvm->irq_routing; kvm_irq_routing_update(kvm, new); mutex_unlock(&kvm->irq_lock); synchronize_rcu(); kfree(old); return 0; Except that the kfree would run in the call_rcu kernel thread instead of the vcpu thread. But the vcpus already see the new routing table after the rcu_assign_pointer that is in kvm_irq_routing_update. I understood the proposal was also to eliminate the synchronize_rcu(), so while new interrupts would see the new routing table, interrupts already in flight could pick up the old one. Isn't that always the case with RCU? (See my answer above: "the vcpus already see the new routing table after the rcu_assign_pointer that is in kvm_irq_routing_update"). With synchronize_rcu(), you have the additional guarantee that any parallel accesses to the old routing table have completed. Since we also trigger the irq from rcu context, you know that after synchronize_rcu() you won't get any interrupts to the old destination (see kvm_set_irq_inatomic()). We do not have this guaranty for other vcpus that do not call synchronize_rcu(). They may still use outdated routing table while a vcpu or iothread that performed table update sits in synchronize_rcu(). Consider this guest code: write msi entry, directing the interrupt away from this vcpu nop memset(&idt, 0, sizeof(idt)); Currently, this code will never trigger a triple fault. With the change to call_rcu(), it may. Now it may be that the guest does not expect this to work (PCI writes are posted; and interrupts can be delayed indefinitely by the pci fabric), but we don't know if there's a path that guarantees the guest something that we're taking away with this change. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
On 11/26/2013 04:46 PM, Paolo Bonzini wrote: Il 26/11/2013 15:36, Avi Kivity ha scritto: No, this would be exactly the same code that is running now: mutex_lock(&kvm->irq_lock); old = kvm->irq_routing; kvm_irq_routing_update(kvm, new); mutex_unlock(&kvm->irq_lock); synchronize_rcu(); kfree(old); return 0; Except that the kfree would run in the call_rcu kernel thread instead of the vcpu thread. But the vcpus already see the new routing table after the rcu_assign_pointer that is in kvm_irq_routing_update. I understood the proposal was also to eliminate the synchronize_rcu(), so while new interrupts would see the new routing table, interrupts already in flight could pick up the old one. Isn't that always the case with RCU? (See my answer above: "the vcpus already see the new routing table after the rcu_assign_pointer that is in kvm_irq_routing_update"). With synchronize_rcu(), you have the additional guarantee that any parallel accesses to the old routing table have completed. Since we also trigger the irq from rcu context, you know that after synchronize_rcu() you won't get any interrupts to the old destination (see kvm_set_irq_inatomic()). It's another question whether the hardware provides the same guarantee. If you eliminate the synchronize_rcu, new interrupts would see the new routing table, while interrupts already in flight will get a dangling pointer. Sure, if you drop the synchronize_rcu(), you have to add call_rcu(). -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 07/15] KVM: MMU: introduce nulls desc
On Mon, Nov 25, 2013 at 8:11 AM, Xiao Guangrong wrote: > > On Nov 23, 2013, at 3:14 AM, Marcelo Tosatti wrote: I'm not really following, but note that parent_pte predates EPT (and the use of rcu in kvm), so all the complexity that is the result of trying to pack as many list entries into a cache line can be dropped. Most setups now would have exactly one list entry, which is handled specially antyway. Alternatively, the trick of storing multiple entries in one list entry can be moved to generic code, it may be useful to others. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: fix sil/dil/bpl/spl in the mod/rm fields
On Thu, May 30, 2013 at 7:34 PM, Paolo Bonzini wrote: > Il 30/05/2013 17:34, Paolo Bonzini ha scritto: >> Il 30/05/2013 16:35, Paolo Bonzini ha scritto: >>> The x86-64 extended low-byte registers were fetched correctly from reg, >>> but not from mod/rm. >>> >>> This fixes another bug in the boot of RHEL5.9 64-bit, but it is still >>> not enough. >> >> Well, it is enough but it takes 2 minutes to reach the point where >> hardware virtualization is used. It is doing a lot of stuff in >> emulation mode because FS and GS have leftovers from the A20 test: >> >> FS = 9300 DPL=0 DS16 [-WA] >> GS = 0000 9300 DPL=0 DS16 [-WA] >> >> 0x000113be: in $0x92,%al >> 0x000113c0: or $0x2,%al >> 0x000113c2: out%al,$0x92 >> 0x000113c4: xor%ax,%ax >> 0x000113c6: mov%ax,%fs >> 0x000113c8: dec%ax >> 0x000113c9: mov%ax,%gs >> 0x000113cb: inc%ax >> 0x000113cc: mov%ax,%fs:0x200 >> 0x000113d0: cmp%gs:0x210,%ax >> 0x000113d5: je 0x113cb >> >> The DPL < RPL test fails. Any ideas? Should we introduce a new >> intermediate value for emulate_invalid_guest_state (0=none, 1=some, 2=full)? > > One idea could be to replace invalid descriptors with NULL ones. Then > you can intercept this in the #GP handler and trigger emulation for that > instruction only. Won't work, vmx won't let you enter in such a configuration. Maybe you can detect the exact code sequence (%eip, some instructions, register state) and clear %fs and %gs. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 2/2] KVM: VMX: Add Posted Interrupt supporting
On Tue, Feb 26, 2013 at 10:12 AM, Gleb Natapov wrote: >> > But do not see how to implement efficiently without interface change. The > idea is basically to register ACK notifier for RTC interrupt but terminate > it in the kernel instead of reporting to userspace. Kernel should know > somehow what GSI should it register ACK notifier for. Right, my idea does not help at all. > There are couple > of solutions: > 1. New interface to tell what GSI to track. > - Cons: KVM_IRQ_LINE_STATUS will stop working for old QEMUs that do > not use it. New special purpose API. > 2. Register ACK notifier only for RTC. > - Cons: KVM_IRQ_LINE_STATUS will work only for RTC. This is the only > thing that use it currently, but such change will make it one > trick pony officially. > 3. Register ACK notifiers for all edge interrupts in IOAPIC. > - Cons: with APICv edge interrupt does not require EOI to do vmexit. > Registering ACK notifiers for all of them will make them all > do vmexit. > 4. Combinations of 1 and 3. Register ACK notifiers for all edge interrupts > in IOAPIC and provide API to drop unneeded notifiers. > - Cons: New special purpose API. > 5. Make KVM_IRQ_LINE_STATUS register an ack notifier, document it as more expensive than KVM_IRQ_LINE, and change qemu to use KVM_IRQ_LINE when it is sufficient. Pros: fully backwards compatible Cons: to realize full performance, need to patch qemu -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 2/2] KVM: VMX: Add Posted Interrupt supporting
On Mon, Feb 25, 2013 at 7:43 PM, Gleb Natapov wrote: > >> > 3. Do not report KVM_IRQ_LINE_STATUS capability and move RTC to use EOI >> > notifiers for interrupt reinjection. This requires us to add interface >> > for reporting EOI to userspace. This is not in the scope of this >> > patchset. Cons: need to introduce new interface (and the one that will >> > not work on AMD BTW) >> > >> > Other ideas? >> >> 1. inject RTC interrupt >> 2. not see EOI >> 3. inject RTC interrupt >> 4. due to 2, report coalesced >> > That's the 3 in my list, no? No, this idea doesn't involve user interface changes. We still report through KVM_IRQ_LINE_STATUS or whatever it's called. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 2/2] KVM: VMX: Add Posted Interrupt supporting
>> > I see a couple of possible solutions: > 1. Do what Avi said. Make KVM_IRQ_LINE_STATUS be synchronous. Cons: > current QEMU uses KVM_IRQ_LINE_STATUS always and it means that it > will be slow on newer kernels You could backport the qemu change, verify that it builds, and push it to stable branches. It's hardly ideal but if nothing else comes up then it's a solution. > 2. Make KVM_IRQ_LINE_STATUS report coalescing only when vcpu is not > running during injection. This assumes that if vcpu is running and does > not process interrupt it is guest fault and the same can happen on real > HW too. Coalescing when vcpu is not running though is the result of CPU > overcommit and should be reported. Cons interface definition is kind of > murky. You still have a window where the vcpu is scheduled out with guest interrupts disabled, then scheduled back in and before it manages to handle the interrupt, the second one hits. It's worth testing though. > 3. Do not report KVM_IRQ_LINE_STATUS capability and move RTC to use EOI > notifiers for interrupt reinjection. This requires us to add interface > for reporting EOI to userspace. This is not in the scope of this > patchset. Cons: need to introduce new interface (and the one that will > not work on AMD BTW) > > Other ideas? 1. inject RTC interrupt 2. not see EOI 3. inject RTC interrupt 4. due to 2, report coalesced AMD can still use IRR test-and-set. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: nVMX: Replace kvm_set_cr0 with vmx_set_cr0 in load_vmcs12_host_state
On Sun, Feb 24, 2013 at 9:15 PM, Jan Kiszka wrote: >> >> They all need consistency checks, otherwise userspace or the guest and >> inject inconsistent values and perhaps exploit the host. > > To my understanding, the hardware does this for us: If we try to enter > the guest (L1, L2) with invalid CRx bits set or cleared, we get an > error, at least on Intel. But I bet AMD does so as well - and, if not, > it would make this test specific again. The point is that kvm code may depend on this consistency, so we fail before that hardware has made the check. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 2/2] KVM: VMX: Add Posted Interrupt supporting
I didn't really follow, but is the root cause the need to keep track of interrupt coalescing? If so we can recommend that users use KVM_IRQ_LINE when coalescing is unneeded, and move interrupt injection with irq coalescing support to vcpu context. It's not pleasant to cause a performance regression, so it should be a last resort (or maybe do both if it helps). On Sun, Feb 24, 2013 at 8:08 PM, Marcelo Tosatti wrote: > On Sun, Feb 24, 2013 at 04:19:17PM +0200, Gleb Natapov wrote: >> On Sun, Feb 24, 2013 at 01:55:07PM +, Zhang, Yang Z wrote: >> > > I do not think it fixes it. There is no guaranty that IPI will be >> > > processed by remote cpu while sending cpu is still in locked section, so >> > > the same race may happen regardless. As you say above there are 3 >> > > contexts, but only two use locks. >> > See following logic, I think the problem you mentioned will not happened >> > with current logic. >> > >> > get lock >> > if test_pir (this will ensure there is no in flight IPI for same >> > interrupt. Since we are taking the lock now, no IPI will be sent before we >> > release the lock and no pir->irr is performed by hardware for same >> > interrupt.) > > Does the PIR IPI interrupt returns synchronously _after_ PIR->IRR transfer > is made? Don't think so. > > PIR IPI interrupt returns after remote CPU has acked interrupt receival, > not after remote CPU has acked _and_ performed PIR->IRR transfer. > > Right? (yes, right, see step 4. below). > > Should try to make it easier to drop the lock later, so depend on it as > little as possible (also please document what it protects in detail). > > Also note: > > " > 3. The processor clears the outstanding-notification bit in the > posted-interrupt descriptor. This is done atomically > so as to leave the remainder of the descriptor unmodified (e.g., with a > locked AND operation). > 4. The processor writes zero to the EOI register in the local APIC; this > dismisses the interrupt with the postedinterrupt > notification vector from the local APIC. > 5. The logical processor performs a logical-OR of PIR into VIRR and > clears PIR. No other agent can read or write a > PIR bit (or group of bits) between the time it is read (to determine > what to OR into VIRR) and when it is cleared. > " > > So checking the ON bit does not mean the HW has finished performing > PIR->VIRR transfer (which means ON bit can only be used as an indication > of whether to send interrupt, not whether PIR->VIRR transfer is > finished). > > So its fine to do > > -> atomic set pir > -> if (atomic_test_and_set(PIR ON bit)) > -> send IPI > > But HW can transfer two distinct bits, set by different serialized instances > of kvm_set_irq (protected with a lock), in a single PIR->IRR pass. > >> I do not see where those assumptions are coming from. Testing pir does >> not guaranty that the IPI is not processed by VCPU right now. >> >> iothread: vcpu: >> send_irq() >> lock(pir) >> check pir and irr >> set pir >> send IPI (*) >> unlock(pir) >> >> send_irq() >> lock(pir) >> receive IPI (*) >> atomic { >>pir_tmp = pir >>pir = 0 >> } >> check pir and irr >> irr &= pir_tmp >> set pir >> send IPI >> unlock(pir) >> >> At this point both pir and irr are set and interrupt may be coalesced, >> but it is reported as delivered. > > s/"set pir"/"injected = !t_a_s(pir)"/ > >> So what prevents the scenario above from happening? >> >> -- >> Gleb. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: nVMX: Replace kvm_set_cr0 with vmx_set_cr0 in load_vmcs12_host_state
On Sun, Feb 24, 2013 at 12:49 PM, Jan Kiszka wrote: > On 2013-02-24 11:11, Avi Kivity wrote: >> On Sun, Feb 24, 2013 at 11:40 AM, Jan Kiszka wrote: >>>>>> We have the same problem in KVM_SET_SREGS. >>>>> >>>>> I don't see the problem. kvm_arch_vcpu_ioctl_set_sregs open-codes the >>>>> state update, not applying any transition checks. >>>> >>>> That's the problem. We have this open coding in three different >>>> places (KVM_SET_SREGS, nvmx, nsvm). >>>> >>>> It's not as if vmx_set_cr0() is defined as "kvm_set_cr0() without the >>>> transition checks". >>> >>> ...and without mmu updates. The latter is done via or after the closing >>> cr3 update. Interestingly, nsvm does not perform kvm_set_cr3 on vmexit >>> when in npt mode. Seems things aren't that regular. >> >> We do want the mmu updates. Of course they can't be attached to >> kvm_set_cr0_without_the_checks() since there is cross-register >> dependencies. >> >> Option 1 is kvm_set_multiple_cr(). This does the checks and updates, >> but only after all registers are updated. >> Option 2 is kvm_begin_cr_transaction()/kvm_commit_cr_transaction(). >> Prettier and more flexible, but a more complicated to implement. >> > > The only thing that these three use case truly have in common is the > closing kvm_mmu_reset_context. Maybe nvmx and nsvm can share a bit more. > But let's get nVMX right first, then think about sharing. They all need consistency checks, otherwise userspace or the guest and inject inconsistent values and perhaps exploit the host. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: nVMX: Replace kvm_set_cr0 with vmx_set_cr0 in load_vmcs12_host_state
On Sun, Feb 24, 2013 at 11:40 AM, Jan Kiszka wrote: We have the same problem in KVM_SET_SREGS. >>> >>> I don't see the problem. kvm_arch_vcpu_ioctl_set_sregs open-codes the >>> state update, not applying any transition checks. >> >> That's the problem. We have this open coding in three different >> places (KVM_SET_SREGS, nvmx, nsvm). >> >> It's not as if vmx_set_cr0() is defined as "kvm_set_cr0() without the >> transition checks". > > ...and without mmu updates. The latter is done via or after the closing > cr3 update. Interestingly, nsvm does not perform kvm_set_cr3 on vmexit > when in npt mode. Seems things aren't that regular. We do want the mmu updates. Of course they can't be attached to kvm_set_cr0_without_the_checks() since there is cross-register dependencies. Option 1 is kvm_set_multiple_cr(). This does the checks and updates, but only after all registers are updated. Option 2 is kvm_begin_cr_transaction()/kvm_commit_cr_transaction(). Prettier and more flexible, but a more complicated to implement. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: nVMX: Replace kvm_set_cr0 with vmx_set_cr0 in load_vmcs12_host_state
On Sun, Feb 24, 2013 at 11:01 AM, Jan Kiszka wrote: > On 2013-02-24 09:56, Avi Kivity wrote: >> On Sat, Feb 23, 2013 at 11:57 PM, Jan Kiszka wrote: >>> On 2013-02-23 22:45, Nadav Har'El wrote: >>>> On Sat, Feb 23, 2013, Jan Kiszka wrote about "[PATCH] KVM: nVMX: Replace >>>> kvm_set_cr0 with vmx_set_cr0 in load_vmcs12_host_state": >>>>> -kvm_set_cr0(vcpu, vmcs12->host_cr0); >>>>> +vmx_set_cr0(vcpu, vmcs12->host_cr0); >>>> >>>> I don't remember now why I did this (and I'm not looking at the code), >>>> but this you'll need to really test carefully, including >>>> shadow-on-shadow mode (ept=0 in L0), to verify you're not missing any >>>> important side-effect of kvm_set_cr0. >>>> >>>> Also, if I remember correctly, during nVMX's review, Avi Kivity asked >>>> in several places that when I called vmx_set_cr0, I should instead call >>>> kvm_set_cr0(), because it does some extra stuff and does some extra >>>> checks. Hmm, see, see this: >>>> http://markmail.org/message/hhidqyhbo2mrgxxc >>>> >>>> where Avi asked for the reverse patch you're attempting now. >>> >>> At least, kvm_set_cr0 can't be used as it assumes an otherwise >>> consistent guest state and an explicitly initiated transition - which is >>> naturally not the case while emulating a vmexit. >> >> We have the same problem in KVM_SET_SREGS. > > I don't see the problem. kvm_arch_vcpu_ioctl_set_sregs open-codes the > state update, not applying any transition checks. That's the problem. We have this open coding in three different places (KVM_SET_SREGS, nvmx, nsvm). It's not as if vmx_set_cr0() is defined as "kvm_set_cr0() without the transition checks". -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: nVMX: Replace kvm_set_cr0 with vmx_set_cr0 in load_vmcs12_host_state
On Sat, Feb 23, 2013 at 11:57 PM, Jan Kiszka wrote: > On 2013-02-23 22:45, Nadav Har'El wrote: >> On Sat, Feb 23, 2013, Jan Kiszka wrote about "[PATCH] KVM: nVMX: Replace >> kvm_set_cr0 with vmx_set_cr0 in load_vmcs12_host_state": >>> -kvm_set_cr0(vcpu, vmcs12->host_cr0); >>> +vmx_set_cr0(vcpu, vmcs12->host_cr0); >> >> I don't remember now why I did this (and I'm not looking at the code), >> but this you'll need to really test carefully, including >> shadow-on-shadow mode (ept=0 in L0), to verify you're not missing any >> important side-effect of kvm_set_cr0. >> >> Also, if I remember correctly, during nVMX's review, Avi Kivity asked >> in several places that when I called vmx_set_cr0, I should instead call >> kvm_set_cr0(), because it does some extra stuff and does some extra >> checks. Hmm, see, see this: >> http://markmail.org/message/hhidqyhbo2mrgxxc >> >> where Avi asked for the reverse patch you're attempting now. > > At least, kvm_set_cr0 can't be used as it assumes an otherwise > consistent guest state and an explicitly initiated transition - which is > naturally not the case while emulating a vmexit. We have the same problem in KVM_SET_SREGS. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 1/2] KVM: VMX: enable acknowledge interupt on vmexit
On Thu, Feb 21, 2013 at 10:58 AM, Zhang, Yang Z wrote: > Thanks. Here is code after changing, please review it: > > asm( > "mov %0, %%" _ASM_DX " \n\t" > #ifdef CONFIG_X86_64 > "mov %%" _ASM_SP ", %%" _ASM_BX " \n\t" > "and $0xfff0, %%" _ASM_SP " \n\t" > "mov %%ss, %%" _ASM_AX " \n\t" > "push %%" _ASM_AX " \n\t" > "push %%" _ASM_BX " \n\t" > #endif > "pushf \n\t" > "orl $0x200, (%%" _ASM_SP ") \n\t" > __ASM_SIZE(push) " %c[cs] \n\t" > "call *%% " _ASM_DX " \n\t" > : : "m"(entry), [cs]"i"(__KERNEL_CS) : > #ifdef CONFIG_X86_64 > "rax", "rbx", "rdx" > #else > "edx" > #endif > ); For cleanliness, you can provide more parameters via the constraints: "m"(entry) -> [entry]"r"(entry) (and then use "call *[entry]") use "push %c[ss]" for ss and [sp]"=&r"(tmp) instead of rdx/edx as the temporary for rsp -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 1/2] KVM: VMX: enable acknowledge interupt on vmexit
On Wed, Feb 20, 2013 at 3:10 PM, Zhang, Yang Z wrote: push %%cs >>> "push %%cs" is invalid in x86_64. >> >> Oops. 'push[lq] $__KERNEL_CS' then. > Is this right? Just copy it from other file. > > #define __STR(X) #X > #define STR(X) __STR(X) > > #ifdef CONFIG_X86_64 > "pushq $"STR(__KERNEL_CS)" \n\t" > #else > "pushl $"STR(__KERNEL_CS)" \n\t" > #endif > > #undef STR > #undef __STR > Use __ASM_SIZE and an immediate operand for __KERNEL_CS: asm ( ... : : [cs]"i"(__KERNEL_CS) ); and the code will be cleaner. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 1/2] KVM: VMX: enable acknowledge interupt on vmexit
On Wed, Feb 20, 2013 at 4:46 AM, Zhang, Yang Z wrote: >>> >>> +static void vmx_handle_external_intr(struct kvm_vcpu *vcpu) +{ + >>> u32 exit_intr_info = vmcs_read32(VM_EXIT_INTR_INFO); + + /* + >>>* If external interrupt exists, IF bit is set in rflags/eflags on >>> the +* interrupt stack frame, and interrupt will be enabled on >>> a return +* from interrupt handler. +*/ + if >>> ((exit_intr_info & (INTR_INFO_VALID_MASK | INTR_INFO_INTR_TYPE_MASK)) + >>> == (INTR_INFO_VALID_MASK | INTR_TYPE_EXT_INTR)) { >>> + unsigned int vector; + unsigned long >>> entry; + gate_desc *desc; + struct vcpu_vmx >>> *vmx = to_vmx(vcpu); + + vector = exit_intr_info & >>> INTR_INFO_VECTOR_MASK; +#ifdef CONFIG_X86_64 + desc = >>> (void *)vmx->host_idt_base + vector * 16; +#else + desc = >>> (void *)vmx->host_idt_base + vector * 8; +#endif + + >>> entry = gate_offset(*desc); + asm( + >>> "mov %0, %%" _ASM_DX " \n\t" +#ifdef CONFIG_X86_64 + >>> "mov %%" _ASM_SP ", %%" _ASM_BX " \n\t" + >>> "and $0xfff0, %%" _ASM_SP " \n\t" + >>> "mov %%ss, %%" _ASM_AX " \n\t" + "push %%" >>> _ASM_AX " \n\t" + "push %%" _ASM_BX " \n\t" >>> +#endif >> >> Are we sure no interrupts are using the IST feature? I guess it's unlikely. > Linux uses IST for NMI, stack fault, machine-check, double fault and debug > interrupt . No external interrupt will use it. > This feature is only for external interrupt. So we don't need to check it > here. Ok, thanks for checking. > >> >>> + "pushf \n\t" >>> + "pop %%" _ASM_AX " \n\t" >>> + "or $0x200, %%" _ASM_AX " \n\t" >>> + "push %%" _ASM_AX " \n\t" >> >> Can simplify to pushf; orl $0x200, %%rsp. > Sure. > >>> + "mov %%cs, %%" _ASM_AX " \n\t" >>> + "push %%" _ASM_AX " \n\t" >> >> push %%cs > "push %%cs" is invalid in x86_64. Oops. 'push[lq] $__KERNEL_CS' then. > >>> + "push intr_return \n\t" >> >> push $1f. Or even combine with the next instruction, and call %rdx. > Which is faster? jmp or call? Actually it doesn't matter, the processor is clever enough to minimize the difference. But the code is simpler and shorter with 'call'. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 1/2] KVM: VMX: enable acknowledge interupt on vmexit
On Tue, Feb 19, 2013 at 3:39 PM, Yang Zhang wrote: > From: Yang Zhang > > The "acknowledge interrupt on exit" feature controls processor behavior > for external interrupt acknowledgement. When this control is set, the > processor acknowledges the interrupt controller to acquire the > interrupt vector on VM exit. > > After enabling this feature, an interrupt which arrived when target cpu is > running in vmx non-root mode will be handled by vmx handler instead of handler > in idt. Currently, vmx handler only fakes an interrupt stack and jump to idt > table to let real handler to handle it. Further, we will recognize the > interrupt > and only delivery the interrupt which not belong to current vcpu through idt > table. > The interrupt which belonged to current vcpu will be handled inside vmx > handler. > This will reduce the interrupt handle cost of KVM. > > Also, interrupt enable logic is changed if this feature is turnning on: > Before this patch, hypervior call local_irq_enable() to enable it directly. > Now IF bit is set on interrupt stack frame, and will be enabled on a return > from > interrupt handler if exterrupt interrupt exists. If no external interrupt, > still > call local_irq_enable() to enable it. > > Refer to Intel SDM volum 3, chapter 33.2. > > +static void vmx_handle_external_intr(struct kvm_vcpu *vcpu) > +{ > + u32 exit_intr_info = vmcs_read32(VM_EXIT_INTR_INFO); > + > + /* > +* If external interrupt exists, IF bit is set in rflags/eflags on the > +* interrupt stack frame, and interrupt will be enabled on a return > +* from interrupt handler. > +*/ > + if ((exit_intr_info & (INTR_INFO_VALID_MASK | > INTR_INFO_INTR_TYPE_MASK)) > + == (INTR_INFO_VALID_MASK | INTR_TYPE_EXT_INTR)) { > + unsigned int vector; > + unsigned long entry; > + gate_desc *desc; > + struct vcpu_vmx *vmx = to_vmx(vcpu); > + > + vector = exit_intr_info & INTR_INFO_VECTOR_MASK; > +#ifdef CONFIG_X86_64 > + desc = (void *)vmx->host_idt_base + vector * 16; > +#else > + desc = (void *)vmx->host_idt_base + vector * 8; > +#endif > + > + entry = gate_offset(*desc); > + asm( > + "mov %0, %%" _ASM_DX " \n\t" > +#ifdef CONFIG_X86_64 > + "mov %%" _ASM_SP ", %%" _ASM_BX " \n\t" > + "and $0xfff0, %%" _ASM_SP " \n\t" > + "mov %%ss, %%" _ASM_AX " \n\t" > + "push %%" _ASM_AX " \n\t" > + "push %%" _ASM_BX " \n\t" > +#endif Are we sure no interrupts are using the IST feature? I guess it's unlikely. > + "pushf \n\t" > + "pop %%" _ASM_AX " \n\t" > + "or $0x200, %%" _ASM_AX " \n\t" > + "push %%" _ASM_AX " \n\t" Can simplify to pushf; orl $0x200, %%rsp. > + "mov %%cs, %%" _ASM_AX " \n\t" > + "push %%" _ASM_AX " \n\t" push %%cs > + "push intr_return \n\t" push $1f. Or even combine with the next instruction, and call %rdx. > + "jmp *%% " _ASM_DX " \n\t" > + "1: \n\t" > + ".pushsection .rodata \n\t" > + ".global intr_return \n\t" > + "intr_return: " _ASM_PTR " 1b \n\t" > + ".popsection \n\t" > + : : "m"(entry) : > +#ifdef CONFIG_X86_64 > + "rax", "rbx", "rdx" > +#else > + "eax", "edx" > +#endif > + ); > + } else > + local_irq_enable(); > +} > + -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] emulator: add MUL tests
Signed-off-by: Avi Kivity --- x86/emulator.c | 19 --- 1 file changed, 16 insertions(+), 3 deletions(-) diff --git a/x86/emulator.c b/x86/emulator.c index a128e13..96576e5 100644 --- a/x86/emulator.c +++ b/x86/emulator.c @@ -583,9 +583,9 @@ static void test_imul(ulong *mem) report("imul rax, mem, imm", a == 0x1D950BDE1D950BC8L); } -static void test_div(long *mem) +static void test_muldiv(long *mem) { -long a, d; +long a, d, aa, dd; u8 ex = 1; *mem = 0; a = 1; d = 2; @@ -598,6 +598,19 @@ static void test_div(long *mem) : "+a"(a), "+d"(d), "+q"(ex) : "m"(*mem)); report("divq (1)", a == 0x1ffb1b963b33ul && d == 0x273ba4384ede2ul && !ex); +aa = 0x; dd = 0x; +*mem = 0x; a = aa; d = dd; +asm("mulb %2" : "+a"(a), "+d"(d) : "m"(*mem)); +report("mulb mem", a == 0x0363 && d == dd); +*mem = 0x; a = aa; d = dd; +asm("mulw %2" : "+a"(a), "+d"(d) : "m"(*mem)); +report("mulw mem", a == 0xc963 && d == 0x0369); +*mem = 0x; a = aa; d = dd; +asm("mull %2" : "+a"(a), "+d"(d) : "m"(*mem)); +report("mull mem", a == 0x962fc963 && d == 0x369d036); +*mem = 0x; a = aa; d = dd; +asm("mulq %2" : "+a"(a), "+d"(d) : "m"(*mem)); +report("mulq mem", a == 0x2fc962fc962fc963 && d == 0x369d0369d0369d0); } typedef unsigned __attribute__((vector_size(16))) sse128; @@ -934,7 +947,7 @@ int main() test_btc(mem); test_bsfbsr(mem); test_imul(mem); - test_div(mem); + test_muldiv(mem); test_sse(mem); test_mmx(mem); test_rip_relative(mem, insn_ram); -- 1.8.1.2 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/8] KVM: x86 emulator: decode extended accumulator explicity
Single-operand MUL and DIV access an extended accumulator: AX for byte instructions, and DX:AX, EDX:EAX, or RDX:RAX for larger-sized instructions. Add support for fetching the extended accumulator. In order not to change things too much, RDX is loaded into Src2, which is already loaded by fastop(). This avoids increasing register pressure on i386. Signed-off-by: Avi Kivity --- arch/x86/kvm/emulate.c | 22 ++ 1 file changed, 22 insertions(+) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index 18c86b5..aa8516e 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -60,6 +60,8 @@ #define OpGS 25ull /* GS */ #define OpMem826ull /* 8-bit zero extended memory operand */ #define OpImm64 27ull /* Sign extended 16/32/64-bit immediate */ +#define OpAccLo 29ull /* Low part of extended acc (AX/AX/EAX/RAX) */ +#define OpAccHi 30ull /* High part of extended acc (-/DX/EDX/RDX) */ #define OpBits 5 /* Width of operand field */ #define OpMask ((1ull << OpBits) - 1) @@ -85,6 +87,7 @@ #define DstMem64(OpMem64 << DstShift) #define DstImmUByte (OpImmUByte << DstShift) #define DstDX (OpDX << DstShift) +#define DstAccLo(OpAccLo << DstShift) #define DstMask (OpMask << DstShift) /* Source operand type. */ #define SrcShift6 @@ -106,6 +109,7 @@ #define SrcImm64(OpImm64 << SrcShift) #define SrcDX (OpDX << SrcShift) #define SrcMem8 (OpMem8 << SrcShift) +#define SrcAccHi(OpAccHi << SrcShift) #define SrcMask (OpMask << SrcShift) #define BitOp (1<<11) #define MemAbs (1<<12) /* Memory operand is absolute displacement */ @@ -154,6 +158,8 @@ #define NoWrite ((u64)1 << 45) /* No writeback */ #define SrcWrite((u64)1 << 46) /* Write back src operand */ +#define DstXacc (DstAccLo | SrcAccHi | SrcWrite) + #define X2(x...) x, x #define X3(x...) X2(x), x #define X4(x...) X2(x), X2(x) @@ -4129,6 +4135,22 @@ static int decode_operand(struct x86_emulate_ctxt *ctxt, struct operand *op, fetch_register_operand(op); op->orig_val = op->val; break; + case OpAccLo: + op->type = OP_REG; + op->bytes = (ctxt->d & ByteOp) ? 2 : ctxt->op_bytes; + op->addr.reg = reg_rmw(ctxt, VCPU_REGS_RAX); + fetch_register_operand(op); + op->orig_val = op->val; + break; + case OpAccHi: + if (ctxt->d & ByteOp) + break; + op->type = OP_REG; + op->bytes = ctxt->op_bytes; + op->addr.reg = reg_rmw(ctxt, VCPU_REGS_RDX); + fetch_register_operand(op); + op->orig_val = op->val; + break; case OpDI: op->type = OP_MEM; op->bytes = (ctxt->d & ByteOp) ? 1 : ctxt->op_bytes; -- 1.8.1.2 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/8] KVM: x86 emulator: Switch fastop src operand to RDX
This makes OpAccHi useful. Signed-off-by: Avi Kivity --- arch/x86/kvm/emulate.c | 26 +- 1 file changed, 13 insertions(+), 13 deletions(-) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index d51f6f4..fe91e70 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -176,8 +176,8 @@ /* * fastop functions have a special calling convention: * - * dst:[rdx]:rax (in/out) - * src:rbx(in/out) + * dst:rax(in/out) + * src:rdx(in/out) * src2: rcx(in) * flags: rflags (in/out) * @@ -482,19 +482,19 @@ static int fastop(struct x86_emulate_ctxt *ctxt, void (*fop)(struct fastop *)); #define FASTOP2(op) \ FOP_START(op) \ - FOP2E(op##b, al, bl) \ - FOP2E(op##w, ax, bx) \ - FOP2E(op##l, eax, ebx) \ - ON64(FOP2E(op##q, rax, rbx)) \ + FOP2E(op##b, al, dl) \ + FOP2E(op##w, ax, dx) \ + FOP2E(op##l, eax, edx) \ + ON64(FOP2E(op##q, rax, rdx)) \ FOP_END /* 2 operand, word only */ #define FASTOP2W(op) \ FOP_START(op) \ FOPNOP() \ - FOP2E(op##w, ax, bx) \ - FOP2E(op##l, eax, ebx) \ - ON64(FOP2E(op##q, rax, rbx)) \ + FOP2E(op##w, ax, dx) \ + FOP2E(op##l, eax, edx) \ + ON64(FOP2E(op##q, rax, rdx)) \ FOP_END /* 2 operand, src is CL */ @@ -513,9 +513,9 @@ static int fastop(struct x86_emulate_ctxt *ctxt, void (*fop)(struct fastop *)); #define FASTOP3WCL(op) \ FOP_START(op) \ FOPNOP() \ - FOP3E(op##w, ax, bx, cl) \ - FOP3E(op##l, eax, ebx, cl) \ - ON64(FOP3E(op##q, rax, rbx, cl)) \ + FOP3E(op##w, ax, dx, cl) \ + FOP3E(op##l, eax, edx, cl) \ + ON64(FOP3E(op##q, rax, rdx, cl)) \ FOP_END /* Special case for SETcc - 1 instruction per cc */ @@ -4521,7 +4521,7 @@ static int fastop(struct x86_emulate_ctxt *ctxt, void (*fop)(struct fastop *)) ulong flags = (ctxt->eflags & EFLAGS_MASK) | X86_EFLAGS_IF; fop += __ffs(ctxt->dst.bytes) * FASTOP_SIZE; asm("push %[flags]; popf; call *%[fastop]; pushf; pop %[flags]\n" - : "+a"(ctxt->dst.val), "+b"(ctxt->src.val), [flags]"+D"(flags) + : "+a"(ctxt->dst.val), "+d"(ctxt->src.val), [flags]"+D"(flags) : "c"(ctxt->src2.val), [fastop]"S"(fop)); ctxt->eflags = (ctxt->eflags & ~EFLAGS_MASK) | (flags & EFLAGS_MASK); return X86EMUL_CONTINUE; -- 1.8.1.2 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 7/8] KVM: x86 emulator: drop unused old-style inline emulation
Signed-off-by: Avi Kivity --- arch/x86/kvm/emulate.c | 198 - 1 file changed, 198 deletions(-) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index 89f56bb..a706e52 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -282,174 +282,17 @@ static void invalidate_registers(struct x86_emulate_ctxt *ctxt) } /* - * Instruction emulation: - * Most instructions are emulated directly via a fragment of inline assembly - * code. This allows us to save/restore EFLAGS and thus very easily pick up - * any modified flags. - */ - -#if defined(CONFIG_X86_64) -#define _LO32 "k" /* force 32-bit operand */ -#define _STK "%%rsp" /* stack pointer */ -#elif defined(__i386__) -#define _LO32 "" /* force 32-bit operand */ -#define _STK "%%esp" /* stack pointer */ -#endif - -/* * These EFLAGS bits are restored from saved value during emulation, and * any changes are written back to the saved value after emulation. */ #define EFLAGS_MASK (EFLG_OF|EFLG_SF|EFLG_ZF|EFLG_AF|EFLG_PF|EFLG_CF) -/* Before executing instruction: restore necessary bits in EFLAGS. */ -#define _PRE_EFLAGS(_sav, _msk, _tmp) \ - /* EFLAGS = (_sav & _msk) | (EFLAGS & ~_msk); _sav &= ~_msk; */ \ - "movl %"_sav",%"_LO32 _tmp"; " \ - "push %"_tmp"; "\ - "push %"_tmp"; "\ - "movl %"_msk",%"_LO32 _tmp"; " \ - "andl %"_LO32 _tmp",("_STK"); " \ - "pushf; " \ - "notl %"_LO32 _tmp"; " \ - "andl %"_LO32 _tmp",("_STK"); " \ - "andl %"_LO32 _tmp","__stringify(BITS_PER_LONG/4)"("_STK"); " \ - "pop %"_tmp"; "\ - "orl %"_LO32 _tmp",("_STK"); " \ - "popf; "\ - "pop %"_sav"; " - -/* After executing instruction: write-back necessary bits in EFLAGS. */ -#define _POST_EFLAGS(_sav, _msk, _tmp) \ - /* _sav |= EFLAGS & _msk; */\ - "pushf; " \ - "pop %"_tmp"; "\ - "andl %"_msk",%"_LO32 _tmp"; " \ - "orl %"_LO32 _tmp",%"_sav"; " - #ifdef CONFIG_X86_64 #define ON64(x) x #else #define ON64(x) #endif -#define emulate_2op(ctxt, _op, _x, _y, _suffix, _dsttype) \ - do {\ - __asm__ __volatile__ ( \ - _PRE_EFLAGS("0", "4", "2") \ - _op _suffix " %"_x"3,%1; " \ - _POST_EFLAGS("0", "4", "2") \ - : "=m" ((ctxt)->eflags),\ - "+q" (*(_dsttype*)&(ctxt)->dst.val), \ - "=&r" (_tmp) \ - : _y ((ctxt)->src.val), "i" (EFLAGS_MASK)); \ - } while (0) - - -/* Raw emulation: instruction has two explicit operands. */ -#define __emulate_2op_nobyte(ctxt,_op,_wx,_wy,_lx,_ly,_qx,_qy) \ - do {\ - unsigned long _tmp; \ - \ - switch ((ctxt)->dst.bytes) {\ - case 2: \ - emulate_2op(ctxt,_op,_wx,_wy,"w",u16); \ - break; \ - case 4: \ - emulate_2op(ctxt,_op,_lx,_ly,"l",u32); \ - break; \ - case 8:
[PATCH 5/8] KVM: x86 emulator: convert single-operand MUL/IMUL to fastop
Signed-off-by: Avi Kivity --- arch/x86/kvm/emulate.c | 35 --- 1 file changed, 16 insertions(+), 19 deletions(-) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index fe91e70..0f0c15e 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -477,6 +477,15 @@ static int fastop(struct x86_emulate_ctxt *ctxt, void (*fop)(struct fastop *)); ON64(FOP1E(op##q, rax)) \ FOP_END +/* 1-operand, using src2 (for MUL/DIV r/m) */ +#define FASTOP1SRC2(op, name) \ + FOP_START(name) \ + FOP1E(op, cl) \ + FOP1E(op, cx) \ + FOP1E(op, ecx) \ + ON64(FOP1E(op, rcx)) \ + FOP_END + #define FOP2E(op, dst, src) \ FOP_ALIGN #op " %" #src ", %" #dst " \n\t" FOP_RET @@ -990,6 +999,9 @@ FASTOP2(xor); FASTOP2(cmp); FASTOP2(test); +FASTOP1SRC2(mul, mul_ex); +FASTOP1SRC2(imul, imul_ex); + FASTOP3WCL(shld); FASTOP3WCL(shrd); @@ -2104,22 +2116,6 @@ static int em_jmp_far(struct x86_emulate_ctxt *ctxt) return X86EMUL_CONTINUE; } -static int em_mul_ex(struct x86_emulate_ctxt *ctxt) -{ - u8 ex = 0; - - emulate_1op_rax_rdx(ctxt, "mul", ex); - return X86EMUL_CONTINUE; -} - -static int em_imul_ex(struct x86_emulate_ctxt *ctxt) -{ - u8 ex = 0; - - emulate_1op_rax_rdx(ctxt, "imul", ex); - return X86EMUL_CONTINUE; -} - static int em_div_ex(struct x86_emulate_ctxt *ctxt) { u8 de = 0; @@ -3702,8 +3698,8 @@ static const struct opcode group3[] = { F(DstMem | SrcImm | NoWrite, em_test), F(DstMem | SrcNone | Lock, em_not), F(DstMem | SrcNone | Lock, em_neg), - I(DstXacc | Src2Mem, em_mul_ex), - I(DstXacc | Src2Mem, em_imul_ex), + F(DstXacc | Src2Mem, em_mul_ex), + F(DstXacc | Src2Mem, em_imul_ex), I(DstXacc | Src2Mem, em_div_ex), I(DstXacc | Src2Mem, em_idiv_ex), }; @@ -4519,7 +4515,8 @@ static void fetch_possible_mmx_operand(struct x86_emulate_ctxt *ctxt, static int fastop(struct x86_emulate_ctxt *ctxt, void (*fop)(struct fastop *)) { ulong flags = (ctxt->eflags & EFLAGS_MASK) | X86_EFLAGS_IF; - fop += __ffs(ctxt->dst.bytes) * FASTOP_SIZE; + if (!(ctxt->d & ByteOp)) + fop += __ffs(ctxt->dst.bytes) * FASTOP_SIZE; asm("push %[flags]; popf; call *%[fastop]; pushf; pop %[flags]\n" : "+a"(ctxt->dst.val), "+d"(ctxt->src.val), [flags]"+D"(flags) : "c"(ctxt->src2.val), [fastop]"S"(fop)); -- 1.8.1.2 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 8/8] KVM: x86 emulator: convert XADD to fastop
Signed-off-by: Avi Kivity --- arch/x86/kvm/emulate.c | 10 +++--- 1 file changed, 3 insertions(+), 7 deletions(-) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index a706e52..2f895c2 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -847,6 +847,8 @@ FASTOP2W(bts); FASTOP2W(btr); FASTOP2W(btc); +FASTOP2(xadd); + static u8 test_cc(unsigned int condition, unsigned long flags) { u8 rc; @@ -3824,7 +3826,7 @@ static const struct opcode twobyte_table[256] = { F(DstReg | SrcMem | ModRM, em_bsf), F(DstReg | SrcMem | ModRM, em_bsr), D(DstReg | SrcMem8 | ModRM | Mov), D(DstReg | SrcMem16 | ModRM | Mov), /* 0xC0 - 0xC7 */ - D2bv(DstMem | SrcReg | ModRM | Lock), + F2bv(DstMem | SrcReg | ModRM | SrcWrite | Lock, em_xadd), N, D(DstMem | SrcReg | ModRM | Mov), N, N, N, GD(0, &group9), /* 0xC8 - 0xCF */ @@ -4643,12 +4645,6 @@ twobyte_insn: ctxt->dst.val = (ctxt->src.bytes == 1) ? (s8) ctxt->src.val : (s16) ctxt->src.val; break; - case 0xc0 ... 0xc1: /* xadd */ - fastop(ctxt, em_add); - /* Write back the register source. */ - ctxt->src.val = ctxt->dst.orig_val; - write_register_operand(&ctxt->src); - break; case 0xc3: /* movnti */ ctxt->dst.bytes = ctxt->op_bytes; ctxt->dst.val = (ctxt->op_bytes == 4) ? (u32) ctxt->src.val : -- 1.8.1.2 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 6/8] KVM: x86 emulator: convert DIV/IDIV to fastop
Since DIV and IDIV can generate exceptions, we need an additional output parameter indicating whether an execption has occured. To avoid increasing register pressure on i386, we use %rsi, which is already allocated for the fastop code pointer. Signed-off-by: Avi Kivity --- arch/x86/kvm/emulate.c | 51 +- 1 file changed, 26 insertions(+), 25 deletions(-) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index 0f0c15e..89f56bb 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -180,6 +180,7 @@ * src:rdx(in/out) * src2: rcx(in) * flags: rflags (in/out) + * ex: rsi(in:nonzero, out:zero if exception) * * Moreover, they are all exactly FASTOP_SIZE bytes long, so functions for * different operand sizes can be reached by calculation, rather than a jump @@ -467,7 +468,10 @@ static int fastop(struct x86_emulate_ctxt *ctxt, void (*fop)(struct fastop *)); #define FOPNOP() FOP_ALIGN FOP_RET #define FOP1E(op, dst) \ - FOP_ALIGN #op " %" #dst " \n\t" FOP_RET + FOP_ALIGN "10: " #op " %" #dst " \n\t" FOP_RET + +#define FOP1EEX(op, dst) \ + FOP1E(op, dst) _ASM_EXTABLE(10b, kvm_fastop_exception) #define FASTOP1(op) \ FOP_START(op) \ @@ -486,6 +490,15 @@ static int fastop(struct x86_emulate_ctxt *ctxt, void (*fop)(struct fastop *)); ON64(FOP1E(op, rcx)) \ FOP_END +/* 1-operand, using src2 (for MUL/DIV r/m), with exceptions */ +#define FASTOP1SRC2EX(op, name) \ + FOP_START(name) \ + FOP1EEX(op, cl) \ + FOP1EEX(op, cx) \ + FOP1EEX(op, ecx) \ + ON64(FOP1EEX(op, rcx)) \ + FOP_END + #define FOP2E(op, dst, src) \ FOP_ALIGN #op " %" #src ", %" #dst " \n\t" FOP_RET @@ -530,6 +543,9 @@ static int fastop(struct x86_emulate_ctxt *ctxt, void (*fop)(struct fastop *)); /* Special case for SETcc - 1 instruction per cc */ #define FOP_SETCC(op) ".align 4; " #op " %al; ret \n\t" +asm(".global kvm_fastop_exception \n" +"kvm_fastop_exception: xor %esi, %esi; ret"); + FOP_START(setcc) FOP_SETCC(seto) FOP_SETCC(setno) @@ -1001,6 +1017,8 @@ FASTOP2(test); FASTOP1SRC2(mul, mul_ex); FASTOP1SRC2(imul, imul_ex); +FASTOP1SRC2EX(div, div_ex); +FASTOP1SRC2EX(idiv, idiv_ex); FASTOP3WCL(shld); FASTOP3WCL(shrd); @@ -2116,26 +2134,6 @@ static int em_jmp_far(struct x86_emulate_ctxt *ctxt) return X86EMUL_CONTINUE; } -static int em_div_ex(struct x86_emulate_ctxt *ctxt) -{ - u8 de = 0; - - emulate_1op_rax_rdx(ctxt, "div", de); - if (de) - return emulate_de(ctxt); - return X86EMUL_CONTINUE; -} - -static int em_idiv_ex(struct x86_emulate_ctxt *ctxt) -{ - u8 de = 0; - - emulate_1op_rax_rdx(ctxt, "idiv", de); - if (de) - return emulate_de(ctxt); - return X86EMUL_CONTINUE; -} - static int em_grp45(struct x86_emulate_ctxt *ctxt) { int rc = X86EMUL_CONTINUE; @@ -3700,8 +3698,8 @@ static const struct opcode group3[] = { F(DstMem | SrcNone | Lock, em_neg), F(DstXacc | Src2Mem, em_mul_ex), F(DstXacc | Src2Mem, em_imul_ex), - I(DstXacc | Src2Mem, em_div_ex), - I(DstXacc | Src2Mem, em_idiv_ex), + F(DstXacc | Src2Mem, em_div_ex), + F(DstXacc | Src2Mem, em_idiv_ex), }; static const struct opcode group4[] = { @@ -4518,9 +4516,12 @@ static int fastop(struct x86_emulate_ctxt *ctxt, void (*fop)(struct fastop *)) if (!(ctxt->d & ByteOp)) fop += __ffs(ctxt->dst.bytes) * FASTOP_SIZE; asm("push %[flags]; popf; call *%[fastop]; pushf; pop %[flags]\n" - : "+a"(ctxt->dst.val), "+d"(ctxt->src.val), [flags]"+D"(flags) - : "c"(ctxt->src2.val), [fastop]"S"(fop)); + : "+a"(ctxt->dst.val), "+d"(ctxt->src.val), [flags]"+D"(flags), + [fastop]"+S"(fop) + : "c"(ctxt->src2.val)); ctxt->eflags = (ctxt->eflags & ~EFLAGS_MASK) | (flags & EFLAGS_MASK); + if (!fop) + return emulate_de(ctxt); return X86EMUL_CONTINUE; } -- 1.8.1.2 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/8] KVM: x86 emulator: switch MUL/DIV to DstXacc
Signed-off-by: Avi Kivity --- arch/x86/kvm/emulate.c | 15 --- 1 file changed, 8 insertions(+), 7 deletions(-) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index aa8516e..d51f6f4 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -139,6 +139,7 @@ /* Source 2 operand type */ #define Src2Shift (30) #define Src2None(OpNone << Src2Shift) +#define Src2Mem (OpMem << Src2Shift) #define Src2CL (OpCL << Src2Shift) #define Src2ImmByte (OpImmByte << Src2Shift) #define Src2One (OpOne << Src2Shift) @@ -542,8 +543,8 @@ FOP_END; #define __emulate_1op_rax_rdx(ctxt, _op, _suffix, _ex) \ do {\ unsigned long _tmp; \ - ulong *rax = reg_rmw((ctxt), VCPU_REGS_RAX);\ - ulong *rdx = reg_rmw((ctxt), VCPU_REGS_RDX);\ + ulong *rax = &ctxt->dst.val;\ + ulong *rdx = &ctxt->src.val;\ \ __asm__ __volatile__ ( \ _PRE_EFLAGS("0", "5", "1") \ @@ -558,7 +559,7 @@ FOP_END; _ASM_EXTABLE(1b, 3b)\ : "=m" ((ctxt)->eflags), "=&r" (_tmp), \ "+a" (*rax), "+d" (*rdx), "+qm"(_ex) \ - : "i" (EFLAGS_MASK), "m" ((ctxt)->src.val));\ + : "i" (EFLAGS_MASK), "m" ((ctxt)->src2.val)); \ } while (0) /* instruction has only one source operand, destination is implicit (e.g. mul, div, imul, idiv) */ @@ -3701,10 +3702,10 @@ static const struct opcode group3[] = { F(DstMem | SrcImm | NoWrite, em_test), F(DstMem | SrcNone | Lock, em_not), F(DstMem | SrcNone | Lock, em_neg), - I(SrcMem, em_mul_ex), - I(SrcMem, em_imul_ex), - I(SrcMem, em_div_ex), - I(SrcMem, em_idiv_ex), + I(DstXacc | Src2Mem, em_mul_ex), + I(DstXacc | Src2Mem, em_imul_ex), + I(DstXacc | Src2Mem, em_div_ex), + I(DstXacc | Src2Mem, em_idiv_ex), }; static const struct opcode group4[] = { -- 1.8.1.2 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/8] KVM: x86 emulator: add support for writing back the source operand
Some instructions write back the source operand, not just the destination. Add support for doing this via the decode flags. Signed-off-by: Avi Kivity --- arch/x86/kvm/emulate.c | 47 ++- 1 file changed, 26 insertions(+), 21 deletions(-) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index 2b11318..18c86b5 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -152,6 +152,7 @@ #define Avx ((u64)1 << 43) /* Advanced Vector Extensions */ #define Fastop ((u64)1 << 44) /* Use opcode::u.fastop */ #define NoWrite ((u64)1 << 45) /* No writeback */ +#define SrcWrite((u64)1 << 46) /* Write back src operand */ #define X2(x...) x, x #define X3(x...) X2(x), x @@ -1708,45 +1709,42 @@ static void write_register_operand(struct operand *op) } } -static int writeback(struct x86_emulate_ctxt *ctxt) +static int writeback(struct x86_emulate_ctxt *ctxt, struct operand *op) { int rc; - if (ctxt->d & NoWrite) - return X86EMUL_CONTINUE; - - switch (ctxt->dst.type) { + switch (op->type) { case OP_REG: - write_register_operand(&ctxt->dst); + write_register_operand(op); break; case OP_MEM: if (ctxt->lock_prefix) rc = segmented_cmpxchg(ctxt, - ctxt->dst.addr.mem, - &ctxt->dst.orig_val, - &ctxt->dst.val, - ctxt->dst.bytes); + op->addr.mem, + &op->orig_val, + &op->val, + op->bytes); else rc = segmented_write(ctxt, -ctxt->dst.addr.mem, -&ctxt->dst.val, -ctxt->dst.bytes); +op->addr.mem, +&op->val, +op->bytes); if (rc != X86EMUL_CONTINUE) return rc; break; case OP_MEM_STR: rc = segmented_write(ctxt, - ctxt->dst.addr.mem, - ctxt->dst.data, - ctxt->dst.bytes * ctxt->dst.count); + op->addr.mem, + op->data, + op->bytes * op->count); if (rc != X86EMUL_CONTINUE) return rc; break; case OP_XMM: - write_sse_reg(ctxt, &ctxt->dst.vec_val, ctxt->dst.addr.xmm); + write_sse_reg(ctxt, &op->vec_val, op->addr.xmm); break; case OP_MM: - write_mmx_reg(ctxt, &ctxt->dst.mm_val, ctxt->dst.addr.mm); + write_mmx_reg(ctxt, &op->mm_val, op->addr.mm); break; case OP_NONE: /* no writeback */ @@ -4717,9 +4715,16 @@ special_insn: goto done; writeback: - rc = writeback(ctxt); - if (rc != X86EMUL_CONTINUE) - goto done; + if (!(ctxt->d & NoWrite)) { + rc = writeback(ctxt, &ctxt->dst); + if (rc != X86EMUL_CONTINUE) + goto done; + } + if (ctxt->d & SrcWrite) { + rc = writeback(ctxt, &ctxt->src); + if (rc != X86EMUL_CONTINUE) + goto done; + } /* * restore dst type in case the decoding will be reused -- 1.8.1.2 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/8] Convert MUL/DIV to fastop
This patchset converts MUL/DIV to fastop. We depart from the plan, which was to encode rdx:rax as a single operand, and instead encode it as two separate operands. This reduces register pressure on i386, and is simpler besides. As a bonus, XADD is converted as well. The series results in a nice code size reduction: 60147 0 0 60147eaf3 arch/x86/kvm/emulate.o.before 56899 0 0 56899de43 arch/x86/kvm/emulate.o.after Avi Kivity (8): KVM: x86 emulator: add support for writing back the source operand KVM: x86 emulator: decode extended accumulator explicity KVM: x86 emulator: switch MUL/DIV to DstXacc KVM: x86 emulator: Switch fastop src operand to RDX KVM: x86 emulator: convert single-operand MUL/IMUL to fastop KVM: x86 emulator: convert DIV/IDIV to fastop KVM: x86 emulator: drop unused old-style inline emulation KVM: x86 emulator: convert XADD to fastop arch/x86/kvm/emulate.c | 388 ++--- 1 file changed, 106 insertions(+), 282 deletions(-) -- 1.8.1.2 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: linux-next: Tree for Jan 25 (kvm)
On Sat, Jan 26, 2013 at 1:46 AM, Stephen Rothwell wrote: > On Fri, 25 Jan 2013 08:53:58 -0800 Randy Dunlap wrote: >> >> Seeing lots of this error on i386: >> >> arch/x86/kvm/emulate.c:1016: Error: unsupported for `push' > > Caused by commit 9ae9febae950 ("KVM: x86 emulator: covert SETCC to > fastop") from the kvm tree. cc's added. > I posted a fix. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] KVM: x86 emulator: fix test_cc() build failure on i386
'pushq' doesn't exist on i386. Replace with 'push', which should work since the operand is a register. Signed-off-by: Avi Kivity --- arch/x86/kvm/emulate.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index e99fb72..2b11318 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -1013,7 +1013,7 @@ static u8 test_cc(unsigned int condition, unsigned long flags) void (*fop)(void) = (void *)em_setcc + 4 * (condition & 0xf); flags = (flags & EFLAGS_MASK) | X86_EFLAGS_IF; - asm("pushq %[flags]; popf; call *%[fastop]" + asm("push %[flags]; popf; call *%[fastop]" : "=a"(rc) : [fastop]"r"(fop), [flags]"r"(flags)); return rc; } -- 1.8.1 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 8/8] KVM: x86 emulator: convert a few freestanding emulations to fastop
On Wed, Jan 23, 2013 at 2:21 AM, Marcelo Tosatti wrote: > Missing signed off by. Signed-off-by: Avi Kivity -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 6/8] KVM: x86 emulator: convert 2-operand IMUL to fastop
Signed-off-by: Avi Kivity --- arch/x86/kvm/emulate.c | 14 ++ 1 file changed, 6 insertions(+), 8 deletions(-) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index 62014dc..45ddec8 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -441,6 +441,8 @@ static void invalidate_registers(struct x86_emulate_ctxt *ctxt) } \ } while (0) +static int fastop(struct x86_emulate_ctxt *ctxt, void (*fop)(struct fastop *)); + #define FOP_ALIGN ".align " __stringify(FASTOP_SIZE) " \n\t" #define FOP_RET "ret \n\t" @@ -3051,6 +3053,8 @@ FASTOP2(test); FASTOP3WCL(shld); FASTOP3WCL(shrd); +FASTOP2W(imul); + static int em_xchg(struct x86_emulate_ctxt *ctxt) { /* Write back the register source. */ @@ -3063,16 +3067,10 @@ static int em_xchg(struct x86_emulate_ctxt *ctxt) return X86EMUL_CONTINUE; } -static int em_imul(struct x86_emulate_ctxt *ctxt) -{ - emulate_2op_SrcV_nobyte(ctxt, "imul"); - return X86EMUL_CONTINUE; -} - static int em_imul_3op(struct x86_emulate_ctxt *ctxt) { ctxt->dst.val = ctxt->src2.val; - return em_imul(ctxt); + return fastop(ctxt, em_imul); } static int em_cwd(struct x86_emulate_ctxt *ctxt) @@ -4010,7 +4008,7 @@ static const struct opcode twobyte_table[256] = { F(DstMem | SrcReg | ModRM | BitOp | Lock | PageTable, em_bts), F(DstMem | SrcReg | Src2ImmByte | ModRM, em_shrd), F(DstMem | SrcReg | Src2CL | ModRM, em_shrd), - D(ModRM), I(DstReg | SrcMem | ModRM, em_imul), + D(ModRM), F(DstReg | SrcMem | ModRM, em_imul), /* 0xB0 - 0xB7 */ I2bv(DstMem | SrcReg | ModRM | Lock | PageTable, em_cmpxchg), I(DstReg | SrcMemFAddr | ModRM | Src2SS, em_lseg), -- 1.8.0.2 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 5/8] KVM: x86 emulator: convert BT/BTS/BTR/BTC/BSF/BSR to fastop
Signed-off-by: Avi Kivity --- arch/x86/kvm/emulate.c | 76 +- 1 file changed, 26 insertions(+), 50 deletions(-) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index edb09e9c..62014dc 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -478,6 +478,15 @@ static void invalidate_registers(struct x86_emulate_ctxt *ctxt) ON64(FOP2E(op##q, rax, rbx)) \ FOP_END +/* 2 operand, word only */ +#define FASTOP2W(op) \ + FOP_START(op) \ + FOPNOP() \ + FOP2E(op##w, ax, bx) \ + FOP2E(op##l, eax, ebx) \ + ON64(FOP2E(op##q, rax, rbx)) \ + FOP_END + /* 2 operand, src is CL */ #define FASTOP2CL(op) \ FOP_START(op) \ @@ -2066,6 +2075,13 @@ FASTOP2CL(shl); FASTOP2CL(shr); FASTOP2CL(sar); +FASTOP2W(bsf); +FASTOP2W(bsr); +FASTOP2W(bt); +FASTOP2W(bts); +FASTOP2W(btr); +FASTOP2W(btc); + static int em_mul_ex(struct x86_emulate_ctxt *ctxt) { u8 ex = 0; @@ -3377,47 +3393,6 @@ static int em_sti(struct x86_emulate_ctxt *ctxt) return X86EMUL_CONTINUE; } -static int em_bt(struct x86_emulate_ctxt *ctxt) -{ - /* Disable writeback. */ - ctxt->dst.type = OP_NONE; - /* only subword offset */ - ctxt->src.val &= (ctxt->dst.bytes << 3) - 1; - - emulate_2op_SrcV_nobyte(ctxt, "bt"); - return X86EMUL_CONTINUE; -} - -static int em_bts(struct x86_emulate_ctxt *ctxt) -{ - emulate_2op_SrcV_nobyte(ctxt, "bts"); - return X86EMUL_CONTINUE; -} - -static int em_btr(struct x86_emulate_ctxt *ctxt) -{ - emulate_2op_SrcV_nobyte(ctxt, "btr"); - return X86EMUL_CONTINUE; -} - -static int em_btc(struct x86_emulate_ctxt *ctxt) -{ - emulate_2op_SrcV_nobyte(ctxt, "btc"); - return X86EMUL_CONTINUE; -} - -static int em_bsf(struct x86_emulate_ctxt *ctxt) -{ - emulate_2op_SrcV_nobyte(ctxt, "bsf"); - return X86EMUL_CONTINUE; -} - -static int em_bsr(struct x86_emulate_ctxt *ctxt) -{ - emulate_2op_SrcV_nobyte(ctxt, "bsr"); - return X86EMUL_CONTINUE; -} - static int em_cpuid(struct x86_emulate_ctxt *ctxt) { u32 eax, ebx, ecx, edx; @@ -3773,10 +3748,10 @@ static const struct group_dual group7 = { { static const struct opcode group8[] = { N, N, N, N, - I(DstMem | SrcImmByte, em_bt), - I(DstMem | SrcImmByte | Lock | PageTable, em_bts), - I(DstMem | SrcImmByte | Lock, em_btr), - I(DstMem | SrcImmByte | Lock | PageTable, em_btc), + F(DstMem | SrcImmByte | NoWrite,em_bt), + F(DstMem | SrcImmByte | Lock | PageTable, em_bts), + F(DstMem | SrcImmByte | Lock, em_btr), + F(DstMem | SrcImmByte | Lock | PageTable, em_btc), }; static const struct group_dual group9 = { { @@ -4025,28 +4000,29 @@ static const struct opcode twobyte_table[256] = { X16(D(ByteOp | DstMem | SrcNone | ModRM| Mov)), /* 0xA0 - 0xA7 */ I(Stack | Src2FS, em_push_sreg), I(Stack | Src2FS, em_pop_sreg), - II(ImplicitOps, em_cpuid, cpuid), I(DstMem | SrcReg | ModRM | BitOp, em_bt), + II(ImplicitOps, em_cpuid, cpuid), + F(DstMem | SrcReg | ModRM | BitOp | NoWrite, em_bt), F(DstMem | SrcReg | Src2ImmByte | ModRM, em_shld), F(DstMem | SrcReg | Src2CL | ModRM, em_shld), N, N, /* 0xA8 - 0xAF */ I(Stack | Src2GS, em_push_sreg), I(Stack | Src2GS, em_pop_sreg), DI(ImplicitOps, rsm), - I(DstMem | SrcReg | ModRM | BitOp | Lock | PageTable, em_bts), + F(DstMem | SrcReg | ModRM | BitOp | Lock | PageTable, em_bts), F(DstMem | SrcReg | Src2ImmByte | ModRM, em_shrd), F(DstMem | SrcReg | Src2CL | ModRM, em_shrd), D(ModRM), I(DstReg | SrcMem | ModRM, em_imul), /* 0xB0 - 0xB7 */ I2bv(DstMem | SrcReg | ModRM | Lock | PageTable, em_cmpxchg), I(DstReg | SrcMemFAddr | ModRM | Src2SS, em_lseg), - I(DstMem | SrcReg | ModRM | BitOp | Lock, em_btr), + F(DstMem | SrcReg | ModRM | BitOp | Lock, em_btr), I(DstReg | SrcMemFAddr | ModRM | Src2FS, em_lseg), I(DstReg | SrcMemFAddr | ModRM | Src2GS, em_lseg), D(DstReg | SrcMem8 | ModRM | Mov), D(DstReg | SrcMem16 | ModRM | Mov), /* 0xB8 - 0xBF */ N, N, G(BitOp, group8), - I(DstMem | SrcReg | ModRM | BitOp | Lock | PageTable, em_btc), - I(DstReg | SrcMem | ModRM, em_bsf), I(DstReg | SrcMem | ModRM, em_bsr), + F(DstMem | SrcReg | ModRM | BitOp | Lock | PageTable, em_btc), + F(DstReg | SrcMem | ModRM, em_bsf), F(DstReg | SrcMem | ModRM, em_bsr), D(DstReg | SrcMem8 | ModRM | Mov), D(DstReg | SrcMem16 | ModRM | Mov), /* 0xC0 - 0xC7 */ D2bv(DstMem | SrcReg | ModRM | Lock), -- 1.8.0.2 -- To unsubscribe from this list: send the line "unsubscribe kvm
[PATCH v2 7/8] KVM: x86 emulator: rearrange fastop definitions
Make fastop opcodes usable in other emulations. Signed-off-by: Avi Kivity --- arch/x86/kvm/emulate.c | 70 +- 1 file changed, 35 insertions(+), 35 deletions(-) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index 45ddec8..d06354d 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -972,6 +972,41 @@ static int read_descriptor(struct x86_emulate_ctxt *ctxt, return rc; } +FASTOP2(add); +FASTOP2(or); +FASTOP2(adc); +FASTOP2(sbb); +FASTOP2(and); +FASTOP2(sub); +FASTOP2(xor); +FASTOP2(cmp); +FASTOP2(test); + +FASTOP3WCL(shld); +FASTOP3WCL(shrd); + +FASTOP2W(imul); + +FASTOP1(not); +FASTOP1(neg); +FASTOP1(inc); +FASTOP1(dec); + +FASTOP2CL(rol); +FASTOP2CL(ror); +FASTOP2CL(rcl); +FASTOP2CL(rcr); +FASTOP2CL(shl); +FASTOP2CL(shr); +FASTOP2CL(sar); + +FASTOP2W(bsf); +FASTOP2W(bsr); +FASTOP2W(bt); +FASTOP2W(bts); +FASTOP2W(btr); +FASTOP2W(btc); + static u8 test_cc(unsigned int condition, unsigned long flags) { u8 rc; @@ -2064,26 +2099,6 @@ static int em_jmp_far(struct x86_emulate_ctxt *ctxt) return X86EMUL_CONTINUE; } -FASTOP1(not); -FASTOP1(neg); -FASTOP1(inc); -FASTOP1(dec); - -FASTOP2CL(rol); -FASTOP2CL(ror); -FASTOP2CL(rcl); -FASTOP2CL(rcr); -FASTOP2CL(shl); -FASTOP2CL(shr); -FASTOP2CL(sar); - -FASTOP2W(bsf); -FASTOP2W(bsr); -FASTOP2W(bt); -FASTOP2W(bts); -FASTOP2W(btr); -FASTOP2W(btc); - static int em_mul_ex(struct x86_emulate_ctxt *ctxt) { u8 ex = 0; @@ -3040,21 +3055,6 @@ static int em_ret_near_imm(struct x86_emulate_ctxt *ctxt) return X86EMUL_CONTINUE; } -FASTOP2(add); -FASTOP2(or); -FASTOP2(adc); -FASTOP2(sbb); -FASTOP2(and); -FASTOP2(sub); -FASTOP2(xor); -FASTOP2(cmp); -FASTOP2(test); - -FASTOP3WCL(shld); -FASTOP3WCL(shrd); - -FASTOP2W(imul); - static int em_xchg(struct x86_emulate_ctxt *ctxt) { /* Write back the register source. */ -- 1.8.0.2 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 8/8] KVM: x86 emulator: convert a few freestanding emulations to fastop
--- arch/x86/kvm/emulate.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index d06354d..e99fb72 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -2209,7 +2209,7 @@ static int em_cmpxchg(struct x86_emulate_ctxt *ctxt) /* Save real source value, then compare EAX against destination. */ ctxt->src.orig_val = ctxt->src.val; ctxt->src.val = reg_read(ctxt, VCPU_REGS_RAX); - emulate_2op_SrcV(ctxt, "cmp"); + fastop(ctxt, em_cmp); if (ctxt->eflags & EFLG_ZF) { /* Success: write back to memory. */ @@ -2977,7 +2977,7 @@ static int em_das(struct x86_emulate_ctxt *ctxt) ctxt->src.type = OP_IMM; ctxt->src.val = 0; ctxt->src.bytes = 1; - emulate_2op_SrcV(ctxt, "or"); + fastop(ctxt, em_or); ctxt->eflags &= ~(X86_EFLAGS_AF | X86_EFLAGS_CF); if (cf) ctxt->eflags |= X86_EFLAGS_CF; @@ -4816,7 +4816,7 @@ twobyte_insn: (s16) ctxt->src.val; break; case 0xc0 ... 0xc1: /* xadd */ - emulate_2op_SrcV(ctxt, "add"); + fastop(ctxt, em_add); /* Write back the register source. */ ctxt->src.val = ctxt->dst.orig_val; write_register_operand(&ctxt->src); -- 1.8.0.2 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 3/8] KVM: x86 emulator: covert SETCC to fastop
This is a bit of a special case since we don't have the usual byte/word/long/quad switch; instead we switch on the condition code embedded in the instruction. Signed-off-by: Avi Kivity --- arch/x86/kvm/emulate.c | 60 -- 1 file changed, 29 insertions(+), 31 deletions(-) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index a94b1d7..e13138d 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -499,6 +499,28 @@ static void invalidate_registers(struct x86_emulate_ctxt *ctxt) ON64(FOP3E(op##q, rax, rbx, cl)) \ FOP_END +/* Special case for SETcc - 1 instruction per cc */ +#define FOP_SETCC(op) ".align 4; " #op " %al; ret \n\t" + +FOP_START(setcc) +FOP_SETCC(seto) +FOP_SETCC(setno) +FOP_SETCC(setc) +FOP_SETCC(setnc) +FOP_SETCC(setz) +FOP_SETCC(setnz) +FOP_SETCC(setbe) +FOP_SETCC(setnbe) +FOP_SETCC(sets) +FOP_SETCC(setns) +FOP_SETCC(setp) +FOP_SETCC(setnp) +FOP_SETCC(setl) +FOP_SETCC(setnl) +FOP_SETCC(setle) +FOP_SETCC(setnle) +FOP_END; + #define __emulate_1op_rax_rdx(ctxt, _op, _suffix, _ex) \ do {\ unsigned long _tmp; \ @@ -939,39 +961,15 @@ static int read_descriptor(struct x86_emulate_ctxt *ctxt, return rc; } -static int test_cc(unsigned int condition, unsigned int flags) +static u8 test_cc(unsigned int condition, unsigned long flags) { - int rc = 0; + u8 rc; + void (*fop)(void) = (void *)em_setcc + 4 * (condition & 0xf); - switch ((condition & 15) >> 1) { - case 0: /* o */ - rc |= (flags & EFLG_OF); - break; - case 1: /* b/c/nae */ - rc |= (flags & EFLG_CF); - break; - case 2: /* z/e */ - rc |= (flags & EFLG_ZF); - break; - case 3: /* be/na */ - rc |= (flags & (EFLG_CF|EFLG_ZF)); - break; - case 4: /* s */ - rc |= (flags & EFLG_SF); - break; - case 5: /* p/pe */ - rc |= (flags & EFLG_PF); - break; - case 7: /* le/ng */ - rc |= (flags & EFLG_ZF); - /* fall through */ - case 6: /* l/nge */ - rc |= (!(flags & EFLG_SF) != !(flags & EFLG_OF)); - break; - } - - /* Odd condition identifiers (lsb == 1) have inverted sense. */ - return (!!rc ^ (condition & 1)); + flags = (flags & EFLAGS_MASK) | X86_EFLAGS_IF; + asm("pushq %[flags]; popf; call *%[fastop]" + : "=a"(rc) : [fastop]"r"(fop), [flags]"r"(flags)); + return rc; } static void fetch_register_operand(struct operand *op) -- 1.8.0.2 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 4/8] KVM: x86 emulator: convert INC/DEC to fastop
Signed-off-by: Avi Kivity --- arch/x86/kvm/emulate.c | 24 +++- 1 file changed, 7 insertions(+), 17 deletions(-) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index e13138d..edb09e9c 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -2055,6 +2055,8 @@ static int em_jmp_far(struct x86_emulate_ctxt *ctxt) FASTOP1(not); FASTOP1(neg); +FASTOP1(inc); +FASTOP1(dec); FASTOP2CL(rol); FASTOP2CL(ror); @@ -2105,12 +2107,6 @@ static int em_grp45(struct x86_emulate_ctxt *ctxt) int rc = X86EMUL_CONTINUE; switch (ctxt->modrm_reg) { - case 0: /* inc */ - emulate_1op(ctxt, "inc"); - break; - case 1: /* dec */ - emulate_1op(ctxt, "dec"); - break; case 2: /* call near abs */ { long int old_eip; old_eip = ctxt->_eip; @@ -3735,14 +3731,14 @@ static const struct opcode group3[] = { }; static const struct opcode group4[] = { - I(ByteOp | DstMem | SrcNone | Lock, em_grp45), - I(ByteOp | DstMem | SrcNone | Lock, em_grp45), + F(ByteOp | DstMem | SrcNone | Lock, em_inc), + F(ByteOp | DstMem | SrcNone | Lock, em_dec), N, N, N, N, N, N, }; static const struct opcode group5[] = { - I(DstMem | SrcNone | Lock, em_grp45), - I(DstMem | SrcNone | Lock, em_grp45), + F(DstMem | SrcNone | Lock, em_inc), + F(DstMem | SrcNone | Lock, em_dec), I(SrcMem | Stack, em_grp45), I(SrcMemFAddr | ImplicitOps | Stack,em_call_far), I(SrcMem | Stack, em_grp45), @@ -3891,7 +3887,7 @@ static const struct opcode opcode_table[256] = { /* 0x38 - 0x3F */ F6ALU(NoWrite, em_cmp), N, N, /* 0x40 - 0x4F */ - X16(D(DstReg)), + X8(F(DstReg, em_inc)), X8(F(DstReg, em_dec)), /* 0x50 - 0x57 */ X8(I(SrcReg | Stack, em_push)), /* 0x58 - 0x5F */ @@ -4681,12 +4677,6 @@ special_insn: goto twobyte_insn; switch (ctxt->b) { - case 0x40 ... 0x47: /* inc r16/r32 */ - emulate_1op(ctxt, "inc"); - break; - case 0x48 ... 0x4f: /* dec r16/r32 */ - emulate_1op(ctxt, "dec"); - break; case 0x63: /* movsxd */ if (ctxt->mode != X86EMUL_MODE_PROT64) goto cannot_emulate; -- 1.8.0.2 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 2/8] KVM: x86 emulator: convert shift/rotate instructions to fastop
SHL, SHR, ROL, ROR, RCL, RCR, SAR, SAL Signed-off-by: Avi Kivity --- arch/x86/kvm/emulate.c | 72 ++ 1 file changed, 31 insertions(+), 41 deletions(-) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index a21773f..a94b1d7 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -478,6 +478,15 @@ static void invalidate_registers(struct x86_emulate_ctxt *ctxt) ON64(FOP2E(op##q, rax, rbx)) \ FOP_END +/* 2 operand, src is CL */ +#define FASTOP2CL(op) \ + FOP_START(op) \ + FOP2E(op##b, al, cl) \ + FOP2E(op##w, ax, cl) \ + FOP2E(op##l, eax, cl) \ + ON64(FOP2E(op##q, rax, cl)) \ + FOP_END + #define FOP3E(op, dst, src, src2) \ FOP_ALIGN #op " %" #src2 ", %" #src ", %" #dst " \n\t" FOP_RET @@ -2046,38 +2055,17 @@ static int em_jmp_far(struct x86_emulate_ctxt *ctxt) return X86EMUL_CONTINUE; } -static int em_grp2(struct x86_emulate_ctxt *ctxt) -{ - switch (ctxt->modrm_reg) { - case 0: /* rol */ - emulate_2op_SrcB(ctxt, "rol"); - break; - case 1: /* ror */ - emulate_2op_SrcB(ctxt, "ror"); - break; - case 2: /* rcl */ - emulate_2op_SrcB(ctxt, "rcl"); - break; - case 3: /* rcr */ - emulate_2op_SrcB(ctxt, "rcr"); - break; - case 4: /* sal/shl */ - case 6: /* sal/shl */ - emulate_2op_SrcB(ctxt, "sal"); - break; - case 5: /* shr */ - emulate_2op_SrcB(ctxt, "shr"); - break; - case 7: /* sar */ - emulate_2op_SrcB(ctxt, "sar"); - break; - } - return X86EMUL_CONTINUE; -} - FASTOP1(not); FASTOP1(neg); +FASTOP2CL(rol); +FASTOP2CL(ror); +FASTOP2CL(rcl); +FASTOP2CL(rcr); +FASTOP2CL(shl); +FASTOP2CL(shr); +FASTOP2CL(sar); + static int em_mul_ex(struct x86_emulate_ctxt *ctxt) { u8 ex = 0; @@ -3726,6 +3714,17 @@ static const struct opcode group1A[] = { I(DstMem | SrcNone | Mov | Stack, em_pop), N, N, N, N, N, N, N, }; +static const struct opcode group2[] = { + F(DstMem | ModRM, em_rol), + F(DstMem | ModRM, em_ror), + F(DstMem | ModRM, em_rcl), + F(DstMem | ModRM, em_rcr), + F(DstMem | ModRM, em_shl), + F(DstMem | ModRM, em_shr), + F(DstMem | ModRM, em_shl), + F(DstMem | ModRM, em_sar), +}; + static const struct opcode group3[] = { F(DstMem | SrcImm | NoWrite, em_test), F(DstMem | SrcImm | NoWrite, em_test), @@ -3949,7 +3948,7 @@ static const struct opcode opcode_table[256] = { /* 0xB8 - 0xBF */ X8(I(DstReg | SrcImm64 | Mov, em_mov)), /* 0xC0 - 0xC7 */ - D2bv(DstMem | SrcImmByte | ModRM), + G(ByteOp | Src2ImmByte, group2), G(Src2ImmByte, group2), I(ImplicitOps | Stack | SrcImmU16, em_ret_near_imm), I(ImplicitOps | Stack, em_ret), I(DstReg | SrcMemFAddr | ModRM | No64 | Src2ES, em_lseg), @@ -3961,7 +3960,8 @@ static const struct opcode opcode_table[256] = { D(ImplicitOps), DI(SrcImmByte, intn), D(ImplicitOps | No64), II(ImplicitOps, em_iret, iret), /* 0xD0 - 0xD7 */ - D2bv(DstMem | SrcOne | ModRM), D2bv(DstMem | ModRM), + G(Src2One | ByteOp, group2), G(Src2One, group2), + G(Src2CL | ByteOp, group2), G(Src2CL, group2), N, I(DstAcc | SrcImmByte | No64, em_aad), N, N, /* 0xD8 - 0xDF */ N, E(0, &escape_d9), N, E(0, &escape_db), N, E(0, &escape_dd), N, N, @@ -4713,9 +4713,6 @@ special_insn: case 8: ctxt->dst.val = (s32)ctxt->dst.val; break; } break; - case 0xc0 ... 0xc1: - rc = em_grp2(ctxt); - break; case 0xcc: /* int3 */ rc = emulate_int(ctxt, 3); break; @@ -4726,13 +4723,6 @@ special_insn: if (ctxt->eflags & EFLG_OF) rc = emulate_int(ctxt, 4); break; - case 0xd0 ... 0xd1: /* Grp2 */ - rc = em_grp2(ctxt); - break; - case 0xd2 ... 0xd3: /* Grp2 */ - ctxt->src.val = reg_read(ctxt, VCPU_REGS_RCX); - rc = em_grp2(ctxt); - break; case 0xe9: /* jmp rel */ case 0xeb: /* jmp rel short */ jmp_rel(ctxt, ctxt->src.val); -- 1.8.0.2 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 0/8] More fastop patches
After this, only the diffcult MUL and DIV case remains. Changes from v1: - make SHLD/SHRD more consistent with the others - fix SETcc misordering Avi Kivity (8): KVM: x86 emulator: Convert SHLD, SHRD to fastop KVM: x86 emulator: convert shift/rotate instructions to fastop KVM: x86 emulator: covert SETCC to fastop KVM: x86 emulator: convert INC/DEC to fastop KVM: x86 emulator: convert BT/BTS/BTR/BTC/BSF/BSR to fastop KVM: x86 emulator: convert 2-operand IMUL to fastop KVM: x86 emulator: rearrange fastop definitions KVM: x86 emulator: convert a few freestanding emulations to fastop arch/x86/kvm/emulate.c | 311 + 1 file changed, 136 insertions(+), 175 deletions(-) -- 1.8.0.2 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 1/8] KVM: x86 emulator: Convert SHLD, SHRD to fastop
Signed-off-by: Avi Kivity --- arch/x86/kvm/emulate.c | 33 + 1 file changed, 21 insertions(+), 12 deletions(-) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index 619a33d..a21773f 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -454,6 +454,8 @@ static void invalidate_registers(struct x86_emulate_ctxt *ctxt) #define FOP_END \ ".popsection") +#define FOPNOP() FOP_ALIGN FOP_RET + #define FOP1E(op, dst) \ FOP_ALIGN #op " %" #dst " \n\t" FOP_RET @@ -476,6 +478,18 @@ static void invalidate_registers(struct x86_emulate_ctxt *ctxt) ON64(FOP2E(op##q, rax, rbx)) \ FOP_END +#define FOP3E(op, dst, src, src2) \ + FOP_ALIGN #op " %" #src2 ", %" #src ", %" #dst " \n\t" FOP_RET + +/* 3-operand, word-only, src2=cl */ +#define FASTOP3WCL(op) \ + FOP_START(op) \ + FOPNOP() \ + FOP3E(op##w, ax, bx, cl) \ + FOP3E(op##l, eax, ebx, cl) \ + ON64(FOP3E(op##q, rax, rbx, cl)) \ + FOP_END + #define __emulate_1op_rax_rdx(ctxt, _op, _suffix, _ex) \ do {\ unsigned long _tmp; \ @@ -3036,6 +3050,9 @@ FASTOP2(xor); FASTOP2(cmp); FASTOP2(test); +FASTOP3WCL(shld); +FASTOP3WCL(shrd); + static int em_xchg(struct x86_emulate_ctxt *ctxt) { /* Write back the register source. */ @@ -4015,14 +4032,14 @@ static const struct opcode twobyte_table[256] = { /* 0xA0 - 0xA7 */ I(Stack | Src2FS, em_push_sreg), I(Stack | Src2FS, em_pop_sreg), II(ImplicitOps, em_cpuid, cpuid), I(DstMem | SrcReg | ModRM | BitOp, em_bt), - D(DstMem | SrcReg | Src2ImmByte | ModRM), - D(DstMem | SrcReg | Src2CL | ModRM), N, N, + F(DstMem | SrcReg | Src2ImmByte | ModRM, em_shld), + F(DstMem | SrcReg | Src2CL | ModRM, em_shld), N, N, /* 0xA8 - 0xAF */ I(Stack | Src2GS, em_push_sreg), I(Stack | Src2GS, em_pop_sreg), DI(ImplicitOps, rsm), I(DstMem | SrcReg | ModRM | BitOp | Lock | PageTable, em_bts), - D(DstMem | SrcReg | Src2ImmByte | ModRM), - D(DstMem | SrcReg | Src2CL | ModRM), + F(DstMem | SrcReg | Src2ImmByte | ModRM, em_shrd), + F(DstMem | SrcReg | Src2CL | ModRM, em_shrd), D(ModRM), I(DstReg | SrcMem | ModRM, em_imul), /* 0xB0 - 0xB7 */ I2bv(DstMem | SrcReg | ModRM | Lock | PageTable, em_cmpxchg), @@ -4834,14 +4851,6 @@ twobyte_insn: case 0x90 ... 0x9f: /* setcc r/m8 */ ctxt->dst.val = test_cc(ctxt->b, ctxt->eflags); break; - case 0xa4: /* shld imm8, r, r/m */ - case 0xa5: /* shld cl, r, r/m */ - emulate_2op_cl(ctxt, "shld"); - break; - case 0xac: /* shrd imm8, r, r/m */ - case 0xad: /* shrd cl, r, r/m */ - emulate_2op_cl(ctxt, "shrd"); - break; case 0xae: /* clflush */ break; case 0xb6 ... 0xb7: /* movzx */ -- 1.8.0.2 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 5/8] KVM: x86 emulator: convert BT/BTS/BTR/BTC/BSF/BSR to fastop
Signed-off-by: Avi Kivity --- arch/x86/kvm/emulate.c | 76 +- 1 file changed, 26 insertions(+), 50 deletions(-) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index d89e88f..7ff83d9 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -478,6 +478,15 @@ static void invalidate_registers(struct x86_emulate_ctxt *ctxt) ON64(FOP2E(op##q, rax, rbx)) \ FOP_END +/* 2 operand, word only */ +#define FASTOP2W(op) \ + FOP_START(op) \ + FOPNOP() \ + FOP2E(op##w, ax, bx) \ + FOP2E(op##l, eax, ebx) \ + ON64(FOP2E(op##q, rax, rbx)) \ + FOP_END + /* 2 operand, src is CL */ #define FASTOP2CL(op) \ FOP_START(op) \ @@ -2066,6 +2075,13 @@ FASTOP2CL(shl); FASTOP2CL(shr); FASTOP2CL(sar); +FASTOP2W(bsf); +FASTOP2W(bsr); +FASTOP2W(bt); +FASTOP2W(bts); +FASTOP2W(btr); +FASTOP2W(btc); + static int em_mul_ex(struct x86_emulate_ctxt *ctxt) { u8 ex = 0; @@ -3377,47 +3393,6 @@ static int em_sti(struct x86_emulate_ctxt *ctxt) return X86EMUL_CONTINUE; } -static int em_bt(struct x86_emulate_ctxt *ctxt) -{ - /* Disable writeback. */ - ctxt->dst.type = OP_NONE; - /* only subword offset */ - ctxt->src.val &= (ctxt->dst.bytes << 3) - 1; - - emulate_2op_SrcV_nobyte(ctxt, "bt"); - return X86EMUL_CONTINUE; -} - -static int em_bts(struct x86_emulate_ctxt *ctxt) -{ - emulate_2op_SrcV_nobyte(ctxt, "bts"); - return X86EMUL_CONTINUE; -} - -static int em_btr(struct x86_emulate_ctxt *ctxt) -{ - emulate_2op_SrcV_nobyte(ctxt, "btr"); - return X86EMUL_CONTINUE; -} - -static int em_btc(struct x86_emulate_ctxt *ctxt) -{ - emulate_2op_SrcV_nobyte(ctxt, "btc"); - return X86EMUL_CONTINUE; -} - -static int em_bsf(struct x86_emulate_ctxt *ctxt) -{ - emulate_2op_SrcV_nobyte(ctxt, "bsf"); - return X86EMUL_CONTINUE; -} - -static int em_bsr(struct x86_emulate_ctxt *ctxt) -{ - emulate_2op_SrcV_nobyte(ctxt, "bsr"); - return X86EMUL_CONTINUE; -} - static int em_cpuid(struct x86_emulate_ctxt *ctxt) { u32 eax, ebx, ecx, edx; @@ -3773,10 +3748,10 @@ static const struct group_dual group7 = { { static const struct opcode group8[] = { N, N, N, N, - I(DstMem | SrcImmByte, em_bt), - I(DstMem | SrcImmByte | Lock | PageTable, em_bts), - I(DstMem | SrcImmByte | Lock, em_btr), - I(DstMem | SrcImmByte | Lock | PageTable, em_btc), + F(DstMem | SrcImmByte | NoWrite,em_bt), + F(DstMem | SrcImmByte | Lock | PageTable, em_bts), + F(DstMem | SrcImmByte | Lock, em_btr), + F(DstMem | SrcImmByte | Lock | PageTable, em_btc), }; static const struct group_dual group9 = { { @@ -4025,28 +4000,29 @@ static const struct opcode twobyte_table[256] = { X16(D(ByteOp | DstMem | SrcNone | ModRM| Mov)), /* 0xA0 - 0xA7 */ I(Stack | Src2FS, em_push_sreg), I(Stack | Src2FS, em_pop_sreg), - II(ImplicitOps, em_cpuid, cpuid), I(DstMem | SrcReg | ModRM | BitOp, em_bt), + II(ImplicitOps, em_cpuid, cpuid), + F(DstMem | SrcReg | ModRM | BitOp | NoWrite, em_bt), F(DstMem | SrcReg | Src2ImmByte | ModRM, em_shld), F(DstMem | SrcReg | Src2CL | ModRM, em_shld), N, N, /* 0xA8 - 0xAF */ I(Stack | Src2GS, em_push_sreg), I(Stack | Src2GS, em_pop_sreg), DI(ImplicitOps, rsm), - I(DstMem | SrcReg | ModRM | BitOp | Lock | PageTable, em_bts), + F(DstMem | SrcReg | ModRM | BitOp | Lock | PageTable, em_bts), F(DstMem | SrcReg | Src2ImmByte | ModRM, em_shrd), F(DstMem | SrcReg | Src2CL | ModRM, em_shrd), D(ModRM), I(DstReg | SrcMem | ModRM, em_imul), /* 0xB0 - 0xB7 */ I2bv(DstMem | SrcReg | ModRM | Lock | PageTable, em_cmpxchg), I(DstReg | SrcMemFAddr | ModRM | Src2SS, em_lseg), - I(DstMem | SrcReg | ModRM | BitOp | Lock, em_btr), + F(DstMem | SrcReg | ModRM | BitOp | Lock, em_btr), I(DstReg | SrcMemFAddr | ModRM | Src2FS, em_lseg), I(DstReg | SrcMemFAddr | ModRM | Src2GS, em_lseg), D(DstReg | SrcMem8 | ModRM | Mov), D(DstReg | SrcMem16 | ModRM | Mov), /* 0xB8 - 0xBF */ N, N, G(BitOp, group8), - I(DstMem | SrcReg | ModRM | BitOp | Lock | PageTable, em_btc), - I(DstReg | SrcMem | ModRM, em_bsf), I(DstReg | SrcMem | ModRM, em_bsr), + F(DstMem | SrcReg | ModRM | BitOp | Lock | PageTable, em_btc), + F(DstReg | SrcMem | ModRM, em_bsf), F(DstReg | SrcMem | ModRM, em_bsr), D(DstReg | SrcMem8 | ModRM | Mov), D(DstReg | SrcMem16 | ModRM | Mov), /* 0xC0 - 0xC7 */ D2bv(DstMem | SrcReg | ModRM | Lock), -- 1.8.0.1 -- To unsubscribe from this list: send the line "unsubscribe kvm
[PATCH 6/8] KVM: x86 emulator: convert 2-operand IMUL to fastop
Signed-off-by: Avi Kivity --- arch/x86/kvm/emulate.c | 14 ++ 1 file changed, 6 insertions(+), 8 deletions(-) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index 7ff83d9..c7578d0 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -441,6 +441,8 @@ static void invalidate_registers(struct x86_emulate_ctxt *ctxt) } \ } while (0) +static int fastop(struct x86_emulate_ctxt *ctxt, void (*fop)(struct fastop *)); + #define FOP_ALIGN ".align " __stringify(FASTOP_SIZE) " \n\t" #define FOP_RET "ret \n\t" @@ -3051,6 +3053,8 @@ FASTOP2(test); FASTOP3WCL(shld); FASTOP3WCL(shrd); +FASTOP2W(imul); + static int em_xchg(struct x86_emulate_ctxt *ctxt) { /* Write back the register source. */ @@ -3063,16 +3067,10 @@ static int em_xchg(struct x86_emulate_ctxt *ctxt) return X86EMUL_CONTINUE; } -static int em_imul(struct x86_emulate_ctxt *ctxt) -{ - emulate_2op_SrcV_nobyte(ctxt, "imul"); - return X86EMUL_CONTINUE; -} - static int em_imul_3op(struct x86_emulate_ctxt *ctxt) { ctxt->dst.val = ctxt->src2.val; - return em_imul(ctxt); + return fastop(ctxt, em_imul); } static int em_cwd(struct x86_emulate_ctxt *ctxt) @@ -4010,7 +4008,7 @@ static const struct opcode twobyte_table[256] = { F(DstMem | SrcReg | ModRM | BitOp | Lock | PageTable, em_bts), F(DstMem | SrcReg | Src2ImmByte | ModRM, em_shrd), F(DstMem | SrcReg | Src2CL | ModRM, em_shrd), - D(ModRM), I(DstReg | SrcMem | ModRM, em_imul), + D(ModRM), F(DstReg | SrcMem | ModRM, em_imul), /* 0xB0 - 0xB7 */ I2bv(DstMem | SrcReg | ModRM | Lock | PageTable, em_cmpxchg), I(DstReg | SrcMemFAddr | ModRM | Src2SS, em_lseg), -- 1.8.0.1 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 7/8] KVM: x86 emulator: rearrange fastop definitions
Make fastop opcodes usable in other emulations. Signed-off-by: Avi Kivity --- arch/x86/kvm/emulate.c | 70 +- 1 file changed, 35 insertions(+), 35 deletions(-) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index c7578d0..da2b903 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -972,6 +972,41 @@ static int read_descriptor(struct x86_emulate_ctxt *ctxt, return rc; } +FASTOP2(add); +FASTOP2(or); +FASTOP2(adc); +FASTOP2(sbb); +FASTOP2(and); +FASTOP2(sub); +FASTOP2(xor); +FASTOP2(cmp); +FASTOP2(test); + +FASTOP3WCL(shld); +FASTOP3WCL(shrd); + +FASTOP2W(imul); + +FASTOP1(not); +FASTOP1(neg); +FASTOP1(inc); +FASTOP1(dec); + +FASTOP2CL(rol); +FASTOP2CL(ror); +FASTOP2CL(rcl); +FASTOP2CL(rcr); +FASTOP2CL(shl); +FASTOP2CL(shr); +FASTOP2CL(sar); + +FASTOP2W(bsf); +FASTOP2W(bsr); +FASTOP2W(bt); +FASTOP2W(bts); +FASTOP2W(btr); +FASTOP2W(btc); + static u8 test_cc(unsigned int condition, unsigned long flags) { u8 rc; @@ -2064,26 +2099,6 @@ static int em_jmp_far(struct x86_emulate_ctxt *ctxt) return X86EMUL_CONTINUE; } -FASTOP1(not); -FASTOP1(neg); -FASTOP1(inc); -FASTOP1(dec); - -FASTOP2CL(rol); -FASTOP2CL(ror); -FASTOP2CL(rcl); -FASTOP2CL(rcr); -FASTOP2CL(shl); -FASTOP2CL(shr); -FASTOP2CL(sar); - -FASTOP2W(bsf); -FASTOP2W(bsr); -FASTOP2W(bt); -FASTOP2W(bts); -FASTOP2W(btr); -FASTOP2W(btc); - static int em_mul_ex(struct x86_emulate_ctxt *ctxt) { u8 ex = 0; @@ -3040,21 +3055,6 @@ static int em_ret_near_imm(struct x86_emulate_ctxt *ctxt) return X86EMUL_CONTINUE; } -FASTOP2(add); -FASTOP2(or); -FASTOP2(adc); -FASTOP2(sbb); -FASTOP2(and); -FASTOP2(sub); -FASTOP2(xor); -FASTOP2(cmp); -FASTOP2(test); - -FASTOP3WCL(shld); -FASTOP3WCL(shrd); - -FASTOP2W(imul); - static int em_xchg(struct x86_emulate_ctxt *ctxt) { /* Write back the register source. */ -- 1.8.0.1 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/8] KVM: x86 emulator: covert SETCC to fastop
This is a bit of a special case since we don't have the usual byte/word/long/quad switch; instead we switch on the condition code embedded in the instruction. Signed-off-by: Avi Kivity --- arch/x86/kvm/emulate.c | 60 -- 1 file changed, 29 insertions(+), 31 deletions(-) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index d641178..f6f615e 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -499,6 +499,28 @@ static void invalidate_registers(struct x86_emulate_ctxt *ctxt) ON64(FOP3E(op, rax, rbx, cl)) \ FOP_END +/* Special case for SETcc - 1 instruction per cc */ +#define FOP_SETCC(op) ".align 4; " #op " %al; ret \n\t" + +FOP_START(setcc) +FOP_SETCC(seto) +FOP_SETCC(setc) +FOP_SETCC(setz) +FOP_SETCC(setbe) +FOP_SETCC(sets) +FOP_SETCC(setp) +FOP_SETCC(setl) +FOP_SETCC(setle) +FOP_SETCC(setno) +FOP_SETCC(setnc) +FOP_SETCC(setnz) +FOP_SETCC(setnbe) +FOP_SETCC(setns) +FOP_SETCC(setnp) +FOP_SETCC(setnl) +FOP_SETCC(setnle) +FOP_END; + #define __emulate_1op_rax_rdx(ctxt, _op, _suffix, _ex) \ do {\ unsigned long _tmp; \ @@ -939,39 +961,15 @@ static int read_descriptor(struct x86_emulate_ctxt *ctxt, return rc; } -static int test_cc(unsigned int condition, unsigned int flags) +static u8 test_cc(unsigned int condition, unsigned long flags) { - int rc = 0; + u8 rc; + void (*fop)(void) = (void *)em_setcc + 4 * (condition & 0xf); - switch ((condition & 15) >> 1) { - case 0: /* o */ - rc |= (flags & EFLG_OF); - break; - case 1: /* b/c/nae */ - rc |= (flags & EFLG_CF); - break; - case 2: /* z/e */ - rc |= (flags & EFLG_ZF); - break; - case 3: /* be/na */ - rc |= (flags & (EFLG_CF|EFLG_ZF)); - break; - case 4: /* s */ - rc |= (flags & EFLG_SF); - break; - case 5: /* p/pe */ - rc |= (flags & EFLG_PF); - break; - case 7: /* le/ng */ - rc |= (flags & EFLG_ZF); - /* fall through */ - case 6: /* l/nge */ - rc |= (!(flags & EFLG_SF) != !(flags & EFLG_OF)); - break; - } - - /* Odd condition identifiers (lsb == 1) have inverted sense. */ - return (!!rc ^ (condition & 1)); + flags = (flags & EFLAGS_MASK) | X86_EFLAGS_IF; + asm("pushq %[flags]; popf; call *%[fastop]" + : "=a"(rc) : [fastop]"r"(fop), [flags]"r"(flags)); + return rc; } static void fetch_register_operand(struct operand *op) -- 1.8.0.1 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 8/8] KVM: x86 emulator: convert a few freestanding emulations to fastop
--- arch/x86/kvm/emulate.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index da2b903..1bb0af2 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -2209,7 +2209,7 @@ static int em_cmpxchg(struct x86_emulate_ctxt *ctxt) /* Save real source value, then compare EAX against destination. */ ctxt->src.orig_val = ctxt->src.val; ctxt->src.val = reg_read(ctxt, VCPU_REGS_RAX); - emulate_2op_SrcV(ctxt, "cmp"); + fastop(ctxt, em_cmp); if (ctxt->eflags & EFLG_ZF) { /* Success: write back to memory. */ @@ -2977,7 +2977,7 @@ static int em_das(struct x86_emulate_ctxt *ctxt) ctxt->src.type = OP_IMM; ctxt->src.val = 0; ctxt->src.bytes = 1; - emulate_2op_SrcV(ctxt, "or"); + fastop(ctxt, em_or); ctxt->eflags &= ~(X86_EFLAGS_AF | X86_EFLAGS_CF); if (cf) ctxt->eflags |= X86_EFLAGS_CF; @@ -4816,7 +4816,7 @@ twobyte_insn: (s16) ctxt->src.val; break; case 0xc0 ... 0xc1: /* xadd */ - emulate_2op_SrcV(ctxt, "add"); + fastop(ctxt, em_add); /* Write back the register source. */ ctxt->src.val = ctxt->dst.orig_val; write_register_operand(&ctxt->src); -- 1.8.0.1 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/8] KVM: x86 emulator: convert shift/rotate instructions to fastop
SHL, SHR, ROL, ROR, RCL, RCR, SAR, SAL Signed-off-by: Avi Kivity --- arch/x86/kvm/emulate.c | 72 ++ 1 file changed, 31 insertions(+), 41 deletions(-) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index 2189c6a..d641178 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -478,6 +478,15 @@ static void invalidate_registers(struct x86_emulate_ctxt *ctxt) ON64(FOP2E(op##q, rax, rbx)) \ FOP_END +/* 2 operand, src is CL */ +#define FASTOP2CL(op) \ + FOP_START(op) \ + FOP2E(op##b, al, cl) \ + FOP2E(op##w, ax, cl) \ + FOP2E(op##l, eax, cl) \ + ON64(FOP2E(op##q, rax, cl)) \ + FOP_END + #define FOP3E(op, dst, src, src2) \ FOP_ALIGN #op " %" #src2 ", %" #src ", %" #dst " \n\t" FOP_RET @@ -2046,38 +2055,17 @@ static int em_jmp_far(struct x86_emulate_ctxt *ctxt) return X86EMUL_CONTINUE; } -static int em_grp2(struct x86_emulate_ctxt *ctxt) -{ - switch (ctxt->modrm_reg) { - case 0: /* rol */ - emulate_2op_SrcB(ctxt, "rol"); - break; - case 1: /* ror */ - emulate_2op_SrcB(ctxt, "ror"); - break; - case 2: /* rcl */ - emulate_2op_SrcB(ctxt, "rcl"); - break; - case 3: /* rcr */ - emulate_2op_SrcB(ctxt, "rcr"); - break; - case 4: /* sal/shl */ - case 6: /* sal/shl */ - emulate_2op_SrcB(ctxt, "sal"); - break; - case 5: /* shr */ - emulate_2op_SrcB(ctxt, "shr"); - break; - case 7: /* sar */ - emulate_2op_SrcB(ctxt, "sar"); - break; - } - return X86EMUL_CONTINUE; -} - FASTOP1(not); FASTOP1(neg); +FASTOP2CL(rol); +FASTOP2CL(ror); +FASTOP2CL(rcl); +FASTOP2CL(rcr); +FASTOP2CL(shl); +FASTOP2CL(shr); +FASTOP2CL(sar); + static int em_mul_ex(struct x86_emulate_ctxt *ctxt) { u8 ex = 0; @@ -3726,6 +3714,17 @@ static const struct opcode group1A[] = { I(DstMem | SrcNone | Mov | Stack, em_pop), N, N, N, N, N, N, N, }; +static const struct opcode group2[] = { + F(DstMem | ModRM, em_rol), + F(DstMem | ModRM, em_ror), + F(DstMem | ModRM, em_rcl), + F(DstMem | ModRM, em_rcr), + F(DstMem | ModRM, em_shl), + F(DstMem | ModRM, em_shr), + F(DstMem | ModRM, em_shl), + F(DstMem | ModRM, em_sar), +}; + static const struct opcode group3[] = { F(DstMem | SrcImm | NoWrite, em_test), F(DstMem | SrcImm | NoWrite, em_test), @@ -3949,7 +3948,7 @@ static const struct opcode opcode_table[256] = { /* 0xB8 - 0xBF */ X8(I(DstReg | SrcImm64 | Mov, em_mov)), /* 0xC0 - 0xC7 */ - D2bv(DstMem | SrcImmByte | ModRM), + G(ByteOp | Src2ImmByte, group2), G(Src2ImmByte, group2), I(ImplicitOps | Stack | SrcImmU16, em_ret_near_imm), I(ImplicitOps | Stack, em_ret), I(DstReg | SrcMemFAddr | ModRM | No64 | Src2ES, em_lseg), @@ -3961,7 +3960,8 @@ static const struct opcode opcode_table[256] = { D(ImplicitOps), DI(SrcImmByte, intn), D(ImplicitOps | No64), II(ImplicitOps, em_iret, iret), /* 0xD0 - 0xD7 */ - D2bv(DstMem | SrcOne | ModRM), D2bv(DstMem | ModRM), + G(Src2One | ByteOp, group2), G(Src2One, group2), + G(Src2CL | ByteOp, group2), G(Src2CL, group2), N, I(DstAcc | SrcImmByte | No64, em_aad), N, N, /* 0xD8 - 0xDF */ N, E(0, &escape_d9), N, E(0, &escape_db), N, E(0, &escape_dd), N, N, @@ -4713,9 +4713,6 @@ special_insn: case 8: ctxt->dst.val = (s32)ctxt->dst.val; break; } break; - case 0xc0 ... 0xc1: - rc = em_grp2(ctxt); - break; case 0xcc: /* int3 */ rc = emulate_int(ctxt, 3); break; @@ -4726,13 +4723,6 @@ special_insn: if (ctxt->eflags & EFLG_OF) rc = emulate_int(ctxt, 4); break; - case 0xd0 ... 0xd1: /* Grp2 */ - rc = em_grp2(ctxt); - break; - case 0xd2 ... 0xd3: /* Grp2 */ - ctxt->src.val = reg_read(ctxt, VCPU_REGS_RCX); - rc = em_grp2(ctxt); - break; case 0xe9: /* jmp rel */ case 0xeb: /* jmp rel short */ jmp_rel(ctxt, ctxt->src.val); -- 1.8.0.1 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/8] KVM: x86 emulator: convert INC/DEC to fastop
Signed-off-by: Avi Kivity --- arch/x86/kvm/emulate.c | 24 +++- 1 file changed, 7 insertions(+), 17 deletions(-) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index f6f615e..d89e88f 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -2055,6 +2055,8 @@ static int em_jmp_far(struct x86_emulate_ctxt *ctxt) FASTOP1(not); FASTOP1(neg); +FASTOP1(inc); +FASTOP1(dec); FASTOP2CL(rol); FASTOP2CL(ror); @@ -2105,12 +2107,6 @@ static int em_grp45(struct x86_emulate_ctxt *ctxt) int rc = X86EMUL_CONTINUE; switch (ctxt->modrm_reg) { - case 0: /* inc */ - emulate_1op(ctxt, "inc"); - break; - case 1: /* dec */ - emulate_1op(ctxt, "dec"); - break; case 2: /* call near abs */ { long int old_eip; old_eip = ctxt->_eip; @@ -3735,14 +3731,14 @@ static const struct opcode group3[] = { }; static const struct opcode group4[] = { - I(ByteOp | DstMem | SrcNone | Lock, em_grp45), - I(ByteOp | DstMem | SrcNone | Lock, em_grp45), + F(ByteOp | DstMem | SrcNone | Lock, em_inc), + F(ByteOp | DstMem | SrcNone | Lock, em_dec), N, N, N, N, N, N, }; static const struct opcode group5[] = { - I(DstMem | SrcNone | Lock, em_grp45), - I(DstMem | SrcNone | Lock, em_grp45), + F(DstMem | SrcNone | Lock, em_inc), + F(DstMem | SrcNone | Lock, em_dec), I(SrcMem | Stack, em_grp45), I(SrcMemFAddr | ImplicitOps | Stack,em_call_far), I(SrcMem | Stack, em_grp45), @@ -3891,7 +3887,7 @@ static const struct opcode opcode_table[256] = { /* 0x38 - 0x3F */ F6ALU(NoWrite, em_cmp), N, N, /* 0x40 - 0x4F */ - X16(D(DstReg)), + X8(F(DstReg, em_inc)), X8(F(DstReg, em_dec)), /* 0x50 - 0x57 */ X8(I(SrcReg | Stack, em_push)), /* 0x58 - 0x5F */ @@ -4681,12 +4677,6 @@ special_insn: goto twobyte_insn; switch (ctxt->b) { - case 0x40 ... 0x47: /* inc r16/r32 */ - emulate_1op(ctxt, "inc"); - break; - case 0x48 ... 0x4f: /* dec r16/r32 */ - emulate_1op(ctxt, "dec"); - break; case 0x63: /* movsxd */ if (ctxt->mode != X86EMUL_MODE_PROT64) goto cannot_emulate; -- 1.8.0.1 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/8] KVM: x86 emulator: Streamline SHLD, SHRD
Signed-off-by: Avi Kivity --- arch/x86/kvm/emulate.c | 33 + 1 file changed, 21 insertions(+), 12 deletions(-) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index 619a33d..2189c6a 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -454,6 +454,8 @@ static void invalidate_registers(struct x86_emulate_ctxt *ctxt) #define FOP_END \ ".popsection") +#define FOPNOP() FOP_ALIGN FOP_RET + #define FOP1E(op, dst) \ FOP_ALIGN #op " %" #dst " \n\t" FOP_RET @@ -476,6 +478,18 @@ static void invalidate_registers(struct x86_emulate_ctxt *ctxt) ON64(FOP2E(op##q, rax, rbx)) \ FOP_END +#define FOP3E(op, dst, src, src2) \ + FOP_ALIGN #op " %" #src2 ", %" #src ", %" #dst " \n\t" FOP_RET + +/* 3-operand, word-only, src2=cl */ +#define FASTOP3WCL(op) \ + FOP_START(op) \ + FOPNOP() \ + FOP3E(op, ax, bx, cl) \ + FOP3E(op, eax, ebx, cl) \ + ON64(FOP3E(op, rax, rbx, cl)) \ + FOP_END + #define __emulate_1op_rax_rdx(ctxt, _op, _suffix, _ex) \ do {\ unsigned long _tmp; \ @@ -3036,6 +3050,9 @@ FASTOP2(xor); FASTOP2(cmp); FASTOP2(test); +FASTOP3WCL(shld); +FASTOP3WCL(shrd); + static int em_xchg(struct x86_emulate_ctxt *ctxt) { /* Write back the register source. */ @@ -4015,14 +4032,14 @@ static const struct opcode twobyte_table[256] = { /* 0xA0 - 0xA7 */ I(Stack | Src2FS, em_push_sreg), I(Stack | Src2FS, em_pop_sreg), II(ImplicitOps, em_cpuid, cpuid), I(DstMem | SrcReg | ModRM | BitOp, em_bt), - D(DstMem | SrcReg | Src2ImmByte | ModRM), - D(DstMem | SrcReg | Src2CL | ModRM), N, N, + F(DstMem | SrcReg | Src2ImmByte | ModRM, em_shld), + F(DstMem | SrcReg | Src2CL | ModRM, em_shld), N, N, /* 0xA8 - 0xAF */ I(Stack | Src2GS, em_push_sreg), I(Stack | Src2GS, em_pop_sreg), DI(ImplicitOps, rsm), I(DstMem | SrcReg | ModRM | BitOp | Lock | PageTable, em_bts), - D(DstMem | SrcReg | Src2ImmByte | ModRM), - D(DstMem | SrcReg | Src2CL | ModRM), + F(DstMem | SrcReg | Src2ImmByte | ModRM, em_shrd), + F(DstMem | SrcReg | Src2CL | ModRM, em_shrd), D(ModRM), I(DstReg | SrcMem | ModRM, em_imul), /* 0xB0 - 0xB7 */ I2bv(DstMem | SrcReg | ModRM | Lock | PageTable, em_cmpxchg), @@ -4834,14 +4851,6 @@ twobyte_insn: case 0x90 ... 0x9f: /* setcc r/m8 */ ctxt->dst.val = test_cc(ctxt->b, ctxt->eflags); break; - case 0xa4: /* shld imm8, r, r/m */ - case 0xa5: /* shld cl, r, r/m */ - emulate_2op_cl(ctxt, "shld"); - break; - case 0xac: /* shrd imm8, r, r/m */ - case 0xad: /* shrd cl, r, r/m */ - emulate_2op_cl(ctxt, "shrd"); - break; case 0xae: /* clflush */ break; case 0xb6 ... 0xb7: /* movzx */ -- 1.8.0.1 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/8] More fastop patches
After this, only the diffult MUL and DIV case remains. Avi Kivity (8): KVM: x86 emulator: Streamline SHLD, SHRD KVM: x86 emulator: convert shift/rotate instructions to fastop KVM: x86 emulator: covert SETCC to fastop KVM: x86 emulator: convert INC/DEC to fastop KVM: x86 emulator: convert BT/BTS/BTR/BTC/BSF/BSR to fastop KVM: x86 emulator: convert 2-operand IMUL to fastop KVM: x86 emulator: rearrange fastop definitions KVM: x86 emulator: convert a few freestanding emulations to fastop arch/x86/kvm/emulate.c | 311 + 1 file changed, 136 insertions(+), 175 deletions(-) -- 1.8.0.1 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 7/7] KVM: x86 emulator: convert basic ALU ops to fastop
Opcodes: TEST CMP ADD ADC SUB SBB XOR OR AND Signed-off-by: Avi Kivity --- arch/x86/kvm/emulate.c | 112 +++-- 1 file changed, 34 insertions(+), 78 deletions(-) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index 3b5d4dd..619a33d 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -3026,59 +3026,15 @@ static int em_ret_near_imm(struct x86_emulate_ctxt *ctxt) return X86EMUL_CONTINUE; } -static int em_add(struct x86_emulate_ctxt *ctxt) -{ - emulate_2op_SrcV(ctxt, "add"); - return X86EMUL_CONTINUE; -} - -static int em_or(struct x86_emulate_ctxt *ctxt) -{ - emulate_2op_SrcV(ctxt, "or"); - return X86EMUL_CONTINUE; -} - -static int em_adc(struct x86_emulate_ctxt *ctxt) -{ - emulate_2op_SrcV(ctxt, "adc"); - return X86EMUL_CONTINUE; -} - -static int em_sbb(struct x86_emulate_ctxt *ctxt) -{ - emulate_2op_SrcV(ctxt, "sbb"); - return X86EMUL_CONTINUE; -} - -static int em_and(struct x86_emulate_ctxt *ctxt) -{ - emulate_2op_SrcV(ctxt, "and"); - return X86EMUL_CONTINUE; -} - -static int em_sub(struct x86_emulate_ctxt *ctxt) -{ - emulate_2op_SrcV(ctxt, "sub"); - return X86EMUL_CONTINUE; -} - -static int em_xor(struct x86_emulate_ctxt *ctxt) -{ - emulate_2op_SrcV(ctxt, "xor"); - return X86EMUL_CONTINUE; -} - -static int em_cmp(struct x86_emulate_ctxt *ctxt) -{ - emulate_2op_SrcV(ctxt, "cmp"); - return X86EMUL_CONTINUE; -} - -static int em_test(struct x86_emulate_ctxt *ctxt) -{ - emulate_2op_SrcV(ctxt, "test"); - return X86EMUL_CONTINUE; -} +FASTOP2(add); +FASTOP2(or); +FASTOP2(adc); +FASTOP2(sbb); +FASTOP2(and); +FASTOP2(sub); +FASTOP2(xor); +FASTOP2(cmp); +FASTOP2(test); static int em_xchg(struct x86_emulate_ctxt *ctxt) { @@ -3711,9 +3667,9 @@ static int check_perm_out(struct x86_emulate_ctxt *ctxt) #define I2bvIP(_f, _e, _i, _p) \ IIP((_f) | ByteOp, _e, _i, _p), IIP(_f, _e, _i, _p) -#define I6ALU(_f, _e) I2bv((_f) | DstMem | SrcReg | ModRM, _e), \ - I2bv(((_f) | DstReg | SrcMem | ModRM) & ~Lock, _e), \ - I2bv(((_f) & ~Lock) | DstAcc | SrcImm, _e) +#define F6ALU(_f, _e) F2bv((_f) | DstMem | SrcReg | ModRM, _e), \ + F2bv(((_f) | DstReg | SrcMem | ModRM) & ~Lock, _e), \ + F2bv(((_f) & ~Lock) | DstAcc | SrcImm, _e) static const struct opcode group7_rm1[] = { DI(SrcNone | Priv, monitor), @@ -3739,14 +3695,14 @@ static const struct opcode group7_rm7[] = { }; static const struct opcode group1[] = { - I(Lock, em_add), - I(Lock | PageTable, em_or), - I(Lock, em_adc), - I(Lock, em_sbb), - I(Lock | PageTable, em_and), - I(Lock, em_sub), - I(Lock, em_xor), - I(NoWrite, em_cmp), + F(Lock, em_add), + F(Lock | PageTable, em_or), + F(Lock, em_adc), + F(Lock, em_sbb), + F(Lock | PageTable, em_and), + F(Lock, em_sub), + F(Lock, em_xor), + F(NoWrite, em_cmp), }; static const struct opcode group1A[] = { @@ -3754,8 +3710,8 @@ static const struct opcode group1A[] = { }; static const struct opcode group3[] = { - I(DstMem | SrcImm | NoWrite, em_test), - I(DstMem | SrcImm | NoWrite, em_test), + F(DstMem | SrcImm | NoWrite, em_test), + F(DstMem | SrcImm | NoWrite, em_test), F(DstMem | SrcNone | Lock, em_not), F(DstMem | SrcNone | Lock, em_neg), I(SrcMem, em_mul_ex), @@ -3897,29 +3853,29 @@ static const struct escape escape_dd = { { static const struct opcode opcode_table[256] = { /* 0x00 - 0x07 */ - I6ALU(Lock, em_add), + F6ALU(Lock, em_add), I(ImplicitOps | Stack | No64 | Src2ES, em_push_sreg), I(ImplicitOps | Stack | No64 | Src2ES, em_pop_sreg), /* 0x08 - 0x0F */ - I6ALU(Lock | PageTable, em_or), + F6ALU(Lock | PageTable, em_or), I(ImplicitOps | Stack | No64 | Src2CS, em_push_sreg), N, /* 0x10 - 0x17 */ - I6ALU(Lock, em_adc), + F6ALU(Lock, em_adc), I(ImplicitOps | Stack | No64 | Src2SS, em_push_sreg), I(ImplicitOps | Stack | No64 | Src2SS, em_pop_sreg), /* 0x18 - 0x1F */ - I6ALU(Lock, em_sbb), + F6ALU(Lock, em_sbb), I(ImplicitOps | Stack | No64 | Src2DS, em_push_sreg), I(ImplicitOps | Stack | No64 | Src2DS, em_pop_sreg), /* 0x20 - 0x27 */ - I6ALU(Lock | PageTable, em_and), N, N, + F6ALU(Lock | PageTable, em_and), N, N, /* 0x28 - 0x2F */ - I6ALU(Lock, em_sub), N, I(ByteOp | DstAcc | No64, em_das), + F6ALU(Lock, em_sub), N, I(ByteOp | DstAcc | No64, em_das), /* 0x30 - 0x37 */ - I6ALU(Lock, em_xor
[PATCH v3 5/7] KVM: x86 emulator: convert NOT, NEG to fastop
Signed-off-by: Avi Kivity --- arch/x86/kvm/emulate.c | 17 - 1 file changed, 4 insertions(+), 13 deletions(-) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index 2af0c44..09dbdc5 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -2050,17 +2050,8 @@ static int em_grp2(struct x86_emulate_ctxt *ctxt) return X86EMUL_CONTINUE; } -static int em_not(struct x86_emulate_ctxt *ctxt) -{ - ctxt->dst.val = ~ctxt->dst.val; - return X86EMUL_CONTINUE; -} - -static int em_neg(struct x86_emulate_ctxt *ctxt) -{ - emulate_1op(ctxt, "neg"); - return X86EMUL_CONTINUE; -} +FASTOP1(not); +FASTOP1(neg); static int em_mul_ex(struct x86_emulate_ctxt *ctxt) { @@ -3753,8 +3744,8 @@ static const struct opcode group1A[] = { static const struct opcode group3[] = { I(DstMem | SrcImm | NoWrite, em_test), I(DstMem | SrcImm | NoWrite, em_test), - I(DstMem | SrcNone | Lock, em_not), - I(DstMem | SrcNone | Lock, em_neg), + F(DstMem | SrcNone | Lock, em_not), + F(DstMem | SrcNone | Lock, em_neg), I(SrcMem, em_mul_ex), I(SrcMem, em_imul_ex), I(SrcMem, em_div_ex), -- 1.8.0.1 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 6/7] KVM: x86 emulator: add macros for defining 2-operand fastop emulation
Signed-off-by: Avi Kivity --- arch/x86/kvm/emulate.c | 12 1 file changed, 12 insertions(+) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index 09dbdc5..3b5d4dd 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -465,6 +465,17 @@ static void invalidate_registers(struct x86_emulate_ctxt *ctxt) ON64(FOP1E(op##q, rax)) \ FOP_END +#define FOP2E(op, dst, src) \ + FOP_ALIGN #op " %" #src ", %" #dst " \n\t" FOP_RET + +#define FASTOP2(op) \ + FOP_START(op) \ + FOP2E(op##b, al, bl) \ + FOP2E(op##w, ax, bx) \ + FOP2E(op##l, eax, ebx) \ + ON64(FOP2E(op##q, rax, rbx)) \ + FOP_END + #define __emulate_1op_rax_rdx(ctxt, _op, _suffix, _ex) \ do {\ unsigned long _tmp; \ @@ -3696,6 +3707,7 @@ static int check_perm_out(struct x86_emulate_ctxt *ctxt) #define D2bv(_f) D((_f) | ByteOp), D(_f) #define D2bvIP(_f, _i, _p) DIP((_f) | ByteOp, _i, _p), DIP(_f, _i, _p) #define I2bv(_f, _e) I((_f) | ByteOp, _e), I(_f, _e) +#define F2bv(_f, _e) F((_f) | ByteOp, _e), F(_f, _e) #define I2bvIP(_f, _e, _i, _p) \ IIP((_f) | ByteOp, _e, _i, _p), IIP(_f, _e, _i, _p) -- 1.8.0.1 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 3/7] KVM: x86 emulator: introduce NoWrite flag
Instead of disabling writeback via OP_NONE, just specify NoWrite. Signed-off-by: Avi Kivity --- arch/x86/kvm/emulate.c | 4 1 file changed, 4 insertions(+) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index 42c53c8..fe113fb 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -151,6 +151,7 @@ #define Unaligned ((u64)1 << 42) /* Explicitly unaligned (e.g. MOVDQU) */ #define Avx ((u64)1 << 43) /* Advanced Vector Extensions */ #define Fastop ((u64)1 << 44) /* Use opcode::u.fastop */ +#define NoWrite ((u64)1 << 45) /* No writeback */ #define X2(x...) x, x #define X3(x...) X2(x), x @@ -1633,6 +1634,9 @@ static int writeback(struct x86_emulate_ctxt *ctxt) { int rc; + if (ctxt->d & NoWrite) + return X86EMUL_CONTINUE; + switch (ctxt->dst.type) { case OP_REG: write_register_operand(&ctxt->dst); -- 1.8.0.1 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html