Re: [PATCH] KVM/x86: remove WARN_ON() for when vm_munmap() fails

2018-02-01 Thread Radim Krčmář
2018-01-31 17:30-0800, Eric Biggers:
> From: Eric Biggers 
> 
> On x86, special KVM memslots such as the TSS region have anonymous
> memory mappings created on behalf of userspace, and these mappings are
> removed when the VM is destroyed.
> 
> It is however possible for removing these mappings via vm_munmap() to
> fail.  This can most easily happen if the thread receives SIGKILL while
> it's waiting to acquire ->mmap_sem.   This triggers the 'WARN_ON(r < 0)'
> in __x86_set_memory_region().  syzkaller was able to hit this, using
> 'exit()' to send the SIGKILL.  Note that while the vm_munmap() failure
> results in the mapping not being removed immediately, it is not leaked
> forever but rather will be freed when the process exits.
> 
> It's not really possible to handle this failure properly, so almost

We could check "r < 0 && r != -EINTR" to get rid of the easily
triggerable warning.

> every other caller of vm_munmap() doesn't check the return value.  It's
> a limitation of having the kernel manage these mappings rather than
> userspace.
> 
> So just remove the WARN_ON() so that users can't spam the kernel log
> with this warning.
> 
> Fixes: f0d648bdf0a5 ("KVM: x86: map/unmap private slots in 
> __x86_set_memory_region")
> Reported-by: syzbot 
> Signed-off-by: Eric Biggers 
> ---

Removing it altogether doesn't sound that bad, though ...
Queued, thanks.

>  arch/x86/kvm/x86.c | 6 ++
>  1 file changed, 2 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index c53298dfbf50..53b57f18baec 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -8272,10 +8272,8 @@ int __x86_set_memory_region(struct kvm *kvm, int id, 
> gpa_t gpa, u32 size)
>   return r;
>   }
>  
> - if (!size) {
> - r = vm_munmap(old.userspace_addr, old.npages * PAGE_SIZE);
> - WARN_ON(r < 0);
> - }
> + if (!size)
> + vm_munmap(old.userspace_addr, old.npages * PAGE_SIZE);
>  
>   return 0;
>  }
> -- 
> 2.16.0.rc1.238.g530d649a79-goog
> 


Re: linux-next: manual merge of the kvm tree with Linus' tree

2018-02-01 Thread Radim Krčmář
2018-02-01 09:21-0500, Paolo Bonzini:
> On 01/02/2018 08:22, Stephen Rothwell wrote:
> > Hi Christoffer,
> > 
> > On Thu, 1 Feb 2018 11:47:07 +0100 Christoffer Dall 
> >  wrote:
> >>
> >> While the suggested fix is functional it does result in some code
> >> duplication, and the better resolution is the following:
> > 
> > OK, I will use that resolution form tomorrow on.
> > 
> > Someone needs to remember to let Linus know when the pull request is
> > sent.
> 
> It should be fixed in the KVM tree before it reaches Linus (when we
> merge a topic branch that is common between x86/pti & KVM).

I wasn't sure if the pti top branch is final, so I pulled hyper-v topic
branch that also also contains v4.15.  This and the SEV feature
conflicts should be gone now,

thanks.


Re: linux-next: manual merge of the kvm tree with Linus' tree

2018-02-01 Thread Radim Krčmář
2018-02-01 09:21-0500, Paolo Bonzini:
> On 01/02/2018 08:22, Stephen Rothwell wrote:
> > Hi Christoffer,
> > 
> > On Thu, 1 Feb 2018 11:47:07 +0100 Christoffer Dall 
> >  wrote:
> >>
> >> While the suggested fix is functional it does result in some code
> >> duplication, and the better resolution is the following:
> > 
> > OK, I will use that resolution form tomorrow on.
> > 
> > Someone needs to remember to let Linus know when the pull request is
> > sent.
> 
> It should be fixed in the KVM tree before it reaches Linus (when we
> merge a topic branch that is common between x86/pti & KVM).

I wasn't sure if the pti top branch is final, so I pulled hyper-v topic
branch that also also contains v4.15.  This and the SEV feature
conflicts should be gone now,

thanks.


Re: [PATCH v2 3/3] KVM: VMX: make MSR bitmaps per-VCPU

2018-01-31 Thread Radim Krčmář
2018-01-31 12:37-0500, Paolo Bonzini:
> On 30/01/2018 11:23, Radim Krčmář wrote:
> > 2018-01-27 09:50+0100, Paolo Bonzini:
> >> Place the MSR bitmap in struct loaded_vmcs, and update it in place
> >> every time the x2apic or APICv state can change.  This is rare and
> >> the loop can handle 64 MSRs per iteration, in a similar fashion as
> >> nested_vmx_prepare_msr_bitmap.
> >>
> >> This prepares for choosing, on a per-VM basis, whether to intercept
> >> the SPEC_CTRL and PRED_CMD MSRs.
> >>
> >> Suggested-by: Jim Mattson <jmatt...@google.com>
> >> Signed-off-by: Paolo Bonzini <pbonz...@redhat.com>
> >> ---
> >> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> >> @@ -10022,7 +10043,7 @@ static inline bool 
> >> nested_vmx_merge_msr_bitmap(struct kvm_vcpu *vcpu,
> >>int msr;
> >>struct page *page;
> >>unsigned long *msr_bitmap_l1;
> >> -  unsigned long *msr_bitmap_l0 = to_vmx(vcpu)->nested.msr_bitmap;
> >> +  unsigned long *msr_bitmap_l0 = to_vmx(vcpu)->nested.vmcs02.msr_bitmap;
> > 
> > The physical address of the nested msr_bitmap is never loaded into vmcs.
> > 
> > The resolution you provided had extra hunk in prepare_vmcs02_full():
> > 
> > +   vmcs_write64(MSR_BITMAP, __pa(vmx->nested.vmcs02.msr_bitmap));
> > 
> > I have queued that as:
> > 
> > +   if (cpu_has_vmx_msr_bitmap())
> > +   vmcs_write64(MSR_BITMAP, __pa(vmx->nested.vmcs02.msr_bitmap));
> 
> Hmm you're right, it should be in prepare_vmcs02() here (4.15-based),
> and then moved to prepare_vmcs02_full() as part of the conflict resolution.

It also makes sense to have it in nested_get_vmcs12_pages, where we call
nested_vmx_prepare_msr_bitmap() and disable MSR bitmaps.

> I'll send a v3.

Thanks.


Re: [PATCH v2 3/3] KVM: VMX: make MSR bitmaps per-VCPU

2018-01-31 Thread Radim Krčmář
2018-01-31 12:37-0500, Paolo Bonzini:
> On 30/01/2018 11:23, Radim Krčmář wrote:
> > 2018-01-27 09:50+0100, Paolo Bonzini:
> >> Place the MSR bitmap in struct loaded_vmcs, and update it in place
> >> every time the x2apic or APICv state can change.  This is rare and
> >> the loop can handle 64 MSRs per iteration, in a similar fashion as
> >> nested_vmx_prepare_msr_bitmap.
> >>
> >> This prepares for choosing, on a per-VM basis, whether to intercept
> >> the SPEC_CTRL and PRED_CMD MSRs.
> >>
> >> Suggested-by: Jim Mattson 
> >> Signed-off-by: Paolo Bonzini 
> >> ---
> >> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> >> @@ -10022,7 +10043,7 @@ static inline bool 
> >> nested_vmx_merge_msr_bitmap(struct kvm_vcpu *vcpu,
> >>int msr;
> >>struct page *page;
> >>unsigned long *msr_bitmap_l1;
> >> -  unsigned long *msr_bitmap_l0 = to_vmx(vcpu)->nested.msr_bitmap;
> >> +  unsigned long *msr_bitmap_l0 = to_vmx(vcpu)->nested.vmcs02.msr_bitmap;
> > 
> > The physical address of the nested msr_bitmap is never loaded into vmcs.
> > 
> > The resolution you provided had extra hunk in prepare_vmcs02_full():
> > 
> > +   vmcs_write64(MSR_BITMAP, __pa(vmx->nested.vmcs02.msr_bitmap));
> > 
> > I have queued that as:
> > 
> > +   if (cpu_has_vmx_msr_bitmap())
> > +   vmcs_write64(MSR_BITMAP, __pa(vmx->nested.vmcs02.msr_bitmap));
> 
> Hmm you're right, it should be in prepare_vmcs02() here (4.15-based),
> and then moved to prepare_vmcs02_full() as part of the conflict resolution.

It also makes sense to have it in nested_get_vmcs12_pages, where we call
nested_vmx_prepare_msr_bitmap() and disable MSR bitmaps.

> I'll send a v3.

Thanks.


[PATCH] KVM: nVMX: preserve SECONDARY_EXEC_DESC without UMIP

2018-01-31 Thread Radim Krčmář
L1 might want to use SECONDARY_EXEC_DESC, so we must not clear the VMCS
bit if UMIP is not being emulated.

We must still set the bit when emulating UMIP as the feature can be
passed to L2 where L0 will do the emulation and because L2 can change
CR4 without a VM exit, we should clear the bit if UMIP is disabled.

Fixes: 0367f205a3b7 ("KVM: vmx: add support for emulating UMIP")
Signed-off-by: Radim Krčmář <rkrc...@redhat.com>
---
 I haven't tested emulated UMIP (yet) nor machines with UMIP, but at
 least kvm-unit-tests don't throw an error anymore.

 arch/x86/kvm/vmx.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 438802d0b01d..b1e554a74b34 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -4379,7 +4379,8 @@ static int vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned 
long cr4)
vmcs_set_bits(SECONDARY_VM_EXEC_CONTROL,
  SECONDARY_EXEC_DESC);
hw_cr4 &= ~X86_CR4_UMIP;
-   } else
+   } else if (!is_guest_mode(vcpu) ||
+  !nested_cpu_has2(get_vmcs12(vcpu), SECONDARY_EXEC_DESC))
vmcs_clear_bits(SECONDARY_VM_EXEC_CONTROL,
SECONDARY_EXEC_DESC);
 
-- 
2.15.0



[PATCH] KVM: nVMX: preserve SECONDARY_EXEC_DESC without UMIP

2018-01-31 Thread Radim Krčmář
L1 might want to use SECONDARY_EXEC_DESC, so we must not clear the VMCS
bit if UMIP is not being emulated.

We must still set the bit when emulating UMIP as the feature can be
passed to L2 where L0 will do the emulation and because L2 can change
CR4 without a VM exit, we should clear the bit if UMIP is disabled.

Fixes: 0367f205a3b7 ("KVM: vmx: add support for emulating UMIP")
Signed-off-by: Radim Krčmář 
---
 I haven't tested emulated UMIP (yet) nor machines with UMIP, but at
 least kvm-unit-tests don't throw an error anymore.

 arch/x86/kvm/vmx.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 438802d0b01d..b1e554a74b34 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -4379,7 +4379,8 @@ static int vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned 
long cr4)
vmcs_set_bits(SECONDARY_VM_EXEC_CONTROL,
  SECONDARY_EXEC_DESC);
hw_cr4 &= ~X86_CR4_UMIP;
-   } else
+   } else if (!is_guest_mode(vcpu) ||
+  !nested_cpu_has2(get_vmcs12(vcpu), SECONDARY_EXEC_DESC))
vmcs_clear_bits(SECONDARY_VM_EXEC_CONTROL,
SECONDARY_EXEC_DESC);
 
-- 
2.15.0



Re: [PATCH v2 3/3] KVM: VMX: make MSR bitmaps per-VCPU

2018-01-30 Thread Radim Krčmář
2018-01-27 09:50+0100, Paolo Bonzini:
> Place the MSR bitmap in struct loaded_vmcs, and update it in place
> every time the x2apic or APICv state can change.  This is rare and
> the loop can handle 64 MSRs per iteration, in a similar fashion as
> nested_vmx_prepare_msr_bitmap.
> 
> This prepares for choosing, on a per-VM basis, whether to intercept
> the SPEC_CTRL and PRED_CMD MSRs.
> 
> Suggested-by: Jim Mattson 
> Signed-off-by: Paolo Bonzini 
> ---
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> @@ -10022,7 +10043,7 @@ static inline bool nested_vmx_merge_msr_bitmap(struct 
> kvm_vcpu *vcpu,
>   int msr;
>   struct page *page;
>   unsigned long *msr_bitmap_l1;
> - unsigned long *msr_bitmap_l0 = to_vmx(vcpu)->nested.msr_bitmap;
> + unsigned long *msr_bitmap_l0 = to_vmx(vcpu)->nested.vmcs02.msr_bitmap;

The physical address of the nested msr_bitmap is never loaded into vmcs.

The resolution you provided had extra hunk in prepare_vmcs02_full():

+   vmcs_write64(MSR_BITMAP, __pa(vmx->nested.vmcs02.msr_bitmap));

I have queued that as:

+   if (cpu_has_vmx_msr_bitmap())
+   vmcs_write64(MSR_BITMAP, __pa(vmx->nested.vmcs02.msr_bitmap));

but it should be a part of the patch or a followup fix.

Is the branch already merged into PTI?

Thanks.

>  
>   /* This shortcut is ok because we support only x2APIC MSRs so far. */
>   if (!nested_cpu_has_virt_x2apic_mode(vmcs12))
> @@ -11397,7 +11418,7 @@ static void load_vmcs12_host_state(struct kvm_vcpu 
> *vcpu,
>   vmcs_write64(GUEST_IA32_DEBUGCTL, 0);
>  
>   if (cpu_has_vmx_msr_bitmap())
> - vmx_set_msr_bitmap(vcpu);
> + vmx_update_msr_bitmap(vcpu);
>  
>   if (nested_vmx_load_msr(vcpu, vmcs12->vm_exit_msr_load_addr,
>   vmcs12->vm_exit_msr_load_count))
> -- 
> 1.8.3.1
> 


Re: [PATCH v2 3/3] KVM: VMX: make MSR bitmaps per-VCPU

2018-01-30 Thread Radim Krčmář
2018-01-27 09:50+0100, Paolo Bonzini:
> Place the MSR bitmap in struct loaded_vmcs, and update it in place
> every time the x2apic or APICv state can change.  This is rare and
> the loop can handle 64 MSRs per iteration, in a similar fashion as
> nested_vmx_prepare_msr_bitmap.
> 
> This prepares for choosing, on a per-VM basis, whether to intercept
> the SPEC_CTRL and PRED_CMD MSRs.
> 
> Suggested-by: Jim Mattson 
> Signed-off-by: Paolo Bonzini 
> ---
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> @@ -10022,7 +10043,7 @@ static inline bool nested_vmx_merge_msr_bitmap(struct 
> kvm_vcpu *vcpu,
>   int msr;
>   struct page *page;
>   unsigned long *msr_bitmap_l1;
> - unsigned long *msr_bitmap_l0 = to_vmx(vcpu)->nested.msr_bitmap;
> + unsigned long *msr_bitmap_l0 = to_vmx(vcpu)->nested.vmcs02.msr_bitmap;

The physical address of the nested msr_bitmap is never loaded into vmcs.

The resolution you provided had extra hunk in prepare_vmcs02_full():

+   vmcs_write64(MSR_BITMAP, __pa(vmx->nested.vmcs02.msr_bitmap));

I have queued that as:

+   if (cpu_has_vmx_msr_bitmap())
+   vmcs_write64(MSR_BITMAP, __pa(vmx->nested.vmcs02.msr_bitmap));

but it should be a part of the patch or a followup fix.

Is the branch already merged into PTI?

Thanks.

>  
>   /* This shortcut is ok because we support only x2APIC MSRs so far. */
>   if (!nested_cpu_has_virt_x2apic_mode(vmcs12))
> @@ -11397,7 +11418,7 @@ static void load_vmcs12_host_state(struct kvm_vcpu 
> *vcpu,
>   vmcs_write64(GUEST_IA32_DEBUGCTL, 0);
>  
>   if (cpu_has_vmx_msr_bitmap())
> - vmx_set_msr_bitmap(vcpu);
> + vmx_update_msr_bitmap(vcpu);
>  
>   if (nested_vmx_load_msr(vcpu, vmcs12->vm_exit_msr_load_addr,
>   vmcs12->vm_exit_msr_load_count))
> -- 
> 1.8.3.1
> 


Re: [PATCH] kvm: x86: remove efer_reload entry in kvm_vcpu_stat

2018-01-30 Thread Radim Krčmář
2018-01-26 17:34+0800, Longpeng(Mike):
> The efer_reload is never used since
> commit 26bb0981b3ff ("KVM: VMX: Use shared msr infrastructure"),
> so remove it.
> 
> Signed-off-by: Longpeng(Mike) 
> ---

Queued, thanks.


Re: [PATCH] kvm: x86: remove efer_reload entry in kvm_vcpu_stat

2018-01-30 Thread Radim Krčmář
2018-01-26 17:34+0800, Longpeng(Mike):
> The efer_reload is never used since
> commit 26bb0981b3ff ("KVM: VMX: Use shared msr infrastructure"),
> so remove it.
> 
> Signed-off-by: Longpeng(Mike) 
> ---

Queued, thanks.


Re: [PATCH v4 5/7] x86/irq: Count Hyper-V reenlightenment interrupts

2018-01-30 Thread Radim Krčmář
2018-01-29 22:48+0100, Thomas Gleixner:
> On Wed, 24 Jan 2018, Radim Krčmář wrote:
> > 2018-01-24 14:23+0100, Vitaly Kuznetsov:
> > > Hyper-V reenlightenment interrupts arrive when the VM is migrated, we're
> > > not supposed to see many of them. However, it may be important to know
> > > that the event has happened in case we have L2 nested guests.
> > > 
> > > Signed-off-by: Vitaly Kuznetsov <vkuzn...@redhat.com>
> > > Reviewed-by: Thomas Gleixner <t...@linutronix.de>
> > > ---
> > 
> > Thomas,
> > 
> > I think the expectation is that this series will go through the KVM
> > tree.  Would you prefer a topic branch?
> 
> Is there any dependency outside of plain 4.15? If not, I'll put it into
> x86/hyperv and let KVM folks pull it over.

There isn't;  we'll wait for x86/hyperv, thanks.


Re: [PATCH v4 5/7] x86/irq: Count Hyper-V reenlightenment interrupts

2018-01-30 Thread Radim Krčmář
2018-01-29 22:48+0100, Thomas Gleixner:
> On Wed, 24 Jan 2018, Radim Krčmář wrote:
> > 2018-01-24 14:23+0100, Vitaly Kuznetsov:
> > > Hyper-V reenlightenment interrupts arrive when the VM is migrated, we're
> > > not supposed to see many of them. However, it may be important to know
> > > that the event has happened in case we have L2 nested guests.
> > > 
> > > Signed-off-by: Vitaly Kuznetsov 
> > > Reviewed-by: Thomas Gleixner 
> > > ---
> > 
> > Thomas,
> > 
> > I think the expectation is that this series will go through the KVM
> > tree.  Would you prefer a topic branch?
> 
> Is there any dependency outside of plain 4.15? If not, I'll put it into
> x86/hyperv and let KVM folks pull it over.

There isn't;  we'll wait for x86/hyperv, thanks.


Re: [PATCH] KVM:x86: AMD Processor Topology Information

2018-01-30 Thread Radim Krčmář
2018-01-29 11:39-0500, Babu Moger:
> From: Stanislav Lanci 
> 
> This patch allow to enable x86 feature TOPOEXT. This is needed to provide
> information about SMT on AMD Zen CPUs to the guest.
> 
> Signed-off-by: Stanislav Lanci 
> Tested-by: Nick Sarnie 
> Reviewed-by: Paolo Bonzini 
> Signed-off-by: Babu Moger 
> ---
> 
> Rebased on top of linux-next.
> Maximum extended functions are already set to 0x801f after the commit
> 8765d75329a3 KVM: X86: Extend CPUID range to include new leaf

Queued, thanks.


Re: [PATCH] KVM:x86: AMD Processor Topology Information

2018-01-30 Thread Radim Krčmář
2018-01-29 11:39-0500, Babu Moger:
> From: Stanislav Lanci 
> 
> This patch allow to enable x86 feature TOPOEXT. This is needed to provide
> information about SMT on AMD Zen CPUs to the guest.
> 
> Signed-off-by: Stanislav Lanci 
> Tested-by: Nick Sarnie 
> Reviewed-by: Paolo Bonzini 
> Signed-off-by: Babu Moger 
> ---
> 
> Rebased on top of linux-next.
> Maximum extended functions are already set to 0x801f after the commit
> 8765d75329a3 KVM: X86: Extend CPUID range to include new leaf

Queued, thanks.


Re: [PATCH v2] x86/kvm/vmx: do not use vm-exit instruction length for fast MMIO when running nested

2018-01-25 Thread Radim Krčmář
2018-01-25 19:16+0200, Michael S. Tsirkin:
> On Thu, Jan 25, 2018 at 04:37:07PM +0100, Vitaly Kuznetsov wrote:
> > I was investigating an issue with seabios >= 1.10 which stopped working
> > for nested KVM on Hyper-V. The problem appears to be in
> > handle_ept_violation() function: when we do fast mmio we need to skip
> > the instruction so we do kvm_skip_emulated_instruction(). This, however,
> > depends on VM_EXIT_INSTRUCTION_LEN field being set correctly in VMCS.
> > However, this is not the case.
> > 
> > Intel's manual doesn't mandate VM_EXIT_INSTRUCTION_LEN to be set when
> > EPT MISCONFIG occurs. While on real hardware it was observed to be set,
> > some hypervisors follow the spec and don't set it; we end up advancing
> > IP with some random value.
> > 
> > I checked with Microsoft and they confirmed they don't fill
> > VM_EXIT_INSTRUCTION_LEN on EPT MISCONFIG.
> > 
> > Fix the issue by doing instruction skip through emulator when running
> > nested.
> > 
> > Fixes: 68c3b4d1676d870f0453c31d5a52e7e65c7448ae
> > Suggested-by: Radim Krčmář <rkrc...@redhat.com>
> > Suggested-by: Paolo Bonzini <pbonz...@redhat.com>
> > Signed-off-by: Vitaly Kuznetsov <vkuzn...@redhat.com>
> 
> I would maybe also disable this when this is a kvm host
> running a nested *guest*, just in case.

You mean to keep the fast path when running on KVM hypervisor?
(We already skip the path for nested guests.)

I'd prefer not to make this any uglier.

> Acked-by: Michael S. Tsirkin <m...@redhat.com>
> 
> > ---
> > v1 -> v2:
> >inlay X86_FEATURE_HYPERVISOR case with EMULTYPE_SKIP optimization
> >[Paolo Bonzini, Radim Krčmář]

Queued, thanks.


Re: [PATCH v2] x86/kvm/vmx: do not use vm-exit instruction length for fast MMIO when running nested

2018-01-25 Thread Radim Krčmář
2018-01-25 19:16+0200, Michael S. Tsirkin:
> On Thu, Jan 25, 2018 at 04:37:07PM +0100, Vitaly Kuznetsov wrote:
> > I was investigating an issue with seabios >= 1.10 which stopped working
> > for nested KVM on Hyper-V. The problem appears to be in
> > handle_ept_violation() function: when we do fast mmio we need to skip
> > the instruction so we do kvm_skip_emulated_instruction(). This, however,
> > depends on VM_EXIT_INSTRUCTION_LEN field being set correctly in VMCS.
> > However, this is not the case.
> > 
> > Intel's manual doesn't mandate VM_EXIT_INSTRUCTION_LEN to be set when
> > EPT MISCONFIG occurs. While on real hardware it was observed to be set,
> > some hypervisors follow the spec and don't set it; we end up advancing
> > IP with some random value.
> > 
> > I checked with Microsoft and they confirmed they don't fill
> > VM_EXIT_INSTRUCTION_LEN on EPT MISCONFIG.
> > 
> > Fix the issue by doing instruction skip through emulator when running
> > nested.
> > 
> > Fixes: 68c3b4d1676d870f0453c31d5a52e7e65c7448ae
> > Suggested-by: Radim Krčmář 
> > Suggested-by: Paolo Bonzini 
> > Signed-off-by: Vitaly Kuznetsov 
> 
> I would maybe also disable this when this is a kvm host
> running a nested *guest*, just in case.

You mean to keep the fast path when running on KVM hypervisor?
(We already skip the path for nested guests.)

I'd prefer not to make this any uglier.

> Acked-by: Michael S. Tsirkin 
> 
> > ---
> > v1 -> v2:
> >inlay X86_FEATURE_HYPERVISOR case with EMULTYPE_SKIP optimization
> >[Paolo Bonzini, Radim Krčmář]

Queued, thanks.


[GIT PULL] KVM fixes for Linux 4.15(-rc10)

2018-01-25 Thread Radim Krčmář
Linus,

The following changes since commit 0c5b9b5d9adbad4b60491f9ba0d2af38904bb4b9:

  Linux 4.15-rc9 (2018-01-21 13:51:26 -0800)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/virt/kvm/kvm for-linus

for you to fetch changes up to bda646dd182a90ba4239fc62b71eb8b73126fa77:

  Merge tag 'kvm-s390-master-4.15-3' of 
git://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux (2018-01-24 
16:25:53 +0100)


KVM fixes for v4.15(-rc10)

Fix races and potential use after free in the s390 cmma migration code.


Christian Borntraeger (1):
  KVM: s390: add proper locking for CMMA migration bitmap

Radim Krčmář (1):
  Merge tag 'kvm-s390-master-4.15-3' of 
git://git.kernel.org/.../kvms390/linux

 arch/s390/kvm/kvm-s390.c | 18 +++---
 1 file changed, 11 insertions(+), 7 deletions(-)


[GIT PULL] KVM fixes for Linux 4.15(-rc10)

2018-01-25 Thread Radim Krčmář
Linus,

The following changes since commit 0c5b9b5d9adbad4b60491f9ba0d2af38904bb4b9:

  Linux 4.15-rc9 (2018-01-21 13:51:26 -0800)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/virt/kvm/kvm for-linus

for you to fetch changes up to bda646dd182a90ba4239fc62b71eb8b73126fa77:

  Merge tag 'kvm-s390-master-4.15-3' of 
git://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux (2018-01-24 
16:25:53 +0100)


KVM fixes for v4.15(-rc10)

Fix races and potential use after free in the s390 cmma migration code.


Christian Borntraeger (1):
  KVM: s390: add proper locking for CMMA migration bitmap

Radim Krčmář (1):
  Merge tag 'kvm-s390-master-4.15-3' of 
git://git.kernel.org/.../kvms390/linux

 arch/s390/kvm/kvm-s390.c | 18 +++---
 1 file changed, 11 insertions(+), 7 deletions(-)


Re: [PATCH] x86/kvm: disable fast MMIO when running nested

2018-01-25 Thread Radim Krčmář
2018-01-24 16:12+0100, Vitaly Kuznetsov:
> I was investigating an issue with seabios >= 1.10 which stopped working
> for nested KVM on Hyper-V. The problem appears to be in
> handle_ept_violation() function: when we do fast mmio we need to skip
> the instruction so we do kvm_skip_emulated_instruction(). This, however,
> depends on VM_EXIT_INSTRUCTION_LEN field being set correctly in VMCS.
> However, this is not the case.
> 
> Intel's manual doesn't mandate VM_EXIT_INSTRUCTION_LEN to be set when
> EPT MISCONFIG occurs. While on real hardware it was observed to be set,
> some hypervisors follow the spec and don't set it; we end up advancing
> IP with some random value.
> 
> I checked with Microsoft and they confirmed they don't fill
> VM_EXIT_INSTRUCTION_LEN on EPT MISCONFIG.
> 
> Fix the issue by disabling fast mmio when running nested.
> 
> Signed-off-by: Vitaly Kuznetsov 
> ---
>  arch/x86/kvm/vmx.c | 9 -
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index c829d89e2e63..54afb446f38e 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -6558,9 +6558,16 @@ static int handle_ept_misconfig(struct kvm_vcpu *vcpu)
>   /*
>* A nested guest cannot optimize MMIO vmexits, because we have an
>* nGPA here instead of the required GPA.
> +  * Skipping instruction below depends on undefined behavior: Intel's
> +  * manual doesn't mandate VM_EXIT_INSTRUCTION_LEN to be set in VMCS
> +  * when EPT MISCONFIG occurs and while on real hardware it was observed
> +  * to be set, other hypervisors (namely Hyper-V) don't set it, we end
> +  * up advancing IP with some random value. Disable fast mmio when
> +  * running nested and keep it for real hardware in hope that
> +  * VM_EXIT_INSTRUCTION_LEN will always be set correctly.
>*/
>   gpa = vmcs_read64(GUEST_PHYSICAL_ADDRESS);
> - if (!is_guest_mode(vcpu) &&
> + if (!static_cpu_has(X86_FEATURE_HYPERVISOR) && !is_guest_mode(vcpu) &&

I realized that Paolo kept a minor optimization while getting rid of the
undefined behavior (https://patchwork.kernel.org/patch/9903811/).
Please do the same trick that signals kvm_io_bus_write() before going to
x86_emulate_instruction(... EMULTYPE_SKIP ...), but add a branch to use
kvm_skip_emulated_instruction() for bare-metal,

thanks.

>   !kvm_io_bus_write(vcpu, KVM_FAST_MMIO_BUS, gpa, 0, NULL)) {
>   trace_kvm_fast_mmio(gpa);
>   return kvm_skip_emulated_instruction(vcpu);
> -- 
> 2.14.3
> 


Re: [PATCH] x86/kvm: disable fast MMIO when running nested

2018-01-25 Thread Radim Krčmář
2018-01-24 16:12+0100, Vitaly Kuznetsov:
> I was investigating an issue with seabios >= 1.10 which stopped working
> for nested KVM on Hyper-V. The problem appears to be in
> handle_ept_violation() function: when we do fast mmio we need to skip
> the instruction so we do kvm_skip_emulated_instruction(). This, however,
> depends on VM_EXIT_INSTRUCTION_LEN field being set correctly in VMCS.
> However, this is not the case.
> 
> Intel's manual doesn't mandate VM_EXIT_INSTRUCTION_LEN to be set when
> EPT MISCONFIG occurs. While on real hardware it was observed to be set,
> some hypervisors follow the spec and don't set it; we end up advancing
> IP with some random value.
> 
> I checked with Microsoft and they confirmed they don't fill
> VM_EXIT_INSTRUCTION_LEN on EPT MISCONFIG.
> 
> Fix the issue by disabling fast mmio when running nested.
> 
> Signed-off-by: Vitaly Kuznetsov 
> ---
>  arch/x86/kvm/vmx.c | 9 -
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index c829d89e2e63..54afb446f38e 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -6558,9 +6558,16 @@ static int handle_ept_misconfig(struct kvm_vcpu *vcpu)
>   /*
>* A nested guest cannot optimize MMIO vmexits, because we have an
>* nGPA here instead of the required GPA.
> +  * Skipping instruction below depends on undefined behavior: Intel's
> +  * manual doesn't mandate VM_EXIT_INSTRUCTION_LEN to be set in VMCS
> +  * when EPT MISCONFIG occurs and while on real hardware it was observed
> +  * to be set, other hypervisors (namely Hyper-V) don't set it, we end
> +  * up advancing IP with some random value. Disable fast mmio when
> +  * running nested and keep it for real hardware in hope that
> +  * VM_EXIT_INSTRUCTION_LEN will always be set correctly.
>*/
>   gpa = vmcs_read64(GUEST_PHYSICAL_ADDRESS);
> - if (!is_guest_mode(vcpu) &&
> + if (!static_cpu_has(X86_FEATURE_HYPERVISOR) && !is_guest_mode(vcpu) &&

I realized that Paolo kept a minor optimization while getting rid of the
undefined behavior (https://patchwork.kernel.org/patch/9903811/).
Please do the same trick that signals kvm_io_bus_write() before going to
x86_emulate_instruction(... EMULTYPE_SKIP ...), but add a branch to use
kvm_skip_emulated_instruction() for bare-metal,

thanks.

>   !kvm_io_bus_write(vcpu, KVM_FAST_MMIO_BUS, gpa, 0, NULL)) {
>   trace_kvm_fast_mmio(gpa);
>   return kvm_skip_emulated_instruction(vcpu);
> -- 
> 2.14.3
> 


Re: [PATCH] x86/kvm: disable fast MMIO when running nested

2018-01-25 Thread Radim Krčmář
2018-01-25 01:55-0800, Liran Alon:
> - vkuzn...@redhat.com wrote:
> > I was investigating an issue with seabios >= 1.10 which stopped
> > working
> > for nested KVM on Hyper-V. The problem appears to be in
> > handle_ept_violation() function: when we do fast mmio we need to skip
> > the instruction so we do kvm_skip_emulated_instruction(). This,
> > however,
> > depends on VM_EXIT_INSTRUCTION_LEN field being set correctly in VMCS.
> > However, this is not the case.
> > 
> > Intel's manual doesn't mandate VM_EXIT_INSTRUCTION_LEN to be set when
> > EPT MISCONFIG occurs. While on real hardware it was observed to be
> > set,
> > some hypervisors follow the spec and don't set it; we end up
> > advancing
> > IP with some random value.
> > 
> > I checked with Microsoft and they confirmed they don't fill
> > VM_EXIT_INSTRUCTION_LEN on EPT MISCONFIG.
> > 
> > Fix the issue by disabling fast mmio when running nested.
> > 
> > Signed-off-by: Vitaly Kuznetsov 
> > ---
> >  arch/x86/kvm/vmx.c | 9 -
> >  1 file changed, 8 insertions(+), 1 deletion(-)
> > 
> > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> > index c829d89e2e63..54afb446f38e 100644
> > --- a/arch/x86/kvm/vmx.c
> > +++ b/arch/x86/kvm/vmx.c
> > @@ -6558,9 +6558,16 @@ static int handle_ept_misconfig(struct kvm_vcpu
> > *vcpu)
> > /*
> >  * A nested guest cannot optimize MMIO vmexits, because we have an
> >  * nGPA here instead of the required GPA.
> > +* Skipping instruction below depends on undefined behavior:
> > Intel's
> > +* manual doesn't mandate VM_EXIT_INSTRUCTION_LEN to be set in VMCS
> > +* when EPT MISCONFIG occurs and while on real hardware it was
> > observed
> > +* to be set, other hypervisors (namely Hyper-V) don't set it, we
> > end
> > +* up advancing IP with some random value. Disable fast mmio when
> > +* running nested and keep it for real hardware in hope that
> > +* VM_EXIT_INSTRUCTION_LEN will always be set correctly.
> 
> If Intel manual doesn't mandate VM_EXIT_INSTRUCTION_LEN to be set in VMCS on 
> EPT_MISCONFIG,
> I don't think we should do this on real-hardware as-well.

Neither do I, but you can see the last discussion on this topic,
https://patchwork.kernel.org/patch/9903811/.  In short, we've agreed to
limit the hack to real hardware and wait for Intel or virtio changes.

Michael and Jason, any progress on implementing a fast virtio mechanism
that doesn't rely on undefined behavior?

(Encode writing instruction length into last 4 bits of MMIO address,
 side-channel say that accesses to the MMIO area always use certain
 instruction length, use hypercall, ...)

Thanks.


Re: [PATCH] x86/kvm: disable fast MMIO when running nested

2018-01-25 Thread Radim Krčmář
2018-01-25 01:55-0800, Liran Alon:
> - vkuzn...@redhat.com wrote:
> > I was investigating an issue with seabios >= 1.10 which stopped
> > working
> > for nested KVM on Hyper-V. The problem appears to be in
> > handle_ept_violation() function: when we do fast mmio we need to skip
> > the instruction so we do kvm_skip_emulated_instruction(). This,
> > however,
> > depends on VM_EXIT_INSTRUCTION_LEN field being set correctly in VMCS.
> > However, this is not the case.
> > 
> > Intel's manual doesn't mandate VM_EXIT_INSTRUCTION_LEN to be set when
> > EPT MISCONFIG occurs. While on real hardware it was observed to be
> > set,
> > some hypervisors follow the spec and don't set it; we end up
> > advancing
> > IP with some random value.
> > 
> > I checked with Microsoft and they confirmed they don't fill
> > VM_EXIT_INSTRUCTION_LEN on EPT MISCONFIG.
> > 
> > Fix the issue by disabling fast mmio when running nested.
> > 
> > Signed-off-by: Vitaly Kuznetsov 
> > ---
> >  arch/x86/kvm/vmx.c | 9 -
> >  1 file changed, 8 insertions(+), 1 deletion(-)
> > 
> > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> > index c829d89e2e63..54afb446f38e 100644
> > --- a/arch/x86/kvm/vmx.c
> > +++ b/arch/x86/kvm/vmx.c
> > @@ -6558,9 +6558,16 @@ static int handle_ept_misconfig(struct kvm_vcpu
> > *vcpu)
> > /*
> >  * A nested guest cannot optimize MMIO vmexits, because we have an
> >  * nGPA here instead of the required GPA.
> > +* Skipping instruction below depends on undefined behavior:
> > Intel's
> > +* manual doesn't mandate VM_EXIT_INSTRUCTION_LEN to be set in VMCS
> > +* when EPT MISCONFIG occurs and while on real hardware it was
> > observed
> > +* to be set, other hypervisors (namely Hyper-V) don't set it, we
> > end
> > +* up advancing IP with some random value. Disable fast mmio when
> > +* running nested and keep it for real hardware in hope that
> > +* VM_EXIT_INSTRUCTION_LEN will always be set correctly.
> 
> If Intel manual doesn't mandate VM_EXIT_INSTRUCTION_LEN to be set in VMCS on 
> EPT_MISCONFIG,
> I don't think we should do this on real-hardware as-well.

Neither do I, but you can see the last discussion on this topic,
https://patchwork.kernel.org/patch/9903811/.  In short, we've agreed to
limit the hack to real hardware and wait for Intel or virtio changes.

Michael and Jason, any progress on implementing a fast virtio mechanism
that doesn't rely on undefined behavior?

(Encode writing instruction length into last 4 bits of MMIO address,
 side-channel say that accesses to the MMIO area always use certain
 instruction length, use hypercall, ...)

Thanks.


Re: [PATCH v4 5/7] x86/irq: Count Hyper-V reenlightenment interrupts

2018-01-24 Thread Radim Krčmář
2018-01-24 14:23+0100, Vitaly Kuznetsov:
> Hyper-V reenlightenment interrupts arrive when the VM is migrated, we're
> not supposed to see many of them. However, it may be important to know
> that the event has happened in case we have L2 nested guests.
> 
> Signed-off-by: Vitaly Kuznetsov 
> Reviewed-by: Thomas Gleixner 
> ---

Thomas,

I think the expectation is that this series will go through the KVM
tree.  Would you prefer a topic branch?

Thanks.


Re: [PATCH v4 5/7] x86/irq: Count Hyper-V reenlightenment interrupts

2018-01-24 Thread Radim Krčmář
2018-01-24 14:23+0100, Vitaly Kuznetsov:
> Hyper-V reenlightenment interrupts arrive when the VM is migrated, we're
> not supposed to see many of them. However, it may be important to know
> that the event has happened in case we have L2 nested guests.
> 
> Signed-off-by: Vitaly Kuznetsov 
> Reviewed-by: Thomas Gleixner 
> ---

Thomas,

I think the expectation is that this series will go through the KVM
tree.  Would you prefer a topic branch?

Thanks.


Re: [PATCH 4/5] s390: define ISOLATE_BP to run tasks with modified branch prediction

2018-01-24 Thread Radim Krčmář
2018-01-24 07:36+0100, Martin Schwidefsky:
> On Tue, 23 Jan 2018 21:32:24 +0100
> Radim Krčmář <rkrc...@redhat.com> wrote:
> 
> > 2018-01-23 15:21+0100, Christian Borntraeger:
> > > Paolo, Radim,
> > > 
> > > this patch not only allows to isolate a userspace process, it also allows 
> > > us
> > > to add a new interface for KVM that would allow us to isolate a KVM guest 
> > > CPU
> > > to no longer being able to inject branches in any host or other  guests. 
> > > (while
> > > at the same time QEMU and host kernel can run with full power). 
> > > We just have to set the TIF bit TIF_ISOLATE_BP_GUEST for the thread that 
> > > runs a
> > > given CPU. This would certainly be an addon patch on top of this patch at 
> > > a later
> > > point in time.  
> > 
> > I think that the default should be secure, so userspace will be
> > breaking the isolation instead of setting it up and having just one
> > place to screw up would be better -- the prctl could decide which
> > isolation mode to pick.
> 
> The prctl is one direction only. Once a task is "secured" there is no way 
> back.

Good point, I was thinking of reversing the direction and having
TIF_NOT_ISOLATE_BP_GUEST prctl, but allowing tasks to subvert security
would be even worse.

> If we start with a default of secure then *all* tasks will run with limited
> branch prediction.

Right, because all of them are untrusted.  What is the performance
impact of BP isolation?

This design seems very fragile to me -- we're forcing userspace to care
about some arcane hardware implementation and isolation in the system is
broken if a task running malicious code doesn't do that for any reason.

> > Maybe we can change the conditions and break logical connection between
> > TIF_ISOLATE_BP and TIF_ISOLATE_BP_GUEST, to make a separate KVM
> > interface useful.
> 
> The thinking here is that you use TIF_ISOLATE_BP to make use space secure,
> but you need to close the loophole that you can use a KVM guest to get out of
> the secured mode. That is why you need to run the guest with isolated BP if
> TIF_ISOLATE_BP is set. But if you want to run qemu as always and only the
> KVM guest with isolataed BP you need a second bit, thus TIF_ISOLATE_GUEST_BP.

I understand, I was following the misguided idea where we have reversed
logic and then use just TIF_NOT_ISOLATE_GUEST_BP for sie switches.

> > > Do you think something similar would be useful for other architectures as 
> > > well?  
> > 
> > It goes against my idea of virtualization, but there probably are users
> > that don't care about isolation and still use virtual machines ...
> > I expect most architectures to have a fairly similar resolution of
> > branch prediction leaks, so the idea should be easily abstractable on
> > all levels.  (At least x86 is.)
> 
> Yes.
> 
> > > In that case we should try to come up with a cross-architecture interface 
> > > to enable
> > > that.  
> > 
> > Makes me think of a generic VM control "prefer performance over
> > security", which would also take care of future problems and let arches
> > decide what is worth the code.
> 
> VM as in virtual machine or VM as in virtual memory?

Virtual machine.  (But could be anywhere really, especially the
kernel/user split slowed applications down for too long already. :])

> > A main drawback is that this will introduce dynamic branches to the
> > code, which are going to slow down the common case to speed up a niche.
> 
> Where would you place these additional branches? I don't quite get the idea.

The BP* macros contain a branch in them -- avoidable if we only had
isolated virtual machines.

Thanks.


Re: [PATCH 4/5] s390: define ISOLATE_BP to run tasks with modified branch prediction

2018-01-24 Thread Radim Krčmář
2018-01-24 07:36+0100, Martin Schwidefsky:
> On Tue, 23 Jan 2018 21:32:24 +0100
> Radim Krčmář  wrote:
> 
> > 2018-01-23 15:21+0100, Christian Borntraeger:
> > > Paolo, Radim,
> > > 
> > > this patch not only allows to isolate a userspace process, it also allows 
> > > us
> > > to add a new interface for KVM that would allow us to isolate a KVM guest 
> > > CPU
> > > to no longer being able to inject branches in any host or other  guests. 
> > > (while
> > > at the same time QEMU and host kernel can run with full power). 
> > > We just have to set the TIF bit TIF_ISOLATE_BP_GUEST for the thread that 
> > > runs a
> > > given CPU. This would certainly be an addon patch on top of this patch at 
> > > a later
> > > point in time.  
> > 
> > I think that the default should be secure, so userspace will be
> > breaking the isolation instead of setting it up and having just one
> > place to screw up would be better -- the prctl could decide which
> > isolation mode to pick.
> 
> The prctl is one direction only. Once a task is "secured" there is no way 
> back.

Good point, I was thinking of reversing the direction and having
TIF_NOT_ISOLATE_BP_GUEST prctl, but allowing tasks to subvert security
would be even worse.

> If we start with a default of secure then *all* tasks will run with limited
> branch prediction.

Right, because all of them are untrusted.  What is the performance
impact of BP isolation?

This design seems very fragile to me -- we're forcing userspace to care
about some arcane hardware implementation and isolation in the system is
broken if a task running malicious code doesn't do that for any reason.

> > Maybe we can change the conditions and break logical connection between
> > TIF_ISOLATE_BP and TIF_ISOLATE_BP_GUEST, to make a separate KVM
> > interface useful.
> 
> The thinking here is that you use TIF_ISOLATE_BP to make use space secure,
> but you need to close the loophole that you can use a KVM guest to get out of
> the secured mode. That is why you need to run the guest with isolated BP if
> TIF_ISOLATE_BP is set. But if you want to run qemu as always and only the
> KVM guest with isolataed BP you need a second bit, thus TIF_ISOLATE_GUEST_BP.

I understand, I was following the misguided idea where we have reversed
logic and then use just TIF_NOT_ISOLATE_GUEST_BP for sie switches.

> > > Do you think something similar would be useful for other architectures as 
> > > well?  
> > 
> > It goes against my idea of virtualization, but there probably are users
> > that don't care about isolation and still use virtual machines ...
> > I expect most architectures to have a fairly similar resolution of
> > branch prediction leaks, so the idea should be easily abstractable on
> > all levels.  (At least x86 is.)
> 
> Yes.
> 
> > > In that case we should try to come up with a cross-architecture interface 
> > > to enable
> > > that.  
> > 
> > Makes me think of a generic VM control "prefer performance over
> > security", which would also take care of future problems and let arches
> > decide what is worth the code.
> 
> VM as in virtual machine or VM as in virtual memory?

Virtual machine.  (But could be anywhere really, especially the
kernel/user split slowed applications down for too long already. :])

> > A main drawback is that this will introduce dynamic branches to the
> > code, which are going to slow down the common case to speed up a niche.
> 
> Where would you place these additional branches? I don't quite get the idea.

The BP* macros contain a branch in them -- avoidable if we only had
isolated virtual machines.

Thanks.


Re: [PATCH 4/5] s390: define ISOLATE_BP to run tasks with modified branch prediction

2018-01-23 Thread Radim Krčmář
2018-01-23 15:21+0100, Christian Borntraeger:
> Paolo, Radim,
> 
> this patch not only allows to isolate a userspace process, it also allows us
> to add a new interface for KVM that would allow us to isolate a KVM guest CPU
> to no longer being able to inject branches in any host or other  guests. 
> (while
> at the same time QEMU and host kernel can run with full power). 
> We just have to set the TIF bit TIF_ISOLATE_BP_GUEST for the thread that runs 
> a
> given CPU. This would certainly be an addon patch on top of this patch at a 
> later
> point in time.

I think that the default should be secure, so userspace will be
breaking the isolation instead of setting it up and having just one
place to screw up would be better -- the prctl could decide which
isolation mode to pick.

Maybe we can change the conditions and break logical connection between
TIF_ISOLATE_BP and TIF_ISOLATE_BP_GUEST, to make a separate KVM
interface useful.

> Do you think something similar would be useful for other architectures as 
> well?

It goes against my idea of virtualization, but there probably are users
that don't care about isolation and still use virtual machines ...
I expect most architectures to have a fairly similar resolution of
branch prediction leaks, so the idea should be easily abstractable on
all levels.  (At least x86 is.)

> In that case we should try to come up with a cross-architecture interface to 
> enable
> that.

Makes me think of a generic VM control "prefer performance over
security", which would also take care of future problems and let arches
decide what is worth the code.

A main drawback is that this will introduce dynamic branches to the
code, which are going to slow down the common case to speed up a niche.


Re: [PATCH 4/5] s390: define ISOLATE_BP to run tasks with modified branch prediction

2018-01-23 Thread Radim Krčmář
2018-01-23 15:21+0100, Christian Borntraeger:
> Paolo, Radim,
> 
> this patch not only allows to isolate a userspace process, it also allows us
> to add a new interface for KVM that would allow us to isolate a KVM guest CPU
> to no longer being able to inject branches in any host or other  guests. 
> (while
> at the same time QEMU and host kernel can run with full power). 
> We just have to set the TIF bit TIF_ISOLATE_BP_GUEST for the thread that runs 
> a
> given CPU. This would certainly be an addon patch on top of this patch at a 
> later
> point in time.

I think that the default should be secure, so userspace will be
breaking the isolation instead of setting it up and having just one
place to screw up would be better -- the prctl could decide which
isolation mode to pick.

Maybe we can change the conditions and break logical connection between
TIF_ISOLATE_BP and TIF_ISOLATE_BP_GUEST, to make a separate KVM
interface useful.

> Do you think something similar would be useful for other architectures as 
> well?

It goes against my idea of virtualization, but there probably are users
that don't care about isolation and still use virtual machines ...
I expect most architectures to have a fairly similar resolution of
branch prediction leaks, so the idea should be easily abstractable on
all levels.  (At least x86 is.)

> In that case we should try to come up with a cross-architecture interface to 
> enable
> that.

Makes me think of a generic VM control "prefer performance over
security", which would also take care of future problems and let arches
decide what is worth the code.

A main drawback is that this will introduce dynamic branches to the
code, which are going to slow down the common case to speed up a niche.


[GIT PULL] KVM fixes for Linux 4.15(-rc9)

2018-01-20 Thread Radim Krčmář
Linus,

the high amount of new code improves situation around CPU vulnerabilities.

The following changes since commit a8750ddca918032d6349adbf9a4b6555e7db20da:

  Linux 4.15-rc8 (2018-01-14 15:32:30 -0800)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/virt/kvm/kvm tags/for-linus

for you to fetch changes up to 35b3fde6203b932b2b1a5b53b3d8808abc9c4f60:

  KVM: s390: wire up bpb feature (2018-01-20 17:30:47 +0100)


KVM fixes for v4.15-rc9

ARM:
* fix incorrect huge page mappings on systems using the contiguous hint
  for hugetlbfs
* support alternative GICv4 init sequence
* correctly implement the ARM SMCC for HVC and SMC handling

PPC:
* add KVM IOCTL for reporting vulnerability and workaround status

s390:
* provide userspace interface for branch prediction changes in firmware

x86:
* use correct macros for bits


Christian Borntraeger (1):
  KVM: s390: wire up bpb feature

Christoffer Dall (1):
  KVM: arm64: Fix GICv4 init when called from vgic_its_create

Marc Zyngier (1):
  arm64: KVM: Fix SMCCC handling of unimplemented SMC/HVC calls

Paul Mackerras (1):
  KVM: PPC: Book3S: Provide information about hardware/firmware CVE 
workarounds

Punit Agrawal (1):
  KVM: arm/arm64: Check pagesize when allocating a hugepage at Stage 2

Radim Krčmář (2):
  Merge tag 'kvm-arm-fixes-for-v4.15-3-v2' of 
git://git.kernel.org/.../kvmarm/kvmarm
  Merge tag 'kvm-ppc-cve-4.15-2' of git://git.kernel.org/.../paulus/powerpc

Tianyu Lan (1):
  KVM/x86: Fix wrong macro references of X86_CR0_PG_BIT and X86_CR4_PAE_BIT 
in kvm_valid_sregs()

 Documentation/virtual/kvm/api.txt   |  46 +
 arch/arm64/kvm/handle_exit.c|   4 +-
 arch/powerpc/include/uapi/asm/kvm.h |  25 +++
 arch/powerpc/kvm/powerpc.c  | 131 
 arch/s390/include/asm/kvm_host.h|   3 +-
 arch/s390/include/uapi/asm/kvm.h|   5 +-
 arch/s390/kvm/kvm-s390.c|  12 
 arch/s390/kvm/vsie.c|  10 +++
 arch/x86/kvm/x86.c  |   4 +-
 include/uapi/linux/kvm.h|   4 ++
 virt/kvm/arm/mmu.c  |   2 +-
 virt/kvm/arm/vgic/vgic-init.c   |   8 ++-
 virt/kvm/arm/vgic/vgic-v4.c |   2 +-
 13 files changed, 245 insertions(+), 11 deletions(-)


[GIT PULL] KVM fixes for Linux 4.15(-rc9)

2018-01-20 Thread Radim Krčmář
Linus,

the high amount of new code improves situation around CPU vulnerabilities.

The following changes since commit a8750ddca918032d6349adbf9a4b6555e7db20da:

  Linux 4.15-rc8 (2018-01-14 15:32:30 -0800)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/virt/kvm/kvm tags/for-linus

for you to fetch changes up to 35b3fde6203b932b2b1a5b53b3d8808abc9c4f60:

  KVM: s390: wire up bpb feature (2018-01-20 17:30:47 +0100)


KVM fixes for v4.15-rc9

ARM:
* fix incorrect huge page mappings on systems using the contiguous hint
  for hugetlbfs
* support alternative GICv4 init sequence
* correctly implement the ARM SMCC for HVC and SMC handling

PPC:
* add KVM IOCTL for reporting vulnerability and workaround status

s390:
* provide userspace interface for branch prediction changes in firmware

x86:
* use correct macros for bits


Christian Borntraeger (1):
  KVM: s390: wire up bpb feature

Christoffer Dall (1):
  KVM: arm64: Fix GICv4 init when called from vgic_its_create

Marc Zyngier (1):
  arm64: KVM: Fix SMCCC handling of unimplemented SMC/HVC calls

Paul Mackerras (1):
  KVM: PPC: Book3S: Provide information about hardware/firmware CVE 
workarounds

Punit Agrawal (1):
  KVM: arm/arm64: Check pagesize when allocating a hugepage at Stage 2

Radim Krčmář (2):
  Merge tag 'kvm-arm-fixes-for-v4.15-3-v2' of 
git://git.kernel.org/.../kvmarm/kvmarm
  Merge tag 'kvm-ppc-cve-4.15-2' of git://git.kernel.org/.../paulus/powerpc

Tianyu Lan (1):
  KVM/x86: Fix wrong macro references of X86_CR0_PG_BIT and X86_CR4_PAE_BIT 
in kvm_valid_sregs()

 Documentation/virtual/kvm/api.txt   |  46 +
 arch/arm64/kvm/handle_exit.c|   4 +-
 arch/powerpc/include/uapi/asm/kvm.h |  25 +++
 arch/powerpc/kvm/powerpc.c  | 131 
 arch/s390/include/asm/kvm_host.h|   3 +-
 arch/s390/include/uapi/asm/kvm.h|   5 +-
 arch/s390/kvm/kvm-s390.c|  12 
 arch/s390/kvm/vsie.c|  10 +++
 arch/x86/kvm/x86.c  |   4 +-
 include/uapi/linux/kvm.h|   4 ++
 virt/kvm/arm/mmu.c  |   2 +-
 virt/kvm/arm/vgic/vgic-init.c   |   8 ++-
 virt/kvm/arm/vgic/vgic-v4.c |   2 +-
 13 files changed, 245 insertions(+), 11 deletions(-)


Re: [RFC 0/6] Enlightened VMCS support for KVM on Hyper-V

2018-01-16 Thread Radim Krčmář
2018-01-15 18:30+0100, Vitaly Kuznetsov:
> Early RFC. I'll refer to this patchset in my DevConf/FOSDEM
> presentations.
> 
> When running nested KVM on Hyper-V it's possible to use so called
> 'Enlightened VMCS' and do normal memory reads/writes instead of
> doing VMWRITE/VMREAD instructions. Tests show that this speeds up
> tight CPUID loop almost 3 times:
> 
> Before:
> ./cpuid_tight
> 20459
> 
> After:
> ./cpuid_tight
> 7698

Nice!

> checkpatch.pl errors/warnings and possible 32bit brokenness are known
> things.
> 
> Main RFC questions I have are:
> - Do we want to have this per L2 VM or per L1 host?

IIUC, eVMCS replaces VMCS when enabled, hence doing it for all VMs would
be simplest -- we wouldn't need to setup VMCS nor reconfigure Hyper-V on
the fly.  (I'm thinking we could have a union in loaded_vmcs for
actually used type of VMCS.)

> - How can we achieve zero overhead for non-Hyper-V deployments? Use static
>   keys? But this will only work if we decide to do eVMCS per host.

Static keys seem like a good choice.

> - Can we do better than a big switch in evmcs_read()/evmcs_write()? And
>   probably don't use 'case' defines which checkpatch.pl hates.

I'd go for a separate mapping from Intel VMCS into its MS eVMCS and
dirty bit, something like vmcs_field_to_offset_table.

Thanks.


Re: [RFC 0/6] Enlightened VMCS support for KVM on Hyper-V

2018-01-16 Thread Radim Krčmář
2018-01-15 18:30+0100, Vitaly Kuznetsov:
> Early RFC. I'll refer to this patchset in my DevConf/FOSDEM
> presentations.
> 
> When running nested KVM on Hyper-V it's possible to use so called
> 'Enlightened VMCS' and do normal memory reads/writes instead of
> doing VMWRITE/VMREAD instructions. Tests show that this speeds up
> tight CPUID loop almost 3 times:
> 
> Before:
> ./cpuid_tight
> 20459
> 
> After:
> ./cpuid_tight
> 7698

Nice!

> checkpatch.pl errors/warnings and possible 32bit brokenness are known
> things.
> 
> Main RFC questions I have are:
> - Do we want to have this per L2 VM or per L1 host?

IIUC, eVMCS replaces VMCS when enabled, hence doing it for all VMs would
be simplest -- we wouldn't need to setup VMCS nor reconfigure Hyper-V on
the fly.  (I'm thinking we could have a union in loaded_vmcs for
actually used type of VMCS.)

> - How can we achieve zero overhead for non-Hyper-V deployments? Use static
>   keys? But this will only work if we decide to do eVMCS per host.

Static keys seem like a good choice.

> - Can we do better than a big switch in evmcs_read()/evmcs_write()? And
>   probably don't use 'case' defines which checkpatch.pl hates.

I'd go for a separate mapping from Intel VMCS into its MS eVMCS and
dirty bit, something like vmcs_field_to_offset_table.

Thanks.


[GIT PULL] KVM fixes for Linux 4.15-rc7

2018-01-06 Thread Radim Krčmář
Linus,

The following changes since commit aa12f594f97efe50223611dbd13ecca4e8dafee6:

  tools/kvm_stat: sort '-f help' output (2017-12-21 13:03:32 +0100)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/virt/kvm/kvm for-linus

for you to fetch changes up to bb4945e60dd0b5afb0e92bc8006ce560948fbc39:

  Merge tag 'kvm-s390-master-4.15-2' of 
git://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux (2018-01-06 
17:26:37 +0100)


KVM fixes for v4.15-rc7

s390:
* Two fixes for potential bitmap overruns in the cmma migration code

x86:
* Clear guest provided GPRs to defeat the Project Zero PoC for CVE
  2017-5715


Christian Borntraeger (2):
  KVM: s390: fix cmma migration for multiple memory slots
  KVM: s390: prevent buffer overrun on memory hotplug during migration

Jim Mattson (1):
  kvm: vmx: Scrub hardware GPRs at VM-exit

Radim Krčmář (1):
  Merge tag 'kvm-s390-master-4.15-2' of 
git://git.kernel.org/.../kvms390/linux

 arch/s390/kvm/kvm-s390.c |  9 +
 arch/s390/kvm/priv.c |  2 +-
 arch/x86/kvm/svm.c   | 19 +++
 arch/x86/kvm/vmx.c   | 14 +-
 4 files changed, 38 insertions(+), 6 deletions(-)


[GIT PULL] KVM fixes for Linux 4.15-rc7

2018-01-06 Thread Radim Krčmář
Linus,

The following changes since commit aa12f594f97efe50223611dbd13ecca4e8dafee6:

  tools/kvm_stat: sort '-f help' output (2017-12-21 13:03:32 +0100)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/virt/kvm/kvm for-linus

for you to fetch changes up to bb4945e60dd0b5afb0e92bc8006ce560948fbc39:

  Merge tag 'kvm-s390-master-4.15-2' of 
git://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux (2018-01-06 
17:26:37 +0100)


KVM fixes for v4.15-rc7

s390:
* Two fixes for potential bitmap overruns in the cmma migration code

x86:
* Clear guest provided GPRs to defeat the Project Zero PoC for CVE
  2017-5715


Christian Borntraeger (2):
  KVM: s390: fix cmma migration for multiple memory slots
  KVM: s390: prevent buffer overrun on memory hotplug during migration

Jim Mattson (1):
  kvm: vmx: Scrub hardware GPRs at VM-exit

Radim Krčmář (1):
  Merge tag 'kvm-s390-master-4.15-2' of 
git://git.kernel.org/.../kvms390/linux

 arch/s390/kvm/kvm-s390.c |  9 +
 arch/s390/kvm/priv.c |  2 +-
 arch/x86/kvm/svm.c   | 19 +++
 arch/x86/kvm/vmx.c   | 14 +-
 4 files changed, 38 insertions(+), 6 deletions(-)


[GIT PULL] KVM fixes for v4.15-rc3

2017-12-09 Thread Radim Krčmář
Linus,

The following changes since commit ae64f9bd1d3621b5e60d7363bc20afb46aede215:

  Linux 4.15-rc2 (2017-12-03 11:01:47 -0500)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/virt/kvm/kvm for-linus

for you to fetch changes up to b1394e745b9453dcb5b0671c205b770e87dedb87:

  KVM: x86: fix APIC page invalidation (2017-12-06 16:10:34 +0100)


KVM fixes for v4.15-rc3

ARM:
 * A number of issues in the vgic discovered using SMATCH
 * A bit one-off calculation in out stage base address mask (32-bit and
   64-bit)
 * Fixes to single-step debugging instructions that trap for other
   reasons such as MMMIO aborts
 * Printing unavailable hyp mode as error
 * Potential spinlock deadlock in the vgic
 * Avoid calling vgic vcpu free more than once
 * Broken bit calculation for big endian systems

s390:
 * SPDX tags
 * Fence storage key accesses from problem state
 * Make sure that irq_state.flags is not used in the future

x86:
 * Intercept port 0x80 accesses to prevent host instability (CVE)
 * Use userspace FPU context for guest FPU (mainly an optimization that
   fixes a double use of kernel FPU)
 * Do not leak one page per module load
 * Flush APIC page address cache from MMU invalidation notifiers


Alex Bennée (5):
  KVM: arm/arm64: debug: Introduce helper for single-step
  kvm: arm64: handle single-stepping trapped instructions
  kvm: arm64: handle single-step of userspace mmio instructions
  kvm: arm64: handle single-step during SError exceptions
  kvm: arm64: handle single-step of hyp emulated mmio instructions

Andre Przywara (1):
  KVM: arm/arm64: VGIC: extend !vgic_is_initialized guard

Andrew Honig (1):
  KVM: VMX: remove I/O port 0x80 bypass on Intel hosts

Andrew Jones (1):
  KVM: arm/arm64: kvm_arch_destroy_vm cleanups

Ard Biesheuvel (1):
  kvm: arm: don't treat unavailable HYP mode as an error

Christian Borntraeger (1):
  KVM: s390: mark irq_state.flags as non-usable

Christoffer Dall (3):
  KVM: arm/arm64: Don't enable/disable physical timer access on VHE
  KVM: arm/arm64: Avoid attempting to load timer vgic state without a vgic
  KVM: arm/arm64: Fix broken GICH_ELRSR big endian conversion

Greg Kroah-Hartman (2):
  KVM: s390: add SPDX identifiers to the remaining files
  KVM: s390: Remove redundant license text

Janosch Frank (1):
  KVM: s390: Fix skey emulation permission check

Jim Mattson (1):
  KVM: VMX: fix page leak in hardware_setup()

Kristina Martsenko (1):
  arm64: KVM: fix VTTBR_BADDR_MASK BUG_ON off-by-one

Marc Zyngier (7):
  KVM: arm/arm64: vgic-irqfd: Fix MSI entry allocation
  KVM: arm/arm64: vgic: Preserve the revious read from the pending table
  KVM: arm/arm64: vgic-its: Preserve the revious read from the pending table
  KVM: arm/arm64: vgic-its: Check result of allocation before use
  KVM: arm/arm64: vgic-v4: Only perform an unmap for valid vLPIs
  arm: KVM: Fix VTTBR_BADDR_MASK BUG_ON off-by-one
  KVM: arm/arm64: Fix spinlock acquisition in vgic_set_owner

Radim Krčmář (3):
  Merge tag 'kvm-arm-fixes-for-v4.15-1' of 
git://git.kernel.org/.../kvmarm/kvmarm
  Merge tag 'kvm-s390-master-4.15-1' of 
git://git.kernel.org/.../kvms390/linux
  KVM: x86: fix APIC page invalidation

Rik van Riel (2):
  x86,kvm: move qemu/guest FPU switching out to vcpu_run
  x86,kvm: remove KVM emulator get_fpu / put_fpu

 Documentation/virtual/kvm/api.txt  | 15 +++--
 arch/arm/include/asm/kvm_arm.h |  3 +-
 arch/arm/include/asm/kvm_host.h|  5 +++
 arch/arm64/include/asm/kvm_arm.h   |  3 +-
 arch/arm64/include/asm/kvm_host.h  |  1 +
 arch/arm64/kvm/debug.c | 21 +
 arch/arm64/kvm/handle_exit.c   | 57 +-
 arch/arm64/kvm/hyp/switch.c| 37 +-
 arch/s390/kvm/Makefile |  5 +--
 arch/s390/kvm/diag.c   |  5 +--
 arch/s390/kvm/gaccess.h|  5 +--
 arch/s390/kvm/guestdbg.c   |  5 +--
 arch/s390/kvm/intercept.c  |  5 +--
 arch/s390/kvm/interrupt.c  |  5 +--
 arch/s390/kvm/irq.h|  5 +--
 arch/s390/kvm/kvm-s390.c   | 11 +++
 arch/s390/kvm/kvm-s390.h   |  5 +--
 arch/s390/kvm/priv.c   | 16 ++
 arch/s390/kvm/sigp.c   |  5 +--
 arch/s390/kvm/vsie.c   |  5 +--
 arch/x86/include/asm/kvm_emulate.h |  2 --
 arch/x86/include/asm/kvm_host.h| 16 ++
 arch/x86/kvm/emulate.c | 24 ---
 arch/x86/kvm/vmx.c |  6 
 arch/x86/kvm/x86.c | 63 +++---
 include/kvm/arm_arch_timer.h   |  3 --
 include/linux/kvm_host.h   |  2 +-
 include/uapi/linux/kvm.h   |  4 +--
 virt/kvm/arm/arch_timer.c  | 11

[GIT PULL] KVM fixes for v4.15-rc3

2017-12-09 Thread Radim Krčmář
Linus,

The following changes since commit ae64f9bd1d3621b5e60d7363bc20afb46aede215:

  Linux 4.15-rc2 (2017-12-03 11:01:47 -0500)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/virt/kvm/kvm for-linus

for you to fetch changes up to b1394e745b9453dcb5b0671c205b770e87dedb87:

  KVM: x86: fix APIC page invalidation (2017-12-06 16:10:34 +0100)


KVM fixes for v4.15-rc3

ARM:
 * A number of issues in the vgic discovered using SMATCH
 * A bit one-off calculation in out stage base address mask (32-bit and
   64-bit)
 * Fixes to single-step debugging instructions that trap for other
   reasons such as MMMIO aborts
 * Printing unavailable hyp mode as error
 * Potential spinlock deadlock in the vgic
 * Avoid calling vgic vcpu free more than once
 * Broken bit calculation for big endian systems

s390:
 * SPDX tags
 * Fence storage key accesses from problem state
 * Make sure that irq_state.flags is not used in the future

x86:
 * Intercept port 0x80 accesses to prevent host instability (CVE)
 * Use userspace FPU context for guest FPU (mainly an optimization that
   fixes a double use of kernel FPU)
 * Do not leak one page per module load
 * Flush APIC page address cache from MMU invalidation notifiers


Alex Bennée (5):
  KVM: arm/arm64: debug: Introduce helper for single-step
  kvm: arm64: handle single-stepping trapped instructions
  kvm: arm64: handle single-step of userspace mmio instructions
  kvm: arm64: handle single-step during SError exceptions
  kvm: arm64: handle single-step of hyp emulated mmio instructions

Andre Przywara (1):
  KVM: arm/arm64: VGIC: extend !vgic_is_initialized guard

Andrew Honig (1):
  KVM: VMX: remove I/O port 0x80 bypass on Intel hosts

Andrew Jones (1):
  KVM: arm/arm64: kvm_arch_destroy_vm cleanups

Ard Biesheuvel (1):
  kvm: arm: don't treat unavailable HYP mode as an error

Christian Borntraeger (1):
  KVM: s390: mark irq_state.flags as non-usable

Christoffer Dall (3):
  KVM: arm/arm64: Don't enable/disable physical timer access on VHE
  KVM: arm/arm64: Avoid attempting to load timer vgic state without a vgic
  KVM: arm/arm64: Fix broken GICH_ELRSR big endian conversion

Greg Kroah-Hartman (2):
  KVM: s390: add SPDX identifiers to the remaining files
  KVM: s390: Remove redundant license text

Janosch Frank (1):
  KVM: s390: Fix skey emulation permission check

Jim Mattson (1):
  KVM: VMX: fix page leak in hardware_setup()

Kristina Martsenko (1):
  arm64: KVM: fix VTTBR_BADDR_MASK BUG_ON off-by-one

Marc Zyngier (7):
  KVM: arm/arm64: vgic-irqfd: Fix MSI entry allocation
  KVM: arm/arm64: vgic: Preserve the revious read from the pending table
  KVM: arm/arm64: vgic-its: Preserve the revious read from the pending table
  KVM: arm/arm64: vgic-its: Check result of allocation before use
  KVM: arm/arm64: vgic-v4: Only perform an unmap for valid vLPIs
  arm: KVM: Fix VTTBR_BADDR_MASK BUG_ON off-by-one
  KVM: arm/arm64: Fix spinlock acquisition in vgic_set_owner

Radim Krčmář (3):
  Merge tag 'kvm-arm-fixes-for-v4.15-1' of 
git://git.kernel.org/.../kvmarm/kvmarm
  Merge tag 'kvm-s390-master-4.15-1' of 
git://git.kernel.org/.../kvms390/linux
  KVM: x86: fix APIC page invalidation

Rik van Riel (2):
  x86,kvm: move qemu/guest FPU switching out to vcpu_run
  x86,kvm: remove KVM emulator get_fpu / put_fpu

 Documentation/virtual/kvm/api.txt  | 15 +++--
 arch/arm/include/asm/kvm_arm.h |  3 +-
 arch/arm/include/asm/kvm_host.h|  5 +++
 arch/arm64/include/asm/kvm_arm.h   |  3 +-
 arch/arm64/include/asm/kvm_host.h  |  1 +
 arch/arm64/kvm/debug.c | 21 +
 arch/arm64/kvm/handle_exit.c   | 57 +-
 arch/arm64/kvm/hyp/switch.c| 37 +-
 arch/s390/kvm/Makefile |  5 +--
 arch/s390/kvm/diag.c   |  5 +--
 arch/s390/kvm/gaccess.h|  5 +--
 arch/s390/kvm/guestdbg.c   |  5 +--
 arch/s390/kvm/intercept.c  |  5 +--
 arch/s390/kvm/interrupt.c  |  5 +--
 arch/s390/kvm/irq.h|  5 +--
 arch/s390/kvm/kvm-s390.c   | 11 +++
 arch/s390/kvm/kvm-s390.h   |  5 +--
 arch/s390/kvm/priv.c   | 16 ++
 arch/s390/kvm/sigp.c   |  5 +--
 arch/s390/kvm/vsie.c   |  5 +--
 arch/x86/include/asm/kvm_emulate.h |  2 --
 arch/x86/include/asm/kvm_host.h| 16 ++
 arch/x86/kvm/emulate.c | 24 ---
 arch/x86/kvm/vmx.c |  6 
 arch/x86/kvm/x86.c | 63 +++---
 include/kvm/arm_arch_timer.h   |  3 --
 include/linux/kvm_host.h   |  2 +-
 include/uapi/linux/kvm.h   |  4 +--
 virt/kvm/arm/arch_timer.c  | 11

Re: [PATCH v2] KVM: VMX: Cache IA32_DEBUGCTL in memory

2017-12-05 Thread Radim Krčmář
2017-11-29 01:31-0800, Wanpeng Li:
> From: Wanpeng Li <wanpeng...@hotmail.com>
> 
> MSR_IA32_DEBUGCTLMSR is zeroed on VMEXIT, so it is saved/restored 
> each time during world switch. Jim from Google pointed out that 
> when running schbench in L2, vmx_vcpu_run will occupy 4% cpu time, 
> and the 25% of vmx_vcpu_run cpu time is occupied by get_debugctlmsr(). 
> This patch caches the host IA32_DEBUGCTL MSR and saves/restores 
> the host IA32_DEBUGCTL msr when guest/host switches to avoid to 
> save/restore each time during world switch.
> 
> Suggested-by: Jim Mattson <jmatt...@google.com>
> Cc: Paolo Bonzini <pbonz...@redhat.com>
> Cc: Radim Krčmář <rkrc...@redhat.com>
> Cc: Jim Mattson <jmatt...@google.com>
> Signed-off-by: Wanpeng Li <wanpeng...@hotmail.com>
> ---

Queued, thanks.

And there is another optimization loosely connected to the "[PATCH v3
00/16] Move vcpu_load and vcpu_put calls to arch code" series:
We only need to read the value for the KVM_RUN ioctl.


Re: [PATCH v2] KVM: VMX: Cache IA32_DEBUGCTL in memory

2017-12-05 Thread Radim Krčmář
2017-11-29 01:31-0800, Wanpeng Li:
> From: Wanpeng Li 
> 
> MSR_IA32_DEBUGCTLMSR is zeroed on VMEXIT, so it is saved/restored 
> each time during world switch. Jim from Google pointed out that 
> when running schbench in L2, vmx_vcpu_run will occupy 4% cpu time, 
> and the 25% of vmx_vcpu_run cpu time is occupied by get_debugctlmsr(). 
> This patch caches the host IA32_DEBUGCTL MSR and saves/restores 
> the host IA32_DEBUGCTL msr when guest/host switches to avoid to 
> save/restore each time during world switch.
> 
> Suggested-by: Jim Mattson 
> Cc: Paolo Bonzini 
> Cc: Radim Krčmář 
> Cc: Jim Mattson 
> Signed-off-by: Wanpeng Li 
> ---

Queued, thanks.

And there is another optimization loosely connected to the "[PATCH v3
00/16] Move vcpu_load and vcpu_put calls to arch code" series:
We only need to read the value for the KVM_RUN ioctl.


Re: [PATCH RFC 5/6] x86/kvm: pass stable clocksource to guests when running nested on Hyper-V

2017-12-01 Thread Radim Krčmář
2017-12-01 14:13+0100, Vitaly Kuznetsov:
> Currently, KVM is able to work in 'masterclock' mode passing
> PVCLOCK_TSC_STABLE_BIT to guests when the clocksource we use on the host
> is TSC. When running nested on Hyper-V we normally use a different one:
> TSC page which is resistant to TSC frequency changes on event like L1
> migration. Add support for it in KVM.
> 
> The only non-trivial change in the patch is in vgettsc(): when updating
> our gtod copy we now need to get both the clockread and tsc value.
> 
> Signed-off-by: Vitaly Kuznetsov 
> ---
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> @@ -1374,6 +1375,11 @@ static u64 compute_guest_tsc(struct kvm_vcpu *vcpu, 
> s64 kernel_ns)
> +static inline int gtod_cs_mode_good(int mode)

"good" isn't saying much; I'd like to express that TSC is the underlying
clock ...

What about "bool gtod_is_based_on_tsc()"?

> +{
> + return mode == VCLOCK_TSC || mode == VCLOCK_HVCLOCK;
> +}
> +
> @@ -1606,9 +1625,17 @@ static inline u64 vgettsc(u64 *cycle_now)
>   long v;
>   struct pvclock_gtod_data *gtod = _gtod_data;
>  
> - *cycle_now = read_tsc();
> + if (gtod->clock.vclock_mode == VCLOCK_HVCLOCK) {
> + u64 tsc_pg_val;
> +
> + tsc_pg_val = hv_read_tsc_page_tsc(hv_get_tsc_page(), cycle_now);

This function might fail to update cycle_now and return -1.
I guess we should propagate the failure in that case.

> + v = (tsc_pg_val - gtod->clock.cycle_last) & gtod->clock.mask;
> + } else {
> + /* VCLOCK_TSC */
> + *cycle_now = read_tsc();
> + v = (*cycle_now - gtod->clock.cycle_last) & gtod->clock.mask;

cycle_now is getting pretty confusing -- it still is TSC timestamp, but
now we also have the current cycle of gtod, which might be the TSC page
timestamp.  Please rename cycle_now to tsc_timestamp in the call tree,

thanks.


Re: [PATCH RFC 5/6] x86/kvm: pass stable clocksource to guests when running nested on Hyper-V

2017-12-01 Thread Radim Krčmář
2017-12-01 14:13+0100, Vitaly Kuznetsov:
> Currently, KVM is able to work in 'masterclock' mode passing
> PVCLOCK_TSC_STABLE_BIT to guests when the clocksource we use on the host
> is TSC. When running nested on Hyper-V we normally use a different one:
> TSC page which is resistant to TSC frequency changes on event like L1
> migration. Add support for it in KVM.
> 
> The only non-trivial change in the patch is in vgettsc(): when updating
> our gtod copy we now need to get both the clockread and tsc value.
> 
> Signed-off-by: Vitaly Kuznetsov 
> ---
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> @@ -1374,6 +1375,11 @@ static u64 compute_guest_tsc(struct kvm_vcpu *vcpu, 
> s64 kernel_ns)
> +static inline int gtod_cs_mode_good(int mode)

"good" isn't saying much; I'd like to express that TSC is the underlying
clock ...

What about "bool gtod_is_based_on_tsc()"?

> +{
> + return mode == VCLOCK_TSC || mode == VCLOCK_HVCLOCK;
> +}
> +
> @@ -1606,9 +1625,17 @@ static inline u64 vgettsc(u64 *cycle_now)
>   long v;
>   struct pvclock_gtod_data *gtod = _gtod_data;
>  
> - *cycle_now = read_tsc();
> + if (gtod->clock.vclock_mode == VCLOCK_HVCLOCK) {
> + u64 tsc_pg_val;
> +
> + tsc_pg_val = hv_read_tsc_page_tsc(hv_get_tsc_page(), cycle_now);

This function might fail to update cycle_now and return -1.
I guess we should propagate the failure in that case.

> + v = (tsc_pg_val - gtod->clock.cycle_last) & gtod->clock.mask;
> + } else {
> + /* VCLOCK_TSC */
> + *cycle_now = read_tsc();
> + v = (*cycle_now - gtod->clock.cycle_last) & gtod->clock.mask;

cycle_now is getting pretty confusing -- it still is TSC timestamp, but
now we also have the current cycle of gtod, which might be the TSC page
timestamp.  Please rename cycle_now to tsc_timestamp in the call tree,

thanks.


[PATCH 1/2] KVM: x86: fix APIC page invalidation

2017-11-30 Thread Radim Krčmář
Implementation of the unpinned APIC page didn't update the VMCS address
cache when invalidation was done through range mmu notifiers.
This became a problem when the page notifier was removed.

Re-introduce the arch-specific helper and call it from ...range_start.

Fixes: 38b9917350cb ("kvm: vmx: Implement set_apic_access_page_addr")
Fixes: 369ea8242c0f ("mm/rmap: update to new mmu_notifier semantic v2")
Signed-off-by: Radim Krčmář <rkrc...@redhat.com>
---
 arch/x86/include/asm/kvm_host.h |  3 +++
 arch/x86/kvm/x86.c  | 14 ++
 virt/kvm/kvm_main.c |  8 
 3 files changed, 25 insertions(+)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 977de5fb968b..c16c3f924863 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1435,4 +1435,7 @@ static inline int kvm_cpu_get_apicid(int mps_cpu)
 #define put_smstate(type, buf, offset, val)  \
*(type *)((buf) + (offset) - 0x7e00) = val
 
+void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
+   unsigned long start, unsigned long end);
+
 #endif /* _ASM_X86_KVM_HOST_H */
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index eee8e7faf1af..a219974cdb89 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6778,6 +6778,20 @@ static void kvm_vcpu_flush_tlb(struct kvm_vcpu *vcpu)
kvm_x86_ops->tlb_flush(vcpu);
 }
 
+void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
+   unsigned long start, unsigned long end)
+{
+   unsigned long apic_address;
+
+   /*
+* The physical address of apic access page is stored in the VMCS.
+* Update it when it becomes invalid.
+*/
+   apic_address = gfn_to_hva(kvm, APIC_DEFAULT_PHYS_BASE >> PAGE_SHIFT);
+   if (start <= apic_address && apic_address < end)
+   kvm_make_all_cpus_request(kvm, KVM_REQ_APIC_PAGE_RELOAD);
+}
+
 void kvm_vcpu_reload_apic_access_page(struct kvm_vcpu *vcpu)
 {
struct page *page = NULL;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index c01cff064ec5..b7f4689e373f 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -135,6 +135,11 @@ static void kvm_uevent_notify_change(unsigned int type, 
struct kvm *kvm);
 static unsigned long long kvm_createvm_count;
 static unsigned long long kvm_active_vms;
 
+__weak void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
+   unsigned long start, unsigned long end)
+{
+}
+
 bool kvm_is_reserved_pfn(kvm_pfn_t pfn)
 {
if (pfn_valid(pfn))
@@ -360,6 +365,9 @@ static void kvm_mmu_notifier_invalidate_range_start(struct 
mmu_notifier *mn,
kvm_flush_remote_tlbs(kvm);
 
spin_unlock(>mmu_lock);
+
+   kvm_arch_mmu_notifier_invalidate_range(kvm, start, end);
+
srcu_read_unlock(>srcu, idx);
 }
 
-- 
2.14.2



[PATCH 2/2] TESTING! KVM: x86: add invalidate_range mmu notifier

2017-11-30 Thread Radim Krčmář
Does roughly what kvm_mmu_notifier_invalidate_page did before.

I am not certain why this would be needed.  It might mean that we have
another bug with start/end or just that I missed something.

Please try just [1/2] first and apply this one only if [1/2] still bugs,
thanks!
---
 virt/kvm/kvm_main.c | 24 
 1 file changed, 24 insertions(+)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index b7f4689e373f..0825ea624f16 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -342,6 +342,29 @@ static void kvm_mmu_notifier_change_pte(struct 
mmu_notifier *mn,
srcu_read_unlock(>srcu, idx);
 }
 
+static void kvm_mmu_notifier_invalidate_range(struct mmu_notifier *mn,
+   struct mm_struct *mm,
+   unsigned long start,
+   unsigned long end)
+{
+   struct kvm *kvm = mmu_notifier_to_kvm(mn);
+   int need_tlb_flush = 0, idx;
+
+   idx = srcu_read_lock(>srcu);
+   spin_lock(>mmu_lock);
+   kvm->mmu_notifier_seq++;
+   need_tlb_flush = kvm_unmap_hva_range(kvm, start, end);
+   need_tlb_flush |= kvm->tlbs_dirty;
+   if (need_tlb_flush)
+   kvm_flush_remote_tlbs(kvm);
+
+   spin_unlock(>mmu_lock);
+
+   kvm_arch_mmu_notifier_invalidate_range(kvm, start, end);
+
+   srcu_read_unlock(>srcu, idx);
+}
+
 static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
struct mm_struct *mm,
unsigned long start,
@@ -476,6 +499,7 @@ static void kvm_mmu_notifier_release(struct mmu_notifier 
*mn,
 }
 
 static const struct mmu_notifier_ops kvm_mmu_notifier_ops = {
+   .invalidate_range   = kvm_mmu_notifier_invalidate_range,
.invalidate_range_start = kvm_mmu_notifier_invalidate_range_start,
.invalidate_range_end   = kvm_mmu_notifier_invalidate_range_end,
.clear_flush_young  = kvm_mmu_notifier_clear_flush_young,
-- 
2.14.2



[PATCH 1/2] KVM: x86: fix APIC page invalidation

2017-11-30 Thread Radim Krčmář
Implementation of the unpinned APIC page didn't update the VMCS address
cache when invalidation was done through range mmu notifiers.
This became a problem when the page notifier was removed.

Re-introduce the arch-specific helper and call it from ...range_start.

Fixes: 38b9917350cb ("kvm: vmx: Implement set_apic_access_page_addr")
Fixes: 369ea8242c0f ("mm/rmap: update to new mmu_notifier semantic v2")
Signed-off-by: Radim Krčmář 
---
 arch/x86/include/asm/kvm_host.h |  3 +++
 arch/x86/kvm/x86.c  | 14 ++
 virt/kvm/kvm_main.c |  8 
 3 files changed, 25 insertions(+)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 977de5fb968b..c16c3f924863 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1435,4 +1435,7 @@ static inline int kvm_cpu_get_apicid(int mps_cpu)
 #define put_smstate(type, buf, offset, val)  \
*(type *)((buf) + (offset) - 0x7e00) = val
 
+void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
+   unsigned long start, unsigned long end);
+
 #endif /* _ASM_X86_KVM_HOST_H */
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index eee8e7faf1af..a219974cdb89 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6778,6 +6778,20 @@ static void kvm_vcpu_flush_tlb(struct kvm_vcpu *vcpu)
kvm_x86_ops->tlb_flush(vcpu);
 }
 
+void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
+   unsigned long start, unsigned long end)
+{
+   unsigned long apic_address;
+
+   /*
+* The physical address of apic access page is stored in the VMCS.
+* Update it when it becomes invalid.
+*/
+   apic_address = gfn_to_hva(kvm, APIC_DEFAULT_PHYS_BASE >> PAGE_SHIFT);
+   if (start <= apic_address && apic_address < end)
+   kvm_make_all_cpus_request(kvm, KVM_REQ_APIC_PAGE_RELOAD);
+}
+
 void kvm_vcpu_reload_apic_access_page(struct kvm_vcpu *vcpu)
 {
struct page *page = NULL;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index c01cff064ec5..b7f4689e373f 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -135,6 +135,11 @@ static void kvm_uevent_notify_change(unsigned int type, 
struct kvm *kvm);
 static unsigned long long kvm_createvm_count;
 static unsigned long long kvm_active_vms;
 
+__weak void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
+   unsigned long start, unsigned long end)
+{
+}
+
 bool kvm_is_reserved_pfn(kvm_pfn_t pfn)
 {
if (pfn_valid(pfn))
@@ -360,6 +365,9 @@ static void kvm_mmu_notifier_invalidate_range_start(struct 
mmu_notifier *mn,
kvm_flush_remote_tlbs(kvm);
 
spin_unlock(>mmu_lock);
+
+   kvm_arch_mmu_notifier_invalidate_range(kvm, start, end);
+
srcu_read_unlock(>srcu, idx);
 }
 
-- 
2.14.2



[PATCH 2/2] TESTING! KVM: x86: add invalidate_range mmu notifier

2017-11-30 Thread Radim Krčmář
Does roughly what kvm_mmu_notifier_invalidate_page did before.

I am not certain why this would be needed.  It might mean that we have
another bug with start/end or just that I missed something.

Please try just [1/2] first and apply this one only if [1/2] still bugs,
thanks!
---
 virt/kvm/kvm_main.c | 24 
 1 file changed, 24 insertions(+)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index b7f4689e373f..0825ea624f16 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -342,6 +342,29 @@ static void kvm_mmu_notifier_change_pte(struct 
mmu_notifier *mn,
srcu_read_unlock(>srcu, idx);
 }
 
+static void kvm_mmu_notifier_invalidate_range(struct mmu_notifier *mn,
+   struct mm_struct *mm,
+   unsigned long start,
+   unsigned long end)
+{
+   struct kvm *kvm = mmu_notifier_to_kvm(mn);
+   int need_tlb_flush = 0, idx;
+
+   idx = srcu_read_lock(>srcu);
+   spin_lock(>mmu_lock);
+   kvm->mmu_notifier_seq++;
+   need_tlb_flush = kvm_unmap_hva_range(kvm, start, end);
+   need_tlb_flush |= kvm->tlbs_dirty;
+   if (need_tlb_flush)
+   kvm_flush_remote_tlbs(kvm);
+
+   spin_unlock(>mmu_lock);
+
+   kvm_arch_mmu_notifier_invalidate_range(kvm, start, end);
+
+   srcu_read_unlock(>srcu, idx);
+}
+
 static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
struct mm_struct *mm,
unsigned long start,
@@ -476,6 +499,7 @@ static void kvm_mmu_notifier_release(struct mmu_notifier 
*mn,
 }
 
 static const struct mmu_notifier_ops kvm_mmu_notifier_ops = {
+   .invalidate_range   = kvm_mmu_notifier_invalidate_range,
.invalidate_range_start = kvm_mmu_notifier_invalidate_range_start,
.invalidate_range_end   = kvm_mmu_notifier_invalidate_range_end,
.clear_flush_young  = kvm_mmu_notifier_clear_flush_young,
-- 
2.14.2



Re: BSOD with [PATCH 00/13] mmu_notifier kill invalidate_page callback

2017-11-30 Thread Radim Krčmář
2017-11-30 12:20+0100, Paolo Bonzini:
> On 30/11/2017 10:33, Fabian Grünbichler wrote:
> > 
> > It was reverted in 785373b4c38719f4af6775845df6be1dfaea120f after which
> > the symptoms disappeared until this series was merged, which contains
> > 
> > 369ea8242c0fb5239b4ddf0dc568f694bd244de4 mm/rmap: update to new 
> > mmu_notifier semantic v2
> > 
> > We haven't bisected the individual commits of the series yet, but the
> > commit immediately preceding its merge exhibits no problems, while
> > everything after does. It is not known whether the bug is actually in
> > the series itself, or whether increasing the likelihood of triggering it
> > is just a side-effect. There is a similar report[2] concerning an
> > upgrade from 4.12.12 to 4.12.13, which does not contain this series in
> > any form AFAICT but might be worth another look as well.
> 
> I know of one issue in this series (invalidate_page was removed from KVM
> without reimplementing it as invalidate_range).  I'll try to prioritize
> the fix, but I don't think I can do it before Monday.

The series also dropped the reloading of the APIC access page and we
never had it in invalidate_range_start ... I'll look into it today.


Re: BSOD with [PATCH 00/13] mmu_notifier kill invalidate_page callback

2017-11-30 Thread Radim Krčmář
2017-11-30 12:20+0100, Paolo Bonzini:
> On 30/11/2017 10:33, Fabian Grünbichler wrote:
> > 
> > It was reverted in 785373b4c38719f4af6775845df6be1dfaea120f after which
> > the symptoms disappeared until this series was merged, which contains
> > 
> > 369ea8242c0fb5239b4ddf0dc568f694bd244de4 mm/rmap: update to new 
> > mmu_notifier semantic v2
> > 
> > We haven't bisected the individual commits of the series yet, but the
> > commit immediately preceding its merge exhibits no problems, while
> > everything after does. It is not known whether the bug is actually in
> > the series itself, or whether increasing the likelihood of triggering it
> > is just a side-effect. There is a similar report[2] concerning an
> > upgrade from 4.12.12 to 4.12.13, which does not contain this series in
> > any form AFAICT but might be worth another look as well.
> 
> I know of one issue in this series (invalidate_page was removed from KVM
> without reimplementing it as invalidate_range).  I'll try to prioritize
> the fix, but I don't think I can do it before Monday.

The series also dropped the reloading of the APIC access page and we
never had it in invalidate_range_start ... I'll look into it today.


Re: [PATCH v7 2/4] KVM: X86: Add Paravirt TLB Shootdown

2017-11-30 Thread Radim Krčmář
2017-11-29 22:01-0800, Wanpeng Li:
> From: Wanpeng Li 
> ---
> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> @@ -498,6 +498,34 @@ static void __init kvm_apf_trap_init(void)
>   update_intr_gate(X86_TRAP_PF, async_page_fault);
>  }
>  
> +static DEFINE_PER_CPU(cpumask_var_t, __pv_tlb_mask);
> +
> +static void kvm_flush_tlb_others(const struct cpumask *cpumask,
> + const struct flush_tlb_info *info)
> +{
> + u8 state;
> + int cpu;
> + struct kvm_steal_time *src;
> + struct cpumask *flushmask = this_cpu_cpumask_var_ptr(__pv_tlb_mask);
> +
> + cpumask_copy(flushmask, cpumask);

Is it impossible to call this function before the allocation?

I was guessing that early_initcall might allow us to avoid a (static)
condition as there is no point in calling when there are no others, but
expected the worst ...

thanks.


Re: [PATCH v7 2/4] KVM: X86: Add Paravirt TLB Shootdown

2017-11-30 Thread Radim Krčmář
2017-11-29 22:01-0800, Wanpeng Li:
> From: Wanpeng Li 
> ---
> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> @@ -498,6 +498,34 @@ static void __init kvm_apf_trap_init(void)
>   update_intr_gate(X86_TRAP_PF, async_page_fault);
>  }
>  
> +static DEFINE_PER_CPU(cpumask_var_t, __pv_tlb_mask);
> +
> +static void kvm_flush_tlb_others(const struct cpumask *cpumask,
> + const struct flush_tlb_info *info)
> +{
> + u8 state;
> + int cpu;
> + struct kvm_steal_time *src;
> + struct cpumask *flushmask = this_cpu_cpumask_var_ptr(__pv_tlb_mask);
> +
> + cpumask_copy(flushmask, cpumask);

Is it impossible to call this function before the allocation?

I was guessing that early_initcall might allow us to avoid a (static)
condition as there is no point in calling when there are no others, but
expected the worst ...

thanks.


Re: [PATCH v6 2/4] KVM: X86: Add Paravirt TLB Shootdown

2017-11-30 Thread Radim Krčmář
2017-11-30 14:24+0800, Wanpeng Li:
> 2017-11-30 0:21 GMT+08:00 Radim Krčmář <rkrc...@redhat.com>:
> > 2017-11-27 20:05-0800, Wanpeng Li:
> >> From: Wanpeng Li <wanpeng...@hotmail.com>
> >> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> >> @@ -498,6 +498,37 @@ static void __init kvm_apf_trap_init(void)
> >>   update_intr_gate(X86_TRAP_PF, async_page_fault);
> >>  }
> >>
> >> +static DEFINE_PER_CPU(cpumask_t, __pv_tlb_mask);
> >> +
> >> +static void kvm_flush_tlb_others(const struct cpumask *cpumask,
> >> + const struct flush_tlb_info *info)
> >> +{
> >> + u8 state;
> >> + int cpu;
> >> + struct kvm_steal_time *src;
> >> + cpumask_t *flushmask = _cpu(__pv_tlb_mask, smp_processor_id());
> >> +
> >> + if (unlikely(!flushmask))
> >> + return;
> >
> > I don't see how this can be NULL and if it could, we'd have to call
> > native_flush_tlb_others() instead of returning anyway.
> >
> > Also, Peter mentioned that we're wasting memory (default is 1k per CPU)
> > when not running on KVM.  Hyper-V hijacks x86_platform.apic_post_init()
> > to achieve late allocation.  smp_ops.smp_prepare_cpus seems slightly
> > better for our purposes, but I don't really like either.
> >
> > Couldn't we use use arch_initcall(), or early_initcall() if there are
> > complications with allocating after smp_init()?
> 
> Do it in v7. In addition, move pv_mmu_ops.flush_tlb_others =
> kvm_flush_tlb_others to the arch_initcall() fails to work even if I
> disable rodata through grub. So I continue to keep the callback
> replacement in kvm_guest_init() and late allocation in
> arch_initcall().

I think it has to do with the patching -- you'd need to re-patch
flush_tlb_others callsites for the change to take effect or add a
hypervisor late init just before check_bugs(), where the patching is
currently done.

Not sure how either of those is acceptable, though.


Re: [PATCH v6 2/4] KVM: X86: Add Paravirt TLB Shootdown

2017-11-30 Thread Radim Krčmář
2017-11-30 14:24+0800, Wanpeng Li:
> 2017-11-30 0:21 GMT+08:00 Radim Krčmář :
> > 2017-11-27 20:05-0800, Wanpeng Li:
> >> From: Wanpeng Li 
> >> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> >> @@ -498,6 +498,37 @@ static void __init kvm_apf_trap_init(void)
> >>   update_intr_gate(X86_TRAP_PF, async_page_fault);
> >>  }
> >>
> >> +static DEFINE_PER_CPU(cpumask_t, __pv_tlb_mask);
> >> +
> >> +static void kvm_flush_tlb_others(const struct cpumask *cpumask,
> >> + const struct flush_tlb_info *info)
> >> +{
> >> + u8 state;
> >> + int cpu;
> >> + struct kvm_steal_time *src;
> >> + cpumask_t *flushmask = _cpu(__pv_tlb_mask, smp_processor_id());
> >> +
> >> + if (unlikely(!flushmask))
> >> + return;
> >
> > I don't see how this can be NULL and if it could, we'd have to call
> > native_flush_tlb_others() instead of returning anyway.
> >
> > Also, Peter mentioned that we're wasting memory (default is 1k per CPU)
> > when not running on KVM.  Hyper-V hijacks x86_platform.apic_post_init()
> > to achieve late allocation.  smp_ops.smp_prepare_cpus seems slightly
> > better for our purposes, but I don't really like either.
> >
> > Couldn't we use use arch_initcall(), or early_initcall() if there are
> > complications with allocating after smp_init()?
> 
> Do it in v7. In addition, move pv_mmu_ops.flush_tlb_others =
> kvm_flush_tlb_others to the arch_initcall() fails to work even if I
> disable rodata through grub. So I continue to keep the callback
> replacement in kvm_guest_init() and late allocation in
> arch_initcall().

I think it has to do with the patching -- you'd need to re-patch
flush_tlb_others callsites for the change to take effect or add a
hypervisor late init just before check_bugs(), where the patching is
currently done.

Not sure how either of those is acceptable, though.


[PATCH v2 3/3] KVM: x86: simplify kvm_mwait_in_guest()

2017-11-29 Thread Radim Krčmář
If Intel/AMD implements MWAIT, we expect that it works well and only
reject known bugs;  no reason to do it the other way around for minor
vendors.  (Not that they are relevant ATM.)

This allows further simplification of kvm_mwait_in_guest().
And use boot_cpu_has() instead of "cpu_has(_cpu_data," while at it.

Signed-off-by: Radim Krčmář <rkrc...@redhat.com>
---
 arch/x86/kvm/x86.h | 14 ++
 1 file changed, 2 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index d15859ec5e92..c69f973111cb 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -265,18 +265,8 @@ static inline u64 nsec_to_cycles(struct kvm_vcpu *vcpu, 
u64 nsec)
 
 static inline bool kvm_mwait_in_guest(void)
 {
-   if (!cpu_has(_cpu_data, X86_FEATURE_MWAIT))
-   return false;
-
-   switch (boot_cpu_data.x86_vendor) {
-   case X86_VENDOR_AMD:
-   /* All AMD CPUs have a working MWAIT implementation */
-   return true;
-   case X86_VENDOR_INTEL:
-   return !boot_cpu_has_bug(X86_BUG_MONITOR);
-   default:
-   return false;
-   }
+   return boot_cpu_has(X86_FEATURE_MWAIT) &&
+   !boot_cpu_has_bug(X86_BUG_MONITOR);
 }
 
 #endif
-- 
2.14.2



[PATCH v2 2/3] KVM: x86: drop bogus MWAIT check

2017-11-29 Thread Radim Krčmář
The check was added in some iteration while trying to fix a reported OS
X on Core 2 bug, but that bug is elsewhere.

The comment is misleading because the guest can call MWAIT with ECX = 0
even if we enforce CPUID5_ECX_INTERRUPT_BREAK;  the call would have the
exactly the same effect as if the host didn't have the feature.

A problem is that a QEMU feature exposes CPUID5_ECX_INTERRUPT_BREAK on
CPUs that do not support it.  Removing the check changes behavior on
last Pentium 4 lines (Presler, Dempsey, and Tulsa, which had VMX and
MONITOR while missing INTERRUPT_BREAK) when running a guest OS that uses
MWAIT without checking for its presence (QEMU doesn't expose MONITOR).

The only known OS that ignores the MONITOR flag is old Mac OS X and we
allowed it to bug on Core 2 (MWAIT used to throw #UD and only that OS
noticed), so we can save another 20 lines letting it bug on even older
CPUs.  Alternatively, we can return MWAIT exiting by default and let
userspace toggle it.

Signed-off-by: Radim Krčmář <rkrc...@redhat.com>
---
 arch/x86/kvm/x86.h | 23 +--
 1 file changed, 1 insertion(+), 22 deletions(-)

diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index 81f5f50794f6..d15859ec5e92 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -265,8 +265,6 @@ static inline u64 nsec_to_cycles(struct kvm_vcpu *vcpu, u64 
nsec)
 
 static inline bool kvm_mwait_in_guest(void)
 {
-   unsigned int eax, ebx, ecx, edx;
-
if (!cpu_has(_cpu_data, X86_FEATURE_MWAIT))
return false;
 
@@ -275,29 +273,10 @@ static inline bool kvm_mwait_in_guest(void)
/* All AMD CPUs have a working MWAIT implementation */
return true;
case X86_VENDOR_INTEL:
-   /* Handle Intel below */
-   break;
+   return !boot_cpu_has_bug(X86_BUG_MONITOR);
default:
return false;
}
-
-   if (boot_cpu_has_bug(X86_BUG_MONITOR))
-   return false;
-
-   /*
-* Intel CPUs without CPUID5_ECX_INTERRUPT_BREAK are problematic as
-* they would allow guest to stop the CPU completely by disabling
-* interrupts then invoking MWAIT.
-*/
-   if (boot_cpu_data.cpuid_level < CPUID_MWAIT_LEAF)
-   return false;
-
-   cpuid(CPUID_MWAIT_LEAF, , , , );
-
-   if (!(ecx & CPUID5_ECX_INTERRUPT_BREAK))
-   return false;
-
-   return true;
 }
 
 #endif
-- 
2.14.2



[PATCH v2 3/3] KVM: x86: simplify kvm_mwait_in_guest()

2017-11-29 Thread Radim Krčmář
If Intel/AMD implements MWAIT, we expect that it works well and only
reject known bugs;  no reason to do it the other way around for minor
vendors.  (Not that they are relevant ATM.)

This allows further simplification of kvm_mwait_in_guest().
And use boot_cpu_has() instead of "cpu_has(_cpu_data," while at it.

Signed-off-by: Radim Krčmář 
---
 arch/x86/kvm/x86.h | 14 ++
 1 file changed, 2 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index d15859ec5e92..c69f973111cb 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -265,18 +265,8 @@ static inline u64 nsec_to_cycles(struct kvm_vcpu *vcpu, 
u64 nsec)
 
 static inline bool kvm_mwait_in_guest(void)
 {
-   if (!cpu_has(_cpu_data, X86_FEATURE_MWAIT))
-   return false;
-
-   switch (boot_cpu_data.x86_vendor) {
-   case X86_VENDOR_AMD:
-   /* All AMD CPUs have a working MWAIT implementation */
-   return true;
-   case X86_VENDOR_INTEL:
-   return !boot_cpu_has_bug(X86_BUG_MONITOR);
-   default:
-   return false;
-   }
+   return boot_cpu_has(X86_FEATURE_MWAIT) &&
+   !boot_cpu_has_bug(X86_BUG_MONITOR);
 }
 
 #endif
-- 
2.14.2



[PATCH v2 2/3] KVM: x86: drop bogus MWAIT check

2017-11-29 Thread Radim Krčmář
The check was added in some iteration while trying to fix a reported OS
X on Core 2 bug, but that bug is elsewhere.

The comment is misleading because the guest can call MWAIT with ECX = 0
even if we enforce CPUID5_ECX_INTERRUPT_BREAK;  the call would have the
exactly the same effect as if the host didn't have the feature.

A problem is that a QEMU feature exposes CPUID5_ECX_INTERRUPT_BREAK on
CPUs that do not support it.  Removing the check changes behavior on
last Pentium 4 lines (Presler, Dempsey, and Tulsa, which had VMX and
MONITOR while missing INTERRUPT_BREAK) when running a guest OS that uses
MWAIT without checking for its presence (QEMU doesn't expose MONITOR).

The only known OS that ignores the MONITOR flag is old Mac OS X and we
allowed it to bug on Core 2 (MWAIT used to throw #UD and only that OS
noticed), so we can save another 20 lines letting it bug on even older
CPUs.  Alternatively, we can return MWAIT exiting by default and let
userspace toggle it.

Signed-off-by: Radim Krčmář 
---
 arch/x86/kvm/x86.h | 23 +--
 1 file changed, 1 insertion(+), 22 deletions(-)

diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index 81f5f50794f6..d15859ec5e92 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -265,8 +265,6 @@ static inline u64 nsec_to_cycles(struct kvm_vcpu *vcpu, u64 
nsec)
 
 static inline bool kvm_mwait_in_guest(void)
 {
-   unsigned int eax, ebx, ecx, edx;
-
if (!cpu_has(_cpu_data, X86_FEATURE_MWAIT))
return false;
 
@@ -275,29 +273,10 @@ static inline bool kvm_mwait_in_guest(void)
/* All AMD CPUs have a working MWAIT implementation */
return true;
case X86_VENDOR_INTEL:
-   /* Handle Intel below */
-   break;
+   return !boot_cpu_has_bug(X86_BUG_MONITOR);
default:
return false;
}
-
-   if (boot_cpu_has_bug(X86_BUG_MONITOR))
-   return false;
-
-   /*
-* Intel CPUs without CPUID5_ECX_INTERRUPT_BREAK are problematic as
-* they would allow guest to stop the CPU completely by disabling
-* interrupts then invoking MWAIT.
-*/
-   if (boot_cpu_data.cpuid_level < CPUID_MWAIT_LEAF)
-   return false;
-
-   cpuid(CPUID_MWAIT_LEAF, , , , );
-
-   if (!(ecx & CPUID5_ECX_INTERRUPT_BREAK))
-   return false;
-
-   return true;
 }
 
 #endif
-- 
2.14.2



[PATCH v2 1/3] KVM: x86: prevent MWAIT in guest with buggy MONITOR

2017-11-29 Thread Radim Krčmář
The bug prevents MWAIT from waking up after a write to the monitored
cache line.
KVM might emulate a CPU model that shouldn't have the bug, so the guest
would not employ a workaround and possibly miss wakeups.
Better to avoid the situation.

Reviewed-by: Alexander Graf <ag...@suse.de>
Signed-off-by: Radim Krčmář <rkrc...@redhat.com>
---
 arch/x86/kvm/x86.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index d0b95b7a90b4..81f5f50794f6 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -281,6 +281,9 @@ static inline bool kvm_mwait_in_guest(void)
return false;
}
 
+   if (boot_cpu_has_bug(X86_BUG_MONITOR))
+   return false;
+
/*
 * Intel CPUs without CPUID5_ECX_INTERRUPT_BREAK are problematic as
 * they would allow guest to stop the CPU completely by disabling
-- 
2.14.2



[PATCH v2 1/3] KVM: x86: prevent MWAIT in guest with buggy MONITOR

2017-11-29 Thread Radim Krčmář
The bug prevents MWAIT from waking up after a write to the monitored
cache line.
KVM might emulate a CPU model that shouldn't have the bug, so the guest
would not employ a workaround and possibly miss wakeups.
Better to avoid the situation.

Reviewed-by: Alexander Graf 
Signed-off-by: Radim Krčmář 
---
 arch/x86/kvm/x86.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index d0b95b7a90b4..81f5f50794f6 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -281,6 +281,9 @@ static inline bool kvm_mwait_in_guest(void)
return false;
}
 
+   if (boot_cpu_has_bug(X86_BUG_MONITOR))
+   return false;
+
/*
 * Intel CPUs without CPUID5_ECX_INTERRUPT_BREAK are problematic as
 * they would allow guest to stop the CPU completely by disabling
-- 
2.14.2



[PATCH v2 0/3] KVM: x86: kvm_mwait_in_guest() cleanup

2017-11-29 Thread Radim Krčmář
This is a rebased version of an old series that simplified
kvm_mwait_in_guest: https://www.spinics.net/lists/kvm/msg149238.html

AMD errata 400 patch was dropped thanks to Boris's review;
[2/3] got an expanded commit message and I didn't include Alexander's
r-b since the context changed when we didn't drop support for ancient
CPUs.

Radim Krčmář (3):
  KVM: x86: prevent MWAIT in guest with buggy MONITOR
  KVM: x86: drop bogus MWAIT check
  KVM: x86: simplify kvm_mwait_in_guest()

 arch/x86/kvm/x86.h | 32 ++--
 1 file changed, 2 insertions(+), 30 deletions(-)

-- 
2.14.2



[PATCH v2 0/3] KVM: x86: kvm_mwait_in_guest() cleanup

2017-11-29 Thread Radim Krčmář
This is a rebased version of an old series that simplified
kvm_mwait_in_guest: https://www.spinics.net/lists/kvm/msg149238.html

AMD errata 400 patch was dropped thanks to Boris's review;
[2/3] got an expanded commit message and I didn't include Alexander's
r-b since the context changed when we didn't drop support for ancient
CPUs.

Radim Krčmář (3):
  KVM: x86: prevent MWAIT in guest with buggy MONITOR
  KVM: x86: drop bogus MWAIT check
  KVM: x86: simplify kvm_mwait_in_guest()

 arch/x86/kvm/x86.h | 32 ++--
 1 file changed, 2 insertions(+), 30 deletions(-)

-- 
2.14.2



Re: [PATCH v6 2/4] KVM: X86: Add Paravirt TLB Shootdown

2017-11-29 Thread Radim Krčmář
2017-11-27 20:05-0800, Wanpeng Li:
> From: Wanpeng Li <wanpeng...@hotmail.com>
> 
> Remote flushing api's does a busy wait which is fine in bare-metal
> scenario. But with-in the guest, the vcpus might have been pre-empted
> or blocked. In this scenario, the initator vcpu would end up
> busy-waiting for a long amount of time.
> 
> This patch set implements para-virt flush tlbs making sure that it
> does not wait for vcpus that are sleeping. And all the sleeping vcpus
> flush the tlb on guest enter.
> 
> The best result is achieved when we're overcommiting the host by running 
> multiple vCPUs on each pCPU. In this case PV tlb flush avoids touching 
> vCPUs which are not scheduled and avoid the wait on the main CPU.
> 
> Testing on a Xeon Gold 6142 2.6GHz 2 sockets, 32 cores, 64 threads,
> so 64 pCPUs, and each VM is 64 vCPUs.
> 
> ebizzy -M 
>   vanillaoptimized boost
> 1VM46799 486704%
> 2VM23962 42691   78%
> 3VM16152     37539  132%
> 
> Cc: Paolo Bonzini <pbonz...@redhat.com>
> Cc: Radim Krčmář <rkrc...@redhat.com>
> Cc: Peter Zijlstra <pet...@infradead.org>
> Signed-off-by: Wanpeng Li <wanpeng...@hotmail.com>
> ---
> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> @@ -498,6 +498,37 @@ static void __init kvm_apf_trap_init(void)
>   update_intr_gate(X86_TRAP_PF, async_page_fault);
>  }
>  
> +static DEFINE_PER_CPU(cpumask_t, __pv_tlb_mask);
> +
> +static void kvm_flush_tlb_others(const struct cpumask *cpumask,
> + const struct flush_tlb_info *info)
> +{
> + u8 state;
> + int cpu;
> + struct kvm_steal_time *src;
> + cpumask_t *flushmask = _cpu(__pv_tlb_mask, smp_processor_id());
> +
> + if (unlikely(!flushmask))
> + return;

I don't see how this can be NULL and if it could, we'd have to call
native_flush_tlb_others() instead of returning anyway.

Also, Peter mentioned that we're wasting memory (default is 1k per CPU)
when not running on KVM.  Hyper-V hijacks x86_platform.apic_post_init()
to achieve late allocation.  smp_ops.smp_prepare_cpus seems slightly
better for our purposes, but I don't really like either.

Couldn't we use use arch_initcall(), or early_initcall() if there are
complications with allocating after smp_init()?

Thanks.


Re: [PATCH v6 2/4] KVM: X86: Add Paravirt TLB Shootdown

2017-11-29 Thread Radim Krčmář
2017-11-27 20:05-0800, Wanpeng Li:
> From: Wanpeng Li 
> 
> Remote flushing api's does a busy wait which is fine in bare-metal
> scenario. But with-in the guest, the vcpus might have been pre-empted
> or blocked. In this scenario, the initator vcpu would end up
> busy-waiting for a long amount of time.
> 
> This patch set implements para-virt flush tlbs making sure that it
> does not wait for vcpus that are sleeping. And all the sleeping vcpus
> flush the tlb on guest enter.
> 
> The best result is achieved when we're overcommiting the host by running 
> multiple vCPUs on each pCPU. In this case PV tlb flush avoids touching 
> vCPUs which are not scheduled and avoid the wait on the main CPU.
> 
> Testing on a Xeon Gold 6142 2.6GHz 2 sockets, 32 cores, 64 threads,
> so 64 pCPUs, and each VM is 64 vCPUs.
> 
> ebizzy -M 
>   vanillaoptimized boost
> 1VM46799 486704%
> 2VM23962 42691   78%
> 3VM        16152 37539  132%
> 
> Cc: Paolo Bonzini 
> Cc: Radim Krčmář 
> Cc: Peter Zijlstra 
> Signed-off-by: Wanpeng Li 
> ---
> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> @@ -498,6 +498,37 @@ static void __init kvm_apf_trap_init(void)
>   update_intr_gate(X86_TRAP_PF, async_page_fault);
>  }
>  
> +static DEFINE_PER_CPU(cpumask_t, __pv_tlb_mask);
> +
> +static void kvm_flush_tlb_others(const struct cpumask *cpumask,
> + const struct flush_tlb_info *info)
> +{
> + u8 state;
> + int cpu;
> + struct kvm_steal_time *src;
> + cpumask_t *flushmask = _cpu(__pv_tlb_mask, smp_processor_id());
> +
> + if (unlikely(!flushmask))
> + return;

I don't see how this can be NULL and if it could, we'd have to call
native_flush_tlb_others() instead of returning anyway.

Also, Peter mentioned that we're wasting memory (default is 1k per CPU)
when not running on KVM.  Hyper-V hijacks x86_platform.apic_post_init()
to achieve late allocation.  smp_ops.smp_prepare_cpus seems slightly
better for our purposes, but I don't really like either.

Couldn't we use use arch_initcall(), or early_initcall() if there are
complications with allocating after smp_init()?

Thanks.


[GIT PULL] Trimmed second batch of KVM changes for Linux 4.15

2017-11-24 Thread Radim Krčmář
Linus,

The following changes since commit cf9b0772f2e410645fece13b749bd56505b998b8:

  Merge tag 'armsoc-drivers' of 
git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc (2017-11-16 16:05:01 
-0800)

are available in the git repository at:

  git://git.kernel.org/pub/scm/virt/kvm/kvm tags/kvm-4.15-2

for you to fetch changes up to d02fcf50779ec9d8eb7a81473fd76efe3f04b3a5:

  kvm: vmx: Allow disabling virtual NMI support (2017-11-17 13:20:07 +0100)


Trimmed second batch of KVM changes for Linux 4.15

* GICv4 Support for KVM/ARM

All ARM patches were in next-20171113.  I have postponed most x86 fixes
to 4.15-rc2 and UMIP to 4.16, but there are fixes that would be good to
have already in 4.15-rc1:

* re-introduce support for CPUs without virtual NMI (cc stable)
  and allow testing of KVM without virtual NMI on available CPUs

* fix long-standing performance issues with assigned devices on AMD
  (cc stable)


Christoffer Dall (3):
  Merge git://git.kernel.org/.../tip/tip.git irq/core
  KVM: arm/arm64: Fix GICv4 ITS initialization issues
  KVM: arm/arm64: Don't queue VLPIs on INV/INVALL

Eric Auger (2):
  KVM: arm/arm64: register irq bypass consumer on ARM/ARM64
  KVM: arm/arm64: vgic: restructure kvm_vgic_(un)map_phys_irq

Marc Zyngier (23):
  KVM: arm: Select ARM_GIC_V3 and ARM_GIC_V3_ITS
  KVM: arm/arm64: vgic: Move kvm_vgic_destroy call around
  KVM: arm/arm64: vITS: Add MSI translation helpers
  KVM: arm/arm64: vITS: Add a helper to update the affinity of an LPI
  KVM: arm/arm64: GICv4: Add property field and per-VM predicate
  KVM: arm/arm64: GICv4: Add init/teardown of the per-VM vPE irq domain
  KVM: arm/arm64: GICv4: Wire mapping/unmapping of VLPIs in VFIO irq bypass
  KVM: arm/arm64: GICv4: Handle INT command applied to a VLPI
  KVM: arm/arm64: GICv4: Unmap VLPI when freeing an LPI
  KVM: arm/arm64: GICv4: Propagate affinity changes to the physical ITS
  KVM: arm/arm64: GICv4: Handle CLEAR applied to a VLPI
  KVM: arm/arm64: GICv4: Handle MOVALL applied to a vPE
  KVM: arm/arm64: GICv4: Propagate property updates to VLPIs
  KVM: arm/arm64: GICv4: Handle INVALL applied to a vPE
  KVM: arm/arm64: GICv4: Use pending_last as a scheduling hint
  KVM: arm/arm64: GICv4: Add doorbell interrupt handling
  KVM: arm/arm64: GICv4: Use the doorbell interrupt as an unblocking source
  KVM: arm/arm64: GICv4: Hook vPE scheduling into vgic flush/sync
  KVM: arm/arm64: GICv4: Enable virtual cpuif if VLPIs can be delivered
  KVM: arm/arm64: GICv4: Prevent a VM using GICv4 from being saved
  KVM: arm/arm64: GICv4: Prevent userspace from changing doorbell affinity
  KVM: arm/arm64: GICv4: Enable VLPI support
  KVM: arm/arm64: GICv4: Theory of operations

Paolo Bonzini (4):
  Merge tag 'kvm-arm-gicv4-for-v4.15' of 
git://git.kernel.org/.../kvmarm/kvmarm into HEAD
  KVM: SVM: obey guest PAT
  kvm: vmx: Reinstate support for CPUs without virtual NMI
  kvm: vmx: Allow disabling virtual NMI support

 Documentation/admin-guide/kernel-parameters.txt|   4 +
 Documentation/virtual/kvm/devices/arm-vgic-its.txt |   2 +
 arch/arm/kvm/Kconfig   |   5 +
 arch/arm/kvm/Makefile  |   1 +
 arch/arm64/kvm/Kconfig |   3 +
 arch/arm64/kvm/Makefile|   1 +
 arch/x86/kvm/svm.c |   7 +
 arch/x86/kvm/vmx.c | 161 ++---
 include/kvm/arm_vgic.h |  41 ++-
 virt/kvm/arm/arch_timer.c  |  24 +-
 virt/kvm/arm/arm.c |  48 ++-
 virt/kvm/arm/hyp/vgic-v3-sr.c  |   9 +-
 virt/kvm/arm/vgic/vgic-init.c  |   7 +
 virt/kvm/arm/vgic/vgic-its.c   | 204 
 virt/kvm/arm/vgic/vgic-mmio-v3.c   |   5 +
 virt/kvm/arm/vgic/vgic-v3.c|  14 +
 virt/kvm/arm/vgic/vgic-v4.c| 364 +
 virt/kvm/arm/vgic/vgic.c   |  67 +++-
 virt/kvm/arm/vgic/vgic.h   |  10 +
 19 files changed, 819 insertions(+), 158 deletions(-)
 create mode 100644 virt/kvm/arm/vgic/vgic-v4.c


[GIT PULL] Trimmed second batch of KVM changes for Linux 4.15

2017-11-24 Thread Radim Krčmář
Linus,

The following changes since commit cf9b0772f2e410645fece13b749bd56505b998b8:

  Merge tag 'armsoc-drivers' of 
git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc (2017-11-16 16:05:01 
-0800)

are available in the git repository at:

  git://git.kernel.org/pub/scm/virt/kvm/kvm tags/kvm-4.15-2

for you to fetch changes up to d02fcf50779ec9d8eb7a81473fd76efe3f04b3a5:

  kvm: vmx: Allow disabling virtual NMI support (2017-11-17 13:20:07 +0100)


Trimmed second batch of KVM changes for Linux 4.15

* GICv4 Support for KVM/ARM

All ARM patches were in next-20171113.  I have postponed most x86 fixes
to 4.15-rc2 and UMIP to 4.16, but there are fixes that would be good to
have already in 4.15-rc1:

* re-introduce support for CPUs without virtual NMI (cc stable)
  and allow testing of KVM without virtual NMI on available CPUs

* fix long-standing performance issues with assigned devices on AMD
  (cc stable)


Christoffer Dall (3):
  Merge git://git.kernel.org/.../tip/tip.git irq/core
  KVM: arm/arm64: Fix GICv4 ITS initialization issues
  KVM: arm/arm64: Don't queue VLPIs on INV/INVALL

Eric Auger (2):
  KVM: arm/arm64: register irq bypass consumer on ARM/ARM64
  KVM: arm/arm64: vgic: restructure kvm_vgic_(un)map_phys_irq

Marc Zyngier (23):
  KVM: arm: Select ARM_GIC_V3 and ARM_GIC_V3_ITS
  KVM: arm/arm64: vgic: Move kvm_vgic_destroy call around
  KVM: arm/arm64: vITS: Add MSI translation helpers
  KVM: arm/arm64: vITS: Add a helper to update the affinity of an LPI
  KVM: arm/arm64: GICv4: Add property field and per-VM predicate
  KVM: arm/arm64: GICv4: Add init/teardown of the per-VM vPE irq domain
  KVM: arm/arm64: GICv4: Wire mapping/unmapping of VLPIs in VFIO irq bypass
  KVM: arm/arm64: GICv4: Handle INT command applied to a VLPI
  KVM: arm/arm64: GICv4: Unmap VLPI when freeing an LPI
  KVM: arm/arm64: GICv4: Propagate affinity changes to the physical ITS
  KVM: arm/arm64: GICv4: Handle CLEAR applied to a VLPI
  KVM: arm/arm64: GICv4: Handle MOVALL applied to a vPE
  KVM: arm/arm64: GICv4: Propagate property updates to VLPIs
  KVM: arm/arm64: GICv4: Handle INVALL applied to a vPE
  KVM: arm/arm64: GICv4: Use pending_last as a scheduling hint
  KVM: arm/arm64: GICv4: Add doorbell interrupt handling
  KVM: arm/arm64: GICv4: Use the doorbell interrupt as an unblocking source
  KVM: arm/arm64: GICv4: Hook vPE scheduling into vgic flush/sync
  KVM: arm/arm64: GICv4: Enable virtual cpuif if VLPIs can be delivered
  KVM: arm/arm64: GICv4: Prevent a VM using GICv4 from being saved
  KVM: arm/arm64: GICv4: Prevent userspace from changing doorbell affinity
  KVM: arm/arm64: GICv4: Enable VLPI support
  KVM: arm/arm64: GICv4: Theory of operations

Paolo Bonzini (4):
  Merge tag 'kvm-arm-gicv4-for-v4.15' of 
git://git.kernel.org/.../kvmarm/kvmarm into HEAD
  KVM: SVM: obey guest PAT
  kvm: vmx: Reinstate support for CPUs without virtual NMI
  kvm: vmx: Allow disabling virtual NMI support

 Documentation/admin-guide/kernel-parameters.txt|   4 +
 Documentation/virtual/kvm/devices/arm-vgic-its.txt |   2 +
 arch/arm/kvm/Kconfig   |   5 +
 arch/arm/kvm/Makefile  |   1 +
 arch/arm64/kvm/Kconfig |   3 +
 arch/arm64/kvm/Makefile|   1 +
 arch/x86/kvm/svm.c |   7 +
 arch/x86/kvm/vmx.c | 161 ++---
 include/kvm/arm_vgic.h |  41 ++-
 virt/kvm/arm/arch_timer.c  |  24 +-
 virt/kvm/arm/arm.c |  48 ++-
 virt/kvm/arm/hyp/vgic-v3-sr.c  |   9 +-
 virt/kvm/arm/vgic/vgic-init.c  |   7 +
 virt/kvm/arm/vgic/vgic-its.c   | 204 
 virt/kvm/arm/vgic/vgic-mmio-v3.c   |   5 +
 virt/kvm/arm/vgic/vgic-v3.c|  14 +
 virt/kvm/arm/vgic/vgic-v4.c| 364 +
 virt/kvm/arm/vgic/vgic.c   |  67 +++-
 virt/kvm/arm/vgic/vgic.h   |  10 +
 19 files changed, 819 insertions(+), 158 deletions(-)
 create mode 100644 virt/kvm/arm/vgic/vgic-v4.c


Re: VMs freezing when host is running 4.14

2017-11-23 Thread Radim Krčmář
2017-11-23 18:18+0200, Liran Alon:
> On 23/11/17 17:59, Radim Krčmář wrote:
> > Btw. there have been already many fixes from Liran Alon for that patch
> > and your case could be the one adressed in
> > https://urldefense.proofpoint.com/v2/url?u=https-3A__www.spinics.net_lists_kvm_msg159158.html=DwIDaQ=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0=206jU1rQdk3xs1DYWbQPz1gR7Iim02XOjwn458rwgIo=fz1JeZiSQBwqYpkmeX8OJukyC4M8BeXSuIOKwuVaeHg=
> > 
> > The patch is incorrect, but you might be able to see only its benefits.
> 
> Actually I would first attempt to check this patch of mine:
> https://www.spinics.net/lists/kvm/msg159062.html
> It fixes a bug of a L2 exception accidentally being delivered into L1.

Marc's guest didn't have kvm/vbox/... module loaded, so I assumed that
we're not running L2.  Do you suspect it also fixes something else?

Thanks.


Re: VMs freezing when host is running 4.14

2017-11-23 Thread Radim Krčmář
2017-11-23 18:18+0200, Liran Alon:
> On 23/11/17 17:59, Radim Krčmář wrote:
> > Btw. there have been already many fixes from Liran Alon for that patch
> > and your case could be the one adressed in
> > https://urldefense.proofpoint.com/v2/url?u=https-3A__www.spinics.net_lists_kvm_msg159158.html=DwIDaQ=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0=206jU1rQdk3xs1DYWbQPz1gR7Iim02XOjwn458rwgIo=fz1JeZiSQBwqYpkmeX8OJukyC4M8BeXSuIOKwuVaeHg=
> > 
> > The patch is incorrect, but you might be able to see only its benefits.
> 
> Actually I would first attempt to check this patch of mine:
> https://www.spinics.net/lists/kvm/msg159062.html
> It fixes a bug of a L2 exception accidentally being delivered into L1.

Marc's guest didn't have kvm/vbox/... module loaded, so I assumed that
we're not running L2.  Do you suspect it also fixes something else?

Thanks.


Re: VMs freezing when host is running 4.14

2017-11-23 Thread Radim Krčmář
2017-11-23 16:20+0100, Marc Haber:
> On Wed, Nov 22, 2017 at 05:43:13PM +0100, Radim Krčmář wrote:
> > 2017-11-22 16:52+0100, Marc Haber:
> > > On Wed, Nov 22, 2017 at 04:04:42PM +0100, 王金浦 wrote:
> > > > So all guest kernels are 4.14, or also other older kernel?
> > > 
> > > Guest kernels are also 4.14, but the issue disappears when the host is
> > > downgraded to an older kernel. I therefore reckoned that the guest
> > > kernel doesn't matter, but that was before I saw the trace in the log.
> > 
> > The two most suspicious patches since 4.13 (which I assume works) are
> > 
> >   664f8e26b00c ("KVM: X86: Fix loss of exception which has not yet been
> >   injected")
> 
> That one does not revert cleanly, the line in questions seems to have
> been removed a bit later.
> 
> Reject is:
> 141 [24/5001]mh@fan:~/linux/git/linux ((v4.14.1) %) $ cat 
> arch/x86/kvm/vmx.c.rej--- arch/x86/kvm/vmx.c
> +++ arch/x86/kvm/vmx.c
> @@ -2516,7 +2516,7 @@ static void vmx_queue_exception(struct kvm_vcpu *vcpu)
> struct vcpu_vmx *vmx = to_vmx(vcpu);
> unsigned nr = vcpu->arch.exception.nr;
> bool has_error_code = vcpu->arch.exception.has_error_code;
> -   bool reinject = vcpu->arch.exception.injected;
> +   bool reinject = vcpu->arch.exception.reinject;
> u32 error_code = vcpu->arch.exception.error_code;
> u32 intr_info = nr | INTR_INFO_VALID_MASK;

This line one can be deleted as reinject isn't used in the function.

Btw. there have been already many fixes from Liran Alon for that patch
and your case could be the one adressed in
https://www.spinics.net/lists/kvm/msg159158.html

The patch is incorrect, but you might be able to see only its benefits.

> > and
> > 
> >   9a6e7c39810e ("KVM: async_pf: Fix #DF due to inject "Page not Present"
> >   and "Page Ready" exceptions simultaneously")
> > 
> > please try reverting them to see if it helps,
> 
> That one reverted cleanly. I am now running the new kernel on the
> affected machine, and I think that a second machine has joined the
> market of being affected.

That one had much lower chances of being the culprit.

> Would this matter on the host only or on the guests as well?

Only on the host.

Thanks.


Re: VMs freezing when host is running 4.14

2017-11-23 Thread Radim Krčmář
2017-11-23 16:20+0100, Marc Haber:
> On Wed, Nov 22, 2017 at 05:43:13PM +0100, Radim Krčmář wrote:
> > 2017-11-22 16:52+0100, Marc Haber:
> > > On Wed, Nov 22, 2017 at 04:04:42PM +0100, 王金浦 wrote:
> > > > So all guest kernels are 4.14, or also other older kernel?
> > > 
> > > Guest kernels are also 4.14, but the issue disappears when the host is
> > > downgraded to an older kernel. I therefore reckoned that the guest
> > > kernel doesn't matter, but that was before I saw the trace in the log.
> > 
> > The two most suspicious patches since 4.13 (which I assume works) are
> > 
> >   664f8e26b00c ("KVM: X86: Fix loss of exception which has not yet been
> >   injected")
> 
> That one does not revert cleanly, the line in questions seems to have
> been removed a bit later.
> 
> Reject is:
> 141 [24/5001]mh@fan:~/linux/git/linux ((v4.14.1) %) $ cat 
> arch/x86/kvm/vmx.c.rej--- arch/x86/kvm/vmx.c
> +++ arch/x86/kvm/vmx.c
> @@ -2516,7 +2516,7 @@ static void vmx_queue_exception(struct kvm_vcpu *vcpu)
> struct vcpu_vmx *vmx = to_vmx(vcpu);
> unsigned nr = vcpu->arch.exception.nr;
> bool has_error_code = vcpu->arch.exception.has_error_code;
> -   bool reinject = vcpu->arch.exception.injected;
> +   bool reinject = vcpu->arch.exception.reinject;
> u32 error_code = vcpu->arch.exception.error_code;
> u32 intr_info = nr | INTR_INFO_VALID_MASK;

This line one can be deleted as reinject isn't used in the function.

Btw. there have been already many fixes from Liran Alon for that patch
and your case could be the one adressed in
https://www.spinics.net/lists/kvm/msg159158.html

The patch is incorrect, but you might be able to see only its benefits.

> > and
> > 
> >   9a6e7c39810e ("KVM: async_pf: Fix #DF due to inject "Page not Present"
> >   and "Page Ready" exceptions simultaneously")
> > 
> > please try reverting them to see if it helps,
> 
> That one reverted cleanly. I am now running the new kernel on the
> affected machine, and I think that a second machine has joined the
> market of being affected.

That one had much lower chances of being the culprit.

> Would this matter on the host only or on the guests as well?

Only on the host.

Thanks.


Re: VMs freezing when host is running 4.14

2017-11-22 Thread Radim Krčmář
2017-11-22 16:52+0100, Marc Haber:
> On Wed, Nov 22, 2017 at 04:04:42PM +0100, 王金浦 wrote:
> > So all guest kernels are 4.14, or also other older kernel?
> 
> Guest kernels are also 4.14, but the issue disappears when the host is
> downgraded to an older kernel. I therefore reckoned that the guest
> kernel doesn't matter, but that was before I saw the trace in the log.

The two most suspicious patches since 4.13 (which I assume works) are

  664f8e26b00c ("KVM: X86: Fix loss of exception which has not yet been
  injected")

and

  9a6e7c39810e ("KVM: async_pf: Fix #DF due to inject "Page not Present"
  and "Page Ready" exceptions simultaneously")

please try reverting them to see if it helps,

thanks.


Re: VMs freezing when host is running 4.14

2017-11-22 Thread Radim Krčmář
2017-11-22 16:52+0100, Marc Haber:
> On Wed, Nov 22, 2017 at 04:04:42PM +0100, 王金浦 wrote:
> > So all guest kernels are 4.14, or also other older kernel?
> 
> Guest kernels are also 4.14, but the issue disappears when the host is
> downgraded to an older kernel. I therefore reckoned that the guest
> kernel doesn't matter, but that was before I saw the trace in the log.

The two most suspicious patches since 4.13 (which I assume works) are

  664f8e26b00c ("KVM: X86: Fix loss of exception which has not yet been
  injected")

and

  9a6e7c39810e ("KVM: async_pf: Fix #DF due to inject "Page not Present"
  and "Page Ready" exceptions simultaneously")

please try reverting them to see if it helps,

thanks.


Re: [PATCH v3] KVM: X86: Fix softlockup when get the current kvmclock

2017-11-16 Thread Radim Krčmář
2017-11-15 09:17+0800, Wanpeng Li:
> Ping, :)

Ah, sorry, I got distracted while learning about the hotplug mechanism.
Indeed we cannot move move the callback earlier because the cpufreq
driver kvm uses on crappy hardware gets set in CPUHP_AP_ONLINE_DYN, which
is way too late.

> 2017-11-09 10:52 GMT+08:00 Wanpeng Li <kernel...@gmail.com>:
> > From: Wanpeng Li <wanpeng...@hotmail.com>
> >
> >  watchdog: BUG: soft lockup - CPU#6 stuck for 22s! [qemu-system-x86:10185]
> >  CPU: 6 PID: 10185 Comm: qemu-system-x86 Tainted: G   OE   
> > 4.14.0-rc4+ #4
> >  RIP: 0010:kvm_get_time_scale+0x4e/0xa0 [kvm]
> >  Call Trace:
> >   ? get_kvmclock_ns+0xa3/0x140 [kvm]
> >   get_time_ref_counter+0x5a/0x80 [kvm]
> >   kvm_hv_process_stimers+0x120/0x5f0 [kvm]
> >   ? kvm_hv_process_stimers+0x120/0x5f0 [kvm]
> >   ? preempt_schedule+0x27/0x30
> >   ? ___preempt_schedule+0x16/0x18
> >   kvm_arch_vcpu_ioctl_run+0x4b4/0x1690 [kvm]
> >   ? kvm_arch_vcpu_load+0x47/0x230 [kvm]
> >   kvm_vcpu_ioctl+0x33a/0x620 [kvm]
> >   ? kvm_vcpu_ioctl+0x33a/0x620 [kvm]
> >   ? kvm_vm_ioctl_check_extension_generic+0x3b/0x40 [kvm]
> >   ? kvm_dev_ioctl+0x279/0x6c0 [kvm]
> >   do_vfs_ioctl+0xa1/0x5d0
> >   ? __fget+0x73/0xa0
> >   SyS_ioctl+0x79/0x90
> >   entry_SYSCALL_64_fastpath+0x1e/0xa9
> >
> > This can be reproduced when running kvm-unit-tests/hyperv_stimer.flat and
> > cpu-hotplug stress simultaneously. __this_cpu_read(cpu_tsc_khz) returns 0
> > (set in kvmclock_cpu_down_prep()) when the pCPU is unhotplug which results
> > in kvm_get_time_scale() gets into an infinite loop.
> >
> > This patch fixes it by treating the unhotplug pCPU as not using master 
> > clock.
> >
> > Cc: Paolo Bonzini <pbonz...@redhat.com>
> > Cc: Radim Krčmář <rkrc...@redhat.com>
> > Signed-off-by: Wanpeng Li <wanpeng...@hotmail.com>
> > ---
> >  arch/x86/kvm/x86.c | 11 +++
> >  1 file changed, 7 insertions(+), 4 deletions(-)
> >
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 03869eb..d61dcce3 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -1795,10 +1795,13 @@ u64 get_kvmclock_ns(struct kvm *kvm)
> > /* both __this_cpu_read() and rdtsc() should be on the same cpu */
> > get_cpu();
> >
> > -   kvm_get_time_scale(NSEC_PER_SEC, __this_cpu_read(cpu_tsc_khz) * 
> > 1000LL,
> > -  _clock.tsc_shift,
> > -  _clock.tsc_to_system_mul);
> > -   ret = __pvclock_read_cycles(_clock, rdtsc());
> > +   if (__this_cpu_read(cpu_tsc_khz)) {
> > +   kvm_get_time_scale(NSEC_PER_SEC, 
> > __this_cpu_read(cpu_tsc_khz) * 1000LL,

Would be safer to read __this_cpu_read(cpu_tsc_khz) only once, but I
think it works for now as unplug thread must be scheduled and get_cpu()
prevents changes.

> > +  _clock.tsc_shift,
> > +  _clock.tsc_to_system_mul);
> > +   ret = __pvclock_read_cycles(_clock, rdtsc());
> > +   } else
> > +   ret = ktime_get_boot_ns() + ka->kvmclock_offset;

Not pretty, but gets the job done ...

Reviewed-by: Radim Krčmář <rkrc...@redhat.com>


Re: [PATCH v3] KVM: X86: Fix softlockup when get the current kvmclock

2017-11-16 Thread Radim Krčmář
2017-11-15 09:17+0800, Wanpeng Li:
> Ping, :)

Ah, sorry, I got distracted while learning about the hotplug mechanism.
Indeed we cannot move move the callback earlier because the cpufreq
driver kvm uses on crappy hardware gets set in CPUHP_AP_ONLINE_DYN, which
is way too late.

> 2017-11-09 10:52 GMT+08:00 Wanpeng Li :
> > From: Wanpeng Li 
> >
> >  watchdog: BUG: soft lockup - CPU#6 stuck for 22s! [qemu-system-x86:10185]
> >  CPU: 6 PID: 10185 Comm: qemu-system-x86 Tainted: G   OE   
> > 4.14.0-rc4+ #4
> >  RIP: 0010:kvm_get_time_scale+0x4e/0xa0 [kvm]
> >  Call Trace:
> >   ? get_kvmclock_ns+0xa3/0x140 [kvm]
> >   get_time_ref_counter+0x5a/0x80 [kvm]
> >   kvm_hv_process_stimers+0x120/0x5f0 [kvm]
> >   ? kvm_hv_process_stimers+0x120/0x5f0 [kvm]
> >   ? preempt_schedule+0x27/0x30
> >   ? ___preempt_schedule+0x16/0x18
> >   kvm_arch_vcpu_ioctl_run+0x4b4/0x1690 [kvm]
> >   ? kvm_arch_vcpu_load+0x47/0x230 [kvm]
> >   kvm_vcpu_ioctl+0x33a/0x620 [kvm]
> >   ? kvm_vcpu_ioctl+0x33a/0x620 [kvm]
> >   ? kvm_vm_ioctl_check_extension_generic+0x3b/0x40 [kvm]
> >   ? kvm_dev_ioctl+0x279/0x6c0 [kvm]
> >   do_vfs_ioctl+0xa1/0x5d0
> >   ? __fget+0x73/0xa0
> >   SyS_ioctl+0x79/0x90
> >   entry_SYSCALL_64_fastpath+0x1e/0xa9
> >
> > This can be reproduced when running kvm-unit-tests/hyperv_stimer.flat and
> > cpu-hotplug stress simultaneously. __this_cpu_read(cpu_tsc_khz) returns 0
> > (set in kvmclock_cpu_down_prep()) when the pCPU is unhotplug which results
> > in kvm_get_time_scale() gets into an infinite loop.
> >
> > This patch fixes it by treating the unhotplug pCPU as not using master 
> > clock.
> >
> > Cc: Paolo Bonzini 
> > Cc: Radim Krčmář 
> > Signed-off-by: Wanpeng Li 
> > ---
> >  arch/x86/kvm/x86.c | 11 +++
> >  1 file changed, 7 insertions(+), 4 deletions(-)
> >
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 03869eb..d61dcce3 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -1795,10 +1795,13 @@ u64 get_kvmclock_ns(struct kvm *kvm)
> > /* both __this_cpu_read() and rdtsc() should be on the same cpu */
> > get_cpu();
> >
> > -   kvm_get_time_scale(NSEC_PER_SEC, __this_cpu_read(cpu_tsc_khz) * 
> > 1000LL,
> > -  _clock.tsc_shift,
> > -  _clock.tsc_to_system_mul);
> > -   ret = __pvclock_read_cycles(_clock, rdtsc());
> > +   if (__this_cpu_read(cpu_tsc_khz)) {
> > +   kvm_get_time_scale(NSEC_PER_SEC, 
> > __this_cpu_read(cpu_tsc_khz) * 1000LL,

Would be safer to read __this_cpu_read(cpu_tsc_khz) only once, but I
think it works for now as unplug thread must be scheduled and get_cpu()
prevents changes.

> > +  _clock.tsc_shift,
> > +  _clock.tsc_to_system_mul);
> > +   ret = __pvclock_read_cycles(_clock, rdtsc());
> > +   } else
> > +   ret = ktime_get_boot_ns() + ka->kvmclock_offset;

Not pretty, but gets the job done ...

Reviewed-by: Radim Krčmář 


Re: [PATCH] KVM: x86: inject exceptions produced by x86_decode_insn

2017-11-16 Thread Radim Krčmář
2017-11-13 09:32+0100, Paolo Bonzini:
> On 13/11/2017 08:15, Wanpeng Li wrote:
> > 2017-11-10 17:49 GMT+08:00 Paolo Bonzini :
> >> Sometimes, a processor might execute an instruction while another
> >> processor is updating the page tables for that instruction's code page,
> >> but before the TLB shootdown completes.  The interesting case happens
> >> if the page is in the TLB.
> >>
> >> In general, the processor will succeed in executing the instruction and
> >> nothing bad happens.  However, what if the instruction is an MMIO access?
> >> If *that* happens, KVM invokes the emulator, and the emulator gets the
> >> updated page tables.  If the update side had marked the code page as non
> >> present, the page table walk then will fail and so will x86_decode_insn.
> >>
> >> Unfortunately, even though kvm_fetch_guest_virt is correctly returning
> >> X86EMUL_PROPAGATE_FAULT, x86_decode_insn's caller treats the failure as
> >> a fatal error if the instruction cannot simply be reexecuted (as is the
> >> case for MMIO).  And this in fact happened sometimes when rebooting
> >> Windows 2012r2 guests.  Just checking ctxt->have_exception and injecting
> >> the exception if true is enough to fix the case.
> > 
> > I found the only place which can set ctxt->have_exception is in the
> > function x86_emulate_insn(), and x86_decode_insn() will not set
> > ctxt->have_exception even if kvm_fetch_guest_virt() returns
> > X86_EMUL_PROPAGATE_FAULT.
> 
> Hmm, you're right.  Looks like Yanan has been (un)lucky when trying out
> this patch! :(

I have dropped this patch in the meantime.


Re: [PATCH] KVM: x86: inject exceptions produced by x86_decode_insn

2017-11-16 Thread Radim Krčmář
2017-11-13 09:32+0100, Paolo Bonzini:
> On 13/11/2017 08:15, Wanpeng Li wrote:
> > 2017-11-10 17:49 GMT+08:00 Paolo Bonzini :
> >> Sometimes, a processor might execute an instruction while another
> >> processor is updating the page tables for that instruction's code page,
> >> but before the TLB shootdown completes.  The interesting case happens
> >> if the page is in the TLB.
> >>
> >> In general, the processor will succeed in executing the instruction and
> >> nothing bad happens.  However, what if the instruction is an MMIO access?
> >> If *that* happens, KVM invokes the emulator, and the emulator gets the
> >> updated page tables.  If the update side had marked the code page as non
> >> present, the page table walk then will fail and so will x86_decode_insn.
> >>
> >> Unfortunately, even though kvm_fetch_guest_virt is correctly returning
> >> X86EMUL_PROPAGATE_FAULT, x86_decode_insn's caller treats the failure as
> >> a fatal error if the instruction cannot simply be reexecuted (as is the
> >> case for MMIO).  And this in fact happened sometimes when rebooting
> >> Windows 2012r2 guests.  Just checking ctxt->have_exception and injecting
> >> the exception if true is enough to fix the case.
> > 
> > I found the only place which can set ctxt->have_exception is in the
> > function x86_emulate_insn(), and x86_decode_insn() will not set
> > ctxt->have_exception even if kvm_fetch_guest_virt() returns
> > X86_EMUL_PROPAGATE_FAULT.
> 
> Hmm, you're right.  Looks like Yanan has been (un)lucky when trying out
> this patch! :(

I have dropped this patch in the meantime.


[GIT PULL] First batch of KVM changes for 4.15

2017-11-16 Thread Radim Krčmář
:
  KVM: x86: introduce ISA specific SMM entry/exit callbacks
  KVM: x86: introduce ISA specific smi_allowed callback
  KVM: nVMX: set IDTR and GDTR limits when loading L1 host state
  KVM: nVMX: fix SMI injection in guest mode
  KVM: nSVM: refactor nested_svm_vmrun
  KVM: nSVM: fix SMI injection in guest mode
  KVM: SVM: detect opening of SMI window using STGI intercept

Marc Zyngier (1):
  KVM: arm/arm64: Unify 32bit fault injection

Markus Elfring (1):
  KVM: PPC: Book3S HV: Delete an error message for a failed memory 
allocation in kvmppc_allocate_hpt()

Michael Ellerman (1):
  KVM: PPC: Tie KVM_CAP_PPC_HTM to the user-visible TM feature

Michael Mueller (2):
  KVM: s390: abstract conversion between isc and enum irq_types
  KVM: s390: clear_io_irq() requests are not expected for adapter interrupts

Nicholas Piggin (1):
  KVM: PPC: Book3S: Fix gas warning due to using r0 as immediate 0

Paolo Bonzini (4):
  KVM: SVM: unconditionally wake up VCPU on IOMMU interrupt
  KVM: SVM: limit kvm_handle_page_fault to #PF handling
  KVM: x86: extend usage of RET_MMIO_PF_* constants
  Merge branch 'kvm-ppc-next' of git://git.kernel.org/.../paulus/powerpc 
into HEAD

Paul Mackerras (13):
  KVM: PPC: Book3S HV: Handle unexpected interrupts better
  KVM: PPC: Book3S HV: Explicitly disable HPT operations on radix guests
  Revert "KVM: PPC: Book3S HV: POWER9 does not require secondary thread 
management"
  KVM: PPC: Book3S HV: Don't call real-mode XICS hypercall handlers if not 
enabled
  Merge remote-tracking branch 'remotes/powerpc/topic/ppc-kvm' into 
kvm-ppc-next
  KVM: PPC: Book3S HV: Don't rely on host's page size information
  KVM: PPC: Book3S HV: Rename hpte_setup_done to mmu_ready
  KVM: PPC: Book3S HV: Unify dirty page map between HPT and radix
  KVM: PPC: Book3S HV: Add infrastructure for running HPT guests on radix 
host
  KVM: PPC: Book3S HV: Allow for running POWER9 host in single-threaded mode
  KVM: PPC: Book3S HV: Run HPT guests on POWER9 radix hosts
  Merge branch 'kvm-ppc-fixes' into kvm-ppc-next
  KVM: PPC: Book3S HV: Cosmetic post-merge cleanups

Radim Krčmář (6):
  KVM: x86: handle 0 write to TSC_DEADLINE MSR
  KVM: x86: really disarm lapic timer when clearing TMICT
  KVM: x86: thoroughly disarm LAPIC timer around TSC deadline switch
  Merge tag 'kvm-arm-for-v4.15' of git://git.kernel.org/.../kvmarm/kvmarm 
into next
  Merge tag 'kvm-ppc-next-4.15-2' of git://git.kernel.org/.../paulus/powerpc
  Merge tag 'kvm-s390-next-4.15-1' of git://git.kernel.org/.../kvms390/linux

Shakeel Butt (1):
  kvm, mm: account kvm related kmem slabs to kmemcg

Thomas Meyer (2):
  KVM: PPC: Book3S HV: Use ARRAY_SIZE macro
  KVM: PPC: BookE: Use vma_pages function

Tim Hansen (1):
  arch/x86: remove redundant null checks before kmem_cache_destroy

Tony Krowiak (1):
  KVM: s390: SIE considerations for AP Queue virtualization

Wanpeng Li (10):
  KVM: VMX: Don't expose PLE enable if there is no hardware support
  KVM: LAPIC: Fix lapic timer mode transition
  KVM: LAPIC: Introduce limit_periodic_timer_frequency
  KVM: LAPIC: Keep timer running when switching between one-shot and 
periodic mode
  KVM: LAPIC: Apply change to TDCR right away to the timer
  KVM: X86: Processor States following Reset or INIT
  KVM: VMX: Don't expose unrestricted_guest is enabled if ept is disabled
  KVM: nVMX: Fix EPT switching advertising
  KVM: VMX: Fix VPID capability detection
  KVM: X86: #GP when guest attempts to write MCi_STATUS register w/o 0

wanghaibin (1):
  KVM: arm/arm64: vgic-its: New helper functions to free the caches

 Documentation/virtual/kvm/api.txt  |  13 +
 Documentation/virtual/kvm/devices/arm-vgic-its.txt |  20 +
 Documentation/virtual/kvm/devices/s390_flic.txt|   5 +
 arch/arm/include/asm/kvm_asm.h |   2 +
 arch/arm/include/asm/kvm_emulate.h |  38 +-
 arch/arm/include/asm/kvm_hyp.h |   4 +-
 arch/arm/include/uapi/asm/kvm.h|   7 +
 arch/arm/kvm/emulate.c | 137 ---
 arch/arm/kvm/hyp/switch.c  |   7 +-
 arch/arm64/include/asm/arch_timer.h|   8 +-
 arch/arm64/include/asm/kvm_asm.h   |   2 +
 arch/arm64/include/asm/kvm_emulate.h   |   5 +-
 arch/arm64/include/asm/kvm_hyp.h   |   4 +-
 arch/arm64/include/asm/timex.h |   2 +-
 arch/arm64/include/uapi/asm/kvm.h  |   7 +
 arch/arm64/kvm/hyp/switch.c|   6 +-
 arch/arm64/kvm/inject_fault.c  |  88 +---
 arch/arm64/kvm/sys_regs.c  |  41 +-
 arch/powerpc/include/asm/kvm_book3s.h  |   3 +-
 arch/powerpc/include/asm/kvm_book3s_64.h   | 140 ++-

[GIT PULL] First batch of KVM changes for 4.15

2017-11-16 Thread Radim Krčmář
:
  KVM: x86: introduce ISA specific SMM entry/exit callbacks
  KVM: x86: introduce ISA specific smi_allowed callback
  KVM: nVMX: set IDTR and GDTR limits when loading L1 host state
  KVM: nVMX: fix SMI injection in guest mode
  KVM: nSVM: refactor nested_svm_vmrun
  KVM: nSVM: fix SMI injection in guest mode
  KVM: SVM: detect opening of SMI window using STGI intercept

Marc Zyngier (1):
  KVM: arm/arm64: Unify 32bit fault injection

Markus Elfring (1):
  KVM: PPC: Book3S HV: Delete an error message for a failed memory 
allocation in kvmppc_allocate_hpt()

Michael Ellerman (1):
  KVM: PPC: Tie KVM_CAP_PPC_HTM to the user-visible TM feature

Michael Mueller (2):
  KVM: s390: abstract conversion between isc and enum irq_types
  KVM: s390: clear_io_irq() requests are not expected for adapter interrupts

Nicholas Piggin (1):
  KVM: PPC: Book3S: Fix gas warning due to using r0 as immediate 0

Paolo Bonzini (4):
  KVM: SVM: unconditionally wake up VCPU on IOMMU interrupt
  KVM: SVM: limit kvm_handle_page_fault to #PF handling
  KVM: x86: extend usage of RET_MMIO_PF_* constants
  Merge branch 'kvm-ppc-next' of git://git.kernel.org/.../paulus/powerpc 
into HEAD

Paul Mackerras (13):
  KVM: PPC: Book3S HV: Handle unexpected interrupts better
  KVM: PPC: Book3S HV: Explicitly disable HPT operations on radix guests
  Revert "KVM: PPC: Book3S HV: POWER9 does not require secondary thread 
management"
  KVM: PPC: Book3S HV: Don't call real-mode XICS hypercall handlers if not 
enabled
  Merge remote-tracking branch 'remotes/powerpc/topic/ppc-kvm' into 
kvm-ppc-next
  KVM: PPC: Book3S HV: Don't rely on host's page size information
  KVM: PPC: Book3S HV: Rename hpte_setup_done to mmu_ready
  KVM: PPC: Book3S HV: Unify dirty page map between HPT and radix
  KVM: PPC: Book3S HV: Add infrastructure for running HPT guests on radix 
host
  KVM: PPC: Book3S HV: Allow for running POWER9 host in single-threaded mode
  KVM: PPC: Book3S HV: Run HPT guests on POWER9 radix hosts
  Merge branch 'kvm-ppc-fixes' into kvm-ppc-next
  KVM: PPC: Book3S HV: Cosmetic post-merge cleanups

Radim Krčmář (6):
  KVM: x86: handle 0 write to TSC_DEADLINE MSR
  KVM: x86: really disarm lapic timer when clearing TMICT
  KVM: x86: thoroughly disarm LAPIC timer around TSC deadline switch
  Merge tag 'kvm-arm-for-v4.15' of git://git.kernel.org/.../kvmarm/kvmarm 
into next
  Merge tag 'kvm-ppc-next-4.15-2' of git://git.kernel.org/.../paulus/powerpc
  Merge tag 'kvm-s390-next-4.15-1' of git://git.kernel.org/.../kvms390/linux

Shakeel Butt (1):
  kvm, mm: account kvm related kmem slabs to kmemcg

Thomas Meyer (2):
  KVM: PPC: Book3S HV: Use ARRAY_SIZE macro
  KVM: PPC: BookE: Use vma_pages function

Tim Hansen (1):
  arch/x86: remove redundant null checks before kmem_cache_destroy

Tony Krowiak (1):
  KVM: s390: SIE considerations for AP Queue virtualization

Wanpeng Li (10):
  KVM: VMX: Don't expose PLE enable if there is no hardware support
  KVM: LAPIC: Fix lapic timer mode transition
  KVM: LAPIC: Introduce limit_periodic_timer_frequency
  KVM: LAPIC: Keep timer running when switching between one-shot and 
periodic mode
  KVM: LAPIC: Apply change to TDCR right away to the timer
  KVM: X86: Processor States following Reset or INIT
  KVM: VMX: Don't expose unrestricted_guest is enabled if ept is disabled
  KVM: nVMX: Fix EPT switching advertising
  KVM: VMX: Fix VPID capability detection
  KVM: X86: #GP when guest attempts to write MCi_STATUS register w/o 0

wanghaibin (1):
  KVM: arm/arm64: vgic-its: New helper functions to free the caches

 Documentation/virtual/kvm/api.txt  |  13 +
 Documentation/virtual/kvm/devices/arm-vgic-its.txt |  20 +
 Documentation/virtual/kvm/devices/s390_flic.txt|   5 +
 arch/arm/include/asm/kvm_asm.h |   2 +
 arch/arm/include/asm/kvm_emulate.h |  38 +-
 arch/arm/include/asm/kvm_hyp.h |   4 +-
 arch/arm/include/uapi/asm/kvm.h|   7 +
 arch/arm/kvm/emulate.c | 137 ---
 arch/arm/kvm/hyp/switch.c  |   7 +-
 arch/arm64/include/asm/arch_timer.h|   8 +-
 arch/arm64/include/asm/kvm_asm.h   |   2 +
 arch/arm64/include/asm/kvm_emulate.h   |   5 +-
 arch/arm64/include/asm/kvm_hyp.h   |   4 +-
 arch/arm64/include/asm/timex.h |   2 +-
 arch/arm64/include/uapi/asm/kvm.h  |   7 +
 arch/arm64/kvm/hyp/switch.c|   6 +-
 arch/arm64/kvm/inject_fault.c  |  88 +---
 arch/arm64/kvm/sys_regs.c  |  41 +-
 arch/powerpc/include/asm/kvm_book3s.h  |   3 +-
 arch/powerpc/include/asm/kvm_book3s_64.h   | 140 ++-

Re: [PATCH] KVM: x86: inject exceptions produced by x86_decode_insn

2017-11-10 Thread Radim Krčmář
2017-11-10 10:49+0100, Paolo Bonzini:
> Sometimes, a processor might execute an instruction while another
> processor is updating the page tables for that instruction's code page,
> but before the TLB shootdown completes.  The interesting case happens
> if the page is in the TLB.
> 
> In general, the processor will succeed in executing the instruction and
> nothing bad happens.  However, what if the instruction is an MMIO access?
> If *that* happens, KVM invokes the emulator, and the emulator gets the
> updated page tables.  If the update side had marked the code page as non
> present, the page table walk then will fail and so will x86_decode_insn.
> 
> Unfortunately, even though kvm_fetch_guest_virt is correctly returning
> X86EMUL_PROPAGATE_FAULT, x86_decode_insn's caller treats the failure as
> a fatal error if the instruction cannot simply be reexecuted (as is the
> case for MMIO).  And this in fact happened sometimes when rebooting
> Windows 2012r2 guests.  Just checking ctxt->have_exception and injecting
> the exception if true is enough to fix the case.
> 
> Thanks to Eduardo Habkost for helping in the debugging of this issue.
> 
> Reported-by: Yanan Fu 
> Cc: Eduardo Habkost 
> Cc: sta...@vger.kernel.org
> Signed-off-by: Paolo Bonzini 
> ---

Applied, thanks.


Re: [PATCH] KVM: x86: inject exceptions produced by x86_decode_insn

2017-11-10 Thread Radim Krčmář
2017-11-10 10:49+0100, Paolo Bonzini:
> Sometimes, a processor might execute an instruction while another
> processor is updating the page tables for that instruction's code page,
> but before the TLB shootdown completes.  The interesting case happens
> if the page is in the TLB.
> 
> In general, the processor will succeed in executing the instruction and
> nothing bad happens.  However, what if the instruction is an MMIO access?
> If *that* happens, KVM invokes the emulator, and the emulator gets the
> updated page tables.  If the update side had marked the code page as non
> present, the page table walk then will fail and so will x86_decode_insn.
> 
> Unfortunately, even though kvm_fetch_guest_virt is correctly returning
> X86EMUL_PROPAGATE_FAULT, x86_decode_insn's caller treats the failure as
> a fatal error if the instruction cannot simply be reexecuted (as is the
> case for MMIO).  And this in fact happened sometimes when rebooting
> Windows 2012r2 guests.  Just checking ctxt->have_exception and injecting
> the exception if true is enough to fix the case.
> 
> Thanks to Eduardo Habkost for helping in the debugging of this issue.
> 
> Reported-by: Yanan Fu 
> Cc: Eduardo Habkost 
> Cc: sta...@vger.kernel.org
> Signed-off-by: Paolo Bonzini 
> ---

Applied, thanks.


Re: [PATCH v6 1/3] KVM: X86: Fix operand/address-size during instruction decoding

2017-11-10 Thread Radim Krčmář
Applied all three, thanks.


Re: [PATCH v6 1/3] KVM: X86: Fix operand/address-size during instruction decoding

2017-11-10 Thread Radim Krčmář
Applied all three, thanks.


Re: [PATCH 0/2] kvm: vmx: CPUs without virtual NMIs

2017-11-10 Thread Radim Krčmář
2017-11-06 13:31+0100, Paolo Bonzini:
> It turns out that Core 2 Duo machines only had virtual NMIs in some SKUs.
> Patch 1 adds back emulation of the NMI window, patch 2 allows testing
> it on modern processors as well.  One eventinj.flat test (NMI after iret)
> fails as expected.

Applied, thanks.  (And already looking forward to removing it again. :])


Re: [PATCH 0/2] kvm: vmx: CPUs without virtual NMIs

2017-11-10 Thread Radim Krčmář
2017-11-06 13:31+0100, Paolo Bonzini:
> It turns out that Core 2 Duo machines only had virtual NMIs in some SKUs.
> Patch 1 adds back emulation of the NMI window, patch 2 allows testing
> it on modern processors as well.  One eventinj.flat test (NMI after iret)
> fails as expected.

Applied, thanks.  (And already looking forward to removing it again. :])


Re: [PATCH] KVM: SVM: obey guest PAT

2017-11-10 Thread Radim Krčmář
2017-10-26 09:13+0200, Paolo Bonzini:
> For many years some users of assigned devices have reported worse
> performance on AMD processors with NPT than on AMD without NPT,
> Intel or bare metal.
> 
> The reason turned out to be that SVM is discarding the guest PAT
> setting and uses the default (PA0=PA4=WB, PA1=PA5=WT, PA2=PA6=UC-,
> PA3=UC).  The guest might be using a different setting, and
> especially might want write combining but isn't getting it
> (instead getting slow UC or UC- accesses).
> 
> Thanks a lot to ge...@hostfission.com for noticing the relation
> to the g_pat setting.  The patch has been tested also by a bunch
> of people on VFIO users forums.
> 
> Fixes: 709ddebf81cb40e3c36c6109a7892e8b93a09464
> Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=196409
> Cc: sta...@vger.kernel.org
> Signed-off-by: Paolo Bonzini 

Applied, thanks.


Re: [PATCH] KVM: SVM: obey guest PAT

2017-11-10 Thread Radim Krčmář
2017-10-26 09:13+0200, Paolo Bonzini:
> For many years some users of assigned devices have reported worse
> performance on AMD processors with NPT than on AMD without NPT,
> Intel or bare metal.
> 
> The reason turned out to be that SVM is discarding the guest PAT
> setting and uses the default (PA0=PA4=WB, PA1=PA5=WT, PA2=PA6=UC-,
> PA3=UC).  The guest might be using a different setting, and
> especially might want write combining but isn't getting it
> (instead getting slow UC or UC- accesses).
> 
> Thanks a lot to ge...@hostfission.com for noticing the relation
> to the g_pat setting.  The patch has been tested also by a bunch
> of people on VFIO users forums.
> 
> Fixes: 709ddebf81cb40e3c36c6109a7892e8b93a09464
> Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=196409
> Cc: sta...@vger.kernel.org
> Signed-off-by: Paolo Bonzini 

Applied, thanks.


[GIT PULL] KVM fix for v4.14(-rc9)

2017-11-10 Thread Radim Krčmář
Linus,

The following changes since commit 39dae59d66acd86d1de24294bd2f343fd5e7a625:

  Linux 4.14-rc8 (2017-11-05 13:05:14 -0800)

are available in the git repository at:

  git://git.kernel.org/pub/scm/virt/kvm/kvm for-linus

for you to fetch changes up to d850a255d5b961fbe085f06ddd116910378b78f1:

  Merge tag 'kvm-ppc-fixes-4.14-2' of 
git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc (2017-11-08 
14:08:59 +0100)


KVM fix for v4.14(-rc9)

Fix PPC HV host crash that can occur as a result of resizing the guest
hashed page table.


Paul Mackerras (1):
  KVM: PPC: Book3S HV: Fix exclusion between HPT resizing and other HPT 
updates

Radim Krčmář (1):
  Merge tag 'kvm-ppc-fixes-4.14-2' of 
git://git.kernel.org/.../paulus/powerpc

 arch/powerpc/kvm/book3s_64_mmu_hv.c | 10 ++
 arch/powerpc/kvm/book3s_hv.c| 29 +++--
 2 files changed, 29 insertions(+), 10 deletions(-)


[GIT PULL] KVM fix for v4.14(-rc9)

2017-11-10 Thread Radim Krčmář
Linus,

The following changes since commit 39dae59d66acd86d1de24294bd2f343fd5e7a625:

  Linux 4.14-rc8 (2017-11-05 13:05:14 -0800)

are available in the git repository at:

  git://git.kernel.org/pub/scm/virt/kvm/kvm for-linus

for you to fetch changes up to d850a255d5b961fbe085f06ddd116910378b78f1:

  Merge tag 'kvm-ppc-fixes-4.14-2' of 
git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc (2017-11-08 
14:08:59 +0100)


KVM fix for v4.14(-rc9)

Fix PPC HV host crash that can occur as a result of resizing the guest
hashed page table.


Paul Mackerras (1):
  KVM: PPC: Book3S HV: Fix exclusion between HPT resizing and other HPT 
updates

Radim Krčmář (1):
  Merge tag 'kvm-ppc-fixes-4.14-2' of 
git://git.kernel.org/.../paulus/powerpc

 arch/powerpc/kvm/book3s_64_mmu_hv.c | 10 ++
 arch/powerpc/kvm/book3s_hv.c| 29 +++--
 2 files changed, 29 insertions(+), 10 deletions(-)


Re: [PATCH RESEND 2/3] KVM: Add paravirt remote TLB flush

2017-11-09 Thread Radim Krčmář
2017-11-08 18:02-0800, Wanpeng Li:
> From: Wanpeng Li <wanpeng...@hotmail.com>
> 
> Remote flushing api's does a busy wait which is fine in bare-metal
> scenario. But with-in the guest, the vcpus might have been pre-empted
> or blocked. In this scenario, the initator vcpu would end up
> busy-waiting for a long amount of time.
> 
> This patch set implements para-virt flush tlbs making sure that it
> does not wait for vcpus that are sleeping. And all the sleeping vcpus
> flush the tlb on guest enter.
> 
> The best result is achieved when we're overcommiting the host by running 
> multiple vCPUs on each pCPU. In this case PV tlb flush avoids touching 
> vCPUs which are not scheduled and avoid the wait on the main CPU.
> 
> Test on a Haswell i7 desktop 4 cores (2HT), so 8 pCPUs, running ebizzy in 
> one linux guest.
> 
> ebizzy -M 
>   vanillaoptimized boost
>  8 vCPUs   10152   10083   -0.68% 
> 16 vCPUs12244866   297.5% 
> 24 vCPUs11093871   249%
> 32 vCPUs1025    3375   229.3% 
> 
> Cc: Paolo Bonzini <pbonz...@redhat.com>
> Cc: Radim Krčmář <rkrc...@redhat.com>
> Signed-off-by: Wanpeng Li <wanpeng...@hotmail.com>
> ---
> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> @@ -465,6 +465,33 @@ static void __init kvm_apf_trap_init(void)
>   update_intr_gate(X86_TRAP_PF, async_page_fault);
>  }
>  
> +static void kvm_flush_tlb_others(const struct cpumask *cpumask,
> + const struct flush_tlb_info *info)
> +{
> + u8 state;
> + int cpu;
> + struct kvm_steal_time *src;
> + cpumask_t flushmask;
> +
> +
> + cpumask_copy(, cpumask);
> + /*
> +  * We have to call flush only on online vCPUs. And
> +  * queue flush_on_enter for pre-empted vCPUs
> +  */
> + for_each_cpu(cpu, cpumask) {
> + src = _cpu(steal_time, cpu);
> + state = src->preempted;
> + if ((state & KVM_VCPU_PREEMPTED)) {
> + if (cmpxchg(>preempted, state, state | 1 <<
> + KVM_VCPU_SHOULD_FLUSH))

We won't be flushing unless the last argument reads 'state |
KVM_VCPU_SHOULD_FLUSH' and the result will be the original value that
should be compared with state to avoid a race that would drop running
VCPU:

  if (cmpxchg(>preempted, state, state | KVM_VCPU_SHOULD_FLUSH) == state)

> + cpumask_clear_cpu(cpu, );
> + }
> + }
> +
> + native_flush_tlb_others(, info);
> +}
> +
>  void __init kvm_guest_init(void)
>  {
>   int i;


Re: [PATCH RESEND 2/3] KVM: Add paravirt remote TLB flush

2017-11-09 Thread Radim Krčmář
2017-11-08 18:02-0800, Wanpeng Li:
> From: Wanpeng Li 
> 
> Remote flushing api's does a busy wait which is fine in bare-metal
> scenario. But with-in the guest, the vcpus might have been pre-empted
> or blocked. In this scenario, the initator vcpu would end up
> busy-waiting for a long amount of time.
> 
> This patch set implements para-virt flush tlbs making sure that it
> does not wait for vcpus that are sleeping. And all the sleeping vcpus
> flush the tlb on guest enter.
> 
> The best result is achieved when we're overcommiting the host by running 
> multiple vCPUs on each pCPU. In this case PV tlb flush avoids touching 
> vCPUs which are not scheduled and avoid the wait on the main CPU.
> 
> Test on a Haswell i7 desktop 4 cores (2HT), so 8 pCPUs, running ebizzy in 
> one linux guest.
> 
> ebizzy -M 
>   vanillaoptimized boost
>  8 vCPUs   10152   10083   -0.68% 
> 16 vCPUs12244866   297.5% 
> 24 vCPUs11093871   249%
> 32 vCPUs    10253375   229.3% 
> 
> Cc: Paolo Bonzini 
> Cc: Radim Krčmář 
> Signed-off-by: Wanpeng Li 
> ---
> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> @@ -465,6 +465,33 @@ static void __init kvm_apf_trap_init(void)
>   update_intr_gate(X86_TRAP_PF, async_page_fault);
>  }
>  
> +static void kvm_flush_tlb_others(const struct cpumask *cpumask,
> + const struct flush_tlb_info *info)
> +{
> + u8 state;
> + int cpu;
> + struct kvm_steal_time *src;
> + cpumask_t flushmask;
> +
> +
> + cpumask_copy(, cpumask);
> + /*
> +  * We have to call flush only on online vCPUs. And
> +  * queue flush_on_enter for pre-empted vCPUs
> +  */
> + for_each_cpu(cpu, cpumask) {
> + src = _cpu(steal_time, cpu);
> + state = src->preempted;
> + if ((state & KVM_VCPU_PREEMPTED)) {
> + if (cmpxchg(>preempted, state, state | 1 <<
> + KVM_VCPU_SHOULD_FLUSH))

We won't be flushing unless the last argument reads 'state |
KVM_VCPU_SHOULD_FLUSH' and the result will be the original value that
should be compared with state to avoid a race that would drop running
VCPU:

  if (cmpxchg(>preempted, state, state | KVM_VCPU_SHOULD_FLUSH) == state)

> + cpumask_clear_cpu(cpu, );
> + }
> + }
> +
> + native_flush_tlb_others(, info);
> +}
> +
>  void __init kvm_guest_init(void)
>  {
>   int i;


Re: [PATCHv3 1/1] locking/qspinlock/x86: Avoid test-and-set when PV_DEDICATED is set

2017-11-09 Thread Radim Krčmář
2017-11-09 00:55-0800, Eduardo Valentin:
> Hello,
> 
> On Wed, Nov 08, 2017 at 06:36:52PM +0100, Radim Krčmář wrote:
> > 2017-11-06 12:26-0800, Eduardo Valentin:
> > > Currently, the existing qspinlock implementation will fallback to
> > > test-and-set if the hypervisor has not set the PV_UNHALT flag.
> > > 
> > > This patch gives the opportunity to guest kernels to select
> > > between test-and-set and the regular queueu fair lock implementation
> > > based on the PV_DEDICATED KVM feature flag. When the PV_DEDICATED
> > > flag is not set, the code will still fall back to test-and-set,
> > > but when the PV_DEDICATED flag is set, the code will use
> > > the regular queue spinlock implementation.
> > > 
> > > With this patch, when in autoselect mode, the guest will
> > > use the default spinlock implementation based on host feature
> > > flags as follows:
> > > 
> > > PV_DEDICATED = 1, PV_UNHALT = anything: default is qspinlock
> > > PV_DEDICATED = 0, PV_UNHALT = 1: default is pvqspinlock
> > > PV_DEDICATED = 0, PV_UNHALT = 0: default is tas
> > > 
> > > Cc: Paolo Bonzini <pbonz...@redhat.com>
> > > Cc: "Radim Krčmář" <rkrc...@redhat.com>
> > > Cc: Jonathan Corbet <cor...@lwn.net>
> > > Cc: Thomas Gleixner <t...@linutronix.de>
> > > Cc: Ingo Molnar <mi...@redhat.com>
> > > Cc: "H. Peter Anvin" <h...@zytor.com>
> > > Cc: x...@kernel.org
> > > Cc: Peter Zijlstra <pet...@infradead.org>
> > > Cc: Waiman Long <long...@redhat.com>
> > > Cc: k...@vger.kernel.org
> > > Cc: linux-...@vger.kernel.org
> > > Cc: linux-kernel@vger.kernel.org
> > > Cc: Jan H. Schoenherr <jscho...@amazon.de>
> > > Cc: Anthony Liguori <aligu...@amazon.com>
> > > Suggested-by: Matt Wilson <m...@amazon.com>
> > > Signed-off-by: Eduardo Valentin <edu...@amazon.com>
> > > ---
> > > V3:
> > >  - When PV_DEDICATED is set (1), qspinlock is selected,
> > >regardless of the value of PV_UNHAULT. Suggested by Paolo Bonzini. 
> > >  - Refreshed on top of tip/master.
> > > V2:
> > >  - rebase on top of tip/master
> > > 
> > >  Documentation/virtual/kvm/cpuid.txt  | 6 ++
> > >  arch/x86/include/asm/qspinlock.h | 4 
> > >  arch/x86/include/uapi/asm/kvm_para.h | 1 +
> > >  arch/x86/kernel/kvm.c| 2 ++
> > >  4 files changed, 13 insertions(+)
> > > 
> > > diff --git a/Documentation/virtual/kvm/cpuid.txt 
> > > b/Documentation/virtual/kvm/cpuid.txt
> > > index 3c65feb..117066a 100644
> > > --- a/Documentation/virtual/kvm/cpuid.txt
> > > +++ b/Documentation/virtual/kvm/cpuid.txt
> > > @@ -54,6 +54,12 @@ KVM_FEATURE_PV_UNHALT  || 7 || guest 
> > > checks this feature bit
> > > ||   || before enabling 
> > > paravirtualized
> > > ||   || spinlock support.
> > >  
> > > --
> > > +KVM_FEATURE_PV_DEDICATED   || 8 || guest checks this feature 
> > > bit
> > > +   ||   || to determine if they run 
> > > on
> > > +   ||   || dedicated vCPUs, allowing 
> > > opti-
> > > +   ||   || mizations such as usage of
> > > +   ||   || qspinlocks.
> > > +--
> > >  KVM_FEATURE_CLOCKSOURCE_STABLE_BIT ||24 || host will warn if no 
> > > guest-side
> > > ||   || per-cpu warps are 
> > > expected in
> > > ||   || kvmclock.
> > > diff --git a/arch/x86/include/asm/qspinlock.h 
> > > b/arch/x86/include/asm/qspinlock.h
> > > index 5e16b5d..de42694 100644
> > > --- a/arch/x86/include/asm/qspinlock.h
> > > +++ b/arch/x86/include/asm/qspinlock.h
> > > @@ -3,6 +3,8 @@
> > >  #define _ASM_X86_QSPINLOCK_H
> > >  
> > >  #include 
> > > +#include 
> > > +
> > >  #include 
> > >  #include 
> > >  #include 
> > > @@ -58,6 +60,8 @@ static inline bool virt_spin_lock(struct qspinlock 
> &g

Re: [PATCHv3 1/1] locking/qspinlock/x86: Avoid test-and-set when PV_DEDICATED is set

2017-11-09 Thread Radim Krčmář
2017-11-09 00:55-0800, Eduardo Valentin:
> Hello,
> 
> On Wed, Nov 08, 2017 at 06:36:52PM +0100, Radim Krčmář wrote:
> > 2017-11-06 12:26-0800, Eduardo Valentin:
> > > Currently, the existing qspinlock implementation will fallback to
> > > test-and-set if the hypervisor has not set the PV_UNHALT flag.
> > > 
> > > This patch gives the opportunity to guest kernels to select
> > > between test-and-set and the regular queueu fair lock implementation
> > > based on the PV_DEDICATED KVM feature flag. When the PV_DEDICATED
> > > flag is not set, the code will still fall back to test-and-set,
> > > but when the PV_DEDICATED flag is set, the code will use
> > > the regular queue spinlock implementation.
> > > 
> > > With this patch, when in autoselect mode, the guest will
> > > use the default spinlock implementation based on host feature
> > > flags as follows:
> > > 
> > > PV_DEDICATED = 1, PV_UNHALT = anything: default is qspinlock
> > > PV_DEDICATED = 0, PV_UNHALT = 1: default is pvqspinlock
> > > PV_DEDICATED = 0, PV_UNHALT = 0: default is tas
> > > 
> > > Cc: Paolo Bonzini 
> > > Cc: "Radim Krčmář" 
> > > Cc: Jonathan Corbet 
> > > Cc: Thomas Gleixner 
> > > Cc: Ingo Molnar 
> > > Cc: "H. Peter Anvin" 
> > > Cc: x...@kernel.org
> > > Cc: Peter Zijlstra 
> > > Cc: Waiman Long 
> > > Cc: k...@vger.kernel.org
> > > Cc: linux-...@vger.kernel.org
> > > Cc: linux-kernel@vger.kernel.org
> > > Cc: Jan H. Schoenherr 
> > > Cc: Anthony Liguori 
> > > Suggested-by: Matt Wilson 
> > > Signed-off-by: Eduardo Valentin 
> > > ---
> > > V3:
> > >  - When PV_DEDICATED is set (1), qspinlock is selected,
> > >regardless of the value of PV_UNHAULT. Suggested by Paolo Bonzini. 
> > >  - Refreshed on top of tip/master.
> > > V2:
> > >  - rebase on top of tip/master
> > > 
> > >  Documentation/virtual/kvm/cpuid.txt  | 6 ++
> > >  arch/x86/include/asm/qspinlock.h | 4 
> > >  arch/x86/include/uapi/asm/kvm_para.h | 1 +
> > >  arch/x86/kernel/kvm.c| 2 ++
> > >  4 files changed, 13 insertions(+)
> > > 
> > > diff --git a/Documentation/virtual/kvm/cpuid.txt 
> > > b/Documentation/virtual/kvm/cpuid.txt
> > > index 3c65feb..117066a 100644
> > > --- a/Documentation/virtual/kvm/cpuid.txt
> > > +++ b/Documentation/virtual/kvm/cpuid.txt
> > > @@ -54,6 +54,12 @@ KVM_FEATURE_PV_UNHALT  || 7 || guest 
> > > checks this feature bit
> > > ||   || before enabling 
> > > paravirtualized
> > > ||   || spinlock support.
> > >  
> > > --
> > > +KVM_FEATURE_PV_DEDICATED   || 8 || guest checks this feature 
> > > bit
> > > +   ||   || to determine if they run 
> > > on
> > > +   ||   || dedicated vCPUs, allowing 
> > > opti-
> > > +   ||   || mizations such as usage of
> > > +   ||   || qspinlocks.
> > > +--
> > >  KVM_FEATURE_CLOCKSOURCE_STABLE_BIT ||24 || host will warn if no 
> > > guest-side
> > > ||   || per-cpu warps are 
> > > expected in
> > > ||   || kvmclock.
> > > diff --git a/arch/x86/include/asm/qspinlock.h 
> > > b/arch/x86/include/asm/qspinlock.h
> > > index 5e16b5d..de42694 100644
> > > --- a/arch/x86/include/asm/qspinlock.h
> > > +++ b/arch/x86/include/asm/qspinlock.h
> > > @@ -3,6 +3,8 @@
> > >  #define _ASM_X86_QSPINLOCK_H
> > >  
> > >  #include 
> > > +#include 
> > > +
> > >  #include 
> > >  #include 
> > >  #include 
> > > @@ -58,6 +60,8 @@ static inline bool virt_spin_lock(struct qspinlock 
> > > *lock)
> > >   if (!static_branch_likely(_spin_lock_key))
> > >   return false;
> > >  
> > > + if (kvm_para_has_feature(KVM_FEATURE_PV_DEDICATED))
> > > + return false;
> > 
> > Hm, every spinlock slowpath calls cpuid, which

Re: [PATCH 1/1] locking/qspinlock/x86: Avoid test-and-set when PV_DEDICATED is set

2017-11-08 Thread Radim Krčmář
2017-10-31 10:02-0700, Eduardo Valentin:
> Hello Radim,
> 
> On Tue, Oct 24, 2017 at 01:18:59PM +0200, Radim Krčmář wrote:
> > 2017-10-23 17:44-0700, Eduardo Valentin:
> > > Currently, the existing qspinlock implementation will fallback to
> > > test-and-set if the hypervisor has not set the PV_UNHALT flag.
> > 
> > Where have you detected the main source of overhead with pinned VCPUs?
> > Makes me wonder if we couldn't improve general PV_UNHALT,
> 
> This is essentially for cases of non-overcommitted vCPUs in which we want 
> the instance vCPUs to run uninterrupted as much as possible. Here by disabling
> the PV_UNHALT,  we avoid the accounting needed to properly do the PV_UNHALT 
> hypercall, as the lock holder won't be preempted anyway for the 1:1 pin case.

Right, I would expect that the scenario should very rarely go into the
halt/kick path -- is SPIN_THRESHOLD too low?

We could also try abolishing the SPIN_THRESHOLD completely and only use
vcpu_is_preempted() and state of the previous lock holder to enter the
halt/kick path.

(The drawback is that vcpu_is_preempted() currently gets set even when
 dropping into userspace.)

> > > This patch gives the opportunity to guest kernels to select
> > > between test-and-set and the regular queueu fair lock implementation
> > > based on the PV_DEDICATED KVM feature flag. When the PV_DEDICATED
> > > flag is not set, the code will still fall back to test-and-set,
> > > but when the PV_DEDICATED flag is set, the code will use
> > > the regular queue spinlock implementation.
> > 
> > Some flag makes sense and we do want to make sure that userspaces don't
> > enable it in pass-through-cpuid mode.
> 
> Did you mean something like:
> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> index 0099e10..8ceb503 100644
> --- a/arch/x86/kvm/cpuid.c
> +++ b/arch/x86/kvm/cpuid.c
> @@ -211,7 +211,8 @@ int kvm_vcpu_ioctl_set_cpuid(struct kvm_vcpu *vcpu,
> }
> for (i = 0; i < cpuid->nent; i++) {
> vcpu->arch.cpuid_entries[i].function = 
> cpuid_entries[i].function;
> -   vcpu->arch.cpuid_entries[i].eax = cpuid_entries[i].eax;
> +   vcpu->arch.cpuid_entries[i].eax = cpuid_entries[i].eax &
> +   
> ~KVM_FEATURE_PV_DEDICATED;
> vcpu->arch.cpuid_entries[i].ebx = cpuid_entries[i].ebx;
> vcpu->arch.cpuid_entries[i].ecx = cpuid_entries[i].ecx;
> vcpu->arch.cpuid_entries[i].edx = cpuid_entries[i].edx;
> 
> 
> But I do not see any other KVM_FEATURE_* being enforced (e.g. PV_UNHALT).
> Do you mind elaborating a bit here?

Sorry, nothing is needed.  I somehow though that we need to expose this
to the userspace through CPUID, but KVM just needs to consider the flag
as reserved.


Re: [PATCH 1/1] locking/qspinlock/x86: Avoid test-and-set when PV_DEDICATED is set

2017-11-08 Thread Radim Krčmář
2017-10-31 10:02-0700, Eduardo Valentin:
> Hello Radim,
> 
> On Tue, Oct 24, 2017 at 01:18:59PM +0200, Radim Krčmář wrote:
> > 2017-10-23 17:44-0700, Eduardo Valentin:
> > > Currently, the existing qspinlock implementation will fallback to
> > > test-and-set if the hypervisor has not set the PV_UNHALT flag.
> > 
> > Where have you detected the main source of overhead with pinned VCPUs?
> > Makes me wonder if we couldn't improve general PV_UNHALT,
> 
> This is essentially for cases of non-overcommitted vCPUs in which we want 
> the instance vCPUs to run uninterrupted as much as possible. Here by disabling
> the PV_UNHALT,  we avoid the accounting needed to properly do the PV_UNHALT 
> hypercall, as the lock holder won't be preempted anyway for the 1:1 pin case.

Right, I would expect that the scenario should very rarely go into the
halt/kick path -- is SPIN_THRESHOLD too low?

We could also try abolishing the SPIN_THRESHOLD completely and only use
vcpu_is_preempted() and state of the previous lock holder to enter the
halt/kick path.

(The drawback is that vcpu_is_preempted() currently gets set even when
 dropping into userspace.)

> > > This patch gives the opportunity to guest kernels to select
> > > between test-and-set and the regular queueu fair lock implementation
> > > based on the PV_DEDICATED KVM feature flag. When the PV_DEDICATED
> > > flag is not set, the code will still fall back to test-and-set,
> > > but when the PV_DEDICATED flag is set, the code will use
> > > the regular queue spinlock implementation.
> > 
> > Some flag makes sense and we do want to make sure that userspaces don't
> > enable it in pass-through-cpuid mode.
> 
> Did you mean something like:
> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> index 0099e10..8ceb503 100644
> --- a/arch/x86/kvm/cpuid.c
> +++ b/arch/x86/kvm/cpuid.c
> @@ -211,7 +211,8 @@ int kvm_vcpu_ioctl_set_cpuid(struct kvm_vcpu *vcpu,
> }
> for (i = 0; i < cpuid->nent; i++) {
> vcpu->arch.cpuid_entries[i].function = 
> cpuid_entries[i].function;
> -   vcpu->arch.cpuid_entries[i].eax = cpuid_entries[i].eax;
> +   vcpu->arch.cpuid_entries[i].eax = cpuid_entries[i].eax &
> +   
> ~KVM_FEATURE_PV_DEDICATED;
> vcpu->arch.cpuid_entries[i].ebx = cpuid_entries[i].ebx;
> vcpu->arch.cpuid_entries[i].ecx = cpuid_entries[i].ecx;
> vcpu->arch.cpuid_entries[i].edx = cpuid_entries[i].edx;
> 
> 
> But I do not see any other KVM_FEATURE_* being enforced (e.g. PV_UNHALT).
> Do you mind elaborating a bit here?

Sorry, nothing is needed.  I somehow though that we need to expose this
to the userspace through CPUID, but KVM just needs to consider the flag
as reserved.


Re: [PATCHv3 1/1] locking/qspinlock/x86: Avoid test-and-set when PV_DEDICATED is set

2017-11-08 Thread Radim Krčmář
2017-11-06 12:26-0800, Eduardo Valentin:
> Currently, the existing qspinlock implementation will fallback to
> test-and-set if the hypervisor has not set the PV_UNHALT flag.
> 
> This patch gives the opportunity to guest kernels to select
> between test-and-set and the regular queueu fair lock implementation
> based on the PV_DEDICATED KVM feature flag. When the PV_DEDICATED
> flag is not set, the code will still fall back to test-and-set,
> but when the PV_DEDICATED flag is set, the code will use
> the regular queue spinlock implementation.
> 
> With this patch, when in autoselect mode, the guest will
> use the default spinlock implementation based on host feature
> flags as follows:
> 
> PV_DEDICATED = 1, PV_UNHALT = anything: default is qspinlock
> PV_DEDICATED = 0, PV_UNHALT = 1: default is pvqspinlock
> PV_DEDICATED = 0, PV_UNHALT = 0: default is tas
> 
> Cc: Paolo Bonzini <pbonz...@redhat.com>
> Cc: "Radim Krčmář" <rkrc...@redhat.com>
> Cc: Jonathan Corbet <cor...@lwn.net>
> Cc: Thomas Gleixner <t...@linutronix.de>
> Cc: Ingo Molnar <mi...@redhat.com>
> Cc: "H. Peter Anvin" <h...@zytor.com>
> Cc: x...@kernel.org
> Cc: Peter Zijlstra <pet...@infradead.org>
> Cc: Waiman Long <long...@redhat.com>
> Cc: k...@vger.kernel.org
> Cc: linux-...@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: Jan H. Schoenherr <jscho...@amazon.de>
> Cc: Anthony Liguori <aligu...@amazon.com>
> Suggested-by: Matt Wilson <m...@amazon.com>
> Signed-off-by: Eduardo Valentin <edu...@amazon.com>
> ---
> V3:
>  - When PV_DEDICATED is set (1), qspinlock is selected,
>regardless of the value of PV_UNHAULT. Suggested by Paolo Bonzini. 
>  - Refreshed on top of tip/master.
> V2:
>  - rebase on top of tip/master
> 
>  Documentation/virtual/kvm/cpuid.txt  | 6 ++
>  arch/x86/include/asm/qspinlock.h | 4 
>  arch/x86/include/uapi/asm/kvm_para.h | 1 +
>  arch/x86/kernel/kvm.c| 2 ++
>  4 files changed, 13 insertions(+)
> 
> diff --git a/Documentation/virtual/kvm/cpuid.txt 
> b/Documentation/virtual/kvm/cpuid.txt
> index 3c65feb..117066a 100644
> --- a/Documentation/virtual/kvm/cpuid.txt
> +++ b/Documentation/virtual/kvm/cpuid.txt
> @@ -54,6 +54,12 @@ KVM_FEATURE_PV_UNHALT  || 7 || guest 
> checks this feature bit
> ||   || before enabling 
> paravirtualized
> ||   || spinlock support.
>  
> --
> +KVM_FEATURE_PV_DEDICATED   || 8 || guest checks this feature bit
> +   ||   || to determine if they run on
> +   ||   || dedicated vCPUs, allowing 
> opti-
> +   ||   || mizations such as usage of
> +   ||   || qspinlocks.
> +--
>  KVM_FEATURE_CLOCKSOURCE_STABLE_BIT ||24 || host will warn if no 
> guest-side
> ||   || per-cpu warps are expected in
> ||   || kvmclock.
> diff --git a/arch/x86/include/asm/qspinlock.h 
> b/arch/x86/include/asm/qspinlock.h
> index 5e16b5d..de42694 100644
> --- a/arch/x86/include/asm/qspinlock.h
> +++ b/arch/x86/include/asm/qspinlock.h
> @@ -3,6 +3,8 @@
>  #define _ASM_X86_QSPINLOCK_H
>  
>  #include 
> +#include 
> +
>  #include 
>  #include 
>  #include 
> @@ -58,6 +60,8 @@ static inline bool virt_spin_lock(struct qspinlock *lock)
>   if (!static_branch_likely(_spin_lock_key))
>   return false;
>  
> + if (kvm_para_has_feature(KVM_FEATURE_PV_DEDICATED))
> + return false;

Hm, every spinlock slowpath calls cpuid, which causes a VM exit, so I
wouldn't expect it to be faster than the existing implementations.
(Using the static key would be better.)

How does this patch perform compared to user-forced qspinlock and hybrid
pvqspinlock?

Thanks.


Re: [PATCHv3 1/1] locking/qspinlock/x86: Avoid test-and-set when PV_DEDICATED is set

2017-11-08 Thread Radim Krčmář
2017-11-06 12:26-0800, Eduardo Valentin:
> Currently, the existing qspinlock implementation will fallback to
> test-and-set if the hypervisor has not set the PV_UNHALT flag.
> 
> This patch gives the opportunity to guest kernels to select
> between test-and-set and the regular queueu fair lock implementation
> based on the PV_DEDICATED KVM feature flag. When the PV_DEDICATED
> flag is not set, the code will still fall back to test-and-set,
> but when the PV_DEDICATED flag is set, the code will use
> the regular queue spinlock implementation.
> 
> With this patch, when in autoselect mode, the guest will
> use the default spinlock implementation based on host feature
> flags as follows:
> 
> PV_DEDICATED = 1, PV_UNHALT = anything: default is qspinlock
> PV_DEDICATED = 0, PV_UNHALT = 1: default is pvqspinlock
> PV_DEDICATED = 0, PV_UNHALT = 0: default is tas
> 
> Cc: Paolo Bonzini 
> Cc: "Radim Krčmář" 
> Cc: Jonathan Corbet 
> Cc: Thomas Gleixner 
> Cc: Ingo Molnar 
> Cc: "H. Peter Anvin" 
> Cc: x...@kernel.org
> Cc: Peter Zijlstra 
> Cc: Waiman Long 
> Cc: k...@vger.kernel.org
> Cc: linux-...@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: Jan H. Schoenherr 
> Cc: Anthony Liguori 
> Suggested-by: Matt Wilson 
> Signed-off-by: Eduardo Valentin 
> ---
> V3:
>  - When PV_DEDICATED is set (1), qspinlock is selected,
>regardless of the value of PV_UNHAULT. Suggested by Paolo Bonzini. 
>  - Refreshed on top of tip/master.
> V2:
>  - rebase on top of tip/master
> 
>  Documentation/virtual/kvm/cpuid.txt  | 6 ++
>  arch/x86/include/asm/qspinlock.h | 4 
>  arch/x86/include/uapi/asm/kvm_para.h | 1 +
>  arch/x86/kernel/kvm.c| 2 ++
>  4 files changed, 13 insertions(+)
> 
> diff --git a/Documentation/virtual/kvm/cpuid.txt 
> b/Documentation/virtual/kvm/cpuid.txt
> index 3c65feb..117066a 100644
> --- a/Documentation/virtual/kvm/cpuid.txt
> +++ b/Documentation/virtual/kvm/cpuid.txt
> @@ -54,6 +54,12 @@ KVM_FEATURE_PV_UNHALT  || 7 || guest 
> checks this feature bit
> ||   || before enabling 
> paravirtualized
> ||   || spinlock support.
>  
> --
> +KVM_FEATURE_PV_DEDICATED   || 8 || guest checks this feature bit
> +   ||   || to determine if they run on
> +   ||   || dedicated vCPUs, allowing 
> opti-
> +   ||   || mizations such as usage of
> +   ||   || qspinlocks.
> +--
>  KVM_FEATURE_CLOCKSOURCE_STABLE_BIT ||24 || host will warn if no 
> guest-side
> ||   || per-cpu warps are expected in
> ||   || kvmclock.
> diff --git a/arch/x86/include/asm/qspinlock.h 
> b/arch/x86/include/asm/qspinlock.h
> index 5e16b5d..de42694 100644
> --- a/arch/x86/include/asm/qspinlock.h
> +++ b/arch/x86/include/asm/qspinlock.h
> @@ -3,6 +3,8 @@
>  #define _ASM_X86_QSPINLOCK_H
>  
>  #include 
> +#include 
> +
>  #include 
>  #include 
>  #include 
> @@ -58,6 +60,8 @@ static inline bool virt_spin_lock(struct qspinlock *lock)
>   if (!static_branch_likely(_spin_lock_key))
>   return false;
>  
> + if (kvm_para_has_feature(KVM_FEATURE_PV_DEDICATED))
> + return false;

Hm, every spinlock slowpath calls cpuid, which causes a VM exit, so I
wouldn't expect it to be faster than the existing implementations.
(Using the static key would be better.)

How does this patch perform compared to user-forced qspinlock and hybrid
pvqspinlock?

Thanks.


Re: [PATCH v2] KVM: X86: Fix softlockup when get the current kvmclock timestamp

2017-11-08 Thread Radim Krčmář
2017-11-06 04:17-0800, Wanpeng Li:
> From: Wanpeng Li <wanpeng...@hotmail.com>
> 
>  watchdog: BUG: soft lockup - CPU#6 stuck for 22s! [qemu-system-x86:10185]
>  CPU: 6 PID: 10185 Comm: qemu-system-x86 Tainted: G   OE   
> 4.14.0-rc4+ #4
>  RIP: 0010:kvm_get_time_scale+0x4e/0xa0 [kvm]
>  Call Trace:
>   ? get_kvmclock_ns+0xa3/0x140 [kvm]
>   get_time_ref_counter+0x5a/0x80 [kvm]
>   kvm_hv_process_stimers+0x120/0x5f0 [kvm]
>   ? kvm_hv_process_stimers+0x120/0x5f0 [kvm]
>   ? preempt_schedule+0x27/0x30
>   ? ___preempt_schedule+0x16/0x18
>   kvm_arch_vcpu_ioctl_run+0x4b4/0x1690 [kvm]
>   ? kvm_arch_vcpu_load+0x47/0x230 [kvm]
>   kvm_vcpu_ioctl+0x33a/0x620 [kvm]
>   ? kvm_vcpu_ioctl+0x33a/0x620 [kvm]
>   ? kvm_vm_ioctl_check_extension_generic+0x3b/0x40 [kvm]
>   ? kvm_dev_ioctl+0x279/0x6c0 [kvm]
>   do_vfs_ioctl+0xa1/0x5d0
>   ? __fget+0x73/0xa0
>   SyS_ioctl+0x79/0x90
>   entry_SYSCALL_64_fastpath+0x1e/0xa9
> 
> This can be reproduced when running kvm-unit-tests/hyperv_stimer.flat and 
> cpu-hotplug stress simultaneously. __this_cpu_read(cpu_tsc_khz) returns 0 
> (set in kvmclock_cpu_down_prep()) when the pCPU is unhotplug which results 
> in kvm_get_time_scale() gets into an infinite loop.
> 
> This patch fixes it by skipping to fill the hv_clock when the pCPU is offline.
> 
> Cc: Paolo Bonzini <pbonz...@redhat.com>
> Cc: Radim Krčmář <rkrc...@redhat.com>
> Signed-off-by: Wanpeng Li <wanpeng...@hotmail.com>
> ---
> v1 -> v2:
>  * avoid infinite loop
> 
>  arch/x86/kvm/x86.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 03869eb..d2507c6 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -1259,6 +1259,9 @@ static void kvm_get_time_scale(uint64_t scaled_hz, 
> uint64_t base_hz,
>   uint64_t tps64;
>   uint32_t tps32;
>  
> + if (unlikely(base_hz == 0))
> + return;

This is a sensible thing to do and will prevent the loop, but KVM will
still have a minor bug:  get_kvmclock_ns() passes uninitialized stack
values with the expectation that kvm_get_time_scale() will set them, but
returning here would result in __pvclock_read_cycles() with random data
and inject timer interrupts early (if not worse).

I think it would be best if kvm_get_time_scale() wasn't executing when
cpu_tsc_khz is 0, by clearing cpu_tsc_khz later and setting earlier;
do you see any problems with moving the CPUHP_AP_X86_KVM_CLK_ONLINE
before CPUHP_AP_ONLINE?

Thanks.


Re: [PATCH v2] KVM: X86: Fix softlockup when get the current kvmclock timestamp

2017-11-08 Thread Radim Krčmář
2017-11-06 04:17-0800, Wanpeng Li:
> From: Wanpeng Li 
> 
>  watchdog: BUG: soft lockup - CPU#6 stuck for 22s! [qemu-system-x86:10185]
>  CPU: 6 PID: 10185 Comm: qemu-system-x86 Tainted: G   OE   
> 4.14.0-rc4+ #4
>  RIP: 0010:kvm_get_time_scale+0x4e/0xa0 [kvm]
>  Call Trace:
>   ? get_kvmclock_ns+0xa3/0x140 [kvm]
>   get_time_ref_counter+0x5a/0x80 [kvm]
>   kvm_hv_process_stimers+0x120/0x5f0 [kvm]
>   ? kvm_hv_process_stimers+0x120/0x5f0 [kvm]
>   ? preempt_schedule+0x27/0x30
>   ? ___preempt_schedule+0x16/0x18
>   kvm_arch_vcpu_ioctl_run+0x4b4/0x1690 [kvm]
>   ? kvm_arch_vcpu_load+0x47/0x230 [kvm]
>   kvm_vcpu_ioctl+0x33a/0x620 [kvm]
>   ? kvm_vcpu_ioctl+0x33a/0x620 [kvm]
>   ? kvm_vm_ioctl_check_extension_generic+0x3b/0x40 [kvm]
>   ? kvm_dev_ioctl+0x279/0x6c0 [kvm]
>   do_vfs_ioctl+0xa1/0x5d0
>   ? __fget+0x73/0xa0
>   SyS_ioctl+0x79/0x90
>   entry_SYSCALL_64_fastpath+0x1e/0xa9
> 
> This can be reproduced when running kvm-unit-tests/hyperv_stimer.flat and 
> cpu-hotplug stress simultaneously. __this_cpu_read(cpu_tsc_khz) returns 0 
> (set in kvmclock_cpu_down_prep()) when the pCPU is unhotplug which results 
> in kvm_get_time_scale() gets into an infinite loop.
> 
> This patch fixes it by skipping to fill the hv_clock when the pCPU is offline.
> 
> Cc: Paolo Bonzini 
> Cc: Radim Krčmář 
> Signed-off-by: Wanpeng Li 
> ---
> v1 -> v2:
>  * avoid infinite loop
> 
>  arch/x86/kvm/x86.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 03869eb..d2507c6 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -1259,6 +1259,9 @@ static void kvm_get_time_scale(uint64_t scaled_hz, 
> uint64_t base_hz,
>   uint64_t tps64;
>   uint32_t tps32;
>  
> + if (unlikely(base_hz == 0))
> + return;

This is a sensible thing to do and will prevent the loop, but KVM will
still have a minor bug:  get_kvmclock_ns() passes uninitialized stack
values with the expectation that kvm_get_time_scale() will set them, but
returning here would result in __pvclock_read_cycles() with random data
and inject timer interrupts early (if not worse).

I think it would be best if kvm_get_time_scale() wasn't executing when
cpu_tsc_khz is 0, by clearing cpu_tsc_khz later and setting earlier;
do you see any problems with moving the CPUHP_AP_X86_KVM_CLK_ONLINE
before CPUHP_AP_ONLINE?

Thanks.


[GIT PULL] KVM fixes for v4.14-rc7

2017-10-24 Thread Radim Krčmář
Linus,

The following changes since commit 33d930e59a98fa10a0db9f56c7fa2f21a4aef9b9:

  Linux 4.14-rc5 (2017-10-15 21:01:12 -0400)

are available in the git repository at:

  git://git.kernel.org/pub/scm/virt/kvm/kvm tags/for-linus

for you to fetch changes up to cc9085b6875323fd0c935ee7176583bb572821ee:

  Merge branch 'kvm-ppc-fixes' of 
git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc (2017-10-19 
14:42:09 +0200)


KVM fixes for v4.14-rc7

PPC fixes for potential host oops and hangs.


Alexey Kardashevskiy (1):
  KVM: PPC: Book3S: Protect kvmppc_gpa_to_ua() with SRCU

Benjamin Herrenschmidt (1):
  KVM: PPC: Book3S HV: Add more barriers in XIVE load/unload code

Greg Kurz (1):
  KVM: PPC: Fix oops when checking KVM_CAP_PPC_HTM

Nicholas Piggin (1):
  KVM: PPC: Book3S HV: POWER9 more doorbell fixes

Radim Krčmář (1):
  Merge branch 'kvm-ppc-fixes' of git://git.kernel.org/.../paulus/powerpc

 arch/powerpc/kvm/book3s_64_vio.c| 23 ++-
 arch/powerpc/kvm/book3s_hv_rmhandlers.S | 13 ++---
 arch/powerpc/kvm/powerpc.c  |  3 +--
 3 files changed, 25 insertions(+), 14 deletions(-)


[GIT PULL] KVM fixes for v4.14-rc7

2017-10-24 Thread Radim Krčmář
Linus,

The following changes since commit 33d930e59a98fa10a0db9f56c7fa2f21a4aef9b9:

  Linux 4.14-rc5 (2017-10-15 21:01:12 -0400)

are available in the git repository at:

  git://git.kernel.org/pub/scm/virt/kvm/kvm tags/for-linus

for you to fetch changes up to cc9085b6875323fd0c935ee7176583bb572821ee:

  Merge branch 'kvm-ppc-fixes' of 
git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc (2017-10-19 
14:42:09 +0200)


KVM fixes for v4.14-rc7

PPC fixes for potential host oops and hangs.


Alexey Kardashevskiy (1):
  KVM: PPC: Book3S: Protect kvmppc_gpa_to_ua() with SRCU

Benjamin Herrenschmidt (1):
  KVM: PPC: Book3S HV: Add more barriers in XIVE load/unload code

Greg Kurz (1):
  KVM: PPC: Fix oops when checking KVM_CAP_PPC_HTM

Nicholas Piggin (1):
  KVM: PPC: Book3S HV: POWER9 more doorbell fixes

Radim Krčmář (1):
  Merge branch 'kvm-ppc-fixes' of git://git.kernel.org/.../paulus/powerpc

 arch/powerpc/kvm/book3s_64_vio.c| 23 ++-
 arch/powerpc/kvm/book3s_hv_rmhandlers.S | 13 ++---
 arch/powerpc/kvm/powerpc.c  |  3 +--
 3 files changed, 25 insertions(+), 14 deletions(-)


Re: [PATCH 1/1] locking/qspinlock/x86: Avoid test-and-set when PV_DEDICATED is set

2017-10-24 Thread Radim Krčmář
2017-10-23 17:44-0700, Eduardo Valentin:
> Currently, the existing qspinlock implementation will fallback to
> test-and-set if the hypervisor has not set the PV_UNHALT flag.

Where have you detected the main source of overhead with pinned VCPUs?
Makes me wonder if we couldn't improve general PV_UNHALT,

thanks.

> This patch gives the opportunity to guest kernels to select
> between test-and-set and the regular queueu fair lock implementation
> based on the PV_DEDICATED KVM feature flag. When the PV_DEDICATED
> flag is not set, the code will still fall back to test-and-set,
> but when the PV_DEDICATED flag is set, the code will use
> the regular queue spinlock implementation.

Some flag makes sense and we do want to make sure that userspaces don't
enable it in pass-through-cpuid mode.


Re: [PATCH 1/1] locking/qspinlock/x86: Avoid test-and-set when PV_DEDICATED is set

2017-10-24 Thread Radim Krčmář
2017-10-23 17:44-0700, Eduardo Valentin:
> Currently, the existing qspinlock implementation will fallback to
> test-and-set if the hypervisor has not set the PV_UNHALT flag.

Where have you detected the main source of overhead with pinned VCPUs?
Makes me wonder if we couldn't improve general PV_UNHALT,

thanks.

> This patch gives the opportunity to guest kernels to select
> between test-and-set and the regular queueu fair lock implementation
> based on the PV_DEDICATED KVM feature flag. When the PV_DEDICATED
> flag is not set, the code will still fall back to test-and-set,
> but when the PV_DEDICATED flag is set, the code will use
> the regular queue spinlock implementation.

Some flag makes sense and we do want to make sure that userspaces don't
enable it in pass-through-cpuid mode.


Re: [PATCH] KVM: LAPIC: Level-sensitive interrupts are not support for LINT1

2017-10-13 Thread Radim Krčmář
2017-10-12 21:27-0700, Wanpeng Li:
> From: Wanpeng Li <wanpeng...@hotmail.com>
> 
> SDM 10.5.1 mentioned:
>  Software should always set the trigger mode in the LVT LINT1 register to 0 
> (edge
>  sensitive). Level-sensitive interrupts are not supported from LINT1.
> 
> I can intercept both Linux/windows 7/windows 2016 guests on my hand will set 
> Level-sensitive trigger mode to LVT LINT1 register during boot.  

And there is no problem with that, software can do that, delivery
through LINT1 is just undefined with that (most likely behaviors are:
deliver as edge and don't deliver.).

> This patch avoids the software too fool to set the level-sensitive trigger 
> mode 
> to LVT LINT1 register.

The software should see the value it writes, though, so the current
behavior is better.

Do we hit a KVM bug if the software uses APIC_LVT_LEVEL_TRIGGER?

Thanks.

> Cc: Paolo Bonzini <pbonz...@redhat.com>
> Cc: Radim Krčmář <rkrc...@redhat.com>
> Signed-off-by: Wanpeng Li <wanpeng...@hotmail.com>
> ---
>  arch/x86/kvm/lapic.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index a778f1a..26593c7 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -1758,6 +1758,8 @@ int kvm_lapic_reg_write(struct kvm_lapic *apic, u32 
> reg, u32 val)
>   val |= APIC_LVT_MASKED;
>  
>   val &= apic_lvt_mask[(reg - APIC_LVTT) >> 4];
> + if (reg == APIC_LVT1)
> + val &= ~APIC_LVT_LEVEL_TRIGGER;
>   kvm_lapic_set_reg(apic, reg, val);
>  
>   break;
> -- 
> 2.7.4
> 


<    1   2   3   4   5   6   7   8   9   10   >