[PATCH 2/2] sched/debug: add sched_update_nr_running tracepoint

2019-09-03 Thread Radim Krčmář
The paper "The Linux Scheduler: a Decade of Wasted Cores" used several
custom data gathering points to better understand what was going on in
the scheduler.
Red Hat adapted one of them for the tracepoint framework and created a
tool to plot a heatmap of nr_running, where the sched_update_nr_running
tracepoint is being used for fine grained monitoring of scheduling
imbalance.
The tool is available from https://github.com/jirvoz/plot-nr-running.

The best place for the tracepoints is inside the add/sub_nr_running,
which requires some shenanigans to make it work as they are defined
inside sched.h.
The tracepoints have to be included from sched.h, which means that
CREATE_TRACE_POINTS has to be defined for the whole header and this
might cause problems if tree-wide headers expose tracepoints in sched.h
dependencies, but I'd argue it's the other side's misuse of tracepoints.

Moving the import sched.h line lower would require fixes in s390 and ppc
headers, because they don't include dependecies properly and expect
sched.h to do it, so it is simpler to keep sched.h there and
preventively undefine CREATE_TRACE_POINTS right after.

Exports of the pelt tracepoints remain because they don't need to be
protected by CREATE_TRACE_POINTS and moving them closer would be
unsightly.

Tested-by: Jirka Hladký 
Tested-by: Jiří Vozár 
Signed-off-by: Radim Krčmář 
---
 include/trace/events/sched.h | 22 ++
 kernel/sched/core.c  |  7 ++-
 kernel/sched/fair.c  |  2 --
 kernel/sched/sched.h |  7 +++
 4 files changed, 31 insertions(+), 7 deletions(-)

diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index 420e80e56e55..1527fc695609 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -625,6 +625,28 @@ DECLARE_TRACE(sched_overutilized_tp,
TP_PROTO(struct root_domain *rd, bool overutilized),
TP_ARGS(rd, overutilized));
 
+TRACE_EVENT(sched_update_nr_running,
+
+   TP_PROTO(int cpu, int change, unsigned int nr_running),
+
+   TP_ARGS(cpu, change, nr_running),
+
+   TP_STRUCT__entry(
+   __field(int,  cpu)
+   __field(int,  change)
+   __field(unsigned int, nr_running)
+   ),
+
+   TP_fast_assign(
+   __entry->cpu= cpu;
+   __entry->change = change;
+   __entry->nr_running = nr_running;
+   ),
+
+   TP_printk("cpu=%u nr_running=%u (%d)",
+   __entry->cpu, __entry->nr_running, __entry->change)
+);
+
 #endif /* _TRACE_SCHED_H */
 
 /* This part must be outside protection */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 71981ce84231..31ac37b9f6f7 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6,7 +6,9 @@
  *
  *  Copyright (C) 1991-2002  Linus Torvalds
  */
+#define CREATE_TRACE_POINTS
 #include "sched.h"
+#undef CREATE_TRACE_POINTS
 
 #include 
 
@@ -20,9 +22,6 @@
 
 #include "pelt.h"
 
-#define CREATE_TRACE_POINTS
-#include 
-
 /*
  * Export tracepoints that act as a bare tracehook (ie: have no trace event
  * associated with them) to allow external modules to probe them.
@@ -7618,5 +7617,3 @@ const u32 sched_prio_to_wmult[40] = {
  /*  10 */  39045157,  49367440,  61356676,  76695844,  95443717,
  /*  15 */ 119304647, 148102320, 186737708, 238609294, 286331153,
 };
-
-#undef CREATE_TRACE_POINTS
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 84959d3285d1..421236d39902 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -22,8 +22,6 @@
  */
 #include "sched.h"
 
-#include 
-
 /*
  * Targeted preemption latency for CPU-bound tasks:
  *
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index c4915f46035a..b89d7786109a 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -75,6 +75,8 @@
 #include "cpupri.h"
 #include "cpudeadline.h"
 
+#include 
+
 #ifdef CONFIG_SCHED_DEBUG
 # define SCHED_WARN_ON(x)  WARN_ONCE(x, #x)
 #else
@@ -1887,6 +1889,8 @@ static inline void add_nr_running(struct rq *rq, unsigned 
count)
 
rq->nr_running = prev_nr + count;
 
+   trace_sched_update_nr_running(cpu_of(rq), count, rq->nr_running);
+
 #ifdef CONFIG_SMP
if (prev_nr < 2 && rq->nr_running >= 2) {
if (!READ_ONCE(rq->rd->overload))
@@ -1900,6 +1904,9 @@ static inline void add_nr_running(struct rq *rq, unsigned 
count)
 static inline void sub_nr_running(struct rq *rq, unsigned count)
 {
rq->nr_running -= count;
+
+   trace_sched_update_nr_running(cpu_of(rq), -count, rq->nr_running);
+
/* Check if we still need preemption */
sched_update_tick_dependency(rq);
 }
-- 
2.23.0



[PATCH 0/2] sched/debug: add sched_update_nr_running tracepoint

2019-09-03 Thread Radim Krčmář
Add a tracepoint for monitoring nr_running values, which is helpful in
discovering scheduling imbalances.

More information and most of the code is in [2/2], while [1/2] fixes a
build issue that popped up because CREATE_TRACE_POINTS is now defined
for several includes.

Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: Borislav Petkov 
Cc: Dave Hansen 
Cc: Steven Rostedt 
Cc: "H. Peter Anvin" 
Cc: Andy Lutomirski 
Cc: Jirka Hladký 
Cc: Jiří Vozár 
Cc: x...@kernel.org


Radim Krčmář (2):
  x86/mm/tlb: include tracepoints from tlb.c instead of mmu_context.h
  sched/debug: add sched_update_nr_running tracepoint

 arch/x86/include/asm/mmu_context.h |  2 --
 arch/x86/mm/tlb.c  |  2 ++
 include/trace/events/sched.h   | 22 ++
 kernel/sched/core.c|  7 ++-
 kernel/sched/fair.c|  2 --
 kernel/sched/sched.h   |  7 +++
 6 files changed, 33 insertions(+), 9 deletions(-)

-- 
2.23.0



[PATCH 1/2] x86/mm/tlb: include tracepoints from tlb.c instead of mmu_context.h

2019-09-03 Thread Radim Krčmář
asm/mmu_context.h is a tree-wide include that will unnecessarily break
the build if CREATE_TRACE_POINTS is defined when including it.

Signed-off-by: Radim Krčmář 
---
 arch/x86/include/asm/mmu_context.h | 2 --
 arch/x86/mm/tlb.c  | 2 ++
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/mmu_context.h 
b/arch/x86/include/asm/mmu_context.h
index 16ae821483c8..e59f230ff981 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -7,8 +7,6 @@
 #include 
 #include 
 
-#include 
-
 #include 
 #include 
 #include 
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index e6a9edc5baaf..83cd66a0db99 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -18,6 +18,8 @@
 
 #include "mm_internal.h"
 
+#include 
+
 /*
  * TLB flushing, formerly SMP-only
  * c/o Linus Torvalds.
-- 
2.23.0



[GIT PULL] KVM fixes for Linux 5.3-rc7

2019-08-30 Thread Radim Krčmář
Linus,

The following changes since commit a55aa89aab90fae7c815b0551b07be37db359d76:

  Linux 5.3-rc6 (2019-08-25 12:01:23 -0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/virt/kvm/kvm tags/for-linus

for you to fetch changes up to 75ee23b30dc712d80d2421a9a547e7ab6e379b44:

  KVM: x86: Don't update RIP or do single-step on faulting emulation 
(2019-08-27 20:59:04 +0200)


KVM fixes for 5.3-rc7

PPC:
- Fix bug which could leave locks locked in the host on return to a
  guest.

x86:
- Prevent infinitely looping emulation of a failing syscall while single
  stepping.
- Do not crash the host when nesting is disabled.


Alexey Kardashevskiy (1):
  KVM: PPC: Book3S: Fix incorrect guest-to-user-translation error handling

Radim Krčmář (1):
  Merge tag 'kvm-ppc-fixes-5.3-1' of git://git.kernel.org/.../paulus/powerpc

Sean Christopherson (1):
  KVM: x86: Don't update RIP or do single-step on faulting emulation

Vitaly Kuznetsov (1):
  KVM: x86: hyper-v: don't crash on KVM_GET_SUPPORTED_HV_CPUID when 
kvm_intel.nested is disabled

 arch/powerpc/kvm/book3s_64_vio.c| 6 --
 arch/powerpc/kvm/book3s_64_vio_hv.c | 6 --
 arch/x86/kvm/hyperv.c   | 5 -
 arch/x86/kvm/svm.c  | 8 +---
 arch/x86/kvm/vmx/vmx.c  | 1 +
 arch/x86/kvm/x86.c  | 9 +
 6 files changed, 19 insertions(+), 16 deletions(-)


Re: [PATCH 1/3] KVM: x86: hyper-v: don't crash on KVM_GET_SUPPORTED_HV_CPUID when kvm_intel.nested is disabled

2019-08-27 Thread Radim Krčmář
2019-08-27 18:04+0200, Vitaly Kuznetsov:
> If kvm_intel is loaded with nested=0 parameter an attempt to perform
> KVM_GET_SUPPORTED_HV_CPUID results in OOPS as nested_get_evmcs_version hook
> in kvm_x86_ops is NULL (we assign it in nested_vmx_hardware_setup() and
> this only happens in case nested is enabled).
> 
> Check that kvm_x86_ops->nested_get_evmcs_version is not NULL before
> calling it. With this, we can remove the stub from svm as it is no
> longer needed.
> 

Added

Cc: 

> Fixes: e2e871ab2f02 ("x86/kvm/hyper-v: Introduce nested_get_evmcs_version() 
> helper")
> Signed-off-by: Vitaly Kuznetsov 

and applied, thanks.


Re: [PATCH] KVM: x86: Don't update RIP or do single-step on faulting emulation

2019-08-27 Thread Radim Krčmář
2019-08-23 13:55-0700, Sean Christopherson:
> Don't advance RIP or inject a single-step #DB if emulation signals a
> fault.  This logic applies to all state updates that are conditional on
> clean retirement of the emulation instruction, e.g. updating RFLAGS was
> previously handled by commit 38827dbd3fb85 ("KVM: x86: Do not update
> EFLAGS on faulting emulation").
> 
> Not advancing RIP is likely a nop, i.e. ctxt->eip isn't updated with
> ctxt->_eip until emulation "retires" anyways.  Skipping #DB injection
> fixes a bug reported by Andy Lutomirski where a #UD on SYSCALL due to
> invalid state with RFLAGS.RF=1 would loop indefinitely due to emulation
> overwriting the #UD with #DB and thus restarting the bad SYSCALL over
> and over.
> 
> Cc: Nadav Amit 
> Cc: sta...@vger.kernel.org
> Reported-by: Andy Lutomirski 
> Fixes: 663f4c61b803 ("KVM: x86: handle singlestep during emulation")
> Signed-off-by: Sean Christopherson 
> ---
> 
> Note, this has minor conflict with my recent series to cleanup the
> emulator return flows[*].  The end result should look something like:
> 
> if (!ctxt->have_exception ||
> exception_type(ctxt->exception.vector) == EXCPT_TRAP) {
> kvm_rip_write(vcpu, ctxt->eip);
> if (r && ctxt->tf)
> r = kvm_vcpu_do_singlestep(vcpu);
> __kvm_set_rflags(vcpu, ctxt->eflags);
> }
> 
> [*] 
> https://lkml.kernel.org/r/20190823010709.24879-1-sean.j.christopher...@intel.com
> 
>  arch/x86/kvm/x86.c | 9 +
>  1 file changed, 5 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index b4cfd786d0b6..d2962671c3d3 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -6611,12 +6611,13 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu,
>   unsigned long rflags = kvm_x86_ops->get_rflags(vcpu);
>   toggle_interruptibility(vcpu, ctxt->interruptibility);
>   vcpu->arch.emulate_regs_need_sync_to_vcpu = false;
> - kvm_rip_write(vcpu, ctxt->eip);
> - if (r == EMULATE_DONE && ctxt->tf)
> - kvm_vcpu_do_singlestep(vcpu, );
>   if (!ctxt->have_exception ||
> - exception_type(ctxt->exception.vector) == EXCPT_TRAP)
> + exception_type(ctxt->exception.vector) == EXCPT_TRAP) {

Hm, EXCPT_TRAP is either #OF, #BP, or another #DB, none of which we want
to override.  The first two disable TF and the last one is the same as
its fault variant must take other path, so it works out in the end...

I've fixed the RF in commit message when applying, thanks.

---
We still seem to have at least a minor problem with single stepping:

SDM, Interrupt 1—Debug Exception (#DB):

  The following items detail the treatment of debug exceptions on the
  instruction boundary following execution of the MOV or the POP
  instruction that loads the SS register:
• If EFLAGS.TF is 1, no single-step trap is generated.

I think a check for KVM_X86_SHADOW_INT_MOV_SS in
kvm_vcpu_do_singlestep() is missing.


Re: [PATCH v2] KVM: LAPIC: Periodically revaluate to get conservative lapic_timer_advance_ns

2019-08-27 Thread Radim Krčmář
2019-08-15 12:03+0800, Wanpeng Li:
> From: Wanpeng Li 
> 
> Even if for realtime CPUs, cache line bounces, frequency scaling, presence 
> of higher-priority RT tasks, etc can still cause different response. These 
> interferences should be considered and periodically revaluate whether 
> or not the lapic_timer_advance_ns value is the best, do nothing if it is,
> otherwise recaluate again. Set lapic_timer_advance_ns to the minimal 
> conservative value from all the estimated values.

IIUC, this patch is looking for the minimal timer_advance_ns because it
provides the best throughput:
When every code path ran as fast as possible and we don't have to wait
for the timer to expire, but still arrive exactly at the point when it
would have expired.
We always arrive late late if something delayed the execution, which
means higher latencies, but RT shouldn't be using an adaptive algorithm
anyway, so that is not an issue.

The computed conservative timer_advance_ns will converge to the minimal
measured timer_advance_ns as time progresses, because it can only go
down and will do so repeatedly by small steps as even one smaller
measurement sufficiently close is enough to decrease it.

With that in mind, wouldn't the following patch (completely untested)
give you about the same numbers?

The idea is that if we have to wait, we are wasting time and therefore
decrease timer_advance_ns to eliminate the time spent waiting.

The first run is special and just sets timer_advance_ns to the latency
we measured, regardless of what it is -- it deals with the possibility
that the default was too low.

This algorithm is also likely prone to turbo boost making few runs
faster than what is then achieved during a more common workload, but
we'd need to have a sliding window or some other more sophisticated
algorithm in order to deal with that.

I also like Paolo's idea of a smoothing -- if we use a moving average
based on advance_expire_delta, we wouldn't even have to convert it into
ns unless it reached a given threshold, which could make decently fast
to be run every time.

Something like

   moving_average = (apic->lapic_timer.moving_average * (weight - 1) + 
advance_expire_delta) / weight

   if (moving_average > threshold)
  recompute timer_advance_ns

   apic->lapic_timer.moving_average = moving_average

where weight would be a of 2 to make the operation fast.

This kind of moving average gives less value to old inputs and the
weight allows us to control the reaction speed of the approximation.
(A small number like 4 or 8 seems about right.)

I don't have any information on the latency, though.
Do you think that the added overhead isn't worth the smoothing?

Thanks.

---8<---
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index e904ff06a83d..d7f2af2eb3ce 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -1491,23 +1491,20 @@ static inline void adjust_lapic_timer_advance(struct 
kvm_vcpu *vcpu,
if (advance_expire_delta < 0) {
ns = -advance_expire_delta * 100ULL;
do_div(ns, vcpu->arch.virtual_tsc_khz);
-   timer_advance_ns -= min((u32)ns,
-   timer_advance_ns / LAPIC_TIMER_ADVANCE_ADJUST_STEP);
+   timer_advance_ns -= (u32)ns;
} else {
/* too late */
+   /* This branch can only be taken on the initial calibration. */
+   if (apic->lapic_timer.timer_advance_adjust_done)
+   pr_err_once("kvm: broken expectation in lapic 
timer_advance_ns");
+
ns = advance_expire_delta * 100ULL;
do_div(ns, vcpu->arch.virtual_tsc_khz);
-   timer_advance_ns += min((u32)ns,
-   timer_advance_ns / LAPIC_TIMER_ADVANCE_ADJUST_STEP);
+   timer_advance_ns += (u32)ns;
}
 
-   if (abs(advance_expire_delta) < LAPIC_TIMER_ADVANCE_ADJUST_DONE)
-   apic->lapic_timer.timer_advance_adjust_done = true;
-   if (unlikely(timer_advance_ns > 5000)) {
-   timer_advance_ns = LAPIC_TIMER_ADVANCE_ADJUST_INIT;
-   apic->lapic_timer.timer_advance_adjust_done = false;
-   }
apic->lapic_timer.timer_advance_ns = timer_advance_ns;
+   apic->lapic_timer.timer_advance_adjust_done = true;
 }
 
 static void __kvm_wait_lapic_expire(struct kvm_vcpu *vcpu)
@@ -1526,7 +1523,7 @@ static void __kvm_wait_lapic_expire(struct kvm_vcpu *vcpu)
if (guest_tsc < tsc_deadline)
__wait_lapic_expire(vcpu, tsc_deadline - guest_tsc);
 
-   if (unlikely(!apic->lapic_timer.timer_advance_adjust_done))
+   if (unlikely(!apic->lapic_timer.timer_advance_adjust_done) || guest_tsc 
< tsc_deadline)
adjust_lapic_timer_advance(vcpu, 
apic->lapic_timer.advance_expire_delta);
 }
 


Re: [PATCH v4 5/5] KVM: LAPIC: add advance timer support to pi_inject_timer

2019-06-17 Thread Radim Krčmář
2019-06-17 19:24+0800, Wanpeng Li:
> From: Wanpeng Li 
> 
> Wait before calling posted-interrupt deliver function directly to add 
> advance timer support to pi_inject_timer.
> 
> Cc: Paolo Bonzini 
> Cc: Radim Krčmář 
> Cc: Marcelo Tosatti 
> Signed-off-by: Wanpeng Li 
> ---

Please merge this patch with [2/5], so bisection doesn't break.

>  arch/x86/kvm/lapic.c   | 6 --
>  arch/x86/kvm/lapic.h   | 2 +-
>  arch/x86/kvm/svm.c | 2 +-
>  arch/x86/kvm/vmx/vmx.c | 2 +-
>  4 files changed, 7 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index 1a31389..1a31ba5 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -1462,6 +1462,8 @@ static void apic_timer_expired(struct kvm_lapic *apic, 
> bool can_pi_inject)
>   return;
>  
>   if (can_pi_inject && posted_interrupt_inject_timer(apic->vcpu)) {
> + if (apic->lapic_timer.timer_advance_ns)
> + kvm_wait_lapic_expire(vcpu, true);

>From where does kvm_wait_lapic_expire() take
apic->lapic_timer.expired_tscdeadline?

(I think it would be best to take the functional core of
 kvm_wait_lapic_expire() and make it into a function that takes the
 expired_tscdeadline as an argument.)

Thanks.


Re: [PATCH v2] KVM: x86: clean up conditions for asynchronous page fault handling

2019-06-17 Thread Radim Krčmář
2019-06-13 19:22+0200, Paolo Bonzini:
> Even when asynchronous page fault is disabled, KVM does not want to pause
> the host if a guest triggers a page fault; instead it will put it into
> an artificial HLT state that allows running other host processes while
> allowing interrupt delivery into the guest.
> 
> However, the way this feature is triggered is a bit confusing.
> First, it is not used for page faults while a nested guest is
> running: but this is not an issue since the artificial halt
> is completely invisible to the guest, either L1 or L2.  Second,
> it is used even if kvm_halt_in_guest() returns true; in this case,
> the guest probably should not pay the additional latency cost of the
> artificial halt, and thus we should handle the page fault in a
> completely synchronous way.
> 
> By introducing a new function kvm_can_deliver_async_pf, this patch
> commonizes the code that chooses whether to deliver an async page fault
> (kvm_arch_async_page_not_present) and the code that chooses whether a
> page fault should be handled synchronously (kvm_can_do_async_pf).
> 
> Signed-off-by: Paolo Bonzini 
> ---

Reviewed-by: Radim Krčmář 


Re: [PATCH 22/43] KVM: nVMX: Don't dump VMCS if virtual APIC page can't be mapped

2019-06-17 Thread Radim Krčmář
2019-06-13 19:03+0200, Paolo Bonzini:
> From: Sean Christopherson 
> 
> ... as a malicious userspace can run a toy guest to generate invalid
> virtual-APIC page addresses in L1, i.e. flood the kernel log with error
> messages.
> 
> Fixes: 690908104e39d ("KVM: nVMX: allow tests to use bad virtual-APIC page 
> address")
> Cc: sta...@vger.kernel.org
> Cc: Paolo Bonzini 
> Signed-off-by: Sean Christopherson 
> Signed-off-by: Paolo Bonzini 
> ---

Makes me wonder why it looks like this in kvm/queue. :)

  commit 1971a835297f9098ce5a735d38916830b8313a65
  Author: Sean Christopherson 
  AuthorDate: Tue May 7 09:06:26 2019 -0700
  Commit: Paolo Bonzini 
  CommitDate: Thu Jun 13 16:23:13 2019 +0200
  
  KVM: nVMX: Don't dump VMCS if virtual APIC page can't be mapped
  
  ... as a malicious userspace can run a toy guest to generate invalid
  virtual-APIC page addresses in L1, i.e. flood the kernel log with error
  messages.
  
  Fixes: 690908104e39d ("KVM: nVMX: allow tests to use bad virtual-APIC 
page address")
  Cc: stable@xxx
  Cc: Paolo Bonzini 
  Signed-off-by: Sean Christopherson 
  Signed-off-by: Paolo Bonzini 


Re: [PATCH RESEND v3 2/3] KVM: vmx: Emulate MSR IA32_UMWAIT_CONTROL

2019-06-17 Thread Radim Krčmář
2019-06-17 14:31+0800, Xiaoyao Li:
> On 6/17/2019 11:32 AM, Xiaoyao Li wrote:
> > On 6/16/2019 5:55 PM, Tao Xu wrote:
> > > +    if (vmx->msr_ia32_umwait_control != host_umwait_control)
> > > +    add_atomic_switch_msr(vmx, MSR_IA32_UMWAIT_CONTROL,
> > > +  vmx->msr_ia32_umwait_control,
> > > +  host_umwait_control, false);
> > 
> > The bit 1 is reserved, at least, we need to do below to ensure not
> > modifying the reserved bit:
> > 
> >  guest_val = (vmx->msr_ia32_umwait_control & ~BIT_ULL(1)) |
> >      (host_val & BIT_ULL(1))
> > 
> 
> I find a better solution to ensure reserved bit 1 not being modified in
> vmx_set_msr() as below:
> 
>   if((data ^ umwait_control_cached) & BIT_ULL(1))
>   return 1;

We could just be checking

if (data & BIT_ULL(1))

because the guest cannot change its visible reserved value and KVM
currently initializes the value to 0.

The arch/x86/kernel/cpu/umwait.c series assumes that the reserved bit
is 0 (hopefully deliberately) and I would do the same in KVM as it
simplifies the logic.  (We don't have to even think about migrations
between machines with a different reserved value and making it play
nicely with possible future implementations of that bit.)

Thanks.


Re: [PATCH] KVM: x86: clean up conditions for asynchronous page fault handling

2019-06-13 Thread Radim Krčmář
2019-06-13 13:03+0200, Paolo Bonzini:
> Even when asynchronous page fault is disabled, KVM does not want to pause
> the host if a guest triggers a page fault; instead it will put it into
> an artificial HLT state that allows running other host processes while
> allowing interrupt delivery into the guest.
> 
> However, the way this feature is triggered is a bit confusing.
> First, it is not used for page faults while a nested guest is
> running: but this is not an issue since the artificial halt
> is completely invisible to the guest, either L1 or L2.  Second,
> it is used even if kvm_halt_in_guest() returns true; in this case,
> the guest probably should not pay the additional latency cost of the
> artificial halt, and thus we should handle the page fault in a
> completely synchronous way.

The same reasoning would apply to kvm_mwait_in_guest(), so I would
disable APF with it as well.

> By introducing a new function kvm_can_deliver_async_pf, this patch
> commonizes the code that chooses whether to deliver an async page fault
> (kvm_arch_async_page_not_present) and the code that chooses whether a
> page fault should be handled synchronously (kvm_can_do_async_pf).
> 
> Signed-off-by: Paolo Bonzini 
> ---
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> @@ -9775,6 +9775,36 @@ static int apf_get_user(struct kvm_vcpu *vcpu, u32 
> *val)
> +bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu)
> +{
> + if (unlikely(!lapic_in_kernel(vcpu) ||
> +  kvm_event_needs_reinjection(vcpu) ||
> +  vcpu->arch.exception.pending))
> + return false;
> +
> + if (kvm_hlt_in_guest(vcpu->kvm) && !kvm_can_deliver_async_pf(vcpu))
> + return false;
> +
> + /*
> +  * If interrupts are off we cannot even use an artificial
> +  * halt state.

Can't we?  The artificial halt state would be canceled by the host page
fault handler.

> +  */
> + return kvm_x86_ops->interrupt_allowed(vcpu);
> @@ -9783,19 +9813,26 @@ void kvm_arch_async_page_not_present(struct kvm_vcpu 
> *vcpu,
>   trace_kvm_async_pf_not_present(work->arch.token, work->gva);
>   kvm_add_async_pf_gfn(vcpu, work->arch.gfn);
>  
> - if (!(vcpu->arch.apf.msr_val & KVM_ASYNC_PF_ENABLED) ||
> - (vcpu->arch.apf.send_user_only &&
> -  kvm_x86_ops->get_cpl(vcpu) == 0))
> + if (!kvm_can_deliver_async_pf(vcpu) ||
> + apf_put_user(vcpu, KVM_PV_REASON_PAGE_NOT_PRESENT)) {
> + /*
> +  * It is not possible to deliver a paravirtualized asynchronous
> +  * page fault, but putting the guest in an artificial halt state
> +  * can be beneficial nevertheless: if an interrupt arrives, we
> +  * can deliver it timely and perhaps the guest will schedule
> +  * another process.  When the instruction that triggered a page
> +  * fault is retried, hopefully the page will be ready in the 
> host.
> +  */
>   kvm_make_request(KVM_REQ_APF_HALT, vcpu);

A return is missing here, to prevent the delivery of PV APF.
(I'd probably keep the if/else.)

Thanks.

> - else if (!apf_put_user(vcpu, KVM_PV_REASON_PAGE_NOT_PRESENT)) {
> - fault.vector = PF_VECTOR;
> - fault.error_code_valid = true;
> - fault.error_code = 0;
> - fault.nested_page_fault = false;
> - fault.address = work->arch.token;
> - fault.async_page_fault = true;
> - kvm_inject_page_fault(vcpu, );

>   }
> +
> + fault.vector = PF_VECTOR;
> + fault.error_code_valid = true;
> + fault.error_code = 0;
> + fault.nested_page_fault = false;
> + fault.address = work->arch.token;
> + fault.async_page_fault = true;
> + kvm_inject_page_fault(vcpu, );
>  }
>  
>  void kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
> -- 
> 1.8.3.1
> 


Re: [PATCH] KVM: nVMX: use correct clean fields when copying from eVMCS

2019-06-13 Thread Radim Krčmář
2019-06-13 13:35+0200, Vitaly Kuznetsov:
> Unfortunately, a couple of mistakes were made while implementing
> Enlightened VMCS support, in particular, wrong clean fields were
> used in copy_enlightened_to_vmcs12():
> - exception_bitmap is covered by CONTROL_EXCPN;
> - vm_exit_controls/pin_based_vm_exec_control/secondary_vm_exec_control
>   are covered by CONTROL_GRP1.
> 
> Fixes: 945679e301ea0 ("KVM: nVMX: add enlightened VMCS state")
> Signed-off-by: Vitaly Kuznetsov 
> ---

Reviewed-by: Radim Krčmář 


Re: [PATCH v3 1/2] KVM: LAPIC: Optimize timer latency consider world switch time

2019-06-12 Thread Radim Krčmář
2019-06-12 21:22+0200, Radim Krčmář:
> 2019-06-12 08:14-0700, Sean Christopherson:
> > On Wed, Jun 12, 2019 at 05:40:18PM +0800, Wanpeng Li wrote:
> > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > @@ -145,6 +145,12 @@ module_param(tsc_tolerance_ppm, uint, S_IRUGO | 
> > > S_IWUSR);
> > >  static int __read_mostly lapic_timer_advance_ns = -1;
> > >  module_param(lapic_timer_advance_ns, int, S_IRUGO | S_IWUSR);
> > >  
> > > +/*
> > > + * lapic timer vmentry advance (tscdeadline mode only) in nanoseconds.
> > > + */
> > > +u32 __read_mostly vmentry_advance_ns = 300;
> > 
> > Enabling this by default makes me nervous, e.g. nothing guarantees that
> > future versions of KVM and/or CPUs will continue to have 300ns of overhead
> > between wait_lapic_expire() and VM-Enter.
> > 
> > If we want it enabled by default so that it gets tested, the default
> > value should be extremely conservative, e.g. set the default to a small
> > percentage (25%?) of the latency of VM-Enter itself on modern CPUs,
> > VM-Enter latency being the min between VMLAUNCH and VMLOAD+VMRUN+VMSAVE.
> 
> I share the sentiment.  We definitely must not enter the guest before
> the deadline has expired and CPUs are approaching 5 GHz (in turbo), so
> 300 ns would be too much even today.
> 
> I wrote a simple testcase for rough timing and there are 267 cycles
> (111 ns @ 2.4 GHz) between doing rdtsc() right after
> kvm_wait_lapic_expire() [1] and doing rdtsc() in the guest as soon as
> possible (see the attached kvm-unit-test).

I forgot to attach it, pasting here as a patch for kvm-unit-tests.

---
diff --git a/x86/Makefile.common b/x86/Makefile.common
index e612dbe..ceed648 100644
--- a/x86/Makefile.common
+++ b/x86/Makefile.common
@@ -58,7 +58,7 @@ tests-common = $(TEST_DIR)/vmexit.flat $(TEST_DIR)/tsc.flat \
$(TEST_DIR)/init.flat $(TEST_DIR)/smap.flat \
$(TEST_DIR)/hyperv_synic.flat $(TEST_DIR)/hyperv_stimer.flat \
$(TEST_DIR)/hyperv_connections.flat \
-   $(TEST_DIR)/umip.flat
+   $(TEST_DIR)/umip.flat $(TEST_DIR)/vmentry_latency.flat
 
 ifdef API
 tests-api = api/api-sample api/dirty-log api/dirty-log-perf
diff --git a/x86/vmentry_latency.c b/x86/vmentry_latency.c
new file mode 100644
index 000..3859f09
--- /dev/null
+++ b/x86/vmentry_latency.c
@@ -0,0 +1,45 @@
+#include "x86/vm.h"
+
+static u64 get_last_hypervisor_tsc_delta(void)
+{
+   u64 a = 0, b, c, d;
+   u64 tsc;
+
+   /*
+* The first vmcall is there to force a vm exit just before measuring.
+*/
+   asm volatile ("vmcall" : "+a"(a), "=b"(b), "=c"(c), "=d"(d));
+
+   tsc = rdtsc();
+
+   /*
+* The second hypercall recovers the value that was stored when vm
+* entering to execute the rdtsc()
+*/
+   a = 11;
+   asm volatile ("vmcall" : "+a"(a), "=b"(b), "=c"(c), "=d"(d));
+
+   return tsc - a;
+}
+
+static void vmentry_latency(void)
+{
+   unsigned i = 100;
+   u64 min = -1;
+
+   while (i--) {
+   u64 latency = get_last_hypervisor_tsc_delta();
+   if (latency < min)
+   min = latency;
+   }
+
+   printf("vm entry latency is %"PRIu64" TSC cycles\n", min);
+}
+
+int main(void)
+{
+   setup_vm();
+   vmentry_latency();
+
+   return 0;
+}


Re: [PATCH v3 1/2] KVM: LAPIC: Optimize timer latency consider world switch time

2019-06-12 Thread Radim Krčmář
2019-06-12 08:14-0700, Sean Christopherson:
> On Wed, Jun 12, 2019 at 05:40:18PM +0800, Wanpeng Li wrote:
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > @@ -145,6 +145,12 @@ module_param(tsc_tolerance_ppm, uint, S_IRUGO | 
> > S_IWUSR);
> >  static int __read_mostly lapic_timer_advance_ns = -1;
> >  module_param(lapic_timer_advance_ns, int, S_IRUGO | S_IWUSR);
> >  
> > +/*
> > + * lapic timer vmentry advance (tscdeadline mode only) in nanoseconds.
> > + */
> > +u32 __read_mostly vmentry_advance_ns = 300;
> 
> Enabling this by default makes me nervous, e.g. nothing guarantees that
> future versions of KVM and/or CPUs will continue to have 300ns of overhead
> between wait_lapic_expire() and VM-Enter.
> 
> If we want it enabled by default so that it gets tested, the default
> value should be extremely conservative, e.g. set the default to a small
> percentage (25%?) of the latency of VM-Enter itself on modern CPUs,
> VM-Enter latency being the min between VMLAUNCH and VMLOAD+VMRUN+VMSAVE.

I share the sentiment.  We definitely must not enter the guest before
the deadline has expired and CPUs are approaching 5 GHz (in turbo), so
300 ns would be too much even today.

I wrote a simple testcase for rough timing and there are 267 cycles
(111 ns @ 2.4 GHz) between doing rdtsc() right after
kvm_wait_lapic_expire() [1] and doing rdtsc() in the guest as soon as
possible (see the attached kvm-unit-test).

That is on a Haswell, where vmexit.flat reports 2120 cycles for a
vmcall.  This would linearly (likely incorrect method in this case)
translate to 230 cycles on a machine with 1800 cycles for a vmcall,
which is less than 50 ns @ 5 GHz.

I wouldn't go above 25 ns for a hard-coded default.

(We could also do a similar measurement when initializing KVM and have a
 dynamic default, but I'm thinking it's going to be way too much code
 for the benefit.)

---
1: This is how the TSC is read in KVM.

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index da24f1858acc..a7251ac0109b 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -6449,6 +6449,8 @@ static void vmx_vcpu_run(struct kvm_vcpu *vcpu)
vcpu->arch.apic->lapic_timer.timer_advance_ns)
kvm_wait_lapic_expire(vcpu);
 
+   vcpu->last_seen_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
+
/*
 * If this vCPU has touched SPEC_CTRL, restore the guest's value if
 * it's non-zero. Since vmentry is serialising on affected CPUs, there
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 6200d5a51f13..5e0ce8ca31e7 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7201,6 +7201,9 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
case KVM_HC_SEND_IPI:
ret = kvm_pv_send_ipi(vcpu->kvm, a0, a1, a2, a3, op_64_bit);
break;
+   case KVM_HC_LAST_SEEN_TSC:
+   ret = vcpu->last_seen_tsc;
+   break;
default:
ret = -KVM_ENOSYS;
break;
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index abafddb9fe2c..7f70fe7a28b1 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -323,6 +323,8 @@ struct kvm_vcpu {
bool preempted;
struct kvm_vcpu_arch arch;
struct dentry *debugfs_dentry;
+
+   u64 last_seen_tsc;
 };
 
 static inline int kvm_vcpu_exiting_guest_mode(struct kvm_vcpu *vcpu)
diff --git a/include/uapi/linux/kvm_para.h b/include/uapi/linux/kvm_para.h
index 6c0ce49931e5..dfbc6e9ad7a1 100644
--- a/include/uapi/linux/kvm_para.h
+++ b/include/uapi/linux/kvm_para.h
@@ -28,6 +28,7 @@
 #define KVM_HC_MIPS_CONSOLE_OUTPUT 8
 #define KVM_HC_CLOCK_PAIRING   9
 #define KVM_HC_SEND_IPI10
+#define KVM_HC_LAST_SEEN_TSC   11
 
 /*
  * hypercalls use architecture specific


Re: [PATCH v3 2/4] KVM: LAPIC: lapic timer interrupt is injected by posted interrupt

2019-06-12 Thread Radim Krčmář
2019-06-12 09:48+0800, Wanpeng Li:
> On Wed, 12 Jun 2019 at 04:39, Marcelo Tosatti  wrote:
> > On Tue, Jun 11, 2019 at 08:17:07PM +0800, Wanpeng Li wrote:
> > > From: Wanpeng Li 
> > > diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> > > @@ -133,6 +133,12 @@ inline bool 
> > > posted_interrupt_inject_timer_enabled(struct kvm_vcpu *vcpu)
> > >  }
> > >  EXPORT_SYMBOL_GPL(posted_interrupt_inject_timer_enabled);
> > >
> > > +static inline bool can_posted_interrupt_inject_timer(struct kvm_vcpu 
> > > *vcpu)
> > > +{
> > > + return posted_interrupt_inject_timer_enabled(vcpu) &&
> > > + kvm_hlt_in_guest(vcpu->kvm);
> > > +}
> >
> > Hi Li,
> 
> Hi Marcelo,
> 
> >
> > Don't think its necessary to depend on kvm_hlt_in_guest: Can also use
> > exitless injection if the guest is running (think DPDK style workloads
> > that busy-spin on network card).

I agree.

> There are some discussions here.
> 
> https://lkml.org/lkml/2019/6/11/424
> https://lkml.org/lkml/2019/6/5/436

Paolo wants to disable the APF synthetic halt first, which I think is
unrelated to the timer implementation.
The synthetic halt happens when the VCPU cannot progress because the
host swapped out its memory and any asynchronous event should unhalt it,
because we assume that the interrupt path wasn't swapped out.

The posted interrupt does a swake_up_one (part of vcpu kick), which is
everything what the non-posted path does after setting a KVM request --
it's a bug if we later handle the PIR differently from the KVM request,
so the guest is going to be woken up on any halt blocking in KVM (even
synthetic APF halt).

Paolo, have I missed the point?

Thanks.


Re: [PATCH v2 1/3] KVM: LAPIC: Make lapic timer unpinned when timer is injected by posted-interrupt

2019-06-10 Thread Radim Krčmář
2019-06-06 13:31+0800, Wanpeng Li:
> From: Wanpeng Li 
> 
> Make lapic timer unpinned when timer is injected by posted-interrupt,
> the emulated timer can be offload to the housekeeping cpus.
> 
> The host admin should fine tuned, e.g. dedicated instances scenario 
> w/ nohz_full cover the pCPUs which vCPUs resident, several pCPUs 
> surplus for housekeeping, disable mwait/hlt/pause vmexits to occupy 
> the pCPUs, fortunately preemption timer is disabled after mwait is 
> exposed to guest which makes emulated timer offload can be possible. 
> 
> Cc: Paolo Bonzini 
> Cc: Radim Krčmář 
> Signed-off-by: Wanpeng Li 
> ---
>  arch/x86/kvm/lapic.c| 20 
>  arch/x86/kvm/x86.c  |  5 +
>  arch/x86/kvm/x86.h  |  2 ++
>  include/linux/sched/isolation.h |  2 ++
>  kernel/sched/isolation.c|  6 ++
>  5 files changed, 31 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index fcf42a3..09b7387 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -127,6 +127,12 @@ static inline u32 kvm_x2apic_id(struct kvm_lapic *apic)
>   return apic->vcpu->vcpu_id;
>  }
>  
> +static inline bool posted_interrupt_inject_timer_enabled(struct kvm_vcpu 
> *vcpu)
> +{
> + return pi_inject_timer && kvm_vcpu_apicv_active(vcpu) &&
> + kvm_mwait_in_guest(vcpu->kvm);

I'm torn about the mwait dependency.  It covers a lot of the targeted
user base, but the relation is convoluted and not fitting perfectly.

What do you think about making posted_interrupt_inject_timer_enabled()
just

pi_inject_timer && kvm_vcpu_apicv_active(vcpu)

and disarming the vmx preemption timer when
posted_interrupt_inject_timer_enabled(), just like we do with mwait now?

Thanks.


Re: [PATCH v2 2/3] KVM: LAPIC: lapic timer interrupt is injected by posted interrupt

2019-06-10 Thread Radim Krčmář
2019-06-06 13:31+0800, Wanpeng Li:
> From: Wanpeng Li 
> 
> Dedicated instances are currently disturbed by unnecessary jitter due 
> to the emulated lapic timers fire on the same pCPUs which vCPUs resident.
> There is no hardware virtual timer on Intel for guest like ARM. Both 
> programming timer in guest and the emulated timer fires incur vmexits.
> This patch tries to avoid vmexit which is incurred by the emulated 
> timer fires in dedicated instance scenario. 
> 
> When nohz_full is enabled in dedicated instances scenario, the emulated 
> timers can be offload to the nearest busy housekeeping cpus since APICv 
> is really common in recent years. The guest timer interrupt is injected 
> by posted-interrupt which is delivered by housekeeping cpu once the emulated 
> timer fires. 
> 
> 3%~5% redis performance benefit can be observed on Skylake server.
> 
> Signed-off-by: Wanpeng Li 
> ---
>  arch/x86/kvm/lapic.c | 32 +---
>  arch/x86/kvm/x86.h   |  5 +
>  2 files changed, 30 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index 09b7387..c08e5a8 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -133,6 +133,12 @@ static inline bool 
> posted_interrupt_inject_timer_enabled(struct kvm_vcpu *vcpu)
>   kvm_mwait_in_guest(vcpu->kvm);
>  }
>  
> +static inline bool can_posted_interrupt_inject_timer(struct kvm_vcpu *vcpu)
> +{
> + return posted_interrupt_inject_timer_enabled(vcpu) &&
> + !vcpu_halt_in_guest(vcpu);

It would make more sense to have a condition for general blocking in
KVM, but keep in mind that we're not running on the same cpu anymore, so
any code like that has to be properly protected against VM entries under
our hands.  (The VCPU could appear halted here, but before we get make
the timer pending, the VCPU would enter and potentially never check the
interrupt.)

I think we should be able to simply do

  if (posted_interrupt_inject_timer_enabled(vcpu))
kvm_inject_apic_timer_irqs();

directly in the apic_timer_expired() as the injection will wake up the
target if necessary.  It's going to be a bit slow for timer callback in
those (too slow to warrant special handling?), but there hopefully
aren't any context restrictions in place.


Thanks.


Re: [PATCH v3 0/3] KVM: Yield to IPI target if necessary

2019-06-10 Thread Radim Krčmář
2019-05-30 09:05+0800, Wanpeng Li:
> The idea is from Xen, when sending a call-function IPI-many to vCPUs, 
> yield if any of the IPI target vCPUs was preempted. 17% performance 
> increasement of ebizzy benchmark can be observed in an over-subscribe 
> environment. (w/ kvm-pv-tlb disabled, testing TLB flush call-function 
> IPI-many since call-function is not easy to be trigged by userspace 
> workload).

Have you checked if we could gain performance by having the yield as an
extension to our PV IPI call?

It would allow us to skip the VM entry/exit overhead on the caller.
(The benefit of that might be negligible and it also poses a
 complication when splitting the target mask into several PV IPI
 hypercalls.)

Thanks.


Re: [PATCH v3 2/3] KVM: X86: Implement PV sched yield hypercall

2019-06-10 Thread Radim Krčmář
2019-05-30 09:05+0800, Wanpeng Li:
> From: Wanpeng Li 
> 
> The target vCPUs are in runnable state after vcpu_kick and suitable 
> as a yield target. This patch implements the sched yield hypercall.
> 
> 17% performance increasement of ebizzy benchmark can be observed in an 
> over-subscribe environment. (w/ kvm-pv-tlb disabled, testing TLB flush 
> call-function IPI-many since call-function is not easy to be trigged 
> by userspace workload).
> 
> Cc: Paolo Bonzini 
> Cc: Radim Krčmář 
> Cc: Liran Alon 
> Signed-off-by: Wanpeng Li 
> ---
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> @@ -7172,6 +7172,28 @@ void kvm_vcpu_deactivate_apicv(struct kvm_vcpu *vcpu)
>   kvm_x86_ops->refresh_apicv_exec_ctrl(vcpu);
>  }
>  
> +static void kvm_sched_yield(struct kvm *kvm, unsigned long dest_id)
> +{
> + struct kvm_vcpu *target = NULL;
> + struct kvm_apic_map *map = NULL;
> +
> + rcu_read_lock();
> + map = rcu_dereference(kvm->arch.apic_map);
> +
> + if (unlikely(!map) || dest_id > map->max_apic_id)
> + goto out;
> +
> + if (map->phys_map[dest_id]->vcpu) {

This should check for map->phys_map[dest_id].

> + target = map->phys_map[dest_id]->vcpu;
> + rcu_read_unlock();
> + kvm_vcpu_yield_to(target);
> + }
> +
> +out:
> + if (!target)
> + rcu_read_unlock();

Also, I find the following logic clearer

  {
struct kvm_vcpu *target = NULL;
struct kvm_apic_map *map;

rcu_read_lock();
map = rcu_dereference(kvm->arch.apic_map);

if (likely(map) && dest_id <= map->max_apic_id && 
map->phys_map[dest_id])
target = map->phys_map[dest_id]->vcpu;

rcu_read_unlock();

if (target)
kvm_vcpu_yield_to(target);
  }

thanks.


Re: [RESEND PATCH v3] KVM: x86: Add Intel CPUID.1F cpuid emulation support

2019-06-03 Thread Radim Krčmář
2019-05-26 21:30+0800, Like Xu:
> Add support to expose Intel V2 Extended Topology Enumeration Leaf for
> some new systems with multiple software-visible die within each package.
> 
> Per Intel's SDM, when CPUID executes with EAX set to 1FH, the processor
> returns information about extended topology enumeration data. Software
> must detect the presence of CPUID leaf 1FH by verifying (a) the highest
> leaf index supported by CPUID is >= 1FH, and (b) CPUID.1FH:EBX[15:0]
> reports a non-zero value.
> 
> Co-developed-by: Xiaoyao Li 
> Signed-off-by: Xiaoyao Li 
> Signed-off-by: Like Xu 
> Reviewed-by: Sean Christopherson 
> ---
> ==changelog==
> v3:
> - Refine commit message and comment
> 
> v2: https://lkml.org/lkml/2019/4/25/1246
> 
> - Apply cpuid.1f check rule on Intel SDM page 3-222 Vol.2A
> - Add comment to handle 0x1f anf 0xb in common code
> - Reduce check time in a descending-break style
> 
> v1: https://lkml.org/lkml/2019/4/22/28
> 
>  arch/x86/kvm/cpuid.c | 11 ++-
>  1 file changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> index 80a642a0143d..f9b41f0103b3 100644
> --- a/arch/x86/kvm/cpuid.c
> +++ b/arch/x86/kvm/cpuid.c
> @@ -426,6 +426,11 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 
> *entry, u32 function,
>  
>   switch (function) {
>   case 0:
> + /* Check if the cpuid leaf 0x1f is actually implemented */
> + if (entry->eax >= 0x1f && (cpuid_ebx(0x1f) & 0x)) {
> + entry->eax = 0x1f;

Sorry for my late reply, but I think this check does more harm than
good.

We'll need to change it if we ever add leaf above 0x1f, which also puts
burden on the new submitter to check that exposing it by an unrelated
feature doesn't break anything.  Just like you had to see whether the
leaf 0x14 is still ok when exposing it without f_intel_pt.

Also, I don't see anything that would make 0x1f worthy of protection
when enabling it also exposes unimplemented leaves (0x14;0x1f).
Zeroing 0x1f.ebx disables it and that is implicitly being done if the
presence check above would fail.

> + break;
> + }
>   entry->eax = min(entry->eax, (u32)(f_intel_pt ? 0x14 : 0xd));

Similarly in the existing code.  If we don't have f_intel_pt, then we
should make sure that leaf 0x14 is not being filled, but we don't really
have to limit the maximal index.

Adding a single clamping like

/* Limited to the highest leaf implemented in KVM. */
entry->eax = min(entry->eax, 0x1f);

seems sufficient.

(Passing the hardware value is ok in theory, but it is a cheap way to
 avoid future leaves that cannot be simply zeroed for some weird reason.)

Thanks.


Re: Linux in KVM guest segfaults when hosts runs Linux 5.1

2019-05-13 Thread Radim Krčmář
2019-05-12 13:53+0200, Marc Haber:
> since updating my home desktop machine to kernel 5.1.1, KVM guests
> started on that machine segfault after booting:
[...]
> Any idea short of bisecting?

It has also been spotted by Borislav and the fix [1] should land in the
next kernel update, thanks for the report.

---
1: https://patchwork.kernel.org/patch/10936271/


Re: [PATCH] KVM: X86: Enable IA32_MSIC_ENABLE MONITOR bit when exposing mwait/monitor

2019-05-13 Thread Radim Krčmář
2019-05-13 17:46+0800, Wanpeng Li:
> From: Wanpeng Li 
> 
> MSR IA32_MSIC_ENABLE bit 18, according to SDM:
> 
>  | When this bit is set to 0, the MONITOR feature flag is not set 
> (CPUID.01H:ECX[bit 3] = 0). 
>  | This indicates that MONITOR/MWAIT are not supported.
>  | 
>  | Software attempts to execute MONITOR/MWAIT will cause #UD when this bit is 
> 0.
>  | 
>  | When this bit is set to 1 (default), MONITOR/MWAIT are supported 
> (CPUID.01H:ECX[bit 3] = 1). 
> 
> This bit should be set to 1, if BIOS enables MONITOR/MWAIT support on host 
> and 
> we intend to expose mwait/monitor to the guest.

The CPUID.01H:ECX[bit 3] ought to mirror the value of the MSR bit and
as userspace has control of them both, I'd argue that it is userspace's
job to configure both bits to match on the initial setup.

Also, CPUID.01H:ECX[bit 3] is a better guard than kvm_mwait_in_guest().
kvm_mwait_in_guest() affects the behavior of MONITOR/MWAIT, not its
guest visibility.
Some weird migration cases might want MONITOR in CPUID without
kvm_mwait_in_guest() and the MSR should be correct there as well.

Missing the MSR bit shouldn't be a big problem for guests, so I am in
favor of fixing the userspace code.

Thanks.

(For extra correctness in KVM, we could implement toggling of the CPUID
 bit based on guest writes to the MSR.)


[PATCH] Revert "KVM: doc: Document the life cycle of a VM and its resources"

2019-04-29 Thread Radim Krčmář
This reverts commit 919f6cd8bb2fe7151f8aecebc3b3d1ca2567396e.

The patch was applied twice.
The first commit is eca6be566d47029f945a5f8e1c94d374e31df2ca.

Reported-by: Cornelia Huck 
Signed-off-by: Radim Krčmář 
---
 Documentation/virtual/kvm/api.txt | 17 -
 1 file changed, 17 deletions(-)

diff --git a/Documentation/virtual/kvm/api.txt 
b/Documentation/virtual/kvm/api.txt
index b62ad0d94234..26dc1280b49b 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -69,23 +69,6 @@ by and on behalf of the VM's process may not be 
freed/unaccounted when
 the VM is shut down.
 
 
-It is important to note that althought VM ioctls may only be issued from
-the process that created the VM, a VM's lifecycle is associated with its
-file descriptor, not its creator (process).  In other words, the VM and
-its resources, *including the associated address space*, are not freed
-until the last reference to the VM's file descriptor has been released.
-For example, if fork() is issued after ioctl(KVM_CREATE_VM), the VM will
-not be freed until both the parent (original) process and its child have
-put their references to the VM's file descriptor.
-
-Because a VM's resources are not freed until the last reference to its
-file descriptor is released, creating additional references to a VM via
-via fork(), dup(), etc... without careful consideration is strongly
-discouraged and may have unwanted side effects, e.g. memory allocated
-by and on behalf of the VM's process may not be freed/unaccounted when
-the VM is shut down.
-
-
 3. Extensions
 -
 
-- 
2.20.1



[GIT PULL] KVM fixes for Linux 5.0-rc2

2019-01-12 Thread Radim Krčmář
Linus,

The following changes since commit bfeffd155283772bbe78c6a05dec7c0128ee500c:

  Linux 5.0-rc1 (2019-01-06 17:08:20 -0800)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/virt/kvm/kvm tags/for-linus

for you to fetch changes up to 826c1362e79abcd36f99092acd083b5a2d576676:

  x86/kvm/nVMX: don't skip emulated instruction twice when vmptr address is not 
backed (2019-01-11 18:41:53 +0100)


KVM fixes for 5.0-rc2

Minor fixes for new code, corner cases, and documentation.
Patches are isolated and sufficiently described by the shortlog.


Christophe de Dinechin (1):
  Documentation/virtual/kvm: Update URL for AMD SEV API specification

David Rientjes (1):
  kvm: sev: Fail KVM_SEV_INIT if already initialized

Gustavo A. R. Silva (1):
  KVM: x86: Fix bit shifting in update_intel_pt_cfg

Lan Tianyu (1):
  KVM/VMX: Avoid return error when flush tlb successfully in the 
hv_remote_flush_tlb_with_range()

Tomas Bortoli (1):
  KVM: validate userspace input in kvm_clear_dirty_log_protect()

Vitaly Kuznetsov (1):
  x86/kvm/nVMX: don't skip emulated instruction twice when vmptr address is 
not backed

 Documentation/virtual/kvm/amd-memory-encryption.rst | 2 +-
 arch/x86/kvm/svm.c  | 3 +++
 arch/x86/kvm/vmx/nested.c   | 3 +--
 arch/x86/kvm/vmx/vmx.c  | 4 ++--
 virt/kvm/kvm_main.c | 9 +++--
 5 files changed, 14 insertions(+), 7 deletions(-)


Re: [PATCH] kvm: Use struct_size() in kmalloc()

2019-01-11 Thread Radim Krčmář
2019-01-04 10:29-0600, Gustavo A. R. Silva:
> One of the more common cases of allocation size calculations is finding
> the size of a structure that has a zero-sized array at the end, along
> with memory for some number of elements for that array. For example:
> 
> struct foo {
> int stuff;
> void *entry[];
> };
> 
> instance = kmalloc(sizeof(struct foo) + sizeof(void *) * count, GFP_KERNEL);
> 
> Instead of leaving these open-coded and prone to type mistakes, we can
> now use the new struct_size() helper:
> 
> instance = kmalloc(struct_size(instance, entry, count), GFP_KERNEL);
> 
> This code was detected with the help of Coccinelle.
> 
> Signed-off-by: Gustavo A. R. Silva 
> ---

Queued, thanks.


Re: [PATCH] KVM/VMX: Avoid return error when flush tlb successfully in the hv_remote_flush_tlb_with_range()

2019-01-11 Thread Radim Krčmář
2019-01-04 15:20+0800, lantianyu1...@gmail.com:
> From: Lan Tianyu 
> 
> The "ret" is initialized to be ENOTSUPP. The return value of
> __hv_remote_flush_tlb_with_range() will be Or with "ret" when ept
> table potiners are mismatched. This will cause return ENOTSUPP even if
> flush tlb successfully. This patch is to fix the issue and set
> "ret" to 0.
> 
> Fix: a5c214da("KVM/VMX: Change hv flush logic when ept tables are 
> mismatched.")
> Signed-off-by: Lan Tianyu 
> ---

Applied, thanks.


Re: [patch] kvm: sev: Fail KVM_SEV_INIT if already initialized

2019-01-11 Thread Radim Krčmář
2019-01-02 12:56-0800, David Rientjes:
> By code inspection, it was found that multiple calls to KVM_SEV_INIT
> could deplete asid bits and overwrite kvm_sev_info's regions_list.
> 
> Multiple calls to KVM_SVM_INIT is not likely to occur with QEMU, but this
> should likely be fixed anyway.
> 
> This code is serialized by kvm->lock.
> 
> Fixes: 1654efcbc431 ("KVM: SVM: Add KVM_SEV_INIT command")
> Reported-by: Cfir Cohen 
> Signed-off-by: David Rientjes 
> ---

Applied, thanks.


Re: [PATCH] KVM: validate userspace input in kvm_clear_dirty_log_protect()

2019-01-11 Thread Radim Krčmář
2019-01-08 17:28+0100, Tomas Bortoli:
> Hi Paolo,
> 
> On 1/7/19 11:42 PM, Paolo Bonzini wrote:
> > On 02/01/19 18:29, Tomas Bortoli wrote:
> >>n = kvm_dirty_bitmap_bytes(memslot);
> >> +
> >> +  if (n << 3 < log->num_pages || log->first_page > log->num_pages)
> >> +  return -EINVAL;
> >> +
> > 
> > This should be
> > 
> > if (log->first_page > memslot->npages ||

(Wouldn't this be clearer with a >= instead?)

> > log->num_pages > memslot->npages - log->first_page)
> > return -EINVAL;
> > 
> > i.e. the comparison should check the last page in the range, not the
> > number of pages.  In addition, using "n" is unnecessary since we do have
> > the memslot.  I'll do the changes myself if you prefer, but an ack would
> > be nice.
> > 
> > 
> 
> 
> Yeah, I agree. Thanks for the reply and sure you can do the changes, np :)

Done that and applied, thanks.


Re: [PATCH] kvm/eventfd : unnecessory conversion to bool

2019-01-11 Thread Radim Krčmář
2018-12-27 14:22+0800, Peng Hao:
> Conversion to bool is not needed in ioeventfd_in_range.
> 
> Signed-off-by: Peng Hao 
> ---

Fixed the typo in subject and queued, thanks.


Re: [PATCH][next] KVM: x86: Fix bit shifting in update_intel_pt_cfg

2019-01-11 Thread Radim Krčmář
2018-12-26 14:40-0600, Gustavo A. R. Silva:
> ctl_bitmask in pt_desc is of type u64. When an integer like 0xf is
> being left shifted more than 32 bits, the behavior is undefined.
> 
> Fix this by adding suffix ULL to integer 0xf.
> 
> Addresses-Coverity-ID: 1476095 ("Bad bit shift operation")
> Fixes: 6c0f0bba85a0 ("KVM: x86: Introduce a function to initialize the PT 
> configuration")
> Signed-off-by: Gustavo A. R. Silva 
> ---

Applied, thanks.


Re: [RESEND PATCH] kvm/x86: propagate fetch fault into guest

2019-01-11 Thread Radim Krčmář
2018-12-24 20:00+0800, Peng Hao:
> When handling ept misconfig exit, it will call emulate instruction
> with insn_len = 0. The decode instruction function may return a fetch
> fault and should propagate to guest.
> 
> The problem will result to emulation fail.
> KVM internal error. Suberror: 1
> emulation failure
> EAX=f81a0024 EBX=f6a07000 ECX=f6a0737c EDX=f8be0118
> ESI=f6a0737c EDI=0021 EBP=f6929f98 ESP=f6929f98
> EIP=f8bdd141 EFL=00010086 [--S--P-] CPL=0 II=0 A20=1 SMM=0 HLT=0
> ES =007b   00c0f300 DPL=3 DS   [-WA]
> CS =0060   00c09b00 DPL=0 CS32 [-RA]
> SS =0068   00c09300 DPL=0 DS   [-WA]
> DS =007b   00c0f300 DPL=3 DS   [-WA]
> FS =00d8 2c044000  00809300 DPL=0 DS16 [-WA]
> GS =0033 081a44c8 01000fff 00d0f300 DPL=3 DS   [-WA]
> LDT=   
> TR =0080 f6ea0c80 206b 8b00 DPL=0 TSS32-busy
> GDT= f6e99000 00ff
> IDT= fffbb000 07ff
> CR0=80050033 CR2=b757d000 CR3=35d31000 CR4=001406d0

Do you have a test case for this?

> Signed-off-by: Peng Hao 
> ---
> diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
> @@ -5114,8 +5114,11 @@ int x86_decode_insn(struct x86_emulate_ctxt *ctxt, 
> void *insn, int insn_len)
>   memcpy(ctxt->fetch.data, insn, insn_len);
>   else {
>   rc = __do_insn_fetch_bytes(ctxt, 1);
> - if (rc != X86EMUL_CONTINUE)
> + if (rc != X86EMUL_CONTINUE) {
> + if (rc == X86EMUL_PROPAGATE_FAULT)
> + ctxt->have_exception = true;
>   return rc;

(Ugh, the caller expects EMULATION_FAILED instead of rc.)

> + }
>   }
>  
>   switch (mode) {
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> @@ -6333,8 +6333,10 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu,
>   if (reexecute_instruction(vcpu, cr2, write_fault_to_spt,
>   emulation_type))
>   return EMULATE_DONE;
> - if (ctxt->have_exception && 
> inject_emulated_exception(vcpu))

I don't understand what that return value check was supposed to do, but
yours version seems good.

I have queued it for rc3 to get some extra testing,

thanks.


Re: [PATCH v3] x86/kvmclock : convert to SPDX identifiers

2018-12-20 Thread Radim Krčmář
2018-11-02 17:05+0800, Peng Hao:
> Update the verbose license text with the matching SPDX 
> license identifier.
> 
> Signed-off-by: Peng Hao 
> ---
>  arch/x86/kernel/kvmclock.c | 15 +--
>  1 files changed, 1 insertions(+), 14 deletions(-)
> 
> diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
> index 1e67646..a59325e 100644
> --- a/arch/x86/kernel/kvmclock.c
> +++ b/arch/x86/kernel/kvmclock.c
> @@ -1,19 +1,6 @@
> +// SPDX-License-Identifier: GPL-2.0+

GPL-2.0+ is deprecated in favor of GPL-2.0-or-later, so I have changed
it to that one and queued,

thanks.


Re: [PATCH] selftests: kvm: report failed stage when exit reason is unexpected

2018-12-20 Thread Radim Krčmář
2018-12-19 12:15+0100, Vitaly Kuznetsov:
> When we get a report like
> 
>  Test Assertion Failure 
>   x86_64/state_test.c:157: run->exit_reason == KVM_EXIT_IO
>   pid=955 tid=955 - Success
>  10x00401350: main at state_test.c:154
>  20x7fc31c9e9412: ?? ??:0
>  30x0040159d: _start at ??:?
>   Unexpected exit reason: 8 (SHUTDOWN),
> 
> it is not obvious which particular stage failed. Add the info.
> 
> Signed-off-by: Vitaly Kuznetsov 
> ---

Queued, thanks.


Re: [PATCH] KVM: x86: svm: report MSR_IA32_MCG_EXT_CTL as unsupported

2018-12-20 Thread Radim Krčmář
2018-12-19 12:06+0100, Vitaly Kuznetsov:
> AMD doesn't seem to implement MSR_IA32_MCG_EXT_CTL and svm code in kvm
> knows nothing about it, however, this MSR is among emulated_msrs and
> thus returned with KVM_GET_MSR_INDEX_LIST. The consequent KVM_GET_MSRS,
> of course, fails.
> 
> Report the MSR as unsupported to not confuse userspace.
> 
> Signed-off-by: Vitaly Kuznetsov 
> ---
>  arch/x86/kvm/svm.c | 7 +++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
> index 2acb42b74a51..dfdf7d0b7f88 100644
> --- a/arch/x86/kvm/svm.c
> +++ b/arch/x86/kvm/svm.c
> @@ -5845,6 +5845,13 @@ static bool svm_cpu_has_accelerated_tpr(void)
>  
>  static bool svm_has_emulated_msr(int index)
>  {
> + switch (index) {
> + case MSR_IA32_MCG_EXT_CTL:
> + return false;
> + default:
> + break;

Queued, thanks.

Btw, I would prefer this without the

  default: break;

as I don't think we'll ever add something there.


Re: [PATCH] KVM: MMU: Introduce single thread to zap collapsible sptes

2018-12-20 Thread Radim Krčmář
2018-12-06 15:58+0800, Wanpeng Li:
> From: Wanpeng Li 
> 
> Last year guys from huawei reported that the call of 
> memory_global_dirty_log_start/stop() 
> takes 13s for 4T memory and cause guest freeze too long which increases the 
> unacceptable 
> migration downtime. [1] [2]
> 
> Guangrong pointed out:
> 
> | collapsible_sptes zaps 4k mappings to make memory-read happy, it is not
> | required by the semanteme of KVM_SET_USER_MEMORY_REGION and it is not
> | urgent for vCPU's running, it could be done in a separate thread and use
> | lock-break technology.
> 
> [1] https://lists.gnu.org/archive/html/qemu-devel/2017-04/msg05249.html
> [2] https://www.mail-archive.com/qemu-devel@nongnu.org/msg449994.html
> 
> Several TB memory guest is common now after NVDIMM is deployed in cloud 
> environment.
> This patch utilizes worker thread to zap collapsible sptes in order to lazy 
> collapse 
> small sptes into large sptes during roll-back after live migration fails.
> 
> Cc: Paolo Bonzini 
> Cc: Radim Krčmář 
> Signed-off-by: Wanpeng Li 
> ---
> @@ -5679,14 +5679,41 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm 
> *kvm,
>   return need_tlb_flush;
>  }
>  
> +void zap_collapsible_sptes_fn(struct work_struct *work)
> +{
> + struct kvm_memory_slot *memslot;
> + struct kvm_memslots *slots;
> + struct delayed_work *dwork = to_delayed_work(work);
> + struct kvm_arch *ka = container_of(dwork, struct kvm_arch,
> +kvm_mmu_zap_collapsible_sptes_work);
> + struct kvm *kvm = container_of(ka, struct kvm, arch);
> + int i;
> +
> + mutex_lock(>slots_lock);
> + for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
> + spin_lock(>mmu_lock);
> + slots = __kvm_memslots(kvm, i);
> + kvm_for_each_memslot(memslot, slots) {
> + slot_handle_leaf(kvm, (struct kvm_memory_slot *)memslot,
> + kvm_mmu_zap_collapsible_spte, true);
> + if (need_resched() || spin_needbreak(>mmu_lock))
> + cond_resched_lock(>mmu_lock);

I think we shouldn't zap all memslots when kvm_mmu_zap_collapsible_sptes
only wanted to zap a specific one.
Please add a list of memslots to be zapped; delete from the list here
and add in kvm_mmu_zap_collapsible_sptes().

> + }
> + spin_unlock(>mmu_lock);
> + }
> + kvm->arch.zap_in_progress = false;
> + mutex_unlock(>slots_lock);
> +}
> +
> +#define KVM_MMU_ZAP_DELAYED (60 * HZ)
>  void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm,
>  const struct kvm_memory_slot *memslot)
>  {
> - /* FIXME: const-ify all uses of struct kvm_memory_slot.  */
> - spin_lock(>mmu_lock);
> - slot_handle_leaf(kvm, (struct kvm_memory_slot *)memslot,
> -  kvm_mmu_zap_collapsible_spte, true);
> - spin_unlock(>mmu_lock);
> + if (!kvm->arch.zap_in_progress) {

The list can also serve in place of zap_in_progress -- if there were any
elements in it, then there is no need to schedule the work again.

Thanks.


Re: [PATCH V3] kvm:x86 :remove unnecessary recalculate_apic_map

2018-12-13 Thread Radim Krčmář
2018-12-04 17:42+0800, Peng Hao:
> In the previous code, the variable apic_sw_disabled influences
> recalculate_apic_map. But in "KVM: x86: simplify kvm_apic_map"
> (commit:3b5a5ffa928a3f875b0d5dd284eeb7c322e1688a),
> the access to apic_sw_disabled in recalculate_apic_map has been
> deleted.
> 
> Signed-off-by: Peng Hao 
> ---

Reviewed-by: Radim Krčmář 


Re: [PATCH] KVM/nVMX: Do not validate that posted_intr_desc_addr is page aligned

2018-10-24 Thread Radim Krčmář
2018-10-24 04:01-0700, Sean Christopherson:
> On Sat, Oct 20, 2018 at 11:42:59PM +0200, KarimAllah Ahmed wrote:
> > The spec only requires the posted interrupt descriptor address to be
> > 64-bytes aligned (i.e. bits[0:5] == 0). Using page_address_valid also
> > forces the address to be page aligned.
> > 
> > Only validate that the address does not cross the maximum physical address
> > without enforcing a page alignment.
> > 
> > Cc: Paolo Bonzini 
> > Cc: Radim Krčmář 
> > Cc: Thomas Gleixner 
> > Cc: Ingo Molnar 
> > Cc: Borislav Petkov 
> > Cc: H. Peter Anvin 
> > Cc: x...@kernel.org
> > Cc: k...@vger.kernel.org
> > Cc: linux-kernel@vger.kernel.org
> > Fixes: 6de84e581c0 ("nVMX x86: check posted-interrupt descriptor addresss 
> > on vmentry of L2")
> > Signed-off-by: KarimAllah Ahmed 
> > ---
> >  arch/x86/kvm/vmx.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> > index 30bf860..47962f2 100644
> > --- a/arch/x86/kvm/vmx.c
> > +++ b/arch/x86/kvm/vmx.c
> > @@ -11668,7 +11668,7 @@ static int nested_vmx_check_apicv_controls(struct 
> > kvm_vcpu *vcpu,
> > !nested_exit_intr_ack_set(vcpu) ||
> > (vmcs12->posted_intr_nv & 0xff00) ||
> > (vmcs12->posted_intr_desc_addr & 0x3f) ||
> > -   (!page_address_valid(vcpu, vmcs12->posted_intr_desc_addr
> > +   (vmcs12->posted_intr_desc_addr >> cpuid_maxphyaddr(vcpu)))
> > return -EINVAL;
> 
> Can you update the comment for this code block?  It has a stale blurb
> about "the descriptor address has been already checked in
> nested_get_vmcs12_pages" and it'd be nice to state why bits[5:0] must
> be zero (your changelog is much more helpful than the current comment).
> 
> With that:
> 
> Reviewed-by: Sean Christopherson 

I have just sent a pull request with the stale comment. :(


Re: [PATCH] KVM/nVMX: Do not validate that posted_intr_desc_addr is page aligned

2018-10-24 Thread Radim Krčmář
2018-10-24 04:01-0700, Sean Christopherson:
> On Sat, Oct 20, 2018 at 11:42:59PM +0200, KarimAllah Ahmed wrote:
> > The spec only requires the posted interrupt descriptor address to be
> > 64-bytes aligned (i.e. bits[0:5] == 0). Using page_address_valid also
> > forces the address to be page aligned.
> > 
> > Only validate that the address does not cross the maximum physical address
> > without enforcing a page alignment.
> > 
> > Cc: Paolo Bonzini 
> > Cc: Radim Krčmář 
> > Cc: Thomas Gleixner 
> > Cc: Ingo Molnar 
> > Cc: Borislav Petkov 
> > Cc: H. Peter Anvin 
> > Cc: x...@kernel.org
> > Cc: k...@vger.kernel.org
> > Cc: linux-kernel@vger.kernel.org
> > Fixes: 6de84e581c0 ("nVMX x86: check posted-interrupt descriptor addresss 
> > on vmentry of L2")
> > Signed-off-by: KarimAllah Ahmed 
> > ---
> >  arch/x86/kvm/vmx.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> > index 30bf860..47962f2 100644
> > --- a/arch/x86/kvm/vmx.c
> > +++ b/arch/x86/kvm/vmx.c
> > @@ -11668,7 +11668,7 @@ static int nested_vmx_check_apicv_controls(struct 
> > kvm_vcpu *vcpu,
> > !nested_exit_intr_ack_set(vcpu) ||
> > (vmcs12->posted_intr_nv & 0xff00) ||
> > (vmcs12->posted_intr_desc_addr & 0x3f) ||
> > -   (!page_address_valid(vcpu, vmcs12->posted_intr_desc_addr
> > +   (vmcs12->posted_intr_desc_addr >> cpuid_maxphyaddr(vcpu)))
> > return -EINVAL;
> 
> Can you update the comment for this code block?  It has a stale blurb
> about "the descriptor address has been already checked in
> nested_get_vmcs12_pages" and it'd be nice to state why bits[5:0] must
> be zero (your changelog is much more helpful than the current comment).
> 
> With that:
> 
> Reviewed-by: Sean Christopherson 

I have just sent a pull request with the stale comment. :(


[GIT PULL] KVM updates for Linux 4.20-rc1

2018-10-24 Thread Radim Krčmář
 nested_vmx_check_pml_controls() concise

Kristina Martsenko (1):
  vgic: Add support for 52bit guest physical address

Ladi Prosek (1):
  KVM: hyperv: define VP assist page helpers

Lan Tianyu (1):
  KVM/VMX: Change hv flush logic when ept tables are mismatched.

Liran Alon (4):
  KVM: nVMX: Flush TLB entries tagged by dest EPTP on L1<->L2 transitions
  KVM: nVMX: Use correct VPID02 when emulating L1 INVVPID
  KVM: nVMX: Flush linear and combined mappings on VPID02 related flushes
  KVM: nVMX: Do not flush TLB on L1<->L2 transitions if L1 uses VPID and EPT

Marc Zyngier (2):
  KVM: arm/arm64: Rename kvm_arm_config_vm to kvm_arm_setup_stage2
  KVM: arm64: Drop __cpu_init_stage2 on the VHE path

Mark Rutland (1):
  KVM: arm64: Fix caching of host MDCR_EL2 value

Michael Ellerman (1):
  Merge branch 'kvm-ppc-fixes' of paulus/powerpc into topic/ppc-kvm

Paolo Bonzini (9):
  Merge tag 'kvm-s390-next-4.20-1' of 
git://git.kernel.org/.../kvms390/linux into HEAD
  Merge tag 'kvm-ppc-next-4.20-1' of 
git://git.kernel.org/.../paulus/powerpc into HEAD
  Merge tag 'kvm-s390-next-4.20-2' of 
git://git.kernel.org/.../kvms390/linux into HEAD
  kvm/x86: return meaningful value from KVM_SIGNAL_MSI
  kvm: x86: optimize dr6 restore
  x86/kvm/mmu: get rid of redundant kvm_mmu_setup()
  KVM: VMX: enable nested virtualization by default
  Merge tag 'kvmarm-for-v4.20' of git://git.kernel.org/.../kvmarm/kvmarm 
into HEAD
  Merge tag 'kvm-ppc-next-4.20-2' of 
git://git.kernel.org/.../paulus/powerpc into HEAD

Paul Mackerras (27):
  KVM: PPC: Book3S HV: Provide mode where all vCPUs on a core must be the 
same VM
  powerpc: Turn off CPU_FTR_P9_TM_HV_ASSIST in non-hypervisor mode
  KVM: PPC: Book3S: Simplify external interrupt handling
  KVM: PPC: Book3S HV: Remove left-over code in XICS-on-XIVE emulation
  KVM: PPC: Book3S HV: Move interrupt delivery on guest entry to C code
  KVM: PPC: Book3S HV: Extract PMU save/restore operations as C-callable 
functions
  KVM: PPC: Book3S HV: Simplify real-mode interrupt handling
  KVM: PPC: Book3S: Rework TM save/restore code and make it C-callable
  KVM: PPC: Book3S HV: Call kvmppc_handle_exit_hv() with vcore unlocked
  KVM: PPC: Book3S HV: Streamlined guest entry/exit path on P9 for radix 
guests
  KVM: PPC: Book3S HV: Handle hypervisor instruction faults better
  KVM: PPC: Book3S HV: Add a debugfs file to dump radix mappings
  KVM: PPC: Use ccr field in pt_regs struct embedded in vcpu struct
  KVM: PPC: Book3S HV: Use kvmppc_unmap_pte() in kvm_unmap_radix()
  KVM: PPC: Book3S HV: Framework and hcall stubs for nested virtualization
  KVM: PPC: Book3S HV: Nested guest entry via hypercall
  KVM: PPC: Book3S HV: Use XICS hypercalls when running as a nested 
hypervisor
  KVM: PPC: Book3S HV: Handle hypercalls correctly when nested
  KVM: PPC: Book3S HV: Use hypercalls for TLB invalidation when nested
  KVM: PPC: Book3S HV: Don't access HFSCR, LPIDR or LPCR when running nested
  KVM: PPC: Book3S HV: Add one-reg interface to virtual PTCR register
  KVM: PPC: Book3S HV: Allow HV module to load without hypervisor mode
  KVM: PPC: Book3S HV: Add nested shadow page tables to debugfs
  Merge remote-tracking branch 'remotes/powerpc/topic/ppc-kvm' into 
kvm-ppc-next
  KVM: PPC: Book3S HV: Add a VM capability to enable nested virtualization
  KVM: PPC: Book3S HV: Add NO_HASH flag to GET_SMMU_INFO ioctl result
  KVM: PPC: Book3S HV: Don't use streamlined entry path on early POWER9 
chips

Peng Hao (3):
  kvm/x86 : fix some typo
  kvm/x86 : add document for coalesced mmio
  kvm/x86 : add coalesced pio support

Pierre Morel (11):
  KVM: s390: Clear Crypto Control Block when using vSIE
  KVM: s390: vsie: Do the CRYCB validation first
  KVM: s390: vsie: Make use of CRYCB FORMAT2 clear
  KVM: s390: vsie: Allow CRYCB FORMAT-2
  KVM: s390: vsie: allow CRYCB FORMAT-1
  KVM: s390: vsie: allow CRYCB FORMAT-0
  KVM: s390: vsie: allow guest FORMAT-0 CRYCB on host FORMAT-1
  KVM: s390: vsie: allow guest FORMAT-1 CRYCB on host FORMAT-2
  KVM: s390: vsie: allow guest FORMAT-0 CRYCB on host FORMAT-2
  KVM: s390: Tracing APCB changes
  s390: vfio-ap: setup APCB mask using KVM dedicated function

Punit Agrawal (1):
  KVM: arm/arm64: Ensure only THP is candidate for adjustment

Radim Krčmář (1):
  Revert "kvm: x86: optimize dr6 restore"

Sean Christopherson (22):
  KVM: vmx: rename KVM_GUEST_CR0_MASK tp KVM_VM_CR0_ALWAYS_OFF
  KVM: nVMX: restore host state in nested_vmx_vmexit for VMFail
  KVM: nVMX: move host EFER consistency checks to VMFail path
  KVM: nVMX: move vmcs12 EPTP consistency check to check_vmentry_prereqs()
  KVM: nVMX: use vm_exit_controls_init() to write exit controls for vmcs02
  KVM: nVMX: reset cache/shadows when switching loaded V

[GIT PULL] KVM updates for Linux 4.20-rc1

2018-10-24 Thread Radim Krčmář
 nested_vmx_check_pml_controls() concise

Kristina Martsenko (1):
  vgic: Add support for 52bit guest physical address

Ladi Prosek (1):
  KVM: hyperv: define VP assist page helpers

Lan Tianyu (1):
  KVM/VMX: Change hv flush logic when ept tables are mismatched.

Liran Alon (4):
  KVM: nVMX: Flush TLB entries tagged by dest EPTP on L1<->L2 transitions
  KVM: nVMX: Use correct VPID02 when emulating L1 INVVPID
  KVM: nVMX: Flush linear and combined mappings on VPID02 related flushes
  KVM: nVMX: Do not flush TLB on L1<->L2 transitions if L1 uses VPID and EPT

Marc Zyngier (2):
  KVM: arm/arm64: Rename kvm_arm_config_vm to kvm_arm_setup_stage2
  KVM: arm64: Drop __cpu_init_stage2 on the VHE path

Mark Rutland (1):
  KVM: arm64: Fix caching of host MDCR_EL2 value

Michael Ellerman (1):
  Merge branch 'kvm-ppc-fixes' of paulus/powerpc into topic/ppc-kvm

Paolo Bonzini (9):
  Merge tag 'kvm-s390-next-4.20-1' of 
git://git.kernel.org/.../kvms390/linux into HEAD
  Merge tag 'kvm-ppc-next-4.20-1' of 
git://git.kernel.org/.../paulus/powerpc into HEAD
  Merge tag 'kvm-s390-next-4.20-2' of 
git://git.kernel.org/.../kvms390/linux into HEAD
  kvm/x86: return meaningful value from KVM_SIGNAL_MSI
  kvm: x86: optimize dr6 restore
  x86/kvm/mmu: get rid of redundant kvm_mmu_setup()
  KVM: VMX: enable nested virtualization by default
  Merge tag 'kvmarm-for-v4.20' of git://git.kernel.org/.../kvmarm/kvmarm 
into HEAD
  Merge tag 'kvm-ppc-next-4.20-2' of 
git://git.kernel.org/.../paulus/powerpc into HEAD

Paul Mackerras (27):
  KVM: PPC: Book3S HV: Provide mode where all vCPUs on a core must be the 
same VM
  powerpc: Turn off CPU_FTR_P9_TM_HV_ASSIST in non-hypervisor mode
  KVM: PPC: Book3S: Simplify external interrupt handling
  KVM: PPC: Book3S HV: Remove left-over code in XICS-on-XIVE emulation
  KVM: PPC: Book3S HV: Move interrupt delivery on guest entry to C code
  KVM: PPC: Book3S HV: Extract PMU save/restore operations as C-callable 
functions
  KVM: PPC: Book3S HV: Simplify real-mode interrupt handling
  KVM: PPC: Book3S: Rework TM save/restore code and make it C-callable
  KVM: PPC: Book3S HV: Call kvmppc_handle_exit_hv() with vcore unlocked
  KVM: PPC: Book3S HV: Streamlined guest entry/exit path on P9 for radix 
guests
  KVM: PPC: Book3S HV: Handle hypervisor instruction faults better
  KVM: PPC: Book3S HV: Add a debugfs file to dump radix mappings
  KVM: PPC: Use ccr field in pt_regs struct embedded in vcpu struct
  KVM: PPC: Book3S HV: Use kvmppc_unmap_pte() in kvm_unmap_radix()
  KVM: PPC: Book3S HV: Framework and hcall stubs for nested virtualization
  KVM: PPC: Book3S HV: Nested guest entry via hypercall
  KVM: PPC: Book3S HV: Use XICS hypercalls when running as a nested 
hypervisor
  KVM: PPC: Book3S HV: Handle hypercalls correctly when nested
  KVM: PPC: Book3S HV: Use hypercalls for TLB invalidation when nested
  KVM: PPC: Book3S HV: Don't access HFSCR, LPIDR or LPCR when running nested
  KVM: PPC: Book3S HV: Add one-reg interface to virtual PTCR register
  KVM: PPC: Book3S HV: Allow HV module to load without hypervisor mode
  KVM: PPC: Book3S HV: Add nested shadow page tables to debugfs
  Merge remote-tracking branch 'remotes/powerpc/topic/ppc-kvm' into 
kvm-ppc-next
  KVM: PPC: Book3S HV: Add a VM capability to enable nested virtualization
  KVM: PPC: Book3S HV: Add NO_HASH flag to GET_SMMU_INFO ioctl result
  KVM: PPC: Book3S HV: Don't use streamlined entry path on early POWER9 
chips

Peng Hao (3):
  kvm/x86 : fix some typo
  kvm/x86 : add document for coalesced mmio
  kvm/x86 : add coalesced pio support

Pierre Morel (11):
  KVM: s390: Clear Crypto Control Block when using vSIE
  KVM: s390: vsie: Do the CRYCB validation first
  KVM: s390: vsie: Make use of CRYCB FORMAT2 clear
  KVM: s390: vsie: Allow CRYCB FORMAT-2
  KVM: s390: vsie: allow CRYCB FORMAT-1
  KVM: s390: vsie: allow CRYCB FORMAT-0
  KVM: s390: vsie: allow guest FORMAT-0 CRYCB on host FORMAT-1
  KVM: s390: vsie: allow guest FORMAT-1 CRYCB on host FORMAT-2
  KVM: s390: vsie: allow guest FORMAT-0 CRYCB on host FORMAT-2
  KVM: s390: Tracing APCB changes
  s390: vfio-ap: setup APCB mask using KVM dedicated function

Punit Agrawal (1):
  KVM: arm/arm64: Ensure only THP is candidate for adjustment

Radim Krčmář (1):
  Revert "kvm: x86: optimize dr6 restore"

Sean Christopherson (22):
  KVM: vmx: rename KVM_GUEST_CR0_MASK tp KVM_VM_CR0_ALWAYS_OFF
  KVM: nVMX: restore host state in nested_vmx_vmexit for VMFail
  KVM: nVMX: move host EFER consistency checks to VMFail path
  KVM: nVMX: move vmcs12 EPTP consistency check to check_vmentry_prereqs()
  KVM: nVMX: use vm_exit_controls_init() to write exit controls for vmcs02
  KVM: nVMX: reset cache/shadows when switching loaded V

[GIT PULL] KVM fixes for Linux 4.19-rc3

2018-09-08 Thread Radim Krčmář
Linus,

The following changes since commit 57361846b52bc686112da6ca5368d11210796804:

  Linux 4.19-rc2 (2018-09-02 14:37:30 -0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/virt/kvm/kvm tags/for-linus

for you to fetch changes up to bdf7ffc89922a52a4f08a12f7421ea24bb7626a0:

  KVM: LAPIC: Fix pv ipis out-of-bounds access (2018-09-07 18:38:43 +0200)


KVM fixes for 4.19-rc3

ARM:
 - Fix a VFP corruption in 32-bit guest
 - Add missing cache invalidation for CoW pages
 - Two small cleanups

s390:
 - Fallout from the hugetlbfs support: pfmf interpretion and locking
 - VSIE: fix keywrapping for nested guests

PPC:
 - Fix a bug where pages might not get marked dirty, causing
   guest memory corruption on migration,
 - Fix a bug causing reads from guest memory to use the wrong guest
   real address for very large HPT guests (>256G of memory), leading to
   failures in instruction emulation.

x86:
 - Fix out of bound access from malicious pv ipi hypercalls (introduced
   in rc1)
 - Fix delivery of pending interrupts when entering a nested guest,
   preventing arbitrarily late injection
 - Sanitize kvm_stat output after destroying a guest
 - Fix infinite loop when emulating a nested guest page fault
   and improve the surrounding emulation code
 - Two minor cleanups


Colin Ian King (1):
  KVM: SVM: remove unused variable dst_vaddr_end

Janosch Frank (2):
  KVM: s390: Fix pfmf and conditional skey emulation
  KVM: s390: Properly lock mm context allow_gmap_hpage_1m setting

Liran Alon (1):
  KVM: nVMX: Fix loss of pending IRQ/NMI before entering L2

Marc Zyngier (3):
  KVM: arm/arm64: Clean dcache to PoC when changing PTE due to CoW
  arm64: KVM: Only force FPEXC32_EL2.EN if trapping FPSIMD
  KVM: Remove obsolete kvm_unmap_hva notifier backend

Paul Mackerras (2):
  KVM: PPC: Book3S HV: Use correct pagesize in kvm_unmap_radix()
  KVM: PPC: Book3S HV: Don't truncate HPTE index in xlate function

Pierre Morel (1):
  KVM: s390: vsie: copy wrapping keys to right place

Radim Krčmář (3):
  Merge tag 'kvm-ppc-fixes-4.19-1' of 
git://git.kernel.org/.../paulus/powerpc
  Merge tag 'kvm-s390-master-4.19-1' of 
git://git.kernel.org/.../kvms390/linux
  Merge tag 'kvm-arm-fixes-for-v4.19-v2' of 
git://git.kernel.org/.../kvmarm/kvmarm

Sean Christopherson (8):
  KVM: VMX: Do not allow reexecute_instruction() when skipping MMIO instr
  KVM: x86: SVM: Set EMULTYPE_NO_REEXECUTE for RSM emulation
  KVM: x86: Invert emulation re-execute behavior to make it opt-in
  KVM: x86: Merge EMULTYPE_RETRY and EMULTYPE_ALLOW_REEXECUTE
  KVM: x86: Default to not allowing emulation retry in kvm_mmu_page_fault
  KVM: x86: Do not re-{try,execute} after failed emulation in L2
  KVM: x86: Rename emulate_instruction() to kvm_emulate_instruction()
  KVM: x86: Unexport x86_emulate_instruction()

Stefan Raspl (7):
  tools/kvm_stat: fix python3 issues
  tools/kvm_stat: fix handling of invalid paths in debugfs provider
  tools/kvm_stat: fix updates for dead guests
  tools/kvm_stat: don't reset stats when setting PID filter for debugfs
  tools/kvm_stat: handle guest removals more gracefully
  tools/kvm_stat: indicate dead guests as such
  tools/kvm_stat: re-animate display of dead guests

Steven Price (1):
  arm64: KVM: Remove pgd_lock

Vitaly Kuznetsov (1):
  KVM: nVMX: avoid redundant double assignment of nested_run_pending

Wanpeng Li (1):
  KVM: LAPIC: Fix pv ipis out-of-bounds access

 arch/arm/include/asm/kvm_host.h|  1 -
 arch/arm64/include/asm/kvm_host.h  |  4 +--
 arch/arm64/kvm/hyp/switch.c|  9 --
 arch/mips/include/asm/kvm_host.h   |  1 -
 arch/mips/kvm/mmu.c| 10 --
 arch/powerpc/kvm/book3s_64_mmu_hv.c|  2 +-
 arch/powerpc/kvm/book3s_64_mmu_radix.c |  6 ++--
 arch/s390/include/asm/mmu.h|  8 -
 arch/s390/kvm/kvm-s390.c   |  2 ++
 arch/s390/kvm/priv.c   | 30 ++---
 arch/s390/kvm/vsie.c   |  3 +-
 arch/x86/include/asm/kvm_host.h| 22 -
 arch/x86/kvm/lapic.c   | 27 
 arch/x86/kvm/mmu.c | 26 ---
 arch/x86/kvm/svm.c | 19 ++-
 arch/x86/kvm/vmx.c | 43 ++---
 arch/x86/kvm/x86.c | 28 +---
 arch/x86/kvm/x86.h |  2 ++
 tools/kvm/kvm_stat/kvm_stat| 59 --
 virt/kvm/arm/mmu.c | 21 +---
 virt/kvm/arm/trace.h   | 15 -
 21 files changed, 204 insertions(+), 134 deletions(-)


[GIT PULL] KVM fixes for Linux 4.19-rc3

2018-09-08 Thread Radim Krčmář
Linus,

The following changes since commit 57361846b52bc686112da6ca5368d11210796804:

  Linux 4.19-rc2 (2018-09-02 14:37:30 -0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/virt/kvm/kvm tags/for-linus

for you to fetch changes up to bdf7ffc89922a52a4f08a12f7421ea24bb7626a0:

  KVM: LAPIC: Fix pv ipis out-of-bounds access (2018-09-07 18:38:43 +0200)


KVM fixes for 4.19-rc3

ARM:
 - Fix a VFP corruption in 32-bit guest
 - Add missing cache invalidation for CoW pages
 - Two small cleanups

s390:
 - Fallout from the hugetlbfs support: pfmf interpretion and locking
 - VSIE: fix keywrapping for nested guests

PPC:
 - Fix a bug where pages might not get marked dirty, causing
   guest memory corruption on migration,
 - Fix a bug causing reads from guest memory to use the wrong guest
   real address for very large HPT guests (>256G of memory), leading to
   failures in instruction emulation.

x86:
 - Fix out of bound access from malicious pv ipi hypercalls (introduced
   in rc1)
 - Fix delivery of pending interrupts when entering a nested guest,
   preventing arbitrarily late injection
 - Sanitize kvm_stat output after destroying a guest
 - Fix infinite loop when emulating a nested guest page fault
   and improve the surrounding emulation code
 - Two minor cleanups


Colin Ian King (1):
  KVM: SVM: remove unused variable dst_vaddr_end

Janosch Frank (2):
  KVM: s390: Fix pfmf and conditional skey emulation
  KVM: s390: Properly lock mm context allow_gmap_hpage_1m setting

Liran Alon (1):
  KVM: nVMX: Fix loss of pending IRQ/NMI before entering L2

Marc Zyngier (3):
  KVM: arm/arm64: Clean dcache to PoC when changing PTE due to CoW
  arm64: KVM: Only force FPEXC32_EL2.EN if trapping FPSIMD
  KVM: Remove obsolete kvm_unmap_hva notifier backend

Paul Mackerras (2):
  KVM: PPC: Book3S HV: Use correct pagesize in kvm_unmap_radix()
  KVM: PPC: Book3S HV: Don't truncate HPTE index in xlate function

Pierre Morel (1):
  KVM: s390: vsie: copy wrapping keys to right place

Radim Krčmář (3):
  Merge tag 'kvm-ppc-fixes-4.19-1' of 
git://git.kernel.org/.../paulus/powerpc
  Merge tag 'kvm-s390-master-4.19-1' of 
git://git.kernel.org/.../kvms390/linux
  Merge tag 'kvm-arm-fixes-for-v4.19-v2' of 
git://git.kernel.org/.../kvmarm/kvmarm

Sean Christopherson (8):
  KVM: VMX: Do not allow reexecute_instruction() when skipping MMIO instr
  KVM: x86: SVM: Set EMULTYPE_NO_REEXECUTE for RSM emulation
  KVM: x86: Invert emulation re-execute behavior to make it opt-in
  KVM: x86: Merge EMULTYPE_RETRY and EMULTYPE_ALLOW_REEXECUTE
  KVM: x86: Default to not allowing emulation retry in kvm_mmu_page_fault
  KVM: x86: Do not re-{try,execute} after failed emulation in L2
  KVM: x86: Rename emulate_instruction() to kvm_emulate_instruction()
  KVM: x86: Unexport x86_emulate_instruction()

Stefan Raspl (7):
  tools/kvm_stat: fix python3 issues
  tools/kvm_stat: fix handling of invalid paths in debugfs provider
  tools/kvm_stat: fix updates for dead guests
  tools/kvm_stat: don't reset stats when setting PID filter for debugfs
  tools/kvm_stat: handle guest removals more gracefully
  tools/kvm_stat: indicate dead guests as such
  tools/kvm_stat: re-animate display of dead guests

Steven Price (1):
  arm64: KVM: Remove pgd_lock

Vitaly Kuznetsov (1):
  KVM: nVMX: avoid redundant double assignment of nested_run_pending

Wanpeng Li (1):
  KVM: LAPIC: Fix pv ipis out-of-bounds access

 arch/arm/include/asm/kvm_host.h|  1 -
 arch/arm64/include/asm/kvm_host.h  |  4 +--
 arch/arm64/kvm/hyp/switch.c|  9 --
 arch/mips/include/asm/kvm_host.h   |  1 -
 arch/mips/kvm/mmu.c| 10 --
 arch/powerpc/kvm/book3s_64_mmu_hv.c|  2 +-
 arch/powerpc/kvm/book3s_64_mmu_radix.c |  6 ++--
 arch/s390/include/asm/mmu.h|  8 -
 arch/s390/kvm/kvm-s390.c   |  2 ++
 arch/s390/kvm/priv.c   | 30 ++---
 arch/s390/kvm/vsie.c   |  3 +-
 arch/x86/include/asm/kvm_host.h| 22 -
 arch/x86/kvm/lapic.c   | 27 
 arch/x86/kvm/mmu.c | 26 ---
 arch/x86/kvm/svm.c | 19 ++-
 arch/x86/kvm/vmx.c | 43 ++---
 arch/x86/kvm/x86.c | 28 +---
 arch/x86/kvm/x86.h |  2 ++
 tools/kvm/kvm_stat/kvm_stat| 59 --
 virt/kvm/arm/mmu.c | 21 +---
 virt/kvm/arm/trace.h   | 15 -
 21 files changed, 204 insertions(+), 134 deletions(-)


Re: [PATCH v2] KVM: LAPIC: Fix pv ipis out-of-bounds access

2018-08-30 Thread Radim Krčmář
2018-08-30 10:03+0800, Wanpeng Li:
> From: Wanpeng Li 
> 
> Dan Carpenter reported that the untrusted data returns from 
> kvm_register_read()
> results in the following static checker warning:
>   arch/x86/kvm/lapic.c:576 kvm_pv_send_ipi()
>   error: buffer underflow 'map->phys_map' 's32min-s32max'
> 
> KVM guest can easily trigger this by executing the following assembly 
> sequence 
> in Ring0:
> 
> mov $10, %rax
> mov $0x, %rbx
> mov $0x, %rdx
> mov $0, %rsi
> vmcall
> 
> As this will cause KVM to execute the following code-path:
> vmx_handle_exit() -> handle_vmcall() -> kvm_emulate_hypercall() -> 
> kvm_pv_send_ipi()
> which will reach out-of-bounds access.
> 
> This patch fixes it by adding a check to kvm_pv_send_ipi() against 
> map->max_apic_id, 
> ignoring destinations that are not present and delivering the rest. We also 
> check 
> whether or not map->phys_map[min + i] is NULL since the max_apic_id is set to 
> the 
> max apic id, some phys_map maybe NULL when apic id is sparse, especially kvm 
> unconditionally set max_apic_id to 255 to reserve enough space for any xAPIC 
> ID.
> 
> Reported-by: Dan Carpenter 
> Reviewed-by: Liran Alon 
> Cc: Paolo Bonzini 
> Cc: Radim Krčmář 
> Cc: Liran Alon 
> Cc: Dan Carpenter 
> Signed-off-by: Wanpeng Li 
> ---
> v1 -> v2:
>  * add min > map->max_apic_id check
>  * change min to u32
>  * add min((u32)BITS_PER_LONG, (map->max_apic_id - min + 1))
> 
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> @@ -548,7 +548,7 @@ int kvm_apic_set_irq(struct kvm_vcpu *vcpu, struct 
> kvm_lapic_irq *irq,
>  }
>  
>  int kvm_pv_send_ipi(struct kvm *kvm, unsigned long ipi_bitmap_low,
> - unsigned long ipi_bitmap_high, int min,
> + unsigned long ipi_bitmap_high, u32 min,
>   unsigned long icr, int op_64_bit)
>  {
>   int i;
> @@ -571,18 +571,27 @@ int kvm_pv_send_ipi(struct kvm *kvm, unsigned long 
> ipi_bitmap_low,
>   rcu_read_lock();
>   map = rcu_dereference(kvm->arch.apic_map);
>  
> + if (min > map->max_apic_id)
> + goto out;
>   /* Bits above cluster_size are masked in the caller.  */
> - for_each_set_bit(i, _bitmap_low, BITS_PER_LONG) {
> - vcpu = map->phys_map[min + i]->vcpu;
> - count += kvm_apic_set_irq(vcpu, , NULL);
> + for_each_set_bit(i, _bitmap_low,
> + min((u32)BITS_PER_LONG, (map->max_apic_id - min + 1))) {
> + if (map->phys_map[min + i]) {
> + vcpu = map->phys_map[min + i]->vcpu;
> + count += kvm_apic_set_irq(vcpu, , NULL);
> + }
>   }
>  
>   min += cluster_size;

We need a second

  if (min > map->max_apic_id)
goto out;

here.  I will add it while applying if there are no other change
requests.

> - for_each_set_bit(i, _bitmap_high, BITS_PER_LONG) {
> - vcpu = map->phys_map[min + i]->vcpu;
> - count += kvm_apic_set_irq(vcpu, , NULL);
> + for_each_set_bit(i, _bitmap_high,
> + min((u32)BITS_PER_LONG, (map->max_apic_id - min + 1))) {
> + if (map->phys_map[min + i]) {
> + vcpu = map->phys_map[min + i]->vcpu;
> + count += kvm_apic_set_irq(vcpu, , NULL);
> + }
>   }
>  
> +out:
>   rcu_read_unlock();
>   return count;
>  }
> -- 
> 2.7.4
> 


Re: [PATCH v2] KVM: LAPIC: Fix pv ipis out-of-bounds access

2018-08-30 Thread Radim Krčmář
2018-08-30 10:03+0800, Wanpeng Li:
> From: Wanpeng Li 
> 
> Dan Carpenter reported that the untrusted data returns from 
> kvm_register_read()
> results in the following static checker warning:
>   arch/x86/kvm/lapic.c:576 kvm_pv_send_ipi()
>   error: buffer underflow 'map->phys_map' 's32min-s32max'
> 
> KVM guest can easily trigger this by executing the following assembly 
> sequence 
> in Ring0:
> 
> mov $10, %rax
> mov $0x, %rbx
> mov $0x, %rdx
> mov $0, %rsi
> vmcall
> 
> As this will cause KVM to execute the following code-path:
> vmx_handle_exit() -> handle_vmcall() -> kvm_emulate_hypercall() -> 
> kvm_pv_send_ipi()
> which will reach out-of-bounds access.
> 
> This patch fixes it by adding a check to kvm_pv_send_ipi() against 
> map->max_apic_id, 
> ignoring destinations that are not present and delivering the rest. We also 
> check 
> whether or not map->phys_map[min + i] is NULL since the max_apic_id is set to 
> the 
> max apic id, some phys_map maybe NULL when apic id is sparse, especially kvm 
> unconditionally set max_apic_id to 255 to reserve enough space for any xAPIC 
> ID.
> 
> Reported-by: Dan Carpenter 
> Reviewed-by: Liran Alon 
> Cc: Paolo Bonzini 
> Cc: Radim Krčmář 
> Cc: Liran Alon 
> Cc: Dan Carpenter 
> Signed-off-by: Wanpeng Li 
> ---
> v1 -> v2:
>  * add min > map->max_apic_id check
>  * change min to u32
>  * add min((u32)BITS_PER_LONG, (map->max_apic_id - min + 1))
> 
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> @@ -548,7 +548,7 @@ int kvm_apic_set_irq(struct kvm_vcpu *vcpu, struct 
> kvm_lapic_irq *irq,
>  }
>  
>  int kvm_pv_send_ipi(struct kvm *kvm, unsigned long ipi_bitmap_low,
> - unsigned long ipi_bitmap_high, int min,
> + unsigned long ipi_bitmap_high, u32 min,
>   unsigned long icr, int op_64_bit)
>  {
>   int i;
> @@ -571,18 +571,27 @@ int kvm_pv_send_ipi(struct kvm *kvm, unsigned long 
> ipi_bitmap_low,
>   rcu_read_lock();
>   map = rcu_dereference(kvm->arch.apic_map);
>  
> + if (min > map->max_apic_id)
> + goto out;
>   /* Bits above cluster_size are masked in the caller.  */
> - for_each_set_bit(i, _bitmap_low, BITS_PER_LONG) {
> - vcpu = map->phys_map[min + i]->vcpu;
> - count += kvm_apic_set_irq(vcpu, , NULL);
> + for_each_set_bit(i, _bitmap_low,
> + min((u32)BITS_PER_LONG, (map->max_apic_id - min + 1))) {
> + if (map->phys_map[min + i]) {
> + vcpu = map->phys_map[min + i]->vcpu;
> + count += kvm_apic_set_irq(vcpu, , NULL);
> + }
>   }
>  
>   min += cluster_size;

We need a second

  if (min > map->max_apic_id)
goto out;

here.  I will add it while applying if there are no other change
requests.

> - for_each_set_bit(i, _bitmap_high, BITS_PER_LONG) {
> - vcpu = map->phys_map[min + i]->vcpu;
> - count += kvm_apic_set_irq(vcpu, , NULL);
> + for_each_set_bit(i, _bitmap_high,
> + min((u32)BITS_PER_LONG, (map->max_apic_id - min + 1))) {
> + if (map->phys_map[min + i]) {
> + vcpu = map->phys_map[min + i]->vcpu;
> + count += kvm_apic_set_irq(vcpu, , NULL);
> + }
>   }
>  
> +out:
>   rcu_read_unlock();
>   return count;
>  }
> -- 
> 2.7.4
> 


Re: [PATCH V3 1/2] kvm/x86 : add coalesced pio support

2018-08-29 Thread Radim Krčmář
2018-08-24 19:20+0800, Peng Hao:
> Signed-off-by: Peng Hao 
> ---
>  include/uapi/linux/kvm.h  | 5 +++--
>  virt/kvm/coalesced_mmio.c | 8 +---
>  virt/kvm/kvm_main.c   | 2 ++
>  3 files changed, 10 insertions(+), 5 deletions(-)
> 
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index b6270a3..9cc56d3 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -420,13 +420,13 @@ struct kvm_run {
>  struct kvm_coalesced_mmio_zone {
>   __u64 addr;
>   __u32 size;
> - __u32 pad;
> + __u32 pio;

I would prefer to have this as a slightly more compatible

union {
__u32 pad;
__u32 pio;
};

>  };
>  
>  struct kvm_coalesced_mmio {
>   __u64 phys_addr;
>   __u32 len;
> - __u32 pad;
> + __u32 pio;

Also, please add a check that "pio <= 1".

This would catch most cases where userspace passed garbage in that field
and we'd also make the remaining bits available for future features.

Thanks.


Re: [PATCH V3 1/2] kvm/x86 : add coalesced pio support

2018-08-29 Thread Radim Krčmář
2018-08-24 19:20+0800, Peng Hao:
> Signed-off-by: Peng Hao 
> ---
>  include/uapi/linux/kvm.h  | 5 +++--
>  virt/kvm/coalesced_mmio.c | 8 +---
>  virt/kvm/kvm_main.c   | 2 ++
>  3 files changed, 10 insertions(+), 5 deletions(-)
> 
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index b6270a3..9cc56d3 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -420,13 +420,13 @@ struct kvm_run {
>  struct kvm_coalesced_mmio_zone {
>   __u64 addr;
>   __u32 size;
> - __u32 pad;
> + __u32 pio;

I would prefer to have this as a slightly more compatible

union {
__u32 pad;
__u32 pio;
};

>  };
>  
>  struct kvm_coalesced_mmio {
>   __u64 phys_addr;
>   __u32 len;
> - __u32 pad;
> + __u32 pio;

Also, please add a check that "pio <= 1".

This would catch most cases where userspace passed garbage in that field
and we'd also make the remaining bits available for future features.

Thanks.


Re: [PATCH v3 2/6] KVM: X86: Implement PV IPIs in linux guest

2018-07-19 Thread Radim Krčmář
2018-07-19 18:47+0200, Paolo Bonzini:
> On 19/07/2018 18:28, Radim Krčmář wrote:
> >> +
> >> +  kvm_hypercall3(KVM_HC_SEND_IPI, ipi_bitmap_low, ipi_bitmap_high, 
> >> vector);
> > and
> > 
> > kvm_hypercall3(KVM_HC_SEND_IPI, ipi_bitmap[0], ipi_bitmap[1], vector);
> > 
> > Still, the main problem is that we can only address 128 APICs.
> > 
> > A simple improvement would reuse the vector field (as we need only 8
> > bits) and put a 'offset' in the rest.  The offset would say which
> > cluster of 128 are we addressing.  24 bits of offset results in 2^31
> > total addressable CPUs (we probably should even use that many bits).
> > The downside of this is that we can only address 128 at a time.
> > 
> > It's basically the same as x2apic cluster mode, only with 128 cluster
> > size instead of 16, so the code should be a straightforward port.
> > And because x2apic code doesn't seem to use any division by the cluster
> > size, we could even try to use kvm_hypercall4, add ipi_bitmap[2], and
> > make the cluster size 192. :)
> 
> I did suggest an offset earlier in the discussion.
> 
> The main problem is that consecutive CPU ids do not map to consecutive
> APIC ids.  But still, we could do an hypercall whenever the total range
> exceeds 64.  Something like

Right, the cluster x2apic implementation came with a second mapping to make
this in linear time and send as little IPIs as possible:

·   /* Collapse cpus in a cluster so a single IPI per cluster is sent */
·   for_each_cpu(cpu, tmpmsk) {
·   ·   struct cluster_mask *cmsk = per_cpu(cluster_masks, cpu);

·   ·   dest = 0;
·   ·   for_each_cpu_and(clustercpu, tmpmsk, >mask)
·   ·   ·   dest |= per_cpu(x86_cpu_to_logical_apicid, clustercpu);

·   ·   if (!dest)
·   ·   ·   continue;

·   ·   __x2apic_send_IPI_dest(dest, vector, apic->dest_logical);
·   ·   /* Remove cluster CPUs from tmpmask */
·   ·   cpumask_andnot(tmpmsk, tmpmsk, >mask);
·   }

I think that the extra memory consumption would be excusable.


Re: [PATCH v3 2/6] KVM: X86: Implement PV IPIs in linux guest

2018-07-19 Thread Radim Krčmář
2018-07-19 18:47+0200, Paolo Bonzini:
> On 19/07/2018 18:28, Radim Krčmář wrote:
> >> +
> >> +  kvm_hypercall3(KVM_HC_SEND_IPI, ipi_bitmap_low, ipi_bitmap_high, 
> >> vector);
> > and
> > 
> > kvm_hypercall3(KVM_HC_SEND_IPI, ipi_bitmap[0], ipi_bitmap[1], vector);
> > 
> > Still, the main problem is that we can only address 128 APICs.
> > 
> > A simple improvement would reuse the vector field (as we need only 8
> > bits) and put a 'offset' in the rest.  The offset would say which
> > cluster of 128 are we addressing.  24 bits of offset results in 2^31
> > total addressable CPUs (we probably should even use that many bits).
> > The downside of this is that we can only address 128 at a time.
> > 
> > It's basically the same as x2apic cluster mode, only with 128 cluster
> > size instead of 16, so the code should be a straightforward port.
> > And because x2apic code doesn't seem to use any division by the cluster
> > size, we could even try to use kvm_hypercall4, add ipi_bitmap[2], and
> > make the cluster size 192. :)
> 
> I did suggest an offset earlier in the discussion.
> 
> The main problem is that consecutive CPU ids do not map to consecutive
> APIC ids.  But still, we could do an hypercall whenever the total range
> exceeds 64.  Something like

Right, the cluster x2apic implementation came with a second mapping to make
this in linear time and send as little IPIs as possible:

·   /* Collapse cpus in a cluster so a single IPI per cluster is sent */
·   for_each_cpu(cpu, tmpmsk) {
·   ·   struct cluster_mask *cmsk = per_cpu(cluster_masks, cpu);

·   ·   dest = 0;
·   ·   for_each_cpu_and(clustercpu, tmpmsk, >mask)
·   ·   ·   dest |= per_cpu(x86_cpu_to_logical_apicid, clustercpu);

·   ·   if (!dest)
·   ·   ·   continue;

·   ·   __x2apic_send_IPI_dest(dest, vector, apic->dest_logical);
·   ·   /* Remove cluster CPUs from tmpmask */
·   ·   cpumask_andnot(tmpmsk, tmpmsk, >mask);
·   }

I think that the extra memory consumption would be excusable.


Re: [PATCH v3 5/6] KVM: X86: Add NMI support to PV IPIs

2018-07-19 Thread Radim Krčmář
2018-07-03 14:21+0800, Wanpeng Li:
> From: Wanpeng Li 
> 
> The NMI delivery mode of ICR is used to deliver an NMI to the processor, 
> and the vector information is ignored.
> 
> Cc: Paolo Bonzini 
> Cc: Radim Krčmář 
> Cc: Vitaly Kuznetsov 
> Signed-off-by: Wanpeng Li 
> ---
> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> @@ -479,7 +479,16 @@ static int __send_ipi_mask(const struct cpumask *mask, 
> int vector)
>   }
>   }
>  
> - ret = kvm_hypercall3(KVM_HC_SEND_IPI, ipi_bitmap_low, ipi_bitmap_high, 
> vector);
> + switch (vector) {
> + default:
> + icr = APIC_DM_FIXED | vector;
> + break;
> + case NMI_VECTOR:
> + icr = APIC_DM_NMI;

I think it would be better to say that KVM interprets NMI_VECTOR and
sends the interrupt as APIC_DM_NMI.

> + break;
> + }
> +
> + ret = kvm_hypercall3(KVM_HC_SEND_IPI, ipi_bitmap_low, ipi_bitmap_high, 
> icr);
>  


Re: [PATCH v3 5/6] KVM: X86: Add NMI support to PV IPIs

2018-07-19 Thread Radim Krčmář
2018-07-03 14:21+0800, Wanpeng Li:
> From: Wanpeng Li 
> 
> The NMI delivery mode of ICR is used to deliver an NMI to the processor, 
> and the vector information is ignored.
> 
> Cc: Paolo Bonzini 
> Cc: Radim Krčmář 
> Cc: Vitaly Kuznetsov 
> Signed-off-by: Wanpeng Li 
> ---
> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> @@ -479,7 +479,16 @@ static int __send_ipi_mask(const struct cpumask *mask, 
> int vector)
>   }
>   }
>  
> - ret = kvm_hypercall3(KVM_HC_SEND_IPI, ipi_bitmap_low, ipi_bitmap_high, 
> vector);
> + switch (vector) {
> + default:
> + icr = APIC_DM_FIXED | vector;
> + break;
> + case NMI_VECTOR:
> + icr = APIC_DM_NMI;

I think it would be better to say that KVM interprets NMI_VECTOR and
sends the interrupt as APIC_DM_NMI.

> + break;
> + }
> +
> + ret = kvm_hypercall3(KVM_HC_SEND_IPI, ipi_bitmap_low, ipi_bitmap_high, 
> icr);
>  


Re: [PATCH v3 2/6] KVM: X86: Implement PV IPIs in linux guest

2018-07-19 Thread Radim Krčmář
2018-07-03 14:21+0800, Wanpeng Li:
> From: Wanpeng Li 
> 
> Implement paravirtual apic hooks to enable PV IPIs.
> 
> apic->send_IPI_mask
> apic->send_IPI_mask_allbutself
> apic->send_IPI_allbutself
> apic->send_IPI_all
> 
> The PV IPIs supports maximal 128 vCPUs VM, it is big enough for cloud 
> environment currently, supporting more vCPUs needs to introduce more 
> complex logic, in the future this might be extended if needed.
> 
> Cc: Paolo Bonzini 
> Cc: Radim Krčmář 
> Cc: Vitaly Kuznetsov 
> Signed-off-by: Wanpeng Li 
> ---
> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> @@ -454,6 +454,71 @@ static void __init sev_map_percpu_data(void)
>  }
>  
>  #ifdef CONFIG_SMP
> +
> +#ifdef CONFIG_X86_64
> +static void __send_ipi_mask(const struct cpumask *mask, int vector)
> +{
> + unsigned long flags, ipi_bitmap_low = 0, ipi_bitmap_high = 0;
> + int cpu, apic_id;
> +
> + if (cpumask_empty(mask))
> + return;
> +
> + local_irq_save(flags);
> +
> + for_each_cpu(cpu, mask) {
> + apic_id = per_cpu(x86_cpu_to_apicid, cpu);
> + if (apic_id < BITS_PER_LONG)
> + __set_bit(apic_id, _bitmap_low);
> + else if (apic_id < 2 * BITS_PER_LONG)
> + __set_bit(apic_id - BITS_PER_LONG, _bitmap_high);

It'd be nicer with 'unsigned long ipi_bitmap[2]' and a single

__set_bit(apic_id, ipi_bitmap);

> + }
> +
> + kvm_hypercall3(KVM_HC_SEND_IPI, ipi_bitmap_low, ipi_bitmap_high, 
> vector);

and

kvm_hypercall3(KVM_HC_SEND_IPI, ipi_bitmap[0], ipi_bitmap[1], vector);

Still, the main problem is that we can only address 128 APICs.

A simple improvement would reuse the vector field (as we need only 8
bits) and put a 'offset' in the rest.  The offset would say which
cluster of 128 are we addressing.  24 bits of offset results in 2^31
total addressable CPUs (we probably should even use that many bits).
The downside of this is that we can only address 128 at a time.

It's basically the same as x2apic cluster mode, only with 128 cluster
size instead of 16, so the code should be a straightforward port.
And because x2apic code doesn't seem to use any division by the cluster
size, we could even try to use kvm_hypercall4, add ipi_bitmap[2], and
make the cluster size 192. :)

But because it is very similar to x2apic, I'd really need some real
performance data to see if this benefits a real workload.
Hardware could further optimize LAPIC (apicv, vapic) in the future,
which we'd lose by using paravirt.

e.g. AMD's acceleration should be superior to this when using < 8 VCPUs
as they can use logical xAPIC and send without VM exits (when all VCPUs
are running).

> +
> + local_irq_restore(flags);
> +}
> +
> +static void kvm_send_ipi_mask(const struct cpumask *mask, int vector)
> +{
> + __send_ipi_mask(mask, vector);
> +}
> +
> +static void kvm_send_ipi_mask_allbutself(const struct cpumask *mask, int 
> vector)
> +{
> + unsigned int this_cpu = smp_processor_id();
> + struct cpumask new_mask;
> + const struct cpumask *local_mask;
> +
> + cpumask_copy(_mask, mask);
> + cpumask_clear_cpu(this_cpu, _mask);
> + local_mask = _mask;
> + __send_ipi_mask(local_mask, vector);
> +}
> +
> +static void kvm_send_ipi_allbutself(int vector)
> +{
> + kvm_send_ipi_mask_allbutself(cpu_online_mask, vector);
> +}
> +
> +static void kvm_send_ipi_all(int vector)
> +{
> + __send_ipi_mask(cpu_online_mask, vector);

These should be faster when using the native APIC shorthand -- is this
the "Broadcast" in your tests?

> +}
> +
> +/*
> + * Set the IPI entry points
> + */
> +static void kvm_setup_pv_ipi(void)
> +{
> + apic->send_IPI_mask = kvm_send_ipi_mask;
> + apic->send_IPI_mask_allbutself = kvm_send_ipi_mask_allbutself;
> + apic->send_IPI_allbutself = kvm_send_ipi_allbutself;
> + apic->send_IPI_all = kvm_send_ipi_all;
> + pr_info("KVM setup pv IPIs\n");
> +}
> +#endif
> +
>  static void __init kvm_smp_prepare_cpus(unsigned int max_cpus)
>  {
>   native_smp_prepare_cpus(max_cpus);
> @@ -626,6 +691,11 @@ static uint32_t __init kvm_detect(void)
>  
>  static void __init kvm_apic_init(void)
>  {
> +#if defined(CONFIG_SMP) && defined(CONFIG_X86_64)
> + if (kvm_para_has_feature(KVM_FEATURE_PV_SEND_IPI) &&
> + num_possible_cpus() <= 2 * BITS_PER_LONG)

It looks that num_possible_cpus() is actually NR_CPUS, so the feature
would never be used on a standard Linux distro.
And we're using APIC_ID, which can be higher even if maximum CPU the
number is lower.  Just remove it.


Re: [PATCH v3 2/6] KVM: X86: Implement PV IPIs in linux guest

2018-07-19 Thread Radim Krčmář
2018-07-03 14:21+0800, Wanpeng Li:
> From: Wanpeng Li 
> 
> Implement paravirtual apic hooks to enable PV IPIs.
> 
> apic->send_IPI_mask
> apic->send_IPI_mask_allbutself
> apic->send_IPI_allbutself
> apic->send_IPI_all
> 
> The PV IPIs supports maximal 128 vCPUs VM, it is big enough for cloud 
> environment currently, supporting more vCPUs needs to introduce more 
> complex logic, in the future this might be extended if needed.
> 
> Cc: Paolo Bonzini 
> Cc: Radim Krčmář 
> Cc: Vitaly Kuznetsov 
> Signed-off-by: Wanpeng Li 
> ---
> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> @@ -454,6 +454,71 @@ static void __init sev_map_percpu_data(void)
>  }
>  
>  #ifdef CONFIG_SMP
> +
> +#ifdef CONFIG_X86_64
> +static void __send_ipi_mask(const struct cpumask *mask, int vector)
> +{
> + unsigned long flags, ipi_bitmap_low = 0, ipi_bitmap_high = 0;
> + int cpu, apic_id;
> +
> + if (cpumask_empty(mask))
> + return;
> +
> + local_irq_save(flags);
> +
> + for_each_cpu(cpu, mask) {
> + apic_id = per_cpu(x86_cpu_to_apicid, cpu);
> + if (apic_id < BITS_PER_LONG)
> + __set_bit(apic_id, _bitmap_low);
> + else if (apic_id < 2 * BITS_PER_LONG)
> + __set_bit(apic_id - BITS_PER_LONG, _bitmap_high);

It'd be nicer with 'unsigned long ipi_bitmap[2]' and a single

__set_bit(apic_id, ipi_bitmap);

> + }
> +
> + kvm_hypercall3(KVM_HC_SEND_IPI, ipi_bitmap_low, ipi_bitmap_high, 
> vector);

and

kvm_hypercall3(KVM_HC_SEND_IPI, ipi_bitmap[0], ipi_bitmap[1], vector);

Still, the main problem is that we can only address 128 APICs.

A simple improvement would reuse the vector field (as we need only 8
bits) and put a 'offset' in the rest.  The offset would say which
cluster of 128 are we addressing.  24 bits of offset results in 2^31
total addressable CPUs (we probably should even use that many bits).
The downside of this is that we can only address 128 at a time.

It's basically the same as x2apic cluster mode, only with 128 cluster
size instead of 16, so the code should be a straightforward port.
And because x2apic code doesn't seem to use any division by the cluster
size, we could even try to use kvm_hypercall4, add ipi_bitmap[2], and
make the cluster size 192. :)

But because it is very similar to x2apic, I'd really need some real
performance data to see if this benefits a real workload.
Hardware could further optimize LAPIC (apicv, vapic) in the future,
which we'd lose by using paravirt.

e.g. AMD's acceleration should be superior to this when using < 8 VCPUs
as they can use logical xAPIC and send without VM exits (when all VCPUs
are running).

> +
> + local_irq_restore(flags);
> +}
> +
> +static void kvm_send_ipi_mask(const struct cpumask *mask, int vector)
> +{
> + __send_ipi_mask(mask, vector);
> +}
> +
> +static void kvm_send_ipi_mask_allbutself(const struct cpumask *mask, int 
> vector)
> +{
> + unsigned int this_cpu = smp_processor_id();
> + struct cpumask new_mask;
> + const struct cpumask *local_mask;
> +
> + cpumask_copy(_mask, mask);
> + cpumask_clear_cpu(this_cpu, _mask);
> + local_mask = _mask;
> + __send_ipi_mask(local_mask, vector);
> +}
> +
> +static void kvm_send_ipi_allbutself(int vector)
> +{
> + kvm_send_ipi_mask_allbutself(cpu_online_mask, vector);
> +}
> +
> +static void kvm_send_ipi_all(int vector)
> +{
> + __send_ipi_mask(cpu_online_mask, vector);

These should be faster when using the native APIC shorthand -- is this
the "Broadcast" in your tests?

> +}
> +
> +/*
> + * Set the IPI entry points
> + */
> +static void kvm_setup_pv_ipi(void)
> +{
> + apic->send_IPI_mask = kvm_send_ipi_mask;
> + apic->send_IPI_mask_allbutself = kvm_send_ipi_mask_allbutself;
> + apic->send_IPI_allbutself = kvm_send_ipi_allbutself;
> + apic->send_IPI_all = kvm_send_ipi_all;
> + pr_info("KVM setup pv IPIs\n");
> +}
> +#endif
> +
>  static void __init kvm_smp_prepare_cpus(unsigned int max_cpus)
>  {
>   native_smp_prepare_cpus(max_cpus);
> @@ -626,6 +691,11 @@ static uint32_t __init kvm_detect(void)
>  
>  static void __init kvm_apic_init(void)
>  {
> +#if defined(CONFIG_SMP) && defined(CONFIG_X86_64)
> + if (kvm_para_has_feature(KVM_FEATURE_PV_SEND_IPI) &&
> + num_possible_cpus() <= 2 * BITS_PER_LONG)

It looks that num_possible_cpus() is actually NR_CPUS, so the feature
would never be used on a standard Linux distro.
And we're using APIC_ID, which can be higher even if maximum CPU the
number is lower.  Just remove it.


Re: [PATCH v5 2/2] kvm: nVMX: Introduce KVM_CAP_NESTED_STATE

2018-07-18 Thread Radim Krčmář
2018-07-10 11:27+0200, KarimAllah Ahmed:
> From: Jim Mattson 
> 
> For nested virtualization L0 KVM is managing a bit of state for L2 guests,
> this state can not be captured through the currently available IOCTLs. In
> fact the state captured through all of these IOCTLs is usually a mix of L1
> and L2 state. It is also dependent on whether the L2 guest was running at
> the moment when the process was interrupted to save its state.
> 
> With this capability, there are two new vcpu ioctls: KVM_GET_NESTED_STATE
> and KVM_SET_NESTED_STATE. These can be used for saving and restoring a VM
> that is in VMX operation.
> 
> Cc: Paolo Bonzini 
> Cc: Radim Krčmář 
> Cc: Thomas Gleixner 
> Cc: Ingo Molnar 
> Cc: H. Peter Anvin 
> Cc: x...@kernel.org
> Cc: k...@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Signed-off-by: Jim Mattson 
> [karahmed@ - rename structs and functions and make them ready for AMD and
>  address previous comments.
>- handle nested.smm state.
>- rebase & a bit of refactoring.
>- Merge 7/8 and 8/8 into one patch. ]
> Signed-off-by: KarimAllah Ahmed 
> ---
> v4 -> v5:
> - Drop the update to KVM_REQUEST_ARCH_BASE in favor of a patch to switch to
>   u64 instead.
> - Fix commit message.
> - Handle nested.smm state as well.
> - rebase
> 
> v3 -> v4:
> - Rename function to have _nested
> 
> v2 -> v3:
> - Remove the forced VMExit from L2 after reading the kvm_state. The actual
>   problem is solved.
> - Rebase again!
> - Set nested_run_pending during restore (not sure if it makes sense yet or
>   not).
> - Reduce KVM_REQUEST_ARCH_BASE to 7 instead of 8 (the other alternative is
>   to switch everything to u64)
> 
> v1 -> v2:
> - Rename structs and functions and make them ready for AMD and address
>   previous comments.
> - Rebase & a bit of refactoring.
> - Merge 7/8 and 8/8 into one patch.
> - Force a VMExit from L2 after reading the kvm_state to avoid mixed state
>   between L1 and L2 on resurrecting the instance.
> ---
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> @@ -12976,6 +12977,197 @@ static int enable_smi_window(struct kvm_vcpu *vcpu)
> +static int set_vmcs_cache(struct kvm_vcpu *vcpu,
> +   struct kvm_nested_state __user *user_kvm_nested_state,
> +   struct kvm_nested_state *kvm_state)
> +
> +{
> [...]
> +
> + if (kvm_state->flags & KVM_STATE_NESTED_RUN_PENDING)
> + vmx->nested.nested_run_pending = 1;
> +
> + if (check_vmentry_prereqs(vcpu, vmcs12) ||
> + check_vmentry_postreqs(vcpu, vmcs12, _qual))
> + return -EINVAL;
> +
> + ret = enter_vmx_non_root_mode(vcpu);
> + if (ret)
> + return ret;
> +
> + /*
> +  * The MMU is not initialized to point at the right entities yet and
> +  * "get pages" would need to read data from the guest (i.e. we will
> +  * need to perform gpa to hpa translation). So, This request will
> +  * result in a call to nested_get_vmcs12_pages before the next
> +  * VM-entry.
> +  */
> + kvm_make_request(KVM_REQ_GET_VMCS12_PAGES, vcpu);
> +
> + vmx->nested.nested_run_pending = 1;

This is not necessary.  We're only copying state and do not add anything
that would be lost on a nested VM exit without prior VM entry.

> +

Halting the VCPU should probably be done here, just like at the end of
nested_vmx_run().

> + return 0;
> +}
> +
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> @@ -963,6 +963,7 @@ struct kvm_ppc_resize_hpt {
>  #define KVM_CAP_GET_MSR_FEATURES 153
>  #define KVM_CAP_HYPERV_EVENTFD 154
>  #define KVM_CAP_HYPERV_TLBFLUSH 155
> +#define KVM_CAP_STATE 156

KVM_CAP_NESTED_STATE

(good documentation makes code worse. :])


Re: [PATCH v5 2/2] kvm: nVMX: Introduce KVM_CAP_NESTED_STATE

2018-07-18 Thread Radim Krčmář
2018-07-10 11:27+0200, KarimAllah Ahmed:
> From: Jim Mattson 
> 
> For nested virtualization L0 KVM is managing a bit of state for L2 guests,
> this state can not be captured through the currently available IOCTLs. In
> fact the state captured through all of these IOCTLs is usually a mix of L1
> and L2 state. It is also dependent on whether the L2 guest was running at
> the moment when the process was interrupted to save its state.
> 
> With this capability, there are two new vcpu ioctls: KVM_GET_NESTED_STATE
> and KVM_SET_NESTED_STATE. These can be used for saving and restoring a VM
> that is in VMX operation.
> 
> Cc: Paolo Bonzini 
> Cc: Radim Krčmář 
> Cc: Thomas Gleixner 
> Cc: Ingo Molnar 
> Cc: H. Peter Anvin 
> Cc: x...@kernel.org
> Cc: k...@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Signed-off-by: Jim Mattson 
> [karahmed@ - rename structs and functions and make them ready for AMD and
>  address previous comments.
>- handle nested.smm state.
>- rebase & a bit of refactoring.
>- Merge 7/8 and 8/8 into one patch. ]
> Signed-off-by: KarimAllah Ahmed 
> ---
> v4 -> v5:
> - Drop the update to KVM_REQUEST_ARCH_BASE in favor of a patch to switch to
>   u64 instead.
> - Fix commit message.
> - Handle nested.smm state as well.
> - rebase
> 
> v3 -> v4:
> - Rename function to have _nested
> 
> v2 -> v3:
> - Remove the forced VMExit from L2 after reading the kvm_state. The actual
>   problem is solved.
> - Rebase again!
> - Set nested_run_pending during restore (not sure if it makes sense yet or
>   not).
> - Reduce KVM_REQUEST_ARCH_BASE to 7 instead of 8 (the other alternative is
>   to switch everything to u64)
> 
> v1 -> v2:
> - Rename structs and functions and make them ready for AMD and address
>   previous comments.
> - Rebase & a bit of refactoring.
> - Merge 7/8 and 8/8 into one patch.
> - Force a VMExit from L2 after reading the kvm_state to avoid mixed state
>   between L1 and L2 on resurrecting the instance.
> ---
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> @@ -12976,6 +12977,197 @@ static int enable_smi_window(struct kvm_vcpu *vcpu)
> +static int set_vmcs_cache(struct kvm_vcpu *vcpu,
> +   struct kvm_nested_state __user *user_kvm_nested_state,
> +   struct kvm_nested_state *kvm_state)
> +
> +{
> [...]
> +
> + if (kvm_state->flags & KVM_STATE_NESTED_RUN_PENDING)
> + vmx->nested.nested_run_pending = 1;
> +
> + if (check_vmentry_prereqs(vcpu, vmcs12) ||
> + check_vmentry_postreqs(vcpu, vmcs12, _qual))
> + return -EINVAL;
> +
> + ret = enter_vmx_non_root_mode(vcpu);
> + if (ret)
> + return ret;
> +
> + /*
> +  * The MMU is not initialized to point at the right entities yet and
> +  * "get pages" would need to read data from the guest (i.e. we will
> +  * need to perform gpa to hpa translation). So, This request will
> +  * result in a call to nested_get_vmcs12_pages before the next
> +  * VM-entry.
> +  */
> + kvm_make_request(KVM_REQ_GET_VMCS12_PAGES, vcpu);
> +
> + vmx->nested.nested_run_pending = 1;

This is not necessary.  We're only copying state and do not add anything
that would be lost on a nested VM exit without prior VM entry.

> +

Halting the VCPU should probably be done here, just like at the end of
nested_vmx_run().

> + return 0;
> +}
> +
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> @@ -963,6 +963,7 @@ struct kvm_ppc_resize_hpt {
>  #define KVM_CAP_GET_MSR_FEATURES 153
>  #define KVM_CAP_HYPERV_EVENTFD 154
>  #define KVM_CAP_HYPERV_TLBFLUSH 155
> +#define KVM_CAP_STATE 156

KVM_CAP_NESTED_STATE

(good documentation makes code worse. :])


Re: [PATCH v5 1/2] KVM: Switch 'requests' to be 64-bit (explicitly)

2018-07-18 Thread Radim Krčmář
2018-07-10 11:27+0200, KarimAllah Ahmed:
> Switch 'requests' to be explicitly 64-bit and update BUILD_BUG_ON check to
> use the size of "requests" instead of the hard-coded '32'.
> 
> That gives us a bit more room again for arch-specific requests as we
> already ran out of space for x86 due to the hard-coded check.
> 
> The only exception here is ARM32 as it is still 32-bits.

What do you mean?

I think we're just going to slow down kvm_request_pending() on 32 bit
architectures.

> Cc: Paolo Bonzini 
> Cc: Radim Krčmář 
> Cc: k...@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Reviewed-by: Jim Mattson 
> Signed-off-by: KarimAllah Ahmed 
> ---
> v1 -> v2:
> - Use FIELD_SIZEOF
> ---
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> @@ -130,7 +130,7 @@ static inline bool is_error_page(struct page *page)
>  #define KVM_REQUEST_ARCH_BASE 8

Now that the base is easily moveable, we could also lower it to 4 and
get few more arch flags.

Bumping requests to 64 bits is probably inevitable and this patch looks
good.

In v4, you have proposed the bitmap-array solution that would easily
allow more than 64 requests -- was the problem that possible
implementations of kvm_request_pending were not as efficient for current
amount of requests?

Thanks.

>  #define KVM_ARCH_REQ_FLAGS(nr, flags) ({ \
> - BUILD_BUG_ON((unsigned)(nr) >= 32 - KVM_REQUEST_ARCH_BASE); \
> + BUILD_BUG_ON((unsigned)(nr) >= (FIELD_SIZEOF(struct kvm_vcpu, requests) 
> * 8) - KVM_REQUEST_ARCH_BASE); \
>   (unsigned)(((nr) + KVM_REQUEST_ARCH_BASE) | (flags)); \
>  })
>  #define KVM_ARCH_REQ(nr)   KVM_ARCH_REQ_FLAGS(nr, 0)


Re: [PATCH v5 1/2] KVM: Switch 'requests' to be 64-bit (explicitly)

2018-07-18 Thread Radim Krčmář
2018-07-10 11:27+0200, KarimAllah Ahmed:
> Switch 'requests' to be explicitly 64-bit and update BUILD_BUG_ON check to
> use the size of "requests" instead of the hard-coded '32'.
> 
> That gives us a bit more room again for arch-specific requests as we
> already ran out of space for x86 due to the hard-coded check.
> 
> The only exception here is ARM32 as it is still 32-bits.

What do you mean?

I think we're just going to slow down kvm_request_pending() on 32 bit
architectures.

> Cc: Paolo Bonzini 
> Cc: Radim Krčmář 
> Cc: k...@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Reviewed-by: Jim Mattson 
> Signed-off-by: KarimAllah Ahmed 
> ---
> v1 -> v2:
> - Use FIELD_SIZEOF
> ---
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> @@ -130,7 +130,7 @@ static inline bool is_error_page(struct page *page)
>  #define KVM_REQUEST_ARCH_BASE 8

Now that the base is easily moveable, we could also lower it to 4 and
get few more arch flags.

Bumping requests to 64 bits is probably inevitable and this patch looks
good.

In v4, you have proposed the bitmap-array solution that would easily
allow more than 64 requests -- was the problem that possible
implementations of kvm_request_pending were not as efficient for current
amount of requests?

Thanks.

>  #define KVM_ARCH_REQ_FLAGS(nr, flags) ({ \
> - BUILD_BUG_ON((unsigned)(nr) >= 32 - KVM_REQUEST_ARCH_BASE); \
> + BUILD_BUG_ON((unsigned)(nr) >= (FIELD_SIZEOF(struct kvm_vcpu, requests) 
> * 8) - KVM_REQUEST_ARCH_BASE); \
>   (unsigned)(((nr) + KVM_REQUEST_ARCH_BASE) | (flags)); \
>  })
>  #define KVM_ARCH_REQ(nr)   KVM_ARCH_REQ_FLAGS(nr, 0)


Re: [PATCH v2] KVM: Add coalesced PIO support

2018-07-18 Thread Radim Krčmář
2018-07-12 09:59+0800, Wanpeng Li:
> From: Peng Hao 
> 
> Windows I/O, such as the real-time clock. The address register (port
> 0x70 in the RTC case) can use coalesced I/O, cutting the number of
> userspace exits by half when reading or writing the RTC.
> 
> Guest access rtc like this: write register index to 0x70, then write or 
> read data from 0x71. writing 0x70 port is just as index and do nothing 
> else. So we can use coalesced mmio to handle this scene to reduce VM-EXIT 
> time.
> 
> In our environment, 12 windows guests running on a Skylake server:
> 
> Before patch:
> 
> IO Port Access  Samples  Samples%   Time%Avg time
> 
> 0x70:POUT2067546.04%92.72%   67.15us ( +-   7.93% )
> 
> After patch:
> 
> IO Port Access  Samples  Samples%   Time%Avg time
> 
> 0x70:POUT17509    45.42%42.08%   6.37us ( +-  20.37% )
> 
> Cc: Paolo Bonzini 
> Cc: Radim Krčmář 
> Cc: Eduardo Habkost 
> Cc: Peng Hao 
> Signed-off-by: Peng Hao 
> Signed-off-by: Wanpeng Li 
> ---
> v1 -> v2:
>  * add the original author
> 
>  Documentation/virtual/kvm/00-INDEX |  2 ++
>  Documentation/virtual/kvm/api.txt  |  7 +++
>  Documentation/virtual/kvm/coalesced-io.txt | 17 +
>  include/uapi/linux/kvm.h   |  5 +++--
>  virt/kvm/coalesced_mmio.c  | 16 +---
>  virt/kvm/kvm_main.c|  2 ++
>  6 files changed, 44 insertions(+), 5 deletions(-)
>  create mode 100644 Documentation/virtual/kvm/coalesced-io.txt
> 
> diff --git a/Documentation/virtual/kvm/00-INDEX 
> b/Documentation/virtual/kvm/00-INDEX
> index 3492458..4160620 100644
> --- a/Documentation/virtual/kvm/00-INDEX
> +++ b/Documentation/virtual/kvm/00-INDEX
> @@ -9,6 +9,8 @@ arm
>   - internal ABI between the kernel and HYP (for arm/arm64)
>  cpuid.txt
>   - KVM-specific cpuid leaves (x86).
> +coalesced-io.txt
> + - Coalesced MMIO and coalesced PIO.
>  devices/
>   - KVM_CAP_DEVICE_CTRL userspace API.
>  halt-polling.txt
> diff --git a/Documentation/virtual/kvm/api.txt 
> b/Documentation/virtual/kvm/api.txt
> index d10944e..4190796 100644
> --- a/Documentation/virtual/kvm/api.txt
> +++ b/Documentation/virtual/kvm/api.txt
> @@ -4618,3 +4618,10 @@ This capability indicates that KVM supports 
> paravirtualized Hyper-V TLB Flush
>  hypercalls:
>  HvFlushVirtualAddressSpace, HvFlushVirtualAddressSpaceEx,
>  HvFlushVirtualAddressList, HvFlushVirtualAddressListEx.
> +
> +8.19 KVM_CAP_COALESCED_PIO
> +
> +Architectures: x86, s390, ppc, arm64
> +
> +This Capability indicates that kvm supports writing to a coalesced-pio region
> +is not reported to userspace until the next non-coalesced pio is issued.
> diff --git a/Documentation/virtual/kvm/coalesced-io.txt 
> b/Documentation/virtual/kvm/coalesced-io.txt
> new file mode 100644
> index 000..4a96eaf
> --- /dev/null
> +++ b/Documentation/virtual/kvm/coalesced-io.txt
> @@ -0,0 +1,17 @@
> +
> +Coalesced MMIO and coalesced PIO can be used to optimize writes to
> +simple device registers. Writes to a coalesced-I/O region are not
> +reported to userspace until the next non-coalesced I/O is issued,
> +in a similar fashion to write combining hardware.  In KVM, coalesced
> +writes are handled in the kernel without exits to userspace, and
> +are thus several times faster.
> +
> +Examples of devices that can benefit from coalesced I/O include:
> +
> +- devices whose memory is accessed with many consecutive writes, for
> +  example the EGA/VGA video RAM.
> +
> +- windows I/O, such as the real-time clock. The address register (port
> +  0x70 in the RTC case) can use coalesced I/O, cutting the number of
> +  userspace exits by half when reading or writing the RTC.
> +
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index b6270a3..9cc56d3 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -420,13 +420,13 @@ struct kvm_run {
>  struct kvm_coalesced_mmio_zone {
>   __u64 addr;
>   __u32 size;
> - __u32 pad;
> + __u32 pio;

Paolo, do you think we can rename the field without breaking userspace
builds?

>  };
>  
>  struct kvm_coalesced_mmio {
>   __u64 phys_addr;
>   __u32 len;
> - __u32 pad;
> + __u32 pio;
>   __u8  data[8];
>  };
>  
> diff --git a/virt/kvm/coalesced_mmio.c b/virt/kvm/coalesced_mmio.c
> @@ -149,8 +150,12 @@ int kvm_vm_ioctl_register_coalesced_mmio(struct kvm *kvm,
>   dev->zone = *zone;
>  
>   mutex_lock(>slots_lock);
> - ret = kvm_io_bus_register_dev(kvm, KVM_MMIO_BUS, zone->addr,
> - 

Re: [PATCH v2] KVM: Add coalesced PIO support

2018-07-18 Thread Radim Krčmář
2018-07-12 09:59+0800, Wanpeng Li:
> From: Peng Hao 
> 
> Windows I/O, such as the real-time clock. The address register (port
> 0x70 in the RTC case) can use coalesced I/O, cutting the number of
> userspace exits by half when reading or writing the RTC.
> 
> Guest access rtc like this: write register index to 0x70, then write or 
> read data from 0x71. writing 0x70 port is just as index and do nothing 
> else. So we can use coalesced mmio to handle this scene to reduce VM-EXIT 
> time.
> 
> In our environment, 12 windows guests running on a Skylake server:
> 
> Before patch:
> 
> IO Port Access  Samples  Samples%   Time%Avg time
> 
> 0x70:POUT2067546.04%92.72%   67.15us ( +-   7.93% )
> 
> After patch:
> 
> IO Port Access  Samples  Samples%   Time%Avg time
> 
> 0x70:POUT17509    45.42%42.08%   6.37us ( +-  20.37% )
> 
> Cc: Paolo Bonzini 
> Cc: Radim Krčmář 
> Cc: Eduardo Habkost 
> Cc: Peng Hao 
> Signed-off-by: Peng Hao 
> Signed-off-by: Wanpeng Li 
> ---
> v1 -> v2:
>  * add the original author
> 
>  Documentation/virtual/kvm/00-INDEX |  2 ++
>  Documentation/virtual/kvm/api.txt  |  7 +++
>  Documentation/virtual/kvm/coalesced-io.txt | 17 +
>  include/uapi/linux/kvm.h   |  5 +++--
>  virt/kvm/coalesced_mmio.c  | 16 +---
>  virt/kvm/kvm_main.c|  2 ++
>  6 files changed, 44 insertions(+), 5 deletions(-)
>  create mode 100644 Documentation/virtual/kvm/coalesced-io.txt
> 
> diff --git a/Documentation/virtual/kvm/00-INDEX 
> b/Documentation/virtual/kvm/00-INDEX
> index 3492458..4160620 100644
> --- a/Documentation/virtual/kvm/00-INDEX
> +++ b/Documentation/virtual/kvm/00-INDEX
> @@ -9,6 +9,8 @@ arm
>   - internal ABI between the kernel and HYP (for arm/arm64)
>  cpuid.txt
>   - KVM-specific cpuid leaves (x86).
> +coalesced-io.txt
> + - Coalesced MMIO and coalesced PIO.
>  devices/
>   - KVM_CAP_DEVICE_CTRL userspace API.
>  halt-polling.txt
> diff --git a/Documentation/virtual/kvm/api.txt 
> b/Documentation/virtual/kvm/api.txt
> index d10944e..4190796 100644
> --- a/Documentation/virtual/kvm/api.txt
> +++ b/Documentation/virtual/kvm/api.txt
> @@ -4618,3 +4618,10 @@ This capability indicates that KVM supports 
> paravirtualized Hyper-V TLB Flush
>  hypercalls:
>  HvFlushVirtualAddressSpace, HvFlushVirtualAddressSpaceEx,
>  HvFlushVirtualAddressList, HvFlushVirtualAddressListEx.
> +
> +8.19 KVM_CAP_COALESCED_PIO
> +
> +Architectures: x86, s390, ppc, arm64
> +
> +This Capability indicates that kvm supports writing to a coalesced-pio region
> +is not reported to userspace until the next non-coalesced pio is issued.
> diff --git a/Documentation/virtual/kvm/coalesced-io.txt 
> b/Documentation/virtual/kvm/coalesced-io.txt
> new file mode 100644
> index 000..4a96eaf
> --- /dev/null
> +++ b/Documentation/virtual/kvm/coalesced-io.txt
> @@ -0,0 +1,17 @@
> +
> +Coalesced MMIO and coalesced PIO can be used to optimize writes to
> +simple device registers. Writes to a coalesced-I/O region are not
> +reported to userspace until the next non-coalesced I/O is issued,
> +in a similar fashion to write combining hardware.  In KVM, coalesced
> +writes are handled in the kernel without exits to userspace, and
> +are thus several times faster.
> +
> +Examples of devices that can benefit from coalesced I/O include:
> +
> +- devices whose memory is accessed with many consecutive writes, for
> +  example the EGA/VGA video RAM.
> +
> +- windows I/O, such as the real-time clock. The address register (port
> +  0x70 in the RTC case) can use coalesced I/O, cutting the number of
> +  userspace exits by half when reading or writing the RTC.
> +
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index b6270a3..9cc56d3 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -420,13 +420,13 @@ struct kvm_run {
>  struct kvm_coalesced_mmio_zone {
>   __u64 addr;
>   __u32 size;
> - __u32 pad;
> + __u32 pio;

Paolo, do you think we can rename the field without breaking userspace
builds?

>  };
>  
>  struct kvm_coalesced_mmio {
>   __u64 phys_addr;
>   __u32 len;
> - __u32 pad;
> + __u32 pio;
>   __u8  data[8];
>  };
>  
> diff --git a/virt/kvm/coalesced_mmio.c b/virt/kvm/coalesced_mmio.c
> @@ -149,8 +150,12 @@ int kvm_vm_ioctl_register_coalesced_mmio(struct kvm *kvm,
>   dev->zone = *zone;
>  
>   mutex_lock(>slots_lock);
> - ret = kvm_io_bus_register_dev(kvm, KVM_MMIO_BUS, zone->addr,
> - 

Re: [PATCH] KVM: VMX: modify macro definition 'R' to 'R ' because of gcc-5+

2018-06-27 Thread Radim Krčmář
2018-06-26 20:59+0800, LiuYang:
> GCC 5.4.0 enables raw strings by default and they have higher priority
> than macros, thus R is interpreted incorrectly.
> Fix it by putting a space between macro R and a string literal.
> 
> Signed-off-by: LiuYang 
> ---

This got fixed in 2012 by b188c81f2e1a ("KVM: VMX: Make use of asm.h").

Please refresh the tree, thanks.


Re: [PATCH] KVM: VMX: modify macro definition 'R' to 'R ' because of gcc-5+

2018-06-27 Thread Radim Krčmář
2018-06-26 20:59+0800, LiuYang:
> GCC 5.4.0 enables raw strings by default and they have higher priority
> than macros, thus R is interpreted incorrectly.
> Fix it by putting a space between macro R and a string literal.
> 
> Signed-off-by: LiuYang 
> ---

This got fixed in 2012 by b188c81f2e1a ("KVM: VMX: Make use of asm.h").

Please refresh the tree, thanks.


[GIT PULL] KVM fixes for Linux 4.18-rc2

2018-06-22 Thread Radim Krčmář
Linus,

The following changes since commit ce397d215ccd07b8ae3f71db689aedb85d56ab40:

  Linux 4.18-rc1 (2018-06-17 08:04:49 +0900)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/virt/kvm/kvm tags/for-linus

for you to fetch changes up to 2ddc649810133fcf8e5282eea898ee7ececf161e:

  KVM: fix KVM_CAP_HYPERV_TLBFLUSH paragraph number (2018-06-22 17:30:20 +0200)


KVM fixes for 4.18-rc2

ARM:
 - Lazy FPSIMD switching fixes
 - Really disable compat ioctls on architectures that don't want it
 - Disable compat on arm64 (it was never implemented...)
 - Rely on architectural requirements for GICV on GICv3
 - Detect bad alignments in unmap_stage2_range

x86:
 - Add nested VM entry checks to avoid broken error recovery path
 - Minor documentation fix


Ard Biesheuvel (1):
  KVM: arm/arm64: Drop resource size check for GICV window

Dave Martin (3):
  KVM: arm64: Don't mask softirq with IRQs disabled in vcpu_put()
  KVM: arm64/sve: Fix SVE trap restoration for non-current tasks
  KVM: arm64: Avoid mistaken attempts to save SVE state for vcpus

Jia He (1):
  KVM: arm/arm64: add WARN_ON if size is not PAGE_SIZE aligned in 
unmap_stage2_range

Marc Orr (1):
  kvm: vmx: Nested VM-entry prereqs for event inj.

Marc Zyngier (2):
  KVM: Enforce error in ioctl for compat tasks when !KVM_COMPAT
  KVM: arm64: Prevent KVM_COMPAT from being selected

Mark Rutland (1):
  arm64: Introduce sysreg_clear_set()

Radim Krčmář (1):
  Merge tag 'kvmarm-fixes-for-4.18-1' of 
git://git.kernel.org/.../kvmarm/kvmarm

Vitaly Kuznetsov (1):
  KVM: fix KVM_CAP_HYPERV_TLBFLUSH paragraph number

 Documentation/virtual/kvm/api.txt |  2 +-
 arch/arm64/include/asm/kvm_host.h |  1 +
 arch/arm64/include/asm/sysreg.h   | 11 +++
 arch/arm64/kvm/fpsimd.c   | 36 +++--
 arch/x86/include/asm/vmx.h|  3 ++
 arch/x86/kvm/vmx.c| 67 +++
 arch/x86/kvm/x86.h|  9 ++
 virt/kvm/Kconfig  |  2 +-
 virt/kvm/arm/mmu.c|  2 ++
 virt/kvm/arm/vgic/vgic-v3.c   |  5 ---
 virt/kvm/kvm_main.c   | 19 ++-
 11 files changed, 131 insertions(+), 26 deletions(-)


[GIT PULL] KVM fixes for Linux 4.18-rc2

2018-06-22 Thread Radim Krčmář
Linus,

The following changes since commit ce397d215ccd07b8ae3f71db689aedb85d56ab40:

  Linux 4.18-rc1 (2018-06-17 08:04:49 +0900)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/virt/kvm/kvm tags/for-linus

for you to fetch changes up to 2ddc649810133fcf8e5282eea898ee7ececf161e:

  KVM: fix KVM_CAP_HYPERV_TLBFLUSH paragraph number (2018-06-22 17:30:20 +0200)


KVM fixes for 4.18-rc2

ARM:
 - Lazy FPSIMD switching fixes
 - Really disable compat ioctls on architectures that don't want it
 - Disable compat on arm64 (it was never implemented...)
 - Rely on architectural requirements for GICV on GICv3
 - Detect bad alignments in unmap_stage2_range

x86:
 - Add nested VM entry checks to avoid broken error recovery path
 - Minor documentation fix


Ard Biesheuvel (1):
  KVM: arm/arm64: Drop resource size check for GICV window

Dave Martin (3):
  KVM: arm64: Don't mask softirq with IRQs disabled in vcpu_put()
  KVM: arm64/sve: Fix SVE trap restoration for non-current tasks
  KVM: arm64: Avoid mistaken attempts to save SVE state for vcpus

Jia He (1):
  KVM: arm/arm64: add WARN_ON if size is not PAGE_SIZE aligned in 
unmap_stage2_range

Marc Orr (1):
  kvm: vmx: Nested VM-entry prereqs for event inj.

Marc Zyngier (2):
  KVM: Enforce error in ioctl for compat tasks when !KVM_COMPAT
  KVM: arm64: Prevent KVM_COMPAT from being selected

Mark Rutland (1):
  arm64: Introduce sysreg_clear_set()

Radim Krčmář (1):
  Merge tag 'kvmarm-fixes-for-4.18-1' of 
git://git.kernel.org/.../kvmarm/kvmarm

Vitaly Kuznetsov (1):
  KVM: fix KVM_CAP_HYPERV_TLBFLUSH paragraph number

 Documentation/virtual/kvm/api.txt |  2 +-
 arch/arm64/include/asm/kvm_host.h |  1 +
 arch/arm64/include/asm/sysreg.h   | 11 +++
 arch/arm64/kvm/fpsimd.c   | 36 +++--
 arch/x86/include/asm/vmx.h|  3 ++
 arch/x86/kvm/vmx.c| 67 +++
 arch/x86/kvm/x86.h|  9 ++
 virt/kvm/Kconfig  |  2 +-
 virt/kvm/arm/mmu.c|  2 ++
 virt/kvm/arm/vgic/vgic-v3.c   |  5 ---
 virt/kvm/kvm_main.c   | 19 ++-
 11 files changed, 131 insertions(+), 26 deletions(-)


Re: [PATCH 2/3] x86/kvm: Implement MSR_HWCR support

2018-06-22 Thread Radim Krčmář
2018-06-22 21:09+0200, Borislav Petkov:
> On Fri, Jun 22, 2018 at 08:52:38PM +0200, Radim Krčmář wrote:
> > msr_info->host_initiated is always going to return true, so it would be
> > better to put it outside of __set_mci_status.
> > 
> > Maybe we could just write the whole logic inline, otherwise I'd call it
> > something like mci_status_is_writeable.
> > 
> > >  static int set_msr_mce(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
> > >  {
> > >   u64 mcg_cap = vcpu->arch.mcg_cap;
> > > @@ -2176,9 +2200,13 @@ static int set_msr_mce(struct kvm_vcpu *vcpu, 
> > > struct msr_data *msr_info)
> > >   if ((offset & 0x3) == 0 &&
> > >   data != 0 && (data | (1 << 10)) != ~(u64)0)
> > >   return -1;
> > > - if (!msr_info->host_initiated &&
> > > - (offset & 0x3) == 1 && data != 0)
> > > - return -1;
> > > +
> > > + /* MCi_STATUS */
> > > + if ((offset & 0x3) == 1) {
> > > + if (!__set_mci_status(vcpu, msr_info))
> > > + return -1;
> > > + }
> > 
> > if (!msr_info->host_initiated &&
> > (offset & 0x3) == 1 && data != 0) {
> > struct msr_data tmp = {.index = MSR_K7_HWCR};
> > 
> > if (!guest_cpuid_is_amd(vcpu) ||
> > !kvm_x86_ops->get_msr(vcpu, ) ||
> > !(tmp.data & BIT_ULL(18)))
> > return -1;
> 
> Don't you feel it is cleaner if all the MCi_STATUS checking is done in
> a separate function? The indentation level and the bunch of checks in
> set_msr_mce() make it hard to read while having a separate function
> separates it and makes it easier to follow.

Yes, I feel the same.

> I mean, you're the maintainer but if I may give a suggestion, moving the
> whole logic into a separate function would be more readable.
> 
> And then do:
> 
>   if (!msr_info->host_initiated) {
>   if (check_mci_status(...))
>   return -1;
>   }
> 
> Something like that...

Much better, thanks.


Re: [PATCH 2/3] x86/kvm: Implement MSR_HWCR support

2018-06-22 Thread Radim Krčmář
2018-06-22 21:09+0200, Borislav Petkov:
> On Fri, Jun 22, 2018 at 08:52:38PM +0200, Radim Krčmář wrote:
> > msr_info->host_initiated is always going to return true, so it would be
> > better to put it outside of __set_mci_status.
> > 
> > Maybe we could just write the whole logic inline, otherwise I'd call it
> > something like mci_status_is_writeable.
> > 
> > >  static int set_msr_mce(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
> > >  {
> > >   u64 mcg_cap = vcpu->arch.mcg_cap;
> > > @@ -2176,9 +2200,13 @@ static int set_msr_mce(struct kvm_vcpu *vcpu, 
> > > struct msr_data *msr_info)
> > >   if ((offset & 0x3) == 0 &&
> > >   data != 0 && (data | (1 << 10)) != ~(u64)0)
> > >   return -1;
> > > - if (!msr_info->host_initiated &&
> > > - (offset & 0x3) == 1 && data != 0)
> > > - return -1;
> > > +
> > > + /* MCi_STATUS */
> > > + if ((offset & 0x3) == 1) {
> > > + if (!__set_mci_status(vcpu, msr_info))
> > > + return -1;
> > > + }
> > 
> > if (!msr_info->host_initiated &&
> > (offset & 0x3) == 1 && data != 0) {
> > struct msr_data tmp = {.index = MSR_K7_HWCR};
> > 
> > if (!guest_cpuid_is_amd(vcpu) ||
> > !kvm_x86_ops->get_msr(vcpu, ) ||
> > !(tmp.data & BIT_ULL(18)))
> > return -1;
> 
> Don't you feel it is cleaner if all the MCi_STATUS checking is done in
> a separate function? The indentation level and the bunch of checks in
> set_msr_mce() make it hard to read while having a separate function
> separates it and makes it easier to follow.

Yes, I feel the same.

> I mean, you're the maintainer but if I may give a suggestion, moving the
> whole logic into a separate function would be more readable.
> 
> And then do:
> 
>   if (!msr_info->host_initiated) {
>   if (check_mci_status(...))
>   return -1;
>   }
> 
> Something like that...

Much better, thanks.


Re: [PATCH 3/3] KVM: x86: hyperv: implement PV IPI send hypercalls

2018-06-22 Thread Radim Krčmář
2018-06-22 16:56+0200, Vitaly Kuznetsov:
> Using hypercall for sending IPIs is faster because this allows to specify
> any number of vCPUs (even > 64 with sparse CPU set), the whole procedure
> will take only one VMEXIT.
> 
> Current Hyper-V TLFS (v5.0b) claims that HvCallSendSyntheticClusterIpi
> hypercall can't be 'fast' (passing parameters through registers) but
> apparently this is not true, Windows always uses it as 'fast' so we need
> to support that.
> 
> Signed-off-by: Vitaly Kuznetsov 
> ---
> diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
> @@ -1357,6 +1357,108 @@ static u64 kvm_hv_flush_tlb(struct kvm_vcpu 
> *current_vcpu, u64 ingpa,
>   ((u64)rep_cnt << HV_HYPERCALL_REP_COMP_OFFSET);
>  }
>  
> +static u64 kvm_hv_send_ipi(struct kvm_vcpu *current_vcpu, u64 ingpa, u64 
> outgpa,
> +bool ex, bool fast)
> +{
> + struct kvm *kvm = current_vcpu->kvm;
> + struct hv_send_ipi_ex send_ipi_ex;
> + struct hv_send_ipi send_ipi;
> + struct kvm_vcpu *vcpu;
> + unsigned long valid_bank_mask = 0;
> + u64 sparse_banks[64];
> + int sparse_banks_len, i;
> + struct kvm_lapic_irq irq = {0};
> + bool all_cpus;
> +
> + if (!ex) {
> + if (!fast) {
> + if (unlikely(kvm_read_guest(kvm, ingpa, _ipi,
> + sizeof(send_ipi
> + return HV_STATUS_INVALID_HYPERCALL_INPUT;
> + sparse_banks[0] = send_ipi.cpu_mask;
> + irq.vector = send_ipi.vector;
> + } else {
> + /* 'reserved' part of hv_send_ipi should be 0 */
> + if (unlikely(ingpa >> 32 != 0))
> + return HV_STATUS_INVALID_HYPERCALL_INPUT;
> + sparse_banks[0] = outgpa;
> + irq.vector = (u32)ingpa;
> + }
> + all_cpus = false;
> +
> + trace_kvm_hv_send_ipi(irq.vector, sparse_banks[0]);
> + } else {
> + if (unlikely(kvm_read_guest(kvm, ingpa, _ipi_ex,
> + sizeof(send_ipi_ex
> + return HV_STATUS_INVALID_HYPERCALL_INPUT;
> +
> + trace_kvm_hv_send_ipi_ex(send_ipi_ex.vector,
> +  send_ipi_ex.vp_set.format,
> +  send_ipi_ex.vp_set.valid_bank_mask);
> +
> + irq.vector = send_ipi_ex.vector;
> + valid_bank_mask = send_ipi_ex.vp_set.valid_bank_mask;
> + sparse_banks_len = bitmap_weight(_bank_mask, 64) *
> + sizeof(sparse_banks[0]);
> + all_cpus = send_ipi_ex.vp_set.format !=
> + HV_GENERIC_SET_SPARSE_4K;

This would be much better readable as

  send_ipi_ex.vp_set.format == HV_GENERIC_SET_ALL

And if Microsoft ever adds more formats, they won't be all VCPUs, so
we're future-proofing as well.

> +
> + if (!sparse_banks_len)
> + goto ret_success;
> +
> + if (!all_cpus &&
> + kvm_read_guest(kvm,
> +ingpa + offsetof(struct hv_send_ipi_ex,
> + vp_set.bank_contents),
> +sparse_banks,
> +sparse_banks_len))
> + return HV_STATUS_INVALID_HYPERCALL_INPUT;
> + }
> +
> + if ((irq.vector < HV_IPI_LOW_VECTOR) ||
> + (irq.vector > HV_IPI_HIGH_VECTOR))
> + return HV_STATUS_INVALID_HYPERCALL_INPUT;
> +
> + irq.delivery_mode = APIC_DM_FIXED;

I'd set this during variable definition.

APIC_DM_FIXED is 0 anyway and the compiler probably won't optimize it
here due to function with side effects since definition.

> +
> + kvm_for_each_vcpu(i, vcpu, kvm) {
> + struct kvm_vcpu_hv *hv = >arch.hyperv;
> + int bank = hv->vp_index / 64, sbank = 0;
> +
> + if (!all_cpus) {
> + /* Banks >64 can't be represented */
> + if (bank >= 64)
> + continue;
> +
> + /* Non-ex hypercalls can only address first 64 vCPUs */
> + if (!ex && bank)
> + continue;
> +
> + if (ex) {
> + /*
> +  * Check is the bank of this vCPU is in sparse
> +  * set and get the sparse bank number.
> +  */
> + sbank = get_sparse_bank_no(valid_bank_mask,
> +bank);
> +
> + if (sbank < 0)
> + continue;
> + }
> +
> + if (!(sparse_banks[sbank] & BIT_ULL(hv->vp_index % 64)))
> +

Re: [PATCH 3/3] KVM: x86: hyperv: implement PV IPI send hypercalls

2018-06-22 Thread Radim Krčmář
2018-06-22 16:56+0200, Vitaly Kuznetsov:
> Using hypercall for sending IPIs is faster because this allows to specify
> any number of vCPUs (even > 64 with sparse CPU set), the whole procedure
> will take only one VMEXIT.
> 
> Current Hyper-V TLFS (v5.0b) claims that HvCallSendSyntheticClusterIpi
> hypercall can't be 'fast' (passing parameters through registers) but
> apparently this is not true, Windows always uses it as 'fast' so we need
> to support that.
> 
> Signed-off-by: Vitaly Kuznetsov 
> ---
> diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
> @@ -1357,6 +1357,108 @@ static u64 kvm_hv_flush_tlb(struct kvm_vcpu 
> *current_vcpu, u64 ingpa,
>   ((u64)rep_cnt << HV_HYPERCALL_REP_COMP_OFFSET);
>  }
>  
> +static u64 kvm_hv_send_ipi(struct kvm_vcpu *current_vcpu, u64 ingpa, u64 
> outgpa,
> +bool ex, bool fast)
> +{
> + struct kvm *kvm = current_vcpu->kvm;
> + struct hv_send_ipi_ex send_ipi_ex;
> + struct hv_send_ipi send_ipi;
> + struct kvm_vcpu *vcpu;
> + unsigned long valid_bank_mask = 0;
> + u64 sparse_banks[64];
> + int sparse_banks_len, i;
> + struct kvm_lapic_irq irq = {0};
> + bool all_cpus;
> +
> + if (!ex) {
> + if (!fast) {
> + if (unlikely(kvm_read_guest(kvm, ingpa, _ipi,
> + sizeof(send_ipi
> + return HV_STATUS_INVALID_HYPERCALL_INPUT;
> + sparse_banks[0] = send_ipi.cpu_mask;
> + irq.vector = send_ipi.vector;
> + } else {
> + /* 'reserved' part of hv_send_ipi should be 0 */
> + if (unlikely(ingpa >> 32 != 0))
> + return HV_STATUS_INVALID_HYPERCALL_INPUT;
> + sparse_banks[0] = outgpa;
> + irq.vector = (u32)ingpa;
> + }
> + all_cpus = false;
> +
> + trace_kvm_hv_send_ipi(irq.vector, sparse_banks[0]);
> + } else {
> + if (unlikely(kvm_read_guest(kvm, ingpa, _ipi_ex,
> + sizeof(send_ipi_ex
> + return HV_STATUS_INVALID_HYPERCALL_INPUT;
> +
> + trace_kvm_hv_send_ipi_ex(send_ipi_ex.vector,
> +  send_ipi_ex.vp_set.format,
> +  send_ipi_ex.vp_set.valid_bank_mask);
> +
> + irq.vector = send_ipi_ex.vector;
> + valid_bank_mask = send_ipi_ex.vp_set.valid_bank_mask;
> + sparse_banks_len = bitmap_weight(_bank_mask, 64) *
> + sizeof(sparse_banks[0]);
> + all_cpus = send_ipi_ex.vp_set.format !=
> + HV_GENERIC_SET_SPARSE_4K;

This would be much better readable as

  send_ipi_ex.vp_set.format == HV_GENERIC_SET_ALL

And if Microsoft ever adds more formats, they won't be all VCPUs, so
we're future-proofing as well.

> +
> + if (!sparse_banks_len)
> + goto ret_success;
> +
> + if (!all_cpus &&
> + kvm_read_guest(kvm,
> +ingpa + offsetof(struct hv_send_ipi_ex,
> + vp_set.bank_contents),
> +sparse_banks,
> +sparse_banks_len))
> + return HV_STATUS_INVALID_HYPERCALL_INPUT;
> + }
> +
> + if ((irq.vector < HV_IPI_LOW_VECTOR) ||
> + (irq.vector > HV_IPI_HIGH_VECTOR))
> + return HV_STATUS_INVALID_HYPERCALL_INPUT;
> +
> + irq.delivery_mode = APIC_DM_FIXED;

I'd set this during variable definition.

APIC_DM_FIXED is 0 anyway and the compiler probably won't optimize it
here due to function with side effects since definition.

> +
> + kvm_for_each_vcpu(i, vcpu, kvm) {
> + struct kvm_vcpu_hv *hv = >arch.hyperv;
> + int bank = hv->vp_index / 64, sbank = 0;
> +
> + if (!all_cpus) {
> + /* Banks >64 can't be represented */
> + if (bank >= 64)
> + continue;
> +
> + /* Non-ex hypercalls can only address first 64 vCPUs */
> + if (!ex && bank)
> + continue;
> +
> + if (ex) {
> + /*
> +  * Check is the bank of this vCPU is in sparse
> +  * set and get the sparse bank number.
> +  */
> + sbank = get_sparse_bank_no(valid_bank_mask,
> +bank);
> +
> + if (sbank < 0)
> + continue;
> + }
> +
> + if (!(sparse_banks[sbank] & BIT_ULL(hv->vp_index % 64)))
> +

Re: [PATCH 2/3] x86/kvm: Implement MSR_HWCR support

2018-06-22 Thread Radim Krčmář
2018-06-22 11:51+0200, Borislav Petkov:
> From: Borislav Petkov 
> 
> The hardware configuration register has some useful bits which can be
> used by guests. Implement McStatusWrEn which can be used by guests when
> injecting MCEs with the in-kernel mce-inject module.
> 
> For that, we need to set bit 18 - McStatusWrEn - first, before writing
> the MCi_STATUS registers (otherwise we #GP).
> 
> Add the required machinery to do so.
> 
> Signed-off-by: Borislav Petkov 
> ---
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> @@ -2146,6 +2146,30 @@ static void kvmclock_sync_fn(struct work_struct *work)
>   KVMCLOCK_SYNC_PERIOD);
>  }
>  
> +/*
> + * On AMD, HWCR[McStatusWrEn] controls whether setting MCi_STATUS results in 
> #GP.
> + */
> +static bool __set_mci_status(struct kvm_vcpu *vcpu, struct msr_data 
> *msr_info)
> +{
> + if (guest_cpuid_is_amd(vcpu)) {
> + struct msr_data tmp;
> +
> + tmp.index = MSR_K7_HWCR;
> +
> + if (kvm_x86_ops->get_msr(vcpu, ))
> + return false;
> +
> + /* McStatusWrEn enabled? */
> + if (tmp.data & BIT_ULL(18))
> + return true;
> + }
> +
> + if (!msr_info->host_initiated && msr_info->data != 0)
> + return false;

msr_info->host_initiated is always going to return true, so it would be
better to put it outside of __set_mci_status.

Maybe we could just write the whole logic inline, otherwise I'd call it
something like mci_status_is_writeable.

>  static int set_msr_mce(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>  {
>   u64 mcg_cap = vcpu->arch.mcg_cap;
> @@ -2176,9 +2200,13 @@ static int set_msr_mce(struct kvm_vcpu *vcpu, struct 
> msr_data *msr_info)
>   if ((offset & 0x3) == 0 &&
>   data != 0 && (data | (1 << 10)) != ~(u64)0)
>   return -1;
> - if (!msr_info->host_initiated &&
> - (offset & 0x3) == 1 && data != 0)
> - return -1;
> +
> + /* MCi_STATUS */
> + if ((offset & 0x3) == 1) {
> + if (!__set_mci_status(vcpu, msr_info))
> + return -1;
> + }

if (!msr_info->host_initiated &&
(offset & 0x3) == 1 && data != 0) {
struct msr_data tmp = {.index = MSR_K7_HWCR};

if (!guest_cpuid_is_amd(vcpu) ||
!kvm_x86_ops->get_msr(vcpu, ) ||
!(tmp.data & BIT_ULL(18)))
return -1;
}

> +
>   vcpu->arch.mce_banks[offset] = data;
>   break;
>   }
> -- 
> 2.17.0.582.gccdcbd54c
> 


Re: [PATCH 2/3] x86/kvm: Implement MSR_HWCR support

2018-06-22 Thread Radim Krčmář
2018-06-22 11:51+0200, Borislav Petkov:
> From: Borislav Petkov 
> 
> The hardware configuration register has some useful bits which can be
> used by guests. Implement McStatusWrEn which can be used by guests when
> injecting MCEs with the in-kernel mce-inject module.
> 
> For that, we need to set bit 18 - McStatusWrEn - first, before writing
> the MCi_STATUS registers (otherwise we #GP).
> 
> Add the required machinery to do so.
> 
> Signed-off-by: Borislav Petkov 
> ---
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> @@ -2146,6 +2146,30 @@ static void kvmclock_sync_fn(struct work_struct *work)
>   KVMCLOCK_SYNC_PERIOD);
>  }
>  
> +/*
> + * On AMD, HWCR[McStatusWrEn] controls whether setting MCi_STATUS results in 
> #GP.
> + */
> +static bool __set_mci_status(struct kvm_vcpu *vcpu, struct msr_data 
> *msr_info)
> +{
> + if (guest_cpuid_is_amd(vcpu)) {
> + struct msr_data tmp;
> +
> + tmp.index = MSR_K7_HWCR;
> +
> + if (kvm_x86_ops->get_msr(vcpu, ))
> + return false;
> +
> + /* McStatusWrEn enabled? */
> + if (tmp.data & BIT_ULL(18))
> + return true;
> + }
> +
> + if (!msr_info->host_initiated && msr_info->data != 0)
> + return false;

msr_info->host_initiated is always going to return true, so it would be
better to put it outside of __set_mci_status.

Maybe we could just write the whole logic inline, otherwise I'd call it
something like mci_status_is_writeable.

>  static int set_msr_mce(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>  {
>   u64 mcg_cap = vcpu->arch.mcg_cap;
> @@ -2176,9 +2200,13 @@ static int set_msr_mce(struct kvm_vcpu *vcpu, struct 
> msr_data *msr_info)
>   if ((offset & 0x3) == 0 &&
>   data != 0 && (data | (1 << 10)) != ~(u64)0)
>   return -1;
> - if (!msr_info->host_initiated &&
> - (offset & 0x3) == 1 && data != 0)
> - return -1;
> +
> + /* MCi_STATUS */
> + if ((offset & 0x3) == 1) {
> + if (!__set_mci_status(vcpu, msr_info))
> + return -1;
> + }

if (!msr_info->host_initiated &&
(offset & 0x3) == 1 && data != 0) {
struct msr_data tmp = {.index = MSR_K7_HWCR};

if (!guest_cpuid_is_amd(vcpu) ||
!kvm_x86_ops->get_msr(vcpu, ) ||
!(tmp.data & BIT_ULL(18)))
return -1;
}

> +
>   vcpu->arch.mce_banks[offset] = data;
>   break;
>   }
> -- 
> 2.17.0.582.gccdcbd54c
> 


Re: [PATCH 3/3] x86/kvm: Handle all MCA banks

2018-06-22 Thread Radim Krčmář
2018-06-22 20:24+0200, Borislav Petkov:
> On Fri, Jun 22, 2018 at 08:16:04PM +0200, Radim Krčmář wrote:
> > 2018-06-22 11:51+0200, Borislav Petkov:
> > > From: Borislav Petkov 
> > > 
> > > Extend the range of MCA banks which get passed to set/get_msr_mce() to
> > > include all the MSRs of the last bank too.
> > > 
> > > Signed-off-by: Borislav Petkov 
> > > ---
> > >  arch/x86/kvm/x86.c | 5 +++--
> > >  1 file changed, 3 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > index 80452b0f0e8c..a7d344823356 100644
> > > --- a/arch/x86/kvm/x86.c
> > > +++ b/arch/x86/kvm/x86.c
> > > @@ -2466,7 +2466,7 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, 
> > > struct msr_data *msr_info)
> > >  
> > >   case MSR_IA32_MCG_CTL:
> > >   case MSR_IA32_MCG_STATUS:
> > > - case MSR_IA32_MC0_CTL ... MSR_IA32_MCx_CTL(KVM_MAX_MCE_BANKS) - 1:
> > > + case MSR_IA32_MC0_CTL ... MSR_IA32_MCx_MISC(KVM_MAX_MCE_BANKS) - 1:
> > 
> > It was correct before.  We have 32 banks (KVM_MAX_MCE_BANKS), so the
> > last useable has index 31 and the "- 1" is going to roll over from first
> > MSR of bank 32 to the last MSR of the last bank.
> > 
> > Another way of writing it would be:
> > 
> >  case MSR_IA32_MC0_CTL ... MSR_IA32_MCx_MISC(KVM_MAX_MCE_BANKS - 1):
> 
> Huh?
> 
> This is what I did:
> 
> +   case MSR_IA32_MC0_CTL ... MSR_IA32_MCx_MISC(KVM_MAX_MCE_BANKS) - 1:
> 
> It needs to be MISC because it is the last MSR in the MCA bank and thus
> the highest.

The last MSR is the original "MSR_IA32_MCx_CTL(KVM_MAX_MCE_BANKS) - 1".

"MSR_IA32_MCx_MISC(KVM_MAX_MCE_BANKS) - 1" also covers

  MSR_IA32_MC32_CTL, MSR_IA32_MC32_STATUS, and MSR_IA32_MC32_ADDR

but the maximal valid MSR is MSR_IA32_MC31_MISC.


Re: [PATCH 3/3] x86/kvm: Handle all MCA banks

2018-06-22 Thread Radim Krčmář
2018-06-22 20:24+0200, Borislav Petkov:
> On Fri, Jun 22, 2018 at 08:16:04PM +0200, Radim Krčmář wrote:
> > 2018-06-22 11:51+0200, Borislav Petkov:
> > > From: Borislav Petkov 
> > > 
> > > Extend the range of MCA banks which get passed to set/get_msr_mce() to
> > > include all the MSRs of the last bank too.
> > > 
> > > Signed-off-by: Borislav Petkov 
> > > ---
> > >  arch/x86/kvm/x86.c | 5 +++--
> > >  1 file changed, 3 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > index 80452b0f0e8c..a7d344823356 100644
> > > --- a/arch/x86/kvm/x86.c
> > > +++ b/arch/x86/kvm/x86.c
> > > @@ -2466,7 +2466,7 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, 
> > > struct msr_data *msr_info)
> > >  
> > >   case MSR_IA32_MCG_CTL:
> > >   case MSR_IA32_MCG_STATUS:
> > > - case MSR_IA32_MC0_CTL ... MSR_IA32_MCx_CTL(KVM_MAX_MCE_BANKS) - 1:
> > > + case MSR_IA32_MC0_CTL ... MSR_IA32_MCx_MISC(KVM_MAX_MCE_BANKS) - 1:
> > 
> > It was correct before.  We have 32 banks (KVM_MAX_MCE_BANKS), so the
> > last useable has index 31 and the "- 1" is going to roll over from first
> > MSR of bank 32 to the last MSR of the last bank.
> > 
> > Another way of writing it would be:
> > 
> >  case MSR_IA32_MC0_CTL ... MSR_IA32_MCx_MISC(KVM_MAX_MCE_BANKS - 1):
> 
> Huh?
> 
> This is what I did:
> 
> +   case MSR_IA32_MC0_CTL ... MSR_IA32_MCx_MISC(KVM_MAX_MCE_BANKS) - 1:
> 
> It needs to be MISC because it is the last MSR in the MCA bank and thus
> the highest.

The last MSR is the original "MSR_IA32_MCx_CTL(KVM_MAX_MCE_BANKS) - 1".

"MSR_IA32_MCx_MISC(KVM_MAX_MCE_BANKS) - 1" also covers

  MSR_IA32_MC32_CTL, MSR_IA32_MC32_STATUS, and MSR_IA32_MC32_ADDR

but the maximal valid MSR is MSR_IA32_MC31_MISC.


Re: [PATCH 3/3] x86/kvm: Handle all MCA banks

2018-06-22 Thread Radim Krčmář
2018-06-22 11:51+0200, Borislav Petkov:
> From: Borislav Petkov 
> 
> Extend the range of MCA banks which get passed to set/get_msr_mce() to
> include all the MSRs of the last bank too.
> 
> Signed-off-by: Borislav Petkov 
> ---
>  arch/x86/kvm/x86.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 80452b0f0e8c..a7d344823356 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -2466,7 +2466,7 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct 
> msr_data *msr_info)
>  
>   case MSR_IA32_MCG_CTL:
>   case MSR_IA32_MCG_STATUS:
> - case MSR_IA32_MC0_CTL ... MSR_IA32_MCx_CTL(KVM_MAX_MCE_BANKS) - 1:
> + case MSR_IA32_MC0_CTL ... MSR_IA32_MCx_MISC(KVM_MAX_MCE_BANKS) - 1:

It was correct before.  We have 32 banks (KVM_MAX_MCE_BANKS), so the
last useable has index 31 and the "- 1" is going to roll over from first
MSR of bank 32 to the last MSR of the last bank.

Another way of writing it would be:

 case MSR_IA32_MC0_CTL ... MSR_IA32_MCx_MISC(KVM_MAX_MCE_BANKS - 1):

>   return set_msr_mce(vcpu, msr_info);
>  
>   case MSR_K7_PERFCTR0 ... MSR_K7_PERFCTR3:
> @@ -2588,9 +2588,10 @@ static int get_msr_mce(struct kvm_vcpu *vcpu, u32 msr, 
> u64 *pdata)
>   case MSR_IA32_MCG_STATUS:
>   data = vcpu->arch.mcg_status;
>   break;
> +
>   default:
>   if (msr >= MSR_IA32_MC0_CTL &&
> - msr < MSR_IA32_MCx_CTL(bank_num)) {
> + msr < MSR_IA32_MCx_MISC(bank_num)) {

Similar logic here.

I think it would be best just to keep the current code,

thanks.


Re: [PATCH 3/3] x86/kvm: Handle all MCA banks

2018-06-22 Thread Radim Krčmář
2018-06-22 11:51+0200, Borislav Petkov:
> From: Borislav Petkov 
> 
> Extend the range of MCA banks which get passed to set/get_msr_mce() to
> include all the MSRs of the last bank too.
> 
> Signed-off-by: Borislav Petkov 
> ---
>  arch/x86/kvm/x86.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 80452b0f0e8c..a7d344823356 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -2466,7 +2466,7 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct 
> msr_data *msr_info)
>  
>   case MSR_IA32_MCG_CTL:
>   case MSR_IA32_MCG_STATUS:
> - case MSR_IA32_MC0_CTL ... MSR_IA32_MCx_CTL(KVM_MAX_MCE_BANKS) - 1:
> + case MSR_IA32_MC0_CTL ... MSR_IA32_MCx_MISC(KVM_MAX_MCE_BANKS) - 1:

It was correct before.  We have 32 banks (KVM_MAX_MCE_BANKS), so the
last useable has index 31 and the "- 1" is going to roll over from first
MSR of bank 32 to the last MSR of the last bank.

Another way of writing it would be:

 case MSR_IA32_MC0_CTL ... MSR_IA32_MCx_MISC(KVM_MAX_MCE_BANKS - 1):

>   return set_msr_mce(vcpu, msr_info);
>  
>   case MSR_K7_PERFCTR0 ... MSR_K7_PERFCTR3:
> @@ -2588,9 +2588,10 @@ static int get_msr_mce(struct kvm_vcpu *vcpu, u32 msr, 
> u64 *pdata)
>   case MSR_IA32_MCG_STATUS:
>   data = vcpu->arch.mcg_status;
>   break;
> +
>   default:
>   if (msr >= MSR_IA32_MC0_CTL &&
> - msr < MSR_IA32_MCx_CTL(bank_num)) {
> + msr < MSR_IA32_MCx_MISC(bank_num)) {

Similar logic here.

I think it would be best just to keep the current code,

thanks.


Re: [PATCH 1/3] KVM: fix KVM_CAP_HYPERV_TLBFLUSH paragraph number

2018-06-22 Thread Radim Krčmář
2018-06-22 16:56+0200, Vitaly Kuznetsov:
> KVM_CAP_HYPERV_TLBFLUSH collided with KVM_CAP_S390_PSW-BPB, its paragraph
> number should now be 8.18.
> 
> Signed-off-by: Vitaly Kuznetsov 
> ---
>  Documentation/virtual/kvm/api.txt | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/Documentation/virtual/kvm/api.txt 
> b/Documentation/virtual/kvm/api.txt
> index 495b7742ab58..d10944e619d3 100644
> --- a/Documentation/virtual/kvm/api.txt
> +++ b/Documentation/virtual/kvm/api.txt
> @@ -4610,7 +4610,7 @@ This capability indicates that kvm will implement the 
> interfaces to handle
>  reset, migration and nested KVM for branch prediction blocking. The stfle
>  facility 82 should not be provided to the guest without this capability.
>  
> -8.14 KVM_CAP_HYPERV_TLBFLUSH
> +8.18 KVM_CAP_HYPERV_TLBFLUSH

Taking this one early, thanks.


Re: [PATCH 1/3] KVM: fix KVM_CAP_HYPERV_TLBFLUSH paragraph number

2018-06-22 Thread Radim Krčmář
2018-06-22 16:56+0200, Vitaly Kuznetsov:
> KVM_CAP_HYPERV_TLBFLUSH collided with KVM_CAP_S390_PSW-BPB, its paragraph
> number should now be 8.18.
> 
> Signed-off-by: Vitaly Kuznetsov 
> ---
>  Documentation/virtual/kvm/api.txt | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/Documentation/virtual/kvm/api.txt 
> b/Documentation/virtual/kvm/api.txt
> index 495b7742ab58..d10944e619d3 100644
> --- a/Documentation/virtual/kvm/api.txt
> +++ b/Documentation/virtual/kvm/api.txt
> @@ -4610,7 +4610,7 @@ This capability indicates that kvm will implement the 
> interfaces to handle
>  reset, migration and nested KVM for branch prediction blocking. The stfle
>  facility 82 should not be provided to the guest without this capability.
>  
> -8.14 KVM_CAP_HYPERV_TLBFLUSH
> +8.18 KVM_CAP_HYPERV_TLBFLUSH

Taking this one early, thanks.


Re: [PATCH] KVM: VMX: Optimize tscdeadline timer latency

2018-05-29 Thread Radim Krčmář
2018-05-29 16:23+0200, Radim Krčmář:
> 2018-05-29 14:53+0800, Wanpeng Li:
> > From: Wanpeng Li 
> > 
> > 'Commit d0659d946be0 ("KVM: x86: add option to advance tscdeadline 
> > hrtimer expiration")' advances the tscdeadline (the timer is emulated 
> > by hrtimer) expiration in order that the latency which is incurred 
> > by hypervisor (apic_timer_fn -> vmentry) can be avoided. This patch 
> > adds the advance tscdeadline expiration support to which the tscdeadline 
> > timer is emulated by VMX preemption timer to reduce the hypervisor 
> > lantency (handle_preemption_timer -> vmentry). clockevents infrastruture 
> > can program minimum delay if hrtimer feeds a expiration in the past, 
> > we set delta_tsc to 1(which will be converted to 0 before vmentry) 
> > which can lead to an immediately vmexit when delta_tsc is not bigger 
> > than advance ns. 
> > 
> > This patch can reduce ~63% latency (~4450 cycles to ~1660 cycles on 
> > a haswell desktop) for kvm-unit-tests/tscdeadline_latency when testing
> > busy waits.
> > 
> > Cc: Paolo Bonzini 
> > Cc: Radim Krčmář 
> > Signed-off-by: Wanpeng Li 
> > ---
> > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> > @@ -12444,6 +12444,12 @@ static int vmx_set_hv_timer(struct kvm_vcpu *vcpu, 
> > u64 guest_deadline_tsc)
> > tscl = rdtsc();
> > guest_tscl = kvm_read_l1_tsc(vcpu, tscl);
> > delta_tsc = max(guest_deadline_tsc, guest_tscl) - guest_tscl;
> > +   lapic_timer_advance_cycles = nsec_to_cycles(vcpu, 
> > lapic_timer_advance_ns);
> > +   if (delta_tsc > lapic_timer_advance_cycles)
> > +   delta_tsc -= lapic_timer_advance_cycles;
> > +   else
> > +   delta_tsc = 1;
> 
> Why don't we just "return 1" to say that the timer has expired?

This case might be rare, so setting delta_tsc = 0 would be safer.


Re: [PATCH] KVM: VMX: Optimize tscdeadline timer latency

2018-05-29 Thread Radim Krčmář
2018-05-29 16:23+0200, Radim Krčmář:
> 2018-05-29 14:53+0800, Wanpeng Li:
> > From: Wanpeng Li 
> > 
> > 'Commit d0659d946be0 ("KVM: x86: add option to advance tscdeadline 
> > hrtimer expiration")' advances the tscdeadline (the timer is emulated 
> > by hrtimer) expiration in order that the latency which is incurred 
> > by hypervisor (apic_timer_fn -> vmentry) can be avoided. This patch 
> > adds the advance tscdeadline expiration support to which the tscdeadline 
> > timer is emulated by VMX preemption timer to reduce the hypervisor 
> > lantency (handle_preemption_timer -> vmentry). clockevents infrastruture 
> > can program minimum delay if hrtimer feeds a expiration in the past, 
> > we set delta_tsc to 1(which will be converted to 0 before vmentry) 
> > which can lead to an immediately vmexit when delta_tsc is not bigger 
> > than advance ns. 
> > 
> > This patch can reduce ~63% latency (~4450 cycles to ~1660 cycles on 
> > a haswell desktop) for kvm-unit-tests/tscdeadline_latency when testing
> > busy waits.
> > 
> > Cc: Paolo Bonzini 
> > Cc: Radim Krčmář 
> > Signed-off-by: Wanpeng Li 
> > ---
> > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> > @@ -12444,6 +12444,12 @@ static int vmx_set_hv_timer(struct kvm_vcpu *vcpu, 
> > u64 guest_deadline_tsc)
> > tscl = rdtsc();
> > guest_tscl = kvm_read_l1_tsc(vcpu, tscl);
> > delta_tsc = max(guest_deadline_tsc, guest_tscl) - guest_tscl;
> > +   lapic_timer_advance_cycles = nsec_to_cycles(vcpu, 
> > lapic_timer_advance_ns);
> > +   if (delta_tsc > lapic_timer_advance_cycles)
> > +   delta_tsc -= lapic_timer_advance_cycles;
> > +   else
> > +   delta_tsc = 1;
> 
> Why don't we just "return 1" to say that the timer has expired?

This case might be rare, so setting delta_tsc = 0 would be safer.


Re: [PATCH] KVM: VMX: Optimize tscdeadline timer latency

2018-05-29 Thread Radim Krčmář
2018-05-29 14:53+0800, Wanpeng Li:
> From: Wanpeng Li 
> 
> 'Commit d0659d946be0 ("KVM: x86: add option to advance tscdeadline 
> hrtimer expiration")' advances the tscdeadline (the timer is emulated 
> by hrtimer) expiration in order that the latency which is incurred 
> by hypervisor (apic_timer_fn -> vmentry) can be avoided. This patch 
> adds the advance tscdeadline expiration support to which the tscdeadline 
> timer is emulated by VMX preemption timer to reduce the hypervisor 
> lantency (handle_preemption_timer -> vmentry). clockevents infrastruture 
> can program minimum delay if hrtimer feeds a expiration in the past, 
> we set delta_tsc to 1(which will be converted to 0 before vmentry) 
> which can lead to an immediately vmexit when delta_tsc is not bigger 
> than advance ns. 
> 
> This patch can reduce ~63% latency (~4450 cycles to ~1660 cycles on 
> a haswell desktop) for kvm-unit-tests/tscdeadline_latency when testing
> busy waits.
> 
> Cc: Paolo Bonzini 
> Cc: Radim Krčmář 
> Signed-off-by: Wanpeng Li 
> ---
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> @@ -12444,6 +12444,12 @@ static int vmx_set_hv_timer(struct kvm_vcpu *vcpu, 
> u64 guest_deadline_tsc)
>   tscl = rdtsc();
>   guest_tscl = kvm_read_l1_tsc(vcpu, tscl);
>   delta_tsc = max(guest_deadline_tsc, guest_tscl) - guest_tscl;
> + lapic_timer_advance_cycles = nsec_to_cycles(vcpu, 
> lapic_timer_advance_ns);
> + if (delta_tsc > lapic_timer_advance_cycles)
> + delta_tsc -= lapic_timer_advance_cycles;
> + else
> + delta_tsc = 1;

Why don't we just "return 1" to say that the timer has expired?

I think "delta_tsc = 1" would just force an immediate VM exit and
a re-entry, which seems wasteful as we could just be delaying the entry
until the deadline has really passed,

thanks.


Re: [PATCH] KVM: VMX: Optimize tscdeadline timer latency

2018-05-29 Thread Radim Krčmář
2018-05-29 14:53+0800, Wanpeng Li:
> From: Wanpeng Li 
> 
> 'Commit d0659d946be0 ("KVM: x86: add option to advance tscdeadline 
> hrtimer expiration")' advances the tscdeadline (the timer is emulated 
> by hrtimer) expiration in order that the latency which is incurred 
> by hypervisor (apic_timer_fn -> vmentry) can be avoided. This patch 
> adds the advance tscdeadline expiration support to which the tscdeadline 
> timer is emulated by VMX preemption timer to reduce the hypervisor 
> lantency (handle_preemption_timer -> vmentry). clockevents infrastruture 
> can program minimum delay if hrtimer feeds a expiration in the past, 
> we set delta_tsc to 1(which will be converted to 0 before vmentry) 
> which can lead to an immediately vmexit when delta_tsc is not bigger 
> than advance ns. 
> 
> This patch can reduce ~63% latency (~4450 cycles to ~1660 cycles on 
> a haswell desktop) for kvm-unit-tests/tscdeadline_latency when testing
> busy waits.
> 
> Cc: Paolo Bonzini 
> Cc: Radim Krčmář 
> Signed-off-by: Wanpeng Li 
> ---
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> @@ -12444,6 +12444,12 @@ static int vmx_set_hv_timer(struct kvm_vcpu *vcpu, 
> u64 guest_deadline_tsc)
>   tscl = rdtsc();
>   guest_tscl = kvm_read_l1_tsc(vcpu, tscl);
>   delta_tsc = max(guest_deadline_tsc, guest_tscl) - guest_tscl;
> + lapic_timer_advance_cycles = nsec_to_cycles(vcpu, 
> lapic_timer_advance_ns);
> + if (delta_tsc > lapic_timer_advance_cycles)
> + delta_tsc -= lapic_timer_advance_cycles;
> + else
> + delta_tsc = 1;

Why don't we just "return 1" to say that the timer has expired?

I think "delta_tsc = 1" would just force an immediate VM exit and
a re-entry, which seems wasteful as we could just be delaying the entry
until the deadline has really passed,

thanks.


[GIT PULL] KVM fixes for Linux 4.17-rc7

2018-05-26 Thread Radim Krčmář
Linus,

The following changes since commit 633711e82878dc29083fc5d2605166755e25b57a:

  kvm: rename KVM_HINTS_DEDICATED to KVM_HINTS_REALTIME (2018-05-17 19:12:13 
+0200)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/virt/kvm/kvm tags/for-linus

for you to fetch changes up to 696ca779a928d0e93d61c38ffc3a4d8914a9b9a0:

  KVM: x86: fix #UD address of failed Hyper-V hypercalls (2018-05-25 21:33:31 
+0200)


KVM fixes for v4.17-rc7

PPC:
 - Close a hole which could possibly lead to the host timebase getting
   out of sync.

 - Three fixes relating to PTEs and TLB entries for radix guests.

 - Fix a bug which could lead to an interrupt never getting delivered
   to the guest, if it is pending for a guest vCPU when the vCPU gets
   offlined.

s390:
 - Fix false negatives in VSIE validity check (Cc stable)

x86:
 - Fix time drift of VMX preemption timer when a guest uses LAPIC timer
   in periodic mode (Cc stable)

 - Unconditionally expose CPUID.IA32_ARCH_CAPABILITIES to allow
   migration from hosts that don't need retpoline mitigation (Cc stable)

 - Fix guest crashes on reboot by properly coupling CR4.OSXSAVE and
   CPUID.OSXSAVE (Cc stable)

 - Report correct RIP after Hyper-V hypercall #UD (introduced in -rc6)


Benjamin Herrenschmidt (1):
  KVM: PPC: Book3S HV: XIVE: Resend re-routed interrupts on CPU priority 
change

David Hildenbrand (1):
  KVM: s390: vsie: fix < 8k check for the itdba

David Vrabel (1):
  x86/kvm: fix LAPIC timer drift when guest uses periodic mode

Jim Mattson (1):
  kvm: x86: IA32_ARCH_CAPABILITIES is always supported

Nicholas Piggin (2):
  KVM: PPC: Book3S HV: Make radix use correct tlbie sequence in 
kvmppc_radix_tlbie_page
  KVM: PPC: Book3S HV: Make radix clear pte when unmapping

Paolo Bonzini (1):
  Merge tag 'kvm-s390-master-4.17-1' of 
git://git.kernel.org/.../kvms390/linux into kvm-master

Paul Mackerras (2):
  KVM: PPC: Book3S HV: Snapshot timebase offset on guest entry
  KVM: PPC: Book 3S HV: Do ptesync in radix guest exit path

Radim Krčmář (2):
  Merge tag 'kvm-ppc-fixes-4.17-1' of 
git://git.kernel.org/.../paulus/powerpc
  KVM: x86: fix #UD address of failed Hyper-V hypercalls

Wei Huang (1):
  KVM: x86: Update cpuid properly when CR4.OSXAVE or CR4.PKE is changed

 arch/powerpc/include/asm/kvm_book3s.h   |   1 +
 arch/powerpc/kernel/asm-offsets.c   |   1 +
 arch/powerpc/kvm/book3s_64_mmu_radix.c  |   6 +-
 arch/powerpc/kvm/book3s_hv.c|   1 +
 arch/powerpc/kvm/book3s_hv_rmhandlers.S |  97 +++-
 arch/powerpc/kvm/book3s_xive_template.c | 108 +---
 arch/s390/kvm/vsie.c|   2 +-
 arch/x86/kvm/cpuid.c|   5 ++
 arch/x86/kvm/hyperv.c   |  19 +++---
 arch/x86/kvm/lapic.c|  16 -
 arch/x86/kvm/x86.c  |  17 +++--
 11 files changed, 198 insertions(+), 75 deletions(-)


[GIT PULL] KVM fixes for Linux 4.17-rc7

2018-05-26 Thread Radim Krčmář
Linus,

The following changes since commit 633711e82878dc29083fc5d2605166755e25b57a:

  kvm: rename KVM_HINTS_DEDICATED to KVM_HINTS_REALTIME (2018-05-17 19:12:13 
+0200)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/virt/kvm/kvm tags/for-linus

for you to fetch changes up to 696ca779a928d0e93d61c38ffc3a4d8914a9b9a0:

  KVM: x86: fix #UD address of failed Hyper-V hypercalls (2018-05-25 21:33:31 
+0200)


KVM fixes for v4.17-rc7

PPC:
 - Close a hole which could possibly lead to the host timebase getting
   out of sync.

 - Three fixes relating to PTEs and TLB entries for radix guests.

 - Fix a bug which could lead to an interrupt never getting delivered
   to the guest, if it is pending for a guest vCPU when the vCPU gets
   offlined.

s390:
 - Fix false negatives in VSIE validity check (Cc stable)

x86:
 - Fix time drift of VMX preemption timer when a guest uses LAPIC timer
   in periodic mode (Cc stable)

 - Unconditionally expose CPUID.IA32_ARCH_CAPABILITIES to allow
   migration from hosts that don't need retpoline mitigation (Cc stable)

 - Fix guest crashes on reboot by properly coupling CR4.OSXSAVE and
   CPUID.OSXSAVE (Cc stable)

 - Report correct RIP after Hyper-V hypercall #UD (introduced in -rc6)


Benjamin Herrenschmidt (1):
  KVM: PPC: Book3S HV: XIVE: Resend re-routed interrupts on CPU priority 
change

David Hildenbrand (1):
  KVM: s390: vsie: fix < 8k check for the itdba

David Vrabel (1):
  x86/kvm: fix LAPIC timer drift when guest uses periodic mode

Jim Mattson (1):
  kvm: x86: IA32_ARCH_CAPABILITIES is always supported

Nicholas Piggin (2):
  KVM: PPC: Book3S HV: Make radix use correct tlbie sequence in 
kvmppc_radix_tlbie_page
  KVM: PPC: Book3S HV: Make radix clear pte when unmapping

Paolo Bonzini (1):
  Merge tag 'kvm-s390-master-4.17-1' of 
git://git.kernel.org/.../kvms390/linux into kvm-master

Paul Mackerras (2):
  KVM: PPC: Book3S HV: Snapshot timebase offset on guest entry
  KVM: PPC: Book 3S HV: Do ptesync in radix guest exit path

Radim Krčmář (2):
  Merge tag 'kvm-ppc-fixes-4.17-1' of 
git://git.kernel.org/.../paulus/powerpc
  KVM: x86: fix #UD address of failed Hyper-V hypercalls

Wei Huang (1):
  KVM: x86: Update cpuid properly when CR4.OSXAVE or CR4.PKE is changed

 arch/powerpc/include/asm/kvm_book3s.h   |   1 +
 arch/powerpc/kernel/asm-offsets.c   |   1 +
 arch/powerpc/kvm/book3s_64_mmu_radix.c  |   6 +-
 arch/powerpc/kvm/book3s_hv.c|   1 +
 arch/powerpc/kvm/book3s_hv_rmhandlers.S |  97 +++-
 arch/powerpc/kvm/book3s_xive_template.c | 108 +---
 arch/s390/kvm/vsie.c|   2 +-
 arch/x86/kvm/cpuid.c|   5 ++
 arch/x86/kvm/hyperv.c   |  19 +++---
 arch/x86/kvm/lapic.c|  16 -
 arch/x86/kvm/x86.c  |  17 +++--
 11 files changed, 198 insertions(+), 75 deletions(-)


Re: [PATCH v4 5/8] KVM: introduce kvm_make_vcpus_request_mask() API

2018-05-26 Thread Radim Krčmář
2018-05-16 17:21+0200, Vitaly Kuznetsov:
> Hyper-V style PV TLB flush hypercalls inmplementation will use this API.
> To avoid memory allocation in CONFIG_CPUMASK_OFFSTACK case add
> cpumask_var_t argument.
> 
> Signed-off-by: Vitaly Kuznetsov 
> ---
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> -bool kvm_make_all_cpus_request(struct kvm *kvm, unsigned int req)
> +bool kvm_make_vcpus_request_mask(struct kvm *kvm, unsigned int req,
> +  unsigned long *vcpu_bitmap, cpumask_var_t tmp)
>  {
>   int i, cpu, me;
> - cpumask_var_t cpus;
> - bool called;
>   struct kvm_vcpu *vcpu;
> -
> - zalloc_cpumask_var(, GFP_ATOMIC);
> + bool called;
>  
>   me = get_cpu();
> +

Two optimizations come to mind: First is to use for_each_set_bit instead
of kvm_for_each_vcpu to improve the sparse case.

>   kvm_for_each_vcpu(i, vcpu, kvm) {
> + if (!test_bit(i, vcpu_bitmap))

And the second is to pass vcpu_bitmap = NULL instead of building the
bitmap with all VCPUs.  Doesn't looks too good in the end, though:

#define kvm_for_each_vcpu_bitmap(idx, vcpup, kvm, bitmap, len) \
for (idx = (bitmap ? find_first_bit(bitmap, len) : 0); \
 idx < len && (vcpup = kvm_get_vcpu(kvm, idx)) != NULL; \
 bitmap ? find_next_bit(bitmap, len, idx + 1) : idx++)


Re: [PATCH v4 5/8] KVM: introduce kvm_make_vcpus_request_mask() API

2018-05-26 Thread Radim Krčmář
2018-05-16 17:21+0200, Vitaly Kuznetsov:
> Hyper-V style PV TLB flush hypercalls inmplementation will use this API.
> To avoid memory allocation in CONFIG_CPUMASK_OFFSTACK case add
> cpumask_var_t argument.
> 
> Signed-off-by: Vitaly Kuznetsov 
> ---
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> -bool kvm_make_all_cpus_request(struct kvm *kvm, unsigned int req)
> +bool kvm_make_vcpus_request_mask(struct kvm *kvm, unsigned int req,
> +  unsigned long *vcpu_bitmap, cpumask_var_t tmp)
>  {
>   int i, cpu, me;
> - cpumask_var_t cpus;
> - bool called;
>   struct kvm_vcpu *vcpu;
> -
> - zalloc_cpumask_var(, GFP_ATOMIC);
> + bool called;
>  
>   me = get_cpu();
> +

Two optimizations come to mind: First is to use for_each_set_bit instead
of kvm_for_each_vcpu to improve the sparse case.

>   kvm_for_each_vcpu(i, vcpu, kvm) {
> + if (!test_bit(i, vcpu_bitmap))

And the second is to pass vcpu_bitmap = NULL instead of building the
bitmap with all VCPUs.  Doesn't looks too good in the end, though:

#define kvm_for_each_vcpu_bitmap(idx, vcpup, kvm, bitmap, len) \
for (idx = (bitmap ? find_first_bit(bitmap, len) : 0); \
 idx < len && (vcpup = kvm_get_vcpu(kvm, idx)) != NULL; \
 bitmap ? find_next_bit(bitmap, len, idx + 1) : idx++)


Re: [PATCH v4 0/8] KVM: x86: hyperv: PV TLB flush for Windows guests

2018-05-26 Thread Radim Krčmář
2018-05-16 17:21+0200, Vitaly Kuznetsov:
> Changes since v3 [Radim Krcmar]:
> - PATCH2 fixing 'HV_GENERIC_SET_SPARCE_4K' typo added.
> - PATCH5 introducing kvm_make_vcpus_request_mask() API added.
> - Fix undefined behavior for hv->vp_index >= 64.
> - Merge kvm_hv_flush_tlb() and kvm_hv_flush_tlb_ex()
> - For -ex case preload all banks with a single kvm_read_guest().

I've pulled tip/hyperv into kvm/queue and applied on top, thanks.


Re: [PATCH v4 0/8] KVM: x86: hyperv: PV TLB flush for Windows guests

2018-05-26 Thread Radim Krčmář
2018-05-16 17:21+0200, Vitaly Kuznetsov:
> Changes since v3 [Radim Krcmar]:
> - PATCH2 fixing 'HV_GENERIC_SET_SPARCE_4K' typo added.
> - PATCH5 introducing kvm_make_vcpus_request_mask() API added.
> - Fix undefined behavior for hv->vp_index >= 64.
> - Merge kvm_hv_flush_tlb() and kvm_hv_flush_tlb_ex()
> - For -ex case preload all banks with a single kvm_read_guest().

I've pulled tip/hyperv into kvm/queue and applied on top, thanks.


Re: [PATCH] KVM: X86: prevent integer overflows in KVM_MEMORY_ENCRYPT_REG_REGION

2018-05-26 Thread Radim Krčmář
2018-05-21 10:52-0500, Brijesh Singh:
> On 05/19/2018 01:01 AM, Dan Carpenter wrote:
> > This is a fix from reviewing the code, but it looks like it might be
> > able to lead to an Oops.  It affects 32bit systems.
> > 
> 
> Please note that SEV is not available on 32bit systems.

Added this note and queued, thanks.


Re: [PATCH] KVM: X86: prevent integer overflows in KVM_MEMORY_ENCRYPT_REG_REGION

2018-05-26 Thread Radim Krčmář
2018-05-21 10:52-0500, Brijesh Singh:
> On 05/19/2018 01:01 AM, Dan Carpenter wrote:
> > This is a fix from reviewing the code, but it looks like it might be
> > able to lead to an Oops.  It affects 32bit systems.
> > 
> 
> Please note that SEV is not available on 32bit systems.

Added this note and queued, thanks.


Re: [PATCH] KVM: x86: VMX: fix building without CONFIG_HYPERV

2018-05-26 Thread Radim Krčmář
2018-05-25 17:36+0200, Arnd Bergmann:
> The global ms_hyperv variable is part of the hyperv support, so
> we get a link error from accessing it in kernels that have this
> turned off:
> 
> arch/x86/kvm/vmx.o: In function `alloc_loaded_vmcs':
> vmx.c:(.text+0x1654a): undefined reference to `ms_hyperv'
> vmx.c:(.text+0x1657a): undefined reference to `ms_hyperv'
> 
> This changes the condition to first check the compile-time
> configuration symbol to avoid the link error.
> 
> Fixes: ceef7d10dfb6 ("KVM: x86: VMX: hyper-v: Enlightened MSR-Bitmap support")
> Signed-off-by: Arnd Bergmann 

Queued, thanks.


Re: [PATCH] KVM: x86: VMX: fix building without CONFIG_HYPERV

2018-05-26 Thread Radim Krčmář
2018-05-25 17:36+0200, Arnd Bergmann:
> The global ms_hyperv variable is part of the hyperv support, so
> we get a link error from accessing it in kernels that have this
> turned off:
> 
> arch/x86/kvm/vmx.o: In function `alloc_loaded_vmcs':
> vmx.c:(.text+0x1654a): undefined reference to `ms_hyperv'
> vmx.c:(.text+0x1657a): undefined reference to `ms_hyperv'
> 
> This changes the condition to first check the compile-time
> configuration symbol to avoid the link error.
> 
> Fixes: ceef7d10dfb6 ("KVM: x86: VMX: hyper-v: Enlightened MSR-Bitmap support")
> Signed-off-by: Arnd Bergmann 

Queued, thanks.


Re: [PATCH] x86/KVM: Fix incorrect SSBD bit in kvm_cpuid_7_0_edx_x86_features

2018-05-25 Thread Radim Krčmář
2018-05-25 13:16-0400, Waiman Long:
> As the SSBD bit in kvm_cpuid_7_0_edx_x86_features has been renamed to
> SPEC_CTRL_SSBD in the commit 52817587e706 ("x86/cpufeatures: Disentangle
> SSBD enumeration"). The corresponding name change needed to be made in
> the KVM code as well.
> 
> Fixes: 52817587e706 ("x86/cpufeatures: Disentangle SSBD enumeration")
> 
> Signed-off-by: Waiman Long 
> ---

Thomas, are you taking this one?

>  arch/x86/kvm/cpuid.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> index ced8511..598461e 100644
> --- a/arch/x86/kvm/cpuid.c
> +++ b/arch/x86/kvm/cpuid.c
> @@ -407,8 +407,8 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 
> *entry, u32 function,
>  
>   /* cpuid 7.0.edx*/
>   const u32 kvm_cpuid_7_0_edx_x86_features =
> - F(AVX512_4VNNIW) | F(AVX512_4FMAPS) | F(SPEC_CTRL) | F(SSBD) |
> - F(ARCH_CAPABILITIES);
> + F(AVX512_4VNNIW) | F(AVX512_4FMAPS) | F(SPEC_CTRL) |
> + F(SPEC_CTRL_SSBD) | F(ARCH_CAPABILITIES);
>  
>   /* all calls to cpuid_count() should be made on the same cpu */
>   get_cpu();
> -- 
> 1.8.3.1
> 


Re: [PATCH] x86/KVM: Fix incorrect SSBD bit in kvm_cpuid_7_0_edx_x86_features

2018-05-25 Thread Radim Krčmář
2018-05-25 13:16-0400, Waiman Long:
> As the SSBD bit in kvm_cpuid_7_0_edx_x86_features has been renamed to
> SPEC_CTRL_SSBD in the commit 52817587e706 ("x86/cpufeatures: Disentangle
> SSBD enumeration"). The corresponding name change needed to be made in
> the KVM code as well.
> 
> Fixes: 52817587e706 ("x86/cpufeatures: Disentangle SSBD enumeration")
> 
> Signed-off-by: Waiman Long 
> ---

Thomas, are you taking this one?

>  arch/x86/kvm/cpuid.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> index ced8511..598461e 100644
> --- a/arch/x86/kvm/cpuid.c
> +++ b/arch/x86/kvm/cpuid.c
> @@ -407,8 +407,8 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 
> *entry, u32 function,
>  
>   /* cpuid 7.0.edx*/
>   const u32 kvm_cpuid_7_0_edx_x86_features =
> - F(AVX512_4VNNIW) | F(AVX512_4FMAPS) | F(SPEC_CTRL) | F(SSBD) |
> - F(ARCH_CAPABILITIES);
> + F(AVX512_4VNNIW) | F(AVX512_4FMAPS) | F(SPEC_CTRL) |
> + F(SPEC_CTRL_SSBD) | F(ARCH_CAPABILITIES);
>  
>   /* all calls to cpuid_count() should be made on the same cpu */
>   get_cpu();
> -- 
> 1.8.3.1
> 


Re: [PATCH v3 5/6] KVM: x86: hyperv: simplistic HVCALL_FLUSH_VIRTUAL_ADDRESS_{LIST,SPACE}_EX implementation

2018-05-10 Thread Radim Krčmář
2018-04-16 13:08+0200, Vitaly Kuznetsov:
> Implement HvFlushVirtualAddress{List,Space}Ex hypercalls in a simplistic
> way: do full TLB flush with KVM_REQ_TLB_FLUSH and kick vCPUs which are
> currently IN_GUEST_MODE.
> 
> Signed-off-by: Vitaly Kuznetsov 
> ---
> diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
> @@ -1301,6 +1301,108 @@ static u64 kvm_hv_flush_tlb(struct kvm_vcpu 
> *current_vcpu, u64 ingpa,
>   ((u64)rep_cnt << HV_HYPERCALL_REP_COMP_OFFSET);
>  }
>  
> +static __always_inline int get_sparse_bank_no(u64 valid_bank_mask, int 
> bank_no)
> +{
> + int i = 0, j;
> +
> + if (!(valid_bank_mask & BIT_ULL(bank_no)))
> + return -1;
> +
> + for (j = 0; j < bank_no; j++)
> + if (valid_bank_mask & BIT_ULL(j))
> + i++;
> +
> + return i;
> +}
> +
> +static __always_inline int load_bank_guest(struct kvm *kvm, u64 ingpa,
> +   int sparse_bank, u64 *bank_contents)
> +{
> + int offset;
> +
> + offset = offsetof(struct hv_tlb_flush_ex, hv_vp_set.bank_contents) +
> + sizeof(u64) * sparse_bank;
> +
> + if (unlikely(kvm_read_guest(kvm, ingpa + offset,
> + bank_contents, sizeof(u64
> + return 1;
> +
> + return 0;
> +}
> +
> +static int kvm_hv_flush_tlb_ex(struct kvm_vcpu *current_vcpu, u64 ingpa,
> +u16 rep_cnt)
> +{
> + struct kvm *kvm = current_vcpu->kvm;
> + struct kvm_vcpu_hv *hv_current = _vcpu->arch.hyperv;
> + struct hv_tlb_flush_ex flush;
> + struct kvm_vcpu *vcpu;
> + u64 bank_contents, valid_bank_mask;
> + int i, cpu, me, current_sparse_bank = -1;
> + u64 ret = HV_STATUS_INVALID_HYPERCALL_INPUT;
> +
> + if (unlikely(kvm_read_guest(kvm, ingpa, , sizeof(flush
> + return ret;
> +
> + valid_bank_mask = flush.hv_vp_set.valid_bank_mask;
> +
> + trace_kvm_hv_flush_tlb_ex(valid_bank_mask, flush.hv_vp_set.format,
> +   flush.address_space, flush.flags);
> +
> + cpumask_clear(_current->tlb_lush);
> +
> + me = get_cpu();
> +
> + kvm_for_each_vcpu(i, vcpu, kvm) {
> + struct kvm_vcpu_hv *hv = >arch.hyperv;
> + int bank = hv->vp_index / 64, sparse_bank;
> +
> + if (flush.hv_vp_set.format == HV_GENERIC_SET_SPARCE_4K) {
 ^
  typo in the define

> + /* Check is the bank of this vCPU is in sparse set */
> + sparse_bank = get_sparse_bank_no(valid_bank_mask, bank);
> + if (sparse_bank < 0)
> + continue;
> +
> + /*
> +  * Assume hv->vp_index is in ascending order and we can
> +  * optimize by not reloading bank contents for every
> +  * vCPU.
> +  */

Since sparse_bank is packed, we could compute how many bank_contents do
we need to load and do it with one kvm_read_guest() into a local array;
it would be faster even if hv->vp_index were in ascending order and
wouldn't take that much memory (up to 512 B).

> + if (sparse_bank != current_sparse_bank) {
> + if (load_bank_guest(kvm, ingpa, sparse_bank,
> + _contents))
> + return ret;
> + current_sparse_bank = sparse_bank;
> + }


Re: [PATCH v3 5/6] KVM: x86: hyperv: simplistic HVCALL_FLUSH_VIRTUAL_ADDRESS_{LIST,SPACE}_EX implementation

2018-05-10 Thread Radim Krčmář
2018-04-16 13:08+0200, Vitaly Kuznetsov:
> Implement HvFlushVirtualAddress{List,Space}Ex hypercalls in a simplistic
> way: do full TLB flush with KVM_REQ_TLB_FLUSH and kick vCPUs which are
> currently IN_GUEST_MODE.
> 
> Signed-off-by: Vitaly Kuznetsov 
> ---
> diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
> @@ -1301,6 +1301,108 @@ static u64 kvm_hv_flush_tlb(struct kvm_vcpu 
> *current_vcpu, u64 ingpa,
>   ((u64)rep_cnt << HV_HYPERCALL_REP_COMP_OFFSET);
>  }
>  
> +static __always_inline int get_sparse_bank_no(u64 valid_bank_mask, int 
> bank_no)
> +{
> + int i = 0, j;
> +
> + if (!(valid_bank_mask & BIT_ULL(bank_no)))
> + return -1;
> +
> + for (j = 0; j < bank_no; j++)
> + if (valid_bank_mask & BIT_ULL(j))
> + i++;
> +
> + return i;
> +}
> +
> +static __always_inline int load_bank_guest(struct kvm *kvm, u64 ingpa,
> +   int sparse_bank, u64 *bank_contents)
> +{
> + int offset;
> +
> + offset = offsetof(struct hv_tlb_flush_ex, hv_vp_set.bank_contents) +
> + sizeof(u64) * sparse_bank;
> +
> + if (unlikely(kvm_read_guest(kvm, ingpa + offset,
> + bank_contents, sizeof(u64
> + return 1;
> +
> + return 0;
> +}
> +
> +static int kvm_hv_flush_tlb_ex(struct kvm_vcpu *current_vcpu, u64 ingpa,
> +u16 rep_cnt)
> +{
> + struct kvm *kvm = current_vcpu->kvm;
> + struct kvm_vcpu_hv *hv_current = _vcpu->arch.hyperv;
> + struct hv_tlb_flush_ex flush;
> + struct kvm_vcpu *vcpu;
> + u64 bank_contents, valid_bank_mask;
> + int i, cpu, me, current_sparse_bank = -1;
> + u64 ret = HV_STATUS_INVALID_HYPERCALL_INPUT;
> +
> + if (unlikely(kvm_read_guest(kvm, ingpa, , sizeof(flush
> + return ret;
> +
> + valid_bank_mask = flush.hv_vp_set.valid_bank_mask;
> +
> + trace_kvm_hv_flush_tlb_ex(valid_bank_mask, flush.hv_vp_set.format,
> +   flush.address_space, flush.flags);
> +
> + cpumask_clear(_current->tlb_lush);
> +
> + me = get_cpu();
> +
> + kvm_for_each_vcpu(i, vcpu, kvm) {
> + struct kvm_vcpu_hv *hv = >arch.hyperv;
> + int bank = hv->vp_index / 64, sparse_bank;
> +
> + if (flush.hv_vp_set.format == HV_GENERIC_SET_SPARCE_4K) {
 ^
  typo in the define

> + /* Check is the bank of this vCPU is in sparse set */
> + sparse_bank = get_sparse_bank_no(valid_bank_mask, bank);
> + if (sparse_bank < 0)
> + continue;
> +
> + /*
> +  * Assume hv->vp_index is in ascending order and we can
> +  * optimize by not reloading bank contents for every
> +  * vCPU.
> +  */

Since sparse_bank is packed, we could compute how many bank_contents do
we need to load and do it with one kvm_read_guest() into a local array;
it would be faster even if hv->vp_index were in ascending order and
wouldn't take that much memory (up to 512 B).

> + if (sparse_bank != current_sparse_bank) {
> + if (load_bank_guest(kvm, ingpa, sparse_bank,
> + _contents))
> + return ret;
> + current_sparse_bank = sparse_bank;
> + }


Re: [PATCH v3 4/6] KVM: x86: hyperv: simplistic HVCALL_FLUSH_VIRTUAL_ADDRESS_{LIST,SPACE} implementation

2018-05-10 Thread Radim Krčmář
2018-04-16 13:08+0200, Vitaly Kuznetsov:
> Implement HvFlushVirtualAddress{List,Space} hypercalls in a simplistic way:
> do full TLB flush with KVM_REQ_TLB_FLUSH and kick vCPUs which are currently
> IN_GUEST_MODE.
> 
> Signed-off-by: Vitaly Kuznetsov 
> ---
> diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
> @@ -1242,6 +1242,65 @@ int kvm_hv_get_msr_common(struct kvm_vcpu *vcpu, u32 
> msr, u64 *pdata)
>   return kvm_hv_get_msr(vcpu, msr, pdata);
>  }
>  
> +static void ack_flush(void *_completed)
> +{
> +}
> +
> +static u64 kvm_hv_flush_tlb(struct kvm_vcpu *current_vcpu, u64 ingpa,
> + u16 rep_cnt)
> +{
> + struct kvm *kvm = current_vcpu->kvm;
> + struct kvm_vcpu_hv *hv_current = _vcpu->arch.hyperv;
> + struct hv_tlb_flush flush;
> + struct kvm_vcpu *vcpu;
> + int i, cpu, me;
> +
> + if (unlikely(kvm_read_guest(kvm, ingpa, , sizeof(flush
> + return HV_STATUS_INVALID_HYPERCALL_INPUT;
> +
> + trace_kvm_hv_flush_tlb(flush.processor_mask, flush.address_space,
> +flush.flags);
> +
> + cpumask_clear(_current->tlb_lush);
> +
> + me = get_cpu();
> +
> + kvm_for_each_vcpu(i, vcpu, kvm) {
> + struct kvm_vcpu_hv *hv = >arch.hyperv;
> +
> + if (!(flush.flags & HV_FLUSH_ALL_PROCESSORS) &&

Please add a check to prevent undefined behavior in C:

(hv->vp_index >= 64 ||

> + !(flush.processor_mask & BIT_ULL(hv->vp_index)))
> + continue;

It would also fail in the wild as shl only considers the bottom 5 bits.

> + /*
> +  * vcpu->arch.cr3 may not be up-to-date for running vCPUs so we
> +  * can't analyze it here, flush TLB regardless of the specified
> +  * address space.
> +  */
> + kvm_make_request(KVM_REQ_TLB_FLUSH, vcpu);
> +
> + /*
> +  * It is possible that vCPU will migrate and we will kick wrong
> +  * CPU but vCPU's TLB will anyway be flushed upon migration as
> +  * we already made KVM_REQ_TLB_FLUSH request.
> +  */
> + cpu = vcpu->cpu;
> + if (cpu != -1 && cpu != me && cpu_online(cpu) &&
> + kvm_arch_vcpu_should_kick(vcpu))
> + cpumask_set_cpu(cpu, _current->tlb_lush);
> + }
> +
> + if (!cpumask_empty(_current->tlb_lush))
> + smp_call_function_many(_current->tlb_lush, ack_flush,
> +NULL, true);

Hm, quite a lot of code duplication with EX hypercall and also
kvm_make_all_cpus_request ... I'm thinking about making something like

  kvm_make_some_cpus_request(struct kvm *kvm, unsigned int req,
 bool (*predicate)(struct kvm_vcpu *vcpu))

or to implement a vp_index -> vcpu mapping and using

  kvm_vcpu_request_mask(struct kvm *kvm, unsigned int req, long *vcpu_bitmap)

The latter would probably simplify logic of the EX hypercall.

What do you think?


Re: [PATCH v3 4/6] KVM: x86: hyperv: simplistic HVCALL_FLUSH_VIRTUAL_ADDRESS_{LIST,SPACE} implementation

2018-05-10 Thread Radim Krčmář
2018-04-16 13:08+0200, Vitaly Kuznetsov:
> Implement HvFlushVirtualAddress{List,Space} hypercalls in a simplistic way:
> do full TLB flush with KVM_REQ_TLB_FLUSH and kick vCPUs which are currently
> IN_GUEST_MODE.
> 
> Signed-off-by: Vitaly Kuznetsov 
> ---
> diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
> @@ -1242,6 +1242,65 @@ int kvm_hv_get_msr_common(struct kvm_vcpu *vcpu, u32 
> msr, u64 *pdata)
>   return kvm_hv_get_msr(vcpu, msr, pdata);
>  }
>  
> +static void ack_flush(void *_completed)
> +{
> +}
> +
> +static u64 kvm_hv_flush_tlb(struct kvm_vcpu *current_vcpu, u64 ingpa,
> + u16 rep_cnt)
> +{
> + struct kvm *kvm = current_vcpu->kvm;
> + struct kvm_vcpu_hv *hv_current = _vcpu->arch.hyperv;
> + struct hv_tlb_flush flush;
> + struct kvm_vcpu *vcpu;
> + int i, cpu, me;
> +
> + if (unlikely(kvm_read_guest(kvm, ingpa, , sizeof(flush
> + return HV_STATUS_INVALID_HYPERCALL_INPUT;
> +
> + trace_kvm_hv_flush_tlb(flush.processor_mask, flush.address_space,
> +flush.flags);
> +
> + cpumask_clear(_current->tlb_lush);
> +
> + me = get_cpu();
> +
> + kvm_for_each_vcpu(i, vcpu, kvm) {
> + struct kvm_vcpu_hv *hv = >arch.hyperv;
> +
> + if (!(flush.flags & HV_FLUSH_ALL_PROCESSORS) &&

Please add a check to prevent undefined behavior in C:

(hv->vp_index >= 64 ||

> + !(flush.processor_mask & BIT_ULL(hv->vp_index)))
> + continue;

It would also fail in the wild as shl only considers the bottom 5 bits.

> + /*
> +  * vcpu->arch.cr3 may not be up-to-date for running vCPUs so we
> +  * can't analyze it here, flush TLB regardless of the specified
> +  * address space.
> +  */
> + kvm_make_request(KVM_REQ_TLB_FLUSH, vcpu);
> +
> + /*
> +  * It is possible that vCPU will migrate and we will kick wrong
> +  * CPU but vCPU's TLB will anyway be flushed upon migration as
> +  * we already made KVM_REQ_TLB_FLUSH request.
> +  */
> + cpu = vcpu->cpu;
> + if (cpu != -1 && cpu != me && cpu_online(cpu) &&
> + kvm_arch_vcpu_should_kick(vcpu))
> + cpumask_set_cpu(cpu, _current->tlb_lush);
> + }
> +
> + if (!cpumask_empty(_current->tlb_lush))
> + smp_call_function_many(_current->tlb_lush, ack_flush,
> +NULL, true);

Hm, quite a lot of code duplication with EX hypercall and also
kvm_make_all_cpus_request ... I'm thinking about making something like

  kvm_make_some_cpus_request(struct kvm *kvm, unsigned int req,
 bool (*predicate)(struct kvm_vcpu *vcpu))

or to implement a vp_index -> vcpu mapping and using

  kvm_vcpu_request_mask(struct kvm *kvm, unsigned int req, long *vcpu_bitmap)

The latter would probably simplify logic of the EX hypercall.

What do you think?


Re: [PATCH v3 1/6] x86/hyper-v: move struct hv_flush_pcpu{,ex} definitions to common header

2018-05-10 Thread Radim Krčmář
2018-04-16 13:08+0200, Vitaly Kuznetsov:
> Hyper-V TLB flush hypercalls definitions will be required for KVM so move
> them hyperv-tlfs.h. Structures also need to be renamed as '_pcpu' suffix is
> invalid for a general-purpose definition.
> 
> Signed-off-by: Vitaly Kuznetsov 
> ---
>  arch/x86/hyperv/mmu.c  | 40 
> ++
>  arch/x86/include/asm/hyperv-tlfs.h | 20 +++
>  2 files changed, 30 insertions(+), 30 deletions(-)
> 
> diff --git a/arch/x86/include/asm/hyperv-tlfs.h 
> b/arch/x86/include/asm/hyperv-tlfs.h
> @@ -703,4 +703,24 @@ struct hv_enlightened_vmcs {
>  #define HV_STIMER_AUTOENABLE (1ULL << 3)
>  #define HV_STIMER_SINT(config)   (__u8)(((config) >> 16) & 0x0F)
>  
> +/* HvFlushVirtualAddressSpace, HvFlushVirtualAddressList hypercalls */
> +struct hv_tlb_flush {
> + u64 address_space;
> + u64 flags;
> + u64 processor_mask;
> + u64 gva_list[];
> +};
> +
> +/* HvFlushVirtualAddressSpaceEx, HvFlushVirtualAddressListEx hypercalls */
> +struct hv_tlb_flush_ex {
> + u64 address_space;
> + u64 flags;
> + struct {
> + u64 format;
> + u64 valid_bank_mask;
> + u64 bank_contents[];
> + } hv_vp_set;
> + u64 gva_list[];

Why is the gva_list there?


Re: [PATCH v3 1/6] x86/hyper-v: move struct hv_flush_pcpu{,ex} definitions to common header

2018-05-10 Thread Radim Krčmář
2018-04-16 13:08+0200, Vitaly Kuznetsov:
> Hyper-V TLB flush hypercalls definitions will be required for KVM so move
> them hyperv-tlfs.h. Structures also need to be renamed as '_pcpu' suffix is
> invalid for a general-purpose definition.
> 
> Signed-off-by: Vitaly Kuznetsov 
> ---
>  arch/x86/hyperv/mmu.c  | 40 
> ++
>  arch/x86/include/asm/hyperv-tlfs.h | 20 +++
>  2 files changed, 30 insertions(+), 30 deletions(-)
> 
> diff --git a/arch/x86/include/asm/hyperv-tlfs.h 
> b/arch/x86/include/asm/hyperv-tlfs.h
> @@ -703,4 +703,24 @@ struct hv_enlightened_vmcs {
>  #define HV_STIMER_AUTOENABLE (1ULL << 3)
>  #define HV_STIMER_SINT(config)   (__u8)(((config) >> 16) & 0x0F)
>  
> +/* HvFlushVirtualAddressSpace, HvFlushVirtualAddressList hypercalls */
> +struct hv_tlb_flush {
> + u64 address_space;
> + u64 flags;
> + u64 processor_mask;
> + u64 gva_list[];
> +};
> +
> +/* HvFlushVirtualAddressSpaceEx, HvFlushVirtualAddressListEx hypercalls */
> +struct hv_tlb_flush_ex {
> + u64 address_space;
> + u64 flags;
> + struct {
> + u64 format;
> + u64 valid_bank_mask;
> + u64 bank_contents[];
> + } hv_vp_set;
> + u64 gva_list[];

Why is the gva_list there?


[GIT PULL] KVM fixes for Linux 4.17-rc4

2018-05-06 Thread Radim Krčmář
Linus,

The following changes since commit 6da6c0db5316275015e8cc2959f12a17584aeb64:

  Linux v4.17-rc3 (2018-04-29 14:17:42 -0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/virt/kvm/kvm tags/for-linus

for you to fetch changes up to ecf08dad723d3e000aecff6c396f54772d124733:

  KVM: x86: remove APIC Timer periodic/oneshot spikes (2018-05-05 23:09:39 
+0200)


KVM fixes for v4.17-rc4

ARM:
 - Fix proxying of GICv2 CPU interface accesses
 - Fix crash when switching to BE
 - Track source vcpu git GICv2 SGIs
 - Fix an outdated bit of documentation

x86:
 - Speed up injection of expired timers (for stable)


Anthoine Bourgeois (1):
  KVM: x86: remove APIC Timer periodic/oneshot spikes

James Morse (2):
  KVM: arm64: Fix order of vcpu_write_sys_reg() arguments
  arm64: vgic-v2: Fix proxying of cpuif access

Marc Zyngier (1):
  KVM: arm/arm64: vgic: Fix source vcpu issues for GICv2 SGI

Radim Krčmář (1):
  Merge tag 'kvmarm-fixes-for-4.17-2' of 
git://git.kernel.org/.../kvmarm/kvmarm

Valentin Schneider (1):
  KVM: arm/arm64: vgic_init: Cleanup reference to process_maintenance

 arch/arm64/include/asm/kvm_emulate.h |  2 +-
 arch/arm64/kvm/hyp/vgic-v2-cpuif-proxy.c | 24 
 arch/x86/kvm/lapic.c | 37 +---
 include/kvm/arm_vgic.h   |  1 +
 virt/kvm/arm/vgic/vgic-init.c|  2 +-
 virt/kvm/arm/vgic/vgic-mmio.c| 10 +--
 virt/kvm/arm/vgic/vgic-v2.c  | 38 ++---
 virt/kvm/arm/vgic/vgic-v3.c  | 49 +++-
 virt/kvm/arm/vgic/vgic.c | 30 +--
 virt/kvm/arm/vgic/vgic.h | 14 +
 10 files changed, 122 insertions(+), 85 deletions(-)


[GIT PULL] KVM fixes for Linux 4.17-rc4

2018-05-06 Thread Radim Krčmář
Linus,

The following changes since commit 6da6c0db5316275015e8cc2959f12a17584aeb64:

  Linux v4.17-rc3 (2018-04-29 14:17:42 -0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/virt/kvm/kvm tags/for-linus

for you to fetch changes up to ecf08dad723d3e000aecff6c396f54772d124733:

  KVM: x86: remove APIC Timer periodic/oneshot spikes (2018-05-05 23:09:39 
+0200)


KVM fixes for v4.17-rc4

ARM:
 - Fix proxying of GICv2 CPU interface accesses
 - Fix crash when switching to BE
 - Track source vcpu git GICv2 SGIs
 - Fix an outdated bit of documentation

x86:
 - Speed up injection of expired timers (for stable)


Anthoine Bourgeois (1):
  KVM: x86: remove APIC Timer periodic/oneshot spikes

James Morse (2):
  KVM: arm64: Fix order of vcpu_write_sys_reg() arguments
  arm64: vgic-v2: Fix proxying of cpuif access

Marc Zyngier (1):
  KVM: arm/arm64: vgic: Fix source vcpu issues for GICv2 SGI

Radim Krčmář (1):
  Merge tag 'kvmarm-fixes-for-4.17-2' of 
git://git.kernel.org/.../kvmarm/kvmarm

Valentin Schneider (1):
  KVM: arm/arm64: vgic_init: Cleanup reference to process_maintenance

 arch/arm64/include/asm/kvm_emulate.h |  2 +-
 arch/arm64/kvm/hyp/vgic-v2-cpuif-proxy.c | 24 
 arch/x86/kvm/lapic.c | 37 +---
 include/kvm/arm_vgic.h   |  1 +
 virt/kvm/arm/vgic/vgic-init.c|  2 +-
 virt/kvm/arm/vgic/vgic-mmio.c| 10 +--
 virt/kvm/arm/vgic/vgic-v2.c  | 38 ++---
 virt/kvm/arm/vgic/vgic-v3.c  | 49 +++-
 virt/kvm/arm/vgic/vgic.c | 30 +--
 virt/kvm/arm/vgic/vgic.h | 14 +
 10 files changed, 122 insertions(+), 85 deletions(-)


[GIT PULL] KVM fixes for Linux 4.17-rc3

2018-04-27 Thread Radim Krčmář
Linus,

The following changes since commit 6d08b06e67cd117f6992c46611dfb4ce267cd71e:

  Linux 4.17-rc2 (2018-04-22 19:20:09 -0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/virt/kvm/kvm tags/for-linus

for you to fetch changes up to 5e62493f1a70e7f13059544daaee05e40e8548e2:

  x86/headers/UAPI: Move DISABLE_EXITS KVM capability bits to the UAPI 
(2018-04-27 18:37:17 +0200)


KVM fixes for v4.17-rc3

ARM:
 - PSCI selection API, a leftover from 4.16 (for stable)
 - Kick vcpu on active interrupt affinity change
 - Plug a VMID allocation race on oversubscribed systems
 - Silence debug messages
 - Update Christoffer's email address (linaro -> arm)

x86:
 - Expose userspace-relevant bits of a newly added feature
 - Fix TLB flushing on VMX with VPID, but without EPT


Andre Przywara (1):
  KVM: arm/arm64: vgic: Kick new VCPU on interrupt migration

Christoffer Dall (1):
  MAINTAINERS: Update e-mail address for Christoffer Dall

Junaid Shahid (1):
  kvm: apic: Flush TLB after APIC mode/address change if VPIDs are in use

KarimAllah Ahmed (1):
  x86/headers/UAPI: Move DISABLE_EXITS KVM capability bits to the UAPI

Marc Zyngier (3):
  KVM: arm/arm64: Close VMID generation race
  arm64: KVM: Demote SVE and LORegion warnings to debug only
  arm/arm64: KVM: Add PSCI version selection API

Radim Krčmář (1):
  Merge tag 'kvmarm-fixes-for-4.17-1' of 
git://git.kernel.org/.../kvmarm/kvmarm

 Documentation/virtual/kvm/api.txt  |  9 -
 Documentation/virtual/kvm/arm/psci.txt | 30 +
 MAINTAINERS|  4 +--
 arch/arm/include/asm/kvm_host.h|  3 ++
 arch/arm/include/uapi/asm/kvm.h|  6 
 arch/arm/kvm/guest.c   | 13 
 arch/arm64/include/asm/kvm_host.h  |  3 ++
 arch/arm64/include/uapi/asm/kvm.h  |  6 
 arch/arm64/kvm/guest.c | 14 +++-
 arch/arm64/kvm/sys_regs.c  |  6 ++--
 arch/x86/kvm/vmx.c | 14 +++-
 arch/x86/kvm/x86.h |  7 
 include/kvm/arm_psci.h | 16 +++--
 include/uapi/linux/kvm.h   |  7 
 virt/kvm/arm/arm.c | 15 ++---
 virt/kvm/arm/psci.c| 60 ++
 virt/kvm/arm/vgic/vgic.c   |  8 +
 17 files changed, 189 insertions(+), 32 deletions(-)
 create mode 100644 Documentation/virtual/kvm/arm/psci.txt


  1   2   3   4   5   6   7   8   9   10   >