from:"David Matlack"

Re: [PATCH v3 5/7] KVM: x86: Participate in bitmap-based PTE aging

2024-04-19 Thread David Matlack

On 2024-04-19 01:47 PM, James Houghton wrote:
> On Thu, Apr 11, 2024 at 10:28 AM David Matlack  wrote:
> > On 2024-04-11 10:08 AM, David Matlack wrote:
> > bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
> > {
> > bool young = false;
> >
> > if (!range->arg.metadata->bitmap && kvm_memslots_have_rmaps(kvm))
> > young = kvm_handle_gfn_range(kvm, range, kvm_age_rmap);
> >
> > if (tdp_mmu_enabled)
> > young |= kvm_tdp_mmu_age_gfn_range(kvm, range);
> >
> > return young;
> > }
> >
> > bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
> > {
> > bool young = false;
> >
> > if (!range->arg.metadata->bitmap && kvm_memslots_have_rmaps(kvm))
> > young = kvm_handle_gfn_range(kvm, range, kvm_test_age_rmap);
> >
> > if (tdp_mmu_enabled)
> > young |= kvm_tdp_mmu_test_age_gfn(kvm, range);
> >
> > return young;
> 
> 
> Yeah I think this is the right thing to do. Given your other
> suggestions (on patch 3), I think this will look something like this
> -- let me know if I've misunderstood something:
> 
> bool check_rmap = !bitmap && kvm_memslot_have_rmaps(kvm);
> 
> if (check_rmap)
>   KVM_MMU_LOCK(kvm);
> 
> rcu_read_lock(); // perhaps only do this when we don't take the MMU lock?
> 
> if (check_rmap)
>   kvm_handle_gfn_range(/* ... */ kvm_test_age_rmap)
> 
> if (tdp_mmu_enabled)
>   kvm_tdp_mmu_test_age_gfn() // modified to be RCU-safe
> 
> rcu_read_unlock();
> if (check_rmap)
>   KVM_MMU_UNLOCK(kvm);

I was thinking a little different. If you follow my suggestion to first
make the TDP MMU aging lockless, you'll end up with something like this
prior to adding bitmap support (note: the comments are just for
demonstrative purposes):

bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
{
bool young = false;

/* Shadow MMU aging holds write-lock. */
if (kvm_memslots_have_rmaps(kvm)) {
write_lock(&kvm->mmu_lock);
young = kvm_handle_gfn_range(kvm, range, kvm_age_rmap);
write_unlock(&kvm->mmu_lock);
}

/* TDM MMU aging is lockless. */
if (tdp_mmu_enabled)
young |= kvm_tdp_mmu_age_gfn_range(kvm, range);

return young;
}

Then when you add bitmap support it would look something like this:

bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
{
unsigned long *bitmap = range->arg.metadata->bitmap;
bool young = false;

/* SHadow MMU aging holds write-lock and does not support bitmap. */
if (kvm_memslots_have_rmaps(kvm) && !bitmap) {
write_lock(&kvm->mmu_lock);
young = kvm_handle_gfn_range(kvm, range, kvm_age_rmap);
write_unlock(&kvm->mmu_lock);
}

/* TDM MMU aging is lockless and supports bitmap. */
if (tdp_mmu_enabled)
young |= kvm_tdp_mmu_age_gfn_range(kvm, range);

return young;
}

rcu_read_lock/unlock() would be called in kvm_tdp_mmu_age_gfn_range().

That brings up a question I've been wondering about. If KVM only
advertises support for the bitmap lookaround when shadow roots are not
allocated, does that mean MGLRU will be blind to accesses made by L2
when nested virtualization is enabled? And does that mean the Linux MM
will think all L2 memory is cold (i.e. good candidate for swapping)
because it isn't seeing accesses made by L2?

Re: [PATCH v3 5/7] KVM: x86: Participate in bitmap-based PTE aging

2024-04-12 Thread David Matlack

On 2024-04-01 11:29 PM, James Houghton wrote:
> Only handle the TDP MMU case for now. In other cases, if a bitmap was
> not provided, fallback to the slowpath that takes mmu_lock, or, if a
> bitmap was provided, inform the caller that the bitmap is unreliable.

I think this patch will trigger a lockdep assert in

  kvm_tdp_mmu_age_gfn_range
kvm_tdp_mmu_handle_gfn
  for_each_tdp_mmu_root
__for_each_tdp_mmu_root
  kvm_lockdep_assert_mmu_lock_held

... because it walks tdp_mmu_roots without holding mmu_lock.

Yu's patch[1] added a lockless walk to the TDP MMU. We'd need something
similar here and also update the comment above tdp_mmu_roots describing
how tdp_mmu_roots can be read locklessly.

[1] https://lore.kernel.org/kvmarm/zitx64bbx5vdj...@google.com/

Re: [PATCH v3 3/7] KVM: Add basic bitmap support into kvm_mmu_notifier_test/clear_young

2024-04-12 Thread David Matlack

On 2024-04-01 11:29 PM, James Houghton wrote:
> Add kvm_arch_prepare_bitmap_age() for architectures to indiciate that
> they support bitmap-based aging in kvm_mmu_notifier_test_clear_young()
> and that they do not need KVM to grab the MMU lock for writing. This
> function allows architectures to do other locking or other preparatory
> work that it needs.

There's a lot going on here. I know it's extra work but I think the
series would be easier to understand and simplify if you introduced the
KVM support for lockless test/clear_young() first, and then introduce
support for the bitmap-based look-around.

Specifically:

 1. Make all test/clear_young() notifiers lockless. i.e. Move the
mmu_lock into the architecture-specific code (kvm_age_gfn() and
kvm_test_age_gfn()).

 2. Convert KVM/x86's kvm_{test,}_age_gfn() to be lockless for the TDP
MMU.

 4. Convert KVM/arm64's kvm_{test,}_age_gfn() to hold the mmu_lock in
read-mode.

 5. Add bitmap-based look-around support to KVM/x86 and KVM/arm64
(probably 2-3 patches).

> 
> If an architecture does not implement kvm_arch_prepare_bitmap_age() or
> is unable to do bitmap-based aging at runtime (and marks the bitmap as
> unreliable):
>  1. If a bitmap was provided, we inform the caller that the bitmap is
> unreliable (MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE).
>  2. If a bitmap was not provided, fall back to the old logic.
> 
> Also add logic for architectures to easily use the provided bitmap if
> they are able. The expectation is that the architecture's implementation
> of kvm_gfn_test_age() will use kvm_gfn_record_young(), and
> kvm_gfn_age() will use kvm_gfn_should_age().
> 
> Suggested-by: Yu Zhao 
> Signed-off-by: James Houghton 
> ---
>  include/linux/kvm_host.h | 60 ++
>  virt/kvm/kvm_main.c  | 92 +---
>  2 files changed, 127 insertions(+), 25 deletions(-)
> 
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 1800d03a06a9..5862fd7b5f9b 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -1992,6 +1992,26 @@ extern const struct _kvm_stats_desc 
> kvm_vm_stats_desc[];
>  extern const struct kvm_stats_header kvm_vcpu_stats_header;
>  extern const struct _kvm_stats_desc kvm_vcpu_stats_desc[];
>  
> +/*
> + * Architectures that support using bitmaps for kvm_age_gfn() and
> + * kvm_test_age_gfn should return true for kvm_arch_prepare_bitmap_age()
> + * and do any work they need to prepare. The subsequent walk will not
> + * automatically grab the KVM MMU lock, so some architectures may opt
> + * to grab it.
> + *
> + * If true is returned, a subsequent call to kvm_arch_finish_bitmap_age() is
> + * guaranteed.
> + */
> +#ifndef kvm_arch_prepare_bitmap_age
> +static inline bool kvm_arch_prepare_bitmap_age(struct mmu_notifier *mn)

I find the name of these architecture callbacks misleading/confusing.
The lockless path is used even when a bitmap is not provided. i.e.
bitmap can be NULL in between kvm_arch_prepare/finish_bitmap_age().

> +{
> + return false;
> +}
> +#endif
> +#ifndef kvm_arch_finish_bitmap_age
> +static inline void kvm_arch_finish_bitmap_age(struct mmu_notifier *mn) {}
> +#endif

kvm_arch_finish_bitmap_age() seems unnecessary. I think the KVM/arm64
code could acquire/release the mmu_lock in read-mode in
kvm_test_age_gfn() and kvm_age_gfn() right?

> +
>  #ifdef CONFIG_KVM_GENERIC_MMU_NOTIFIER
>  static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)
>  {
> @@ -2076,9 +2096,16 @@ static inline bool 
> mmu_invalidate_retry_gfn_unsafe(struct kvm *kvm,
>   return READ_ONCE(kvm->mmu_invalidate_seq) != mmu_seq;
>  }
>  
> +struct test_clear_young_metadata {
> + unsigned long *bitmap;
> + unsigned long bitmap_offset_end;

bitmap_offset_end is unused.

> + unsigned long end;
> + bool unreliable;
> +};
>  union kvm_mmu_notifier_arg {
>   pte_t pte;
>   unsigned long attributes;
> + struct test_clear_young_metadata *metadata;

nit: Maybe s/metadata/test_clear_young/ ?

>  };
>  
>  struct kvm_gfn_range {
> @@ -2087,11 +2114,44 @@ struct kvm_gfn_range {
>   gfn_t end;
>   union kvm_mmu_notifier_arg arg;
>   bool may_block;
> + bool lockless;

Please document this as it's somewhat subtle. A reader might think this
implies the entire operation runs without taking the mmu_lock.

>  };
>  bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
>  bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
>  bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
>  bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> +
> +static inline void kvm_age_set_unreliable(struct kvm_gfn_range *range)
> +{
> + struct test_clear_young_metadata *args = range->arg.metadata;
> +
> + args->unreliable = true;
> +}
> +static inline unsigned long kvm_young_bitmap_offset(struct kvm_gfn_range 
> *range,
> +

Re: [PATCH v3 1/7] mm: Add a bitmap into mmu_notifier_{clear,test}_young

2024-04-12 Thread David Matlack

On 2024-04-01 11:29 PM, James Houghton wrote:
> The bitmap is provided for secondary MMUs to use if they support it. For
> test_young(), after it returns, the bitmap represents the pages that
> were young in the interval [start, end). For clear_young, it represents
> the pages that we wish the secondary MMU to clear the accessed/young bit
> for.
> 
> If a bitmap is not provided, the mmu_notifier_{test,clear}_young() API
> should be unchanged except that if young PTEs are found and the
> architecture supports passing in a bitmap, instead of returning 1,
> MMU_NOTIFIER_YOUNG_FAST is returned.
> 
> This allows MGLRU's look-around logic to work faster, resulting in a 4%
> improvement in real workloads[1]. Also introduce MMU_NOTIFIER_YOUNG_FAST
> to indicate to main mm that doing look-around is likely to be
> beneficial.
> 
> If the secondary MMU doesn't support the bitmap, it must return
> an int that contains MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE.
> 
> [1]: https://lore.kernel.org/all/20230609005935.42390-1-yuz...@google.com/
> 
> Suggested-by: Yu Zhao 
> Signed-off-by: James Houghton 
> ---
>  include/linux/mmu_notifier.h | 93 +---
>  include/trace/events/kvm.h   | 13 +++--
>  mm/mmu_notifier.c| 20 +---
>  virt/kvm/kvm_main.c  | 19 ++--
>  4 files changed, 123 insertions(+), 22 deletions(-)
> 
> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> index f349e08a9dfe..daaa9db625d3 100644
> --- a/include/linux/mmu_notifier.h
> +++ b/include/linux/mmu_notifier.h
> @@ -61,6 +61,10 @@ enum mmu_notifier_event {
>  
>  #define MMU_NOTIFIER_RANGE_BLOCKABLE (1 << 0)
>  
> +#define MMU_NOTIFIER_YOUNG   (1 << 0)
> +#define MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE (1 << 1)

MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE appears to be unused by all callers
of test/clear_young(). I would vote to remove it.

> +#define MMU_NOTIFIER_YOUNG_FAST  (1 << 2)

Instead of MMU_NOTIFIER_YOUNG_FAST, how about
MMU_NOTIFIER_YOUNG_LOOK_AROUND? i.e. The secondary MMU is returning
saying it recommends doing a look-around and passing in a bitmap?

That would avoid the whole "what does FAST really mean" confusion.

> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index fb49c2a60200..ca4b1ef9dfc2 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -917,10 +917,15 @@ static int kvm_mmu_notifier_clear_flush_young(struct 
> mmu_notifier *mn,
>  static int kvm_mmu_notifier_clear_young(struct mmu_notifier *mn,
>   struct mm_struct *mm,
>   unsigned long start,
> - unsigned long end)
> + unsigned long end,
> + unsigned long *bitmap)
>  {
>   trace_kvm_age_hva(start, end);
>  
> + /* We don't support bitmaps. Don't test or clear anything. */
> + if (bitmap)
> + return MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE;

Wouldn't it be a bug to get a bitmap here? The main MM is only suppost
to pass in a bitmap if the secondary MMU returns
MMU_NOTIFIER_YOUNG_FAST, which KVM does not do at this point.

Put another way, this check seems unneccessary.

> +
>   /*
>* Even though we do not flush TLB, this will still adversely
>* affect performance on pre-Haswell Intel EPT, where there is
> @@ -939,11 +944,17 @@ static int kvm_mmu_notifier_clear_young(struct 
> mmu_notifier *mn,
>  
>  static int kvm_mmu_notifier_test_young(struct mmu_notifier *mn,
>  struct mm_struct *mm,
> -unsigned long address)
> +unsigned long start,
> +unsigned long end,
> +unsigned long *bitmap)
>  {
> - trace_kvm_test_age_hva(address);
> + trace_kvm_test_age_hva(start, end);
> +
> + /* We don't support bitmaps. Don't test or clear anything. */
> + if (bitmap)
> + return MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE;

Same thing here.

Re: [PATCH v3 0/7] mm/kvm: Improve parallelism for access bit harvesting

2024-04-12 Thread David Matlack

On 2024-04-01 11:29 PM, James Houghton wrote:
> This patchset adds a fast path in KVM to test and clear access bits on
> sptes without taking the mmu_lock. It also adds support for using a
> bitmap to (1) test the access bits for many sptes in a single call to
> mmu_notifier_test_young, and to (2) clear the access bits for many ptes
> in a single call to mmu_notifier_clear_young.

How much improvement would we get if we _just_ made test/clear_young
lockless on x86 and hold the read-lock on arm64? And then how much
benefit does the bitmap look-around add on top of that?

Re: [PATCH v3 5/7] KVM: x86: Participate in bitmap-based PTE aging

2024-04-11 Thread David Matlack

On Thu, Apr 11, 2024 at 11:00 AM David Matlack  wrote:
>
> On Thu, Apr 11, 2024 at 10:28 AM David Matlack  wrote:
> >
> > On 2024-04-11 10:08 AM, David Matlack wrote:
> > > On 2024-04-01 11:29 PM, James Houghton wrote:
> > > > Only handle the TDP MMU case for now. In other cases, if a bitmap was
> > > > not provided, fallback to the slowpath that takes mmu_lock, or, if a
> > > > bitmap was provided, inform the caller that the bitmap is unreliable.
> > > >
> > > > Suggested-by: Yu Zhao 
> > > > Signed-off-by: James Houghton 
> > > > ---
> > > >  arch/x86/include/asm/kvm_host.h | 14 ++
> > > >  arch/x86/kvm/mmu/mmu.c  | 16 ++--
> > > >  arch/x86/kvm/mmu/tdp_mmu.c  | 10 +-
> > > >  3 files changed, 37 insertions(+), 3 deletions(-)
> > > >
> > > > diff --git a/arch/x86/include/asm/kvm_host.h 
> > > > b/arch/x86/include/asm/kvm_host.h
> > > > index 3b58e2306621..c30918d0887e 100644
> > > > --- a/arch/x86/include/asm/kvm_host.h
> > > > +++ b/arch/x86/include/asm/kvm_host.h
> > > > @@ -2324,4 +2324,18 @@ int memslot_rmap_alloc(struct kvm_memory_slot 
> > > > *slot, unsigned long npages);
> > > >   */
> > > >  #define KVM_EXIT_HYPERCALL_MBZ GENMASK_ULL(31, 1)
> > > >
> > > > +#define kvm_arch_prepare_bitmap_age kvm_arch_prepare_bitmap_age
> > > > +static inline bool kvm_arch_prepare_bitmap_age(struct mmu_notifier *mn)
> > > > +{
> > > > +   /*
> > > > +* Indicate that we support bitmap-based aging when using the TDP 
> > > > MMU
> > > > +* and the accessed bit is available in the TDP page tables.
> > > > +*
> > > > +* We have no other preparatory work to do here, so we do not need 
> > > > to
> > > > +* redefine kvm_arch_finish_bitmap_age().
> > > > +*/
> > > > +   return IS_ENABLED(CONFIG_X86_64) && tdp_mmu_enabled
> > > > +&& shadow_accessed_mask;
> > > > +}
> > > > +
> > > >  #endif /* _ASM_X86_KVM_HOST_H */
> > > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > > index 992e651540e8..fae1a75750bb 100644
> > > > --- a/arch/x86/kvm/mmu/mmu.c
> > > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > > @@ -1674,8 +1674,14 @@ bool kvm_age_gfn(struct kvm *kvm, struct 
> > > > kvm_gfn_range *range)
> > > >  {
> > > > bool young = false;
> > > >
> > > > -   if (kvm_memslots_have_rmaps(kvm))
> > > > +   if (kvm_memslots_have_rmaps(kvm)) {
> > > > +   if (range->lockless) {
> > > > +   kvm_age_set_unreliable(range);
> > > > +   return false;
> > > > +   }
> > >
> > > If a VM has TDP MMU enabled, supports A/D bits, and is using nested
> > > virtualization, MGLRU will effectively be blind to all accesses made by
> > > the VM.
> > >
> > > kvm_arch_prepare_bitmap_age() will return true indicating that the
> > > bitmap is supported. But then kvm_age_gfn() and kvm_test_age_gfn() will
> > > return false immediately and indicate the bitmap is unreliable because a
> > > shadow root is allocate. The notfier will then return
> > > MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE.
>
> Ah no, I'm wrong here. Setting args.unreliable causes the notifier to
> return 0 instead of MMU_NOTIFIER_YOUNG_FAST.
> MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE is used for something else.

Nope, wrong again. Just ignore me while I try to figure out how this
actually works :)

Re: [PATCH v3 5/7] KVM: x86: Participate in bitmap-based PTE aging

2024-04-11 Thread David Matlack

On Thu, Apr 11, 2024 at 10:28 AM David Matlack  wrote:
>
> On 2024-04-11 10:08 AM, David Matlack wrote:
> > On 2024-04-01 11:29 PM, James Houghton wrote:
> > > Only handle the TDP MMU case for now. In other cases, if a bitmap was
> > > not provided, fallback to the slowpath that takes mmu_lock, or, if a
> > > bitmap was provided, inform the caller that the bitmap is unreliable.
> > >
> > > Suggested-by: Yu Zhao 
> > > Signed-off-by: James Houghton 
> > > ---
> > >  arch/x86/include/asm/kvm_host.h | 14 ++
> > >  arch/x86/kvm/mmu/mmu.c  | 16 ++--
> > >  arch/x86/kvm/mmu/tdp_mmu.c  | 10 +-
> > >  3 files changed, 37 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/arch/x86/include/asm/kvm_host.h 
> > > b/arch/x86/include/asm/kvm_host.h
> > > index 3b58e2306621..c30918d0887e 100644
> > > --- a/arch/x86/include/asm/kvm_host.h
> > > +++ b/arch/x86/include/asm/kvm_host.h
> > > @@ -2324,4 +2324,18 @@ int memslot_rmap_alloc(struct kvm_memory_slot 
> > > *slot, unsigned long npages);
> > >   */
> > >  #define KVM_EXIT_HYPERCALL_MBZ GENMASK_ULL(31, 1)
> > >
> > > +#define kvm_arch_prepare_bitmap_age kvm_arch_prepare_bitmap_age
> > > +static inline bool kvm_arch_prepare_bitmap_age(struct mmu_notifier *mn)
> > > +{
> > > +   /*
> > > +* Indicate that we support bitmap-based aging when using the TDP MMU
> > > +* and the accessed bit is available in the TDP page tables.
> > > +*
> > > +* We have no other preparatory work to do here, so we do not need to
> > > +* redefine kvm_arch_finish_bitmap_age().
> > > +*/
> > > +   return IS_ENABLED(CONFIG_X86_64) && tdp_mmu_enabled
> > > +&& shadow_accessed_mask;
> > > +}
> > > +
> > >  #endif /* _ASM_X86_KVM_HOST_H */
> > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > index 992e651540e8..fae1a75750bb 100644
> > > --- a/arch/x86/kvm/mmu/mmu.c
> > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > @@ -1674,8 +1674,14 @@ bool kvm_age_gfn(struct kvm *kvm, struct 
> > > kvm_gfn_range *range)
> > >  {
> > > bool young = false;
> > >
> > > -   if (kvm_memslots_have_rmaps(kvm))
> > > +   if (kvm_memslots_have_rmaps(kvm)) {
> > > +   if (range->lockless) {
> > > +   kvm_age_set_unreliable(range);
> > > +   return false;
> > > +   }
> >
> > If a VM has TDP MMU enabled, supports A/D bits, and is using nested
> > virtualization, MGLRU will effectively be blind to all accesses made by
> > the VM.
> >
> > kvm_arch_prepare_bitmap_age() will return true indicating that the
> > bitmap is supported. But then kvm_age_gfn() and kvm_test_age_gfn() will
> > return false immediately and indicate the bitmap is unreliable because a
> > shadow root is allocate. The notfier will then return
> > MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE.

Ah no, I'm wrong here. Setting args.unreliable causes the notifier to
return 0 instead of MMU_NOTIFIER_YOUNG_FAST.
MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE is used for something else.

The control flow of all this and naming of functions and macros is
overall confusing. args.unreliable and
MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE for one. Also I now realize
kvm_arch_prepare/finish_bitmap_age() are used even when the bitmap is
_not_ provided, so those names are also misleading.

Re: [PATCH v3 5/7] KVM: x86: Participate in bitmap-based PTE aging

2024-04-11 Thread David Matlack

On 2024-04-11 10:08 AM, David Matlack wrote:
> On 2024-04-01 11:29 PM, James Houghton wrote:
> > Only handle the TDP MMU case for now. In other cases, if a bitmap was
> > not provided, fallback to the slowpath that takes mmu_lock, or, if a
> > bitmap was provided, inform the caller that the bitmap is unreliable.
> > 
> > Suggested-by: Yu Zhao 
> > Signed-off-by: James Houghton 
> > ---
> >  arch/x86/include/asm/kvm_host.h | 14 ++
> >  arch/x86/kvm/mmu/mmu.c  | 16 ++--
> >  arch/x86/kvm/mmu/tdp_mmu.c  | 10 +-
> >  3 files changed, 37 insertions(+), 3 deletions(-)
> > 
> > diff --git a/arch/x86/include/asm/kvm_host.h 
> > b/arch/x86/include/asm/kvm_host.h
> > index 3b58e2306621..c30918d0887e 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -2324,4 +2324,18 @@ int memslot_rmap_alloc(struct kvm_memory_slot *slot, 
> > unsigned long npages);
> >   */
> >  #define KVM_EXIT_HYPERCALL_MBZ GENMASK_ULL(31, 1)
> >  
> > +#define kvm_arch_prepare_bitmap_age kvm_arch_prepare_bitmap_age
> > +static inline bool kvm_arch_prepare_bitmap_age(struct mmu_notifier *mn)
> > +{
> > +   /*
> > +* Indicate that we support bitmap-based aging when using the TDP MMU
> > +* and the accessed bit is available in the TDP page tables.
> > +*
> > +* We have no other preparatory work to do here, so we do not need to
> > +* redefine kvm_arch_finish_bitmap_age().
> > +*/
> > +   return IS_ENABLED(CONFIG_X86_64) && tdp_mmu_enabled
> > +&& shadow_accessed_mask;
> > +}
> > +
> >  #endif /* _ASM_X86_KVM_HOST_H */
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 992e651540e8..fae1a75750bb 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -1674,8 +1674,14 @@ bool kvm_age_gfn(struct kvm *kvm, struct 
> > kvm_gfn_range *range)
> >  {
> > bool young = false;
> >  
> > -   if (kvm_memslots_have_rmaps(kvm))
> > +   if (kvm_memslots_have_rmaps(kvm)) {
> > +   if (range->lockless) {
> > +   kvm_age_set_unreliable(range);
> > +   return false;
> > +   }
> 
> If a VM has TDP MMU enabled, supports A/D bits, and is using nested
> virtualization, MGLRU will effectively be blind to all accesses made by
> the VM.
> 
> kvm_arch_prepare_bitmap_age() will return true indicating that the
> bitmap is supported. But then kvm_age_gfn() and kvm_test_age_gfn() will
> return false immediately and indicate the bitmap is unreliable because a
> shadow root is allocate. The notfier will then return
> MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE.
> 
> Looking at the callers, MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE is never
> consumed or used. So I think MGLRU will assume all memory is
> unaccessed?
> 
> One way to improve the situation would be to re-order the TDP MMU
> function first and return young instead of false, so that way MGLRU at
> least has visibility into accesses made by L1 (and L2 if EPT is disable
> in L2). But that still means MGLRU is blind to accesses made by L2.
> 
> What about grabbing the mmu_lock if there's a shadow root allocated and
> get rid of MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE altogether?
> 
>   if (kvm_memslots_have_rmaps(kvm)) {
>   write_lock(&kvm->mmu_lock);
>   young |= kvm_handle_gfn_range(kvm, range, kvm_age_rmap);
>   write_unlock(&kvm->mmu_lock);
>   }
> 
> The TDP MMU walk would still be lockless. KVM only has to take the
> mmu_lock to collect accesses made by L2.
> 
> kvm_age_rmap() and kvm_test_age_rmap() will need to become bitmap-aware
> as well, but that seems relatively simple with the helper functions.

Wait, even simpler, just check kvm_memslots_have_rmaps() in
kvm_arch_prepare_bitmap_age() and skip the shadow MMU when processing a
bitmap request.

i.e.

static inline bool kvm_arch_prepare_bitmap_age(struct kvm *kvm, struct 
mmu_notifier *mn)
{
/*
 * Indicate that we support bitmap-based aging when using the TDP MMU
 * and the accessed bit is available in the TDP page tables.
 *
 * We have no other preparatory work to do here, so we do not need to
 * redefine kvm_arch_finish_bitmap_age().
 */
return IS_ENABLED(CONFIG_X86_64)
&& tdp_mmu_enabled
&& shadow_accessed_mask
&& !kvm_memslots_have_rmaps(kvm);
}

bool kvm_age_gfn(struct kvm *kvm, struct kvm_gf

Re: [PATCH v3 5/7] KVM: x86: Participate in bitmap-based PTE aging

2024-04-11 Thread David Matlack

On 2024-04-01 11:29 PM, James Houghton wrote:
> Only handle the TDP MMU case for now. In other cases, if a bitmap was
> not provided, fallback to the slowpath that takes mmu_lock, or, if a
> bitmap was provided, inform the caller that the bitmap is unreliable.
> 
> Suggested-by: Yu Zhao 
> Signed-off-by: James Houghton 
> ---
>  arch/x86/include/asm/kvm_host.h | 14 ++
>  arch/x86/kvm/mmu/mmu.c  | 16 ++--
>  arch/x86/kvm/mmu/tdp_mmu.c  | 10 +-
>  3 files changed, 37 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 3b58e2306621..c30918d0887e 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -2324,4 +2324,18 @@ int memslot_rmap_alloc(struct kvm_memory_slot *slot, 
> unsigned long npages);
>   */
>  #define KVM_EXIT_HYPERCALL_MBZ   GENMASK_ULL(31, 1)
>  
> +#define kvm_arch_prepare_bitmap_age kvm_arch_prepare_bitmap_age
> +static inline bool kvm_arch_prepare_bitmap_age(struct mmu_notifier *mn)
> +{
> + /*
> +  * Indicate that we support bitmap-based aging when using the TDP MMU
> +  * and the accessed bit is available in the TDP page tables.
> +  *
> +  * We have no other preparatory work to do here, so we do not need to
> +  * redefine kvm_arch_finish_bitmap_age().
> +  */
> + return IS_ENABLED(CONFIG_X86_64) && tdp_mmu_enabled
> +  && shadow_accessed_mask;
> +}
> +
>  #endif /* _ASM_X86_KVM_HOST_H */
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 992e651540e8..fae1a75750bb 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -1674,8 +1674,14 @@ bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range 
> *range)
>  {
>   bool young = false;
>  
> - if (kvm_memslots_have_rmaps(kvm))
> + if (kvm_memslots_have_rmaps(kvm)) {
> + if (range->lockless) {
> + kvm_age_set_unreliable(range);
> + return false;
> + }

If a VM has TDP MMU enabled, supports A/D bits, and is using nested
virtualization, MGLRU will effectively be blind to all accesses made by
the VM.

kvm_arch_prepare_bitmap_age() will return true indicating that the
bitmap is supported. But then kvm_age_gfn() and kvm_test_age_gfn() will
return false immediately and indicate the bitmap is unreliable because a
shadow root is allocate. The notfier will then return
MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE.

Looking at the callers, MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE is never
consumed or used. So I think MGLRU will assume all memory is
unaccessed?

One way to improve the situation would be to re-order the TDP MMU
function first and return young instead of false, so that way MGLRU at
least has visibility into accesses made by L1 (and L2 if EPT is disable
in L2). But that still means MGLRU is blind to accesses made by L2.

What about grabbing the mmu_lock if there's a shadow root allocated and
get rid of MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE altogether?

if (kvm_memslots_have_rmaps(kvm)) {
write_lock(&kvm->mmu_lock);
young |= kvm_handle_gfn_range(kvm, range, kvm_age_rmap);
write_unlock(&kvm->mmu_lock);
}

The TDP MMU walk would still be lockless. KVM only has to take the
mmu_lock to collect accesses made by L2.

kvm_age_rmap() and kvm_test_age_rmap() will need to become bitmap-aware
as well, but that seems relatively simple with the helper functions.

Re: [PATCH v3 2/6] KVM: X86: Implement PV IPIs in linux guest

2018-07-19 Thread David Matlack

On Mon, Jul 2, 2018 at 11:23 PM Wanpeng Li  wrote:
>
> From: Wanpeng Li 
>
> Implement paravirtual apic hooks to enable PV IPIs.

Very cool. Thanks for working on this!

>
> apic->send_IPI_mask
> apic->send_IPI_mask_allbutself
> apic->send_IPI_allbutself
> apic->send_IPI_all
>
> The PV IPIs supports maximal 128 vCPUs VM, it is big enough for cloud
> environment currently,

>From the Cloud perspective, 128 vCPUs is already obsolete. GCE's
n1-utlramem-160 VMs have 160 vCPUs where the maximum APIC ID is 231.
I'd definitely prefer an approach that scales to higher APIC IDs, like
Paolo's offset idea.

To Radim's point of real world performance testing, do you know what
is the primary source of multi-target IPIs? If it's TLB shootdowns we
might get a bigger bang for our buck with a PV TLB Shootdown.

> supporting more vCPUs needs to introduce more
> complex logic, in the future this might be extended if needed.
>
> Cc: Paolo Bonzini 
> Cc: Radim Krčmář 
> Cc: Vitaly Kuznetsov 
> Signed-off-by: Wanpeng Li 
> ---
>  arch/x86/include/uapi/asm/kvm_para.h |  1 +
>  arch/x86/kernel/kvm.c| 70 
> 
>  include/uapi/linux/kvm_para.h|  1 +
>  3 files changed, 72 insertions(+)
>
> diff --git a/arch/x86/include/uapi/asm/kvm_para.h 
> b/arch/x86/include/uapi/asm/kvm_para.h
> index 0ede697..19980ec 100644
> --- a/arch/x86/include/uapi/asm/kvm_para.h
> +++ b/arch/x86/include/uapi/asm/kvm_para.h
> @@ -28,6 +28,7 @@
>  #define KVM_FEATURE_PV_UNHALT  7
>  #define KVM_FEATURE_PV_TLB_FLUSH   9
>  #define KVM_FEATURE_ASYNC_PF_VMEXIT10
> +#define KVM_FEATURE_PV_SEND_IPI11
>
>  #define KVM_HINTS_REALTIME  0
>
> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> index 591bcf2..2fe1420 100644
> --- a/arch/x86/kernel/kvm.c
> +++ b/arch/x86/kernel/kvm.c
> @@ -454,6 +454,71 @@ static void __init sev_map_percpu_data(void)
>  }
>
>  #ifdef CONFIG_SMP
> +
> +#ifdef CONFIG_X86_64
> +static void __send_ipi_mask(const struct cpumask *mask, int vector)
> +{
> +   unsigned long flags, ipi_bitmap_low = 0, ipi_bitmap_high = 0;
> +   int cpu, apic_id;
> +
> +   if (cpumask_empty(mask))
> +   return;
> +
> +   local_irq_save(flags);
> +
> +   for_each_cpu(cpu, mask) {
> +   apic_id = per_cpu(x86_cpu_to_apicid, cpu);
> +   if (apic_id < BITS_PER_LONG)
> +   __set_bit(apic_id, &ipi_bitmap_low);
> +   else if (apic_id < 2 * BITS_PER_LONG)
> +   __set_bit(apic_id - BITS_PER_LONG, &ipi_bitmap_high);
> +   }
> +
> +   kvm_hypercall3(KVM_HC_SEND_IPI, ipi_bitmap_low, ipi_bitmap_high, 
> vector);
> +
> +   local_irq_restore(flags);
> +}
> +
> +static void kvm_send_ipi_mask(const struct cpumask *mask, int vector)
> +{
> +   __send_ipi_mask(mask, vector);
> +}
> +
> +static void kvm_send_ipi_mask_allbutself(const struct cpumask *mask, int 
> vector)
> +{
> +   unsigned int this_cpu = smp_processor_id();
> +   struct cpumask new_mask;
> +   const struct cpumask *local_mask;
> +
> +   cpumask_copy(&new_mask, mask);
> +   cpumask_clear_cpu(this_cpu, &new_mask);
> +   local_mask = &new_mask;
> +   __send_ipi_mask(local_mask, vector);
> +}
> +
> +static void kvm_send_ipi_allbutself(int vector)
> +{
> +   kvm_send_ipi_mask_allbutself(cpu_online_mask, vector);
> +}
> +
> +static void kvm_send_ipi_all(int vector)
> +{
> +   __send_ipi_mask(cpu_online_mask, vector);
> +}
> +
> +/*
> + * Set the IPI entry points
> + */
> +static void kvm_setup_pv_ipi(void)
> +{
> +   apic->send_IPI_mask = kvm_send_ipi_mask;
> +   apic->send_IPI_mask_allbutself = kvm_send_ipi_mask_allbutself;
> +   apic->send_IPI_allbutself = kvm_send_ipi_allbutself;
> +   apic->send_IPI_all = kvm_send_ipi_all;
> +   pr_info("KVM setup pv IPIs\n");
> +}
> +#endif
> +
>  static void __init kvm_smp_prepare_cpus(unsigned int max_cpus)
>  {
> native_smp_prepare_cpus(max_cpus);
> @@ -626,6 +691,11 @@ static uint32_t __init kvm_detect(void)
>
>  static void __init kvm_apic_init(void)
>  {
> +#if defined(CONFIG_SMP) && defined(CONFIG_X86_64)
> +   if (kvm_para_has_feature(KVM_FEATURE_PV_SEND_IPI) &&
> +   num_possible_cpus() <= 2 * BITS_PER_LONG)
> +   kvm_setup_pv_ipi();
> +#endif
>  }
>
>  static void __init kvm_init_platform(void)
> diff --git a/include/uapi/linux/kvm_para.h b/include/uapi/linux/kvm_para.h
> index dcf629d..84f8fe3 100644
> --- a/include/uapi/linux/kvm_para.h
> +++ b/include/uapi/linux/kvm_para.h
> @@ -26,6 +26,7 @@
>  #define KVM_HC_MIPS_EXIT_VM7
>  #define KVM_HC_MIPS_CONSOLE_OUTPUT 8
>  #define KVM_HC_CLOCK_PAIRING   9
> +#define KVM_HC_SEND_IPI10
>
>  /*
>   * hypercalls use architecture specific
> --
> 2.7.4
>

Re: [PATCH] KVM: nVMX: do not pin the VMCS12

2017-07-27 Thread David Matlack

On Thu, Jul 27, 2017 at 6:54 AM, Paolo Bonzini  wrote:
> Since the current implementation of VMCS12 does a memcpy in and out
> of guest memory, we do not need current_vmcs12 and current_vmcs12_page
> anymore.  current_vmptr is enough to read and write the VMCS12.

This patch also fixes dirty tracking (memslot->dirty_bitmap) of the
VMCS12 page by using kvm_write_guest. nested_release_page() only marks
the struct page dirty.

>
> Signed-off-by: Paolo Bonzini 
> ---
>  arch/x86/kvm/vmx.c | 23 ++-
>  1 file changed, 6 insertions(+), 17 deletions(-)
>
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index b37161808352..142f16ebdca2 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -416,9 +416,6 @@ struct nested_vmx {
>
> /* The guest-physical address of the current VMCS L1 keeps for L2 */
> gpa_t current_vmptr;
> -   /* The host-usable pointer to the above */
> -   struct page *current_vmcs12_page;
> -   struct vmcs12 *current_vmcs12;
> /*
>  * Cache of the guest's VMCS, existing outside of guest memory.
>  * Loaded from guest memory during VMPTRLD. Flushed to guest
> @@ -7183,10 +7180,6 @@ static inline void nested_release_vmcs12(struct 
> vcpu_vmx *vmx)
> if (vmx->nested.current_vmptr == -1ull)
> return;
>
> -   /* current_vmptr and current_vmcs12 are always set/reset together */
> -   if (WARN_ON(vmx->nested.current_vmcs12 == NULL))
> -   return;
> -
> if (enable_shadow_vmcs) {
> /* copy to memory all shadowed fields in case
>they were modified */
> @@ -7199,13 +7192,11 @@ static inline void nested_release_vmcs12(struct 
> vcpu_vmx *vmx)
> vmx->nested.posted_intr_nv = -1;
>
> /* Flush VMCS12 to guest memory */
> -   memcpy(vmx->nested.current_vmcs12, vmx->nested.cached_vmcs12,
> -  VMCS12_SIZE);
> +   kvm_vcpu_write_guest_page(&vmx->vcpu,
> + vmx->nested.current_vmptr >> PAGE_SHIFT,
> + vmx->nested.cached_vmcs12, 0, VMCS12_SIZE);

Have you hit any "suspicious RCU usage" error messages during VM
teardown with this patch? We did when we replaced memcpy with
kvm_write_guest a while back. IIRC it was due to kvm->srcu not being
held in one of the teardown paths. kvm_write_guest() expects it to be
held in order to access memslots.

We fixed this by skipping the VMCS12 flush during VMXOFF. I'll send
that patch along with a few other nVMX dirty tracking related patches
I've been meaning to get upstreamed.

>
> -   kunmap(vmx->nested.current_vmcs12_page);
> -   nested_release_page(vmx->nested.current_vmcs12_page);
> vmx->nested.current_vmptr = -1ull;
> -   vmx->nested.current_vmcs12 = NULL;
>  }
>
>  /*
> @@ -7623,14 +7614,13 @@ static int handle_vmptrld(struct kvm_vcpu *vcpu)
> }
>
> nested_release_vmcs12(vmx);
> -   vmx->nested.current_vmcs12 = new_vmcs12;
> -   vmx->nested.current_vmcs12_page = page;
> /*
>  * Load VMCS12 from guest memory since it is not already
>  * cached.
>  */
> -   memcpy(vmx->nested.cached_vmcs12,
> -  vmx->nested.current_vmcs12, VMCS12_SIZE);
> +   memcpy(vmx->nested.cached_vmcs12, new_vmcs12, VMCS12_SIZE);
> +   kunmap(page);

+ nested_release_page_clean(page);

> +
> set_current_vmptr(vmx, vmptr);
> }
>
> @@ -9354,7 +9344,6 @@ static struct kvm_vcpu *vmx_create_vcpu(struct kvm 
> *kvm, unsigned int id)
>
> vmx->nested.posted_intr_nv = -1;
> vmx->nested.current_vmptr = -1ull;
> -   vmx->nested.current_vmcs12 = NULL;
>
> vmx->msr_ia32_feature_control_valid_bits = FEATURE_CONTROL_LOCKED;
>
> --
> 1.8.3.1
>

Re: [PATCH] KVM: x86: remove code for lazy FPU handling

2017-02-16 Thread David Matlack

On Thu, Feb 16, 2017 at 1:33 AM, Paolo Bonzini  wrote:
>
> The FPU is always active now when running KVM.
>
> Signed-off-by: Paolo Bonzini 

Reviewed-by: David Matlack 

Glad to see this cleanup! Thanks for doing it.

> ---
>  arch/x86/include/asm/kvm_host.h |   3 --
>  arch/x86/kvm/cpuid.c|   2 -
>  arch/x86/kvm/svm.c  |  43 ++-
>  arch/x86/kvm/vmx.c  | 112 
> ++--
>  arch/x86/kvm/x86.c  |   7 +--
>  include/linux/kvm_host.h|   1 -
>  6 files changed, 19 insertions(+), 149 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index e4f13e714bcf..74ef58c8ff53 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -55,7 +55,6 @@
>  #define KVM_REQ_TRIPLE_FAULT  10
>  #define KVM_REQ_MMU_SYNC  11
>  #define KVM_REQ_CLOCK_UPDATE  12
> -#define KVM_REQ_DEACTIVATE_FPU13
>  #define KVM_REQ_EVENT 14
>  #define KVM_REQ_APF_HALT  15
>  #define KVM_REQ_STEAL_UPDATE  16
> @@ -936,8 +935,6 @@ struct kvm_x86_ops {
> unsigned long (*get_rflags)(struct kvm_vcpu *vcpu);
> void (*set_rflags)(struct kvm_vcpu *vcpu, unsigned long rflags);
> u32 (*get_pkru)(struct kvm_vcpu *vcpu);
> -   void (*fpu_activate)(struct kvm_vcpu *vcpu);
> -   void (*fpu_deactivate)(struct kvm_vcpu *vcpu);
>
> void (*tlb_flush)(struct kvm_vcpu *vcpu);
>
> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> index c0e2036217ad..1d155cc56629 100644
> --- a/arch/x86/kvm/cpuid.c
> +++ b/arch/x86/kvm/cpuid.c
> @@ -123,8 +123,6 @@ int kvm_update_cpuid(struct kvm_vcpu *vcpu)
> if (best && (best->eax & (F(XSAVES) | F(XSAVEC
> best->ebx = xstate_required_size(vcpu->arch.xcr0, true);
>
> -   kvm_x86_ops->fpu_activate(vcpu);
> -
> /*
>  * The existing code assumes virtual address is 48-bit in the 
> canonical
>  * address checks; exit if it is ever changed.
> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
> index 4e5905a1ce70..d1efe2c62b3f 100644
> --- a/arch/x86/kvm/svm.c
> +++ b/arch/x86/kvm/svm.c
> @@ -1157,7 +1157,6 @@ static void init_vmcb(struct vcpu_svm *svm)
> struct vmcb_control_area *control = &svm->vmcb->control;
> struct vmcb_save_area *save = &svm->vmcb->save;
>
> -   svm->vcpu.fpu_active = 1;
> svm->vcpu.arch.hflags = 0;
>
> set_cr_intercept(svm, INTERCEPT_CR0_READ);
> @@ -1899,15 +1898,12 @@ static void update_cr0_intercept(struct vcpu_svm *svm)
> ulong gcr0 = svm->vcpu.arch.cr0;
> u64 *hcr0 = &svm->vmcb->save.cr0;
>
> -   if (!svm->vcpu.fpu_active)
> -   *hcr0 |= SVM_CR0_SELECTIVE_MASK;
> -   else
> -   *hcr0 = (*hcr0 & ~SVM_CR0_SELECTIVE_MASK)
> -   | (gcr0 & SVM_CR0_SELECTIVE_MASK);
> +   *hcr0 = (*hcr0 & ~SVM_CR0_SELECTIVE_MASK)
> +   | (gcr0 & SVM_CR0_SELECTIVE_MASK);
>
> mark_dirty(svm->vmcb, VMCB_CR);
>
> -   if (gcr0 == *hcr0 && svm->vcpu.fpu_active) {
> +   if (gcr0 == *hcr0) {
> clr_cr_intercept(svm, INTERCEPT_CR0_READ);
> clr_cr_intercept(svm, INTERCEPT_CR0_WRITE);
> } else {
> @@ -1938,8 +1934,6 @@ static void svm_set_cr0(struct kvm_vcpu *vcpu, unsigned 
> long cr0)
> if (!npt_enabled)
> cr0 |= X86_CR0_PG | X86_CR0_WP;
>
> -   if (!vcpu->fpu_active)
> -   cr0 |= X86_CR0_TS;
> /*
>  * re-enable caching here because the QEMU bios
>  * does not do it - this results in some delay at
> @@ -2158,22 +2152,6 @@ static int ac_interception(struct vcpu_svm *svm)
> return 1;
>  }
>
> -static void svm_fpu_activate(struct kvm_vcpu *vcpu)
> -{
> -   struct vcpu_svm *svm = to_svm(vcpu);
> -
> -   clr_exception_intercept(svm, NM_VECTOR);
> -
> -   svm->vcpu.fpu_active = 1;
> -   update_cr0_intercept(svm);
> -}
> -
> -static int nm_interception(struct vcpu_svm *svm)
> -{
> -   svm_fpu_activate(&svm->vcpu);
> -   return 1;
> -}
> -
>  static bool is_erratum_383(void)
>  {
> int err, i;
> @@ -2571,9 +2549,6 @@ static int nested_svm_exit_special(struct vcpu_svm *svm)
> if (!npt_enabled && svm->apf_reason == 0)
> return NESTED_EXIT_HOST;
> break;
> -   case SVM_EXIT_EXCP_BASE + NM_VECTOR:
> -

Re: [PATCH v2 3/5] KVM: VMX: Move skip_emulated_instruction out of nested_vmx_check_vmcs12

2016-12-19 Thread David Matlack

On Tue, Nov 29, 2016 at 12:40 PM, Kyle Huey  wrote:
> We can't return both the pass/fail boolean for the vmcs and the upcoming
> continue/exit-to-userspace boolean for skip_emulated_instruction out of
> nested_vmx_check_vmcs, so move skip_emulated_instruction out of it instead.
>
> Additionally, VMENTER/VMRESUME only trigger singlestep exceptions when
> they advance the IP to the following instruction, not when they a) succeed,
> b) fail MSR validation or c) throw an exception. Add a separate call to
> skip_emulated_instruction that will later not be converted to the variant
> that checks the singlestep flag.
>
> Signed-off-by: Kyle Huey 
> ---
>  arch/x86/kvm/vmx.c | 53 +
>  1 file changed, 33 insertions(+), 20 deletions(-)
>
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index f2f9cf5..f4f6304 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -7319,33 +7319,36 @@ static void copy_vmcs12_to_shadow(struct vcpu_vmx 
> *vmx)
>   * VMX instructions which assume a current vmcs12 (i.e., that VMPTRLD was
>   * used before) all generate the same failure when it is missing.
>   */
>  static int nested_vmx_check_vmcs12(struct kvm_vcpu *vcpu)
>  {
> struct vcpu_vmx *vmx = to_vmx(vcpu);
> if (vmx->nested.current_vmptr == -1ull) {
> nested_vmx_failInvalid(vcpu);
> -   skip_emulated_instruction(vcpu);
> return 0;
> }
> return 1;
>  }
>
>  static int handle_vmread(struct kvm_vcpu *vcpu)
>  {
> unsigned long field;
> u64 field_value;
> unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
> u32 vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO);
> gva_t gva = 0;
>
> -   if (!nested_vmx_check_permission(vcpu) ||
> -   !nested_vmx_check_vmcs12(vcpu))
> +   if (!nested_vmx_check_permission(vcpu))
> +   return 1;
> +
> +   if (!nested_vmx_check_vmcs12(vcpu)) {
> +   skip_emulated_instruction(vcpu);
> return 1;
> +   }
>
> /* Decode instruction info and find the field to read */
> field = kvm_register_readl(vcpu, (((vmx_instruction_info) >> 28) & 
> 0xf));
> /* Read the field, zero-extended to a u64 field_value */
> if (vmcs12_read_any(vcpu, field, &field_value) < 0) {
> nested_vmx_failValid(vcpu, VMXERR_UNSUPPORTED_VMCS_COMPONENT);
> skip_emulated_instruction(vcpu);
> return 1;
> @@ -7383,20 +7386,24 @@ static int handle_vmwrite(struct kvm_vcpu *vcpu)
>  * mode, and eventually we need to write that into a field of several
>  * possible lengths. The code below first zero-extends the value to 64
>  * bit (field_value), and then copies only the appropriate number of
>  * bits into the vmcs12 field.
>  */
> u64 field_value = 0;
> struct x86_exception e;
>
> -   if (!nested_vmx_check_permission(vcpu) ||
> -   !nested_vmx_check_vmcs12(vcpu))
> +   if (!nested_vmx_check_permission(vcpu))
> return 1;
>
> +   if (!nested_vmx_check_vmcs12(vcpu)) {
> +   skip_emulated_instruction(vcpu);
> +   return 1;
> +   }
> +
> if (vmx_instruction_info & (1u << 10))
> field_value = kvm_register_readl(vcpu,
> (((vmx_instruction_info) >> 3) & 0xf));
> else {
> if (get_vmx_mem_address(vcpu, exit_qualification,
> vmx_instruction_info, false, &gva))
> return 1;
> if (kvm_read_guest_virt(&vcpu->arch.emulate_ctxt, gva,
> @@ -10041,21 +10048,22 @@ static int nested_vmx_run(struct kvm_vcpu *vcpu, 
> bool launch)
>  {
> struct vmcs12 *vmcs12;
> struct vcpu_vmx *vmx = to_vmx(vcpu);
> int cpu;
> struct loaded_vmcs *vmcs02;
> bool ia32e;
> u32 msr_entry_idx;
>
> -   if (!nested_vmx_check_permission(vcpu) ||
> -   !nested_vmx_check_vmcs12(vcpu))
> +   if (!nested_vmx_check_permission(vcpu))
> return 1;
>
> -   skip_emulated_instruction(vcpu);
> +   if (!nested_vmx_check_vmcs12(vcpu))
> +   goto out;
> +
> vmcs12 = get_vmcs12(vcpu);
>
> if (enable_shadow_vmcs)
> copy_shadow_to_vmcs12(vmx);
>
> /*
>  * The nested entry process starts with enforcing various 
> prerequisites
>  * on vmcs12 as required by the Intel SDM, and act appropriately when
> @@ -10065,43 +10073,43 @@ static int nested_vmx_run(struct kvm_vcpu *vcpu, 
> bool launch)
>  * To speed up the normal (success) code path, we should avoid 
> checking
>  * for misconfigurations which will anyway be caught by the processor
>  * when using the merged vmcs02.
>  */
> if (vmcs12->launch_state == launch) {
>

[PATCH 2/2] KVM: x86: flush pending lapic jump label updates on module unload

2016-12-16 Thread David Matlack

KVM's lapic emulation uses static_key_deferred (apic_{hw,sw}_disabled).
These are implemented with delayed_work structs which can still be
pending when the KVM module is unloaded. We've seen this cause kernel
panics when the kvm_intel module is quickly reloaded.

Use the new static_key_deferred_flush() API to flush pending updates on
module unload.

Signed-off-by: David Matlack 
---
 arch/x86/kvm/lapic.c | 6 ++
 arch/x86/kvm/lapic.h | 1 +
 arch/x86/kvm/x86.c   | 1 +
 3 files changed, 8 insertions(+)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 34a66b2..1b80fa3 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -2426,3 +2426,9 @@ void kvm_lapic_init(void)
jump_label_rate_limit(&apic_hw_disabled, HZ);
jump_label_rate_limit(&apic_sw_disabled, HZ);
 }
+
+void kvm_lapic_exit(void)
+{
+   static_key_deferred_flush(&apic_hw_disabled);
+   static_key_deferred_flush(&apic_sw_disabled);
+}
diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
index e0c8023..ff8039d 100644
--- a/arch/x86/kvm/lapic.h
+++ b/arch/x86/kvm/lapic.h
@@ -110,6 +110,7 @@ static inline bool kvm_hv_vapic_assist_page_enabled(struct 
kvm_vcpu *vcpu)
 
 int kvm_lapic_enable_pv_eoi(struct kvm_vcpu *vcpu, u64 data);
 void kvm_lapic_init(void);
+void kvm_lapic_exit(void);
 
 #define VEC_POS(v) ((v) & (32 - 1))
 #define REG_POS(v) (((v) >> 5) << 4)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 8f86c0c..da386bf 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6007,6 +6007,7 @@ int kvm_arch_init(void *opaque)
 
 void kvm_arch_exit(void)
 {
+   kvm_lapic_exit();
perf_unregister_guest_info_callbacks(&kvm_guest_cbs);
 
if (!boot_cpu_has(X86_FEATURE_CONSTANT_TSC))
-- 
2.8.0.rc3.226.g39d4020

[PATCH 1/2] jump_labels: API for flushing deferred jump label updates

2016-12-16 Thread David Matlack

Modules that use static_key_deferred need a way to synchronize with
any delayed work that is still pending when the module is unloaded.
Introduce static_key_deferred_flush() which flushes any pending
jump label updates.

Signed-off-by: David Matlack 
---
 include/linux/jump_label_ratelimit.h | 5 +
 kernel/jump_label.c  | 7 +++
 2 files changed, 12 insertions(+)

diff --git a/include/linux/jump_label_ratelimit.h 
b/include/linux/jump_label_ratelimit.h
index 089f70f..23da3af 100644
--- a/include/linux/jump_label_ratelimit.h
+++ b/include/linux/jump_label_ratelimit.h
@@ -14,6 +14,7 @@ struct static_key_deferred {
 
 #ifdef HAVE_JUMP_LABEL
 extern void static_key_slow_dec_deferred(struct static_key_deferred *key);
+extern void static_key_deferred_flush(struct static_key_deferred *key);
 extern void
 jump_label_rate_limit(struct static_key_deferred *key, unsigned long rl);
 
@@ -26,6 +27,10 @@ static inline void static_key_slow_dec_deferred(struct 
static_key_deferred *key)
STATIC_KEY_CHECK_USE();
static_key_slow_dec(&key->key);
 }
+static inline void static_key_deferred_flush(struct static_key_deferred *key)
+{
+   STATIC_KEY_CHECK_USE();
+}
 static inline void
 jump_label_rate_limit(struct static_key_deferred *key,
unsigned long rl)
diff --git a/kernel/jump_label.c b/kernel/jump_label.c
index 93ad6c1..a9b8cf5 100644
--- a/kernel/jump_label.c
+++ b/kernel/jump_label.c
@@ -182,6 +182,13 @@ void static_key_slow_dec_deferred(struct 
static_key_deferred *key)
 }
 EXPORT_SYMBOL_GPL(static_key_slow_dec_deferred);
 
+void static_key_deferred_flush(struct static_key_deferred *key)
+{
+   STATIC_KEY_CHECK_USE();
+   flush_delayed_work(&key->work);
+}
+EXPORT_SYMBOL_GPL(static_key_deferred_flush);
+
 void jump_label_rate_limit(struct static_key_deferred *key,
unsigned long rl)
 {
-- 
2.8.0.rc3.226.g39d4020

Re: [PATCH v3 3/5] KVM: nVMX: fix checks on CR{0,4} during virtual VMX operation

2016-11-30 Thread David Matlack

On Wed, Nov 30, 2016 at 2:33 PM, Paolo Bonzini  wrote:
> - Original Message -
>> From: "Radim Krčmář" 
>> To: "David Matlack" 
>> Cc: k...@vger.kernel.org, linux-kernel@vger.kernel.org, jmatt...@google.com, 
>> pbonz...@redhat.com
>> Sent: Wednesday, November 30, 2016 10:52:35 PM
>> Subject: Re: [PATCH v3 3/5] KVM: nVMX: fix checks on CR{0,4} during virtual 
>> VMX operation
>>
>> 2016-11-29 18:14-0800, David Matlack:
>> > KVM emulates MSR_IA32_VMX_CR{0,4}_FIXED1 with the value -1ULL, meaning
>> > all CR0 and CR4 bits are allowed to be 1 during VMX operation.
>> >
>> > This does not match real hardware, which disallows the high 32 bits of
>> > CR0 to be 1, and disallows reserved bits of CR4 to be 1 (including bits
>> > which are defined in the SDM but missing according to CPUID). A guest
>> > can induce a VM-entry failure by setting these bits in GUEST_CR0 and
>> > GUEST_CR4, despite MSR_IA32_VMX_CR{0,4}_FIXED1 indicating they are
>> > valid.
>> >
>> > Since KVM has allowed all bits to be 1 in CR0 and CR4, the existing
>> > checks on these registers do not verify must-be-0 bits. Fix these checks
>> > to identify must-be-0 bits according to MSR_IA32_VMX_CR{0,4}_FIXED1.
>> >
>> > This patch should introduce no change in behavior in KVM, since these
>> > MSRs are still -1ULL.
>> >
>> > Signed-off-by: David Matlack 
>> > ---
>> > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
>> > @@ -4104,6 +4110,40 @@ static void ept_save_pdptrs(struct kvm_vcpu *vcpu)
>> > +static bool nested_guest_cr0_valid(struct kvm_vcpu *vcpu, unsigned long
>> > val)
>> > +{
>> > +   u64 fixed0 = to_vmx(vcpu)->nested.nested_vmx_cr0_fixed0;
>> > +   u64 fixed1 = to_vmx(vcpu)->nested.nested_vmx_cr0_fixed1;
>> > +   struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
>> > +
>> > +   if (to_vmx(vcpu)->nested.nested_vmx_secondary_ctls_high &
>> > +   SECONDARY_EXEC_UNRESTRICTED_GUEST &&
>> > +   nested_cpu_has2(vmcs12, SECONDARY_EXEC_UNRESTRICTED_GUEST))
>> > +   fixed0 &= ~(X86_CR0_PE | X86_CR0_PG);
>>
>> These bits also seem to be guaranteed in fixed1 ... complicated
>> dependencies.
>
> Bits that are set in fixed0 must be set in fixed1 too.  Since patch 4
> always sets CR0_FIXED1 to all-ones (matching bare metal), this is okay.
>
>> There is another exception, SDM 26.3.1.1 (Checks on Guest Control
>> Registers, Debug Registers, and MSRs):
>>
>>   Bit 29 (corresponding to CR0.NW) and bit 30 (CD) are never checked
>>   because the values of these bits are not changed by VM entry; see
>>   Section 26.3.2.1.
>
> Same here, we never check them anyway.
>
>> And another check:
>>
>>   If bit 31 in the CR0 field (corresponding to PG) is 1, bit 0 in that
>>   field (PE) must also be 1.
>
> This should not be a problem, a failed vmentry is reflected into L1
> anyway.  We only need to check insofar as we could have a more restrictive
> check than what the processor does.

I had the same thought when I was first writing this patch, Radim.
Maybe we should add a comment here. E.g.

/*
 * CR0.PG && !CR0.PE is also invalid but caught by the CPU
 * during VM-entry to L2.
 */

>
> Paolo
>
>> > +
>> > +   return fixed_bits_valid(val, fixed0, fixed1);
>> > +}
>> > +
>> > +static bool nested_host_cr0_valid(struct kvm_vcpu *vcpu, unsigned long
>> > val)
>> > +{
>> > +   u64 fixed0 = to_vmx(vcpu)->nested.nested_vmx_cr0_fixed0;
>> > +   u64 fixed1 = to_vmx(vcpu)->nested.nested_vmx_cr0_fixed1;
>> > +
>> > +   return fixed_bits_valid(val, fixed0, fixed1);
>> > +}
>> > +
>> > +static bool nested_cr4_valid(struct kvm_vcpu *vcpu, unsigned long val)
>> > +{
>> > +   u64 fixed0 = to_vmx(vcpu)->nested.nested_vmx_cr4_fixed0;
>> > +   u64 fixed1 = to_vmx(vcpu)->nested.nested_vmx_cr4_fixed1;
>> > +
>> > +   return fixed_bits_valid(val, fixed0, fixed1);
>> > +}
>> > +
>> > +/* No difference in the restrictions on guest and host CR4 in VMX
>> > operation. */
>> > +#define nested_guest_cr4_valid nested_cr4_valid
>> > +#define nested_host_cr4_valid  nested_cr4_valid
>>
>> We should use cr0 and cr4 checks also in handle_vmon().
>>
>> I've applied this series to kvm/queue for early testing.
>> Please send replacement patch or patch(es) on top of this series.
>>
>> Thanks.
>>

Re: [PATCH v3 0/5] VMX Capability MSRs

2016-11-30 Thread David Matlack

On Wed, Nov 30, 2016 at 3:22 AM, Paolo Bonzini  wrote:
>
>
> On 30/11/2016 03:14, David Matlack wrote:
>> This patchset adds support setting the VMX capability MSRs from userspace.
>> This is required for migration of nested-capable VMs to different CPUs and
>> KVM versions.
>>
>> Patch 1 generates the non-true VMX MSRs using the true MSRs, which allows
>> userspace to skip restoring them.
>>
>> Patch 2 adds support for restoring the VMX capability MSRs.
>>
>> Patches 3 and 4 make KVM's emulation of MSR_IA32_VMX_CR{0,4}_FIXED1 more
>> accurate.
>>
>> Patch 5 fixes a bug in emulated VM-entry that came up when testing patches
>> 3 and 4.
>>
>> Changes since v2:
>>   * Generate CR0_FIXED1 in addition to CR4_FIXED1
>>   * Generate "non-true" capability MSRs from the "true" versions and remove
>> "non-true" MSRs from struct nested_vmx.
>>   * Disallow restore of CR{0,4}_FIXED1 and "non-true" MSRs since they are
>> generated.
>>
>> Changes since v1:
>>   * Support restoring less-capable versions of MSR_IA32_VMX_BASIC,
>> MSR_IA32_VMX_CR{0,4}_FIXED{0,1}.
>>   * Include VMX_INS_OUTS in MSR_IA32_VMX_BASIC initial value.
>>
>> David Matlack (5):
>>   KVM: nVMX: generate non-true VMX MSRs based on true versions
>>   KVM: nVMX: support restore of VMX capability MSRs
>>   KVM: nVMX: fix checks on CR{0,4} during virtual VMX operation
>>   KVM: nVMX: generate MSR_IA32_CR{0,4}_FIXED1 from guest CPUID
>>   KVM: nVMX: load GUEST_EFER after GUEST_CR0 during emulated VM-entry
>>
>>  arch/x86/include/asm/vmx.h |  31 +++
>>  arch/x86/kvm/vmx.c | 479 
>> +
>>  2 files changed, 427 insertions(+), 83 deletions(-)
>
> Just a small nit that can be fixed on applying.  Thanks!

Thanks for the thorough review!

>
> Paolo

Re: [PATCH v3 1/5] KVM: nVMX: generate non-true VMX MSRs based on true versions

2016-11-30 Thread David Matlack

On Wed, Nov 30, 2016 at 3:16 AM, Paolo Bonzini  wrote:
> On 30/11/2016 03:14, David Matlack wrote:
>>
>>   /* secondary cpu-based controls */
>> @@ -2868,36 +2865,32 @@ static int vmx_get_vmx_msr(struct kvm_vcpu *vcpu, 
>> u32 msr_index, u64 *pdata)
>>   *pdata = vmx_control_msr(
>>   vmx->nested.nested_vmx_pinbased_ctls_low,
>>   vmx->nested.nested_vmx_pinbased_ctls_high);
>> + if (msr_index == MSR_IA32_VMX_PINBASED_CTLS)
>> + *pdata |= PIN_BASED_ALWAYSON_WITHOUT_TRUE_MSR;
>
> Almost: PIN_BASED_ALWAYSON_WITHOUT_TRUE_MSR must be
> added to both the low and high parts.  Likewise below.
> I guess you can use vmx_control_msr to generate it, too.

SGTM.

Although that would mean the true MSRs indicate a bit must-be-0 while
the non-true MSRs are indicating it must-be-1, which seems odd.

[PATCH v3 2/5] KVM: nVMX: support restore of VMX capability MSRs

2016-11-29 Thread David Matlack

The VMX capability MSRs advertise the set of features the KVM virtual
CPU can support. This set of features varies across different host CPUs
and KVM versions. This patch aims to addresses both sources of
differences, allowing VMs to be migrated across CPUs and KVM versions
without guest-visible changes to these MSRs. Note that cross-KVM-
version migration is only supported from this point forward.

When the VMX capability MSRs are restored, they are audited to check
that the set of features advertised are a subset of what KVM and the
CPU support.

Since the VMX capability MSRs are read-only, they do not need to be on
the default MSR save/restore lists. The userspace hypervisor can set
the values of these MSRs or read them from KVM at VCPU creation time,
and restore the same value after every save/restore.

Signed-off-by: David Matlack 
---
 arch/x86/include/asm/vmx.h |  31 +
 arch/x86/kvm/vmx.c | 290 +
 2 files changed, 297 insertions(+), 24 deletions(-)

diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index a002b07..a4ca897 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -25,6 +25,7 @@
 #define VMX_H
 
 
+#include 
 #include 
 #include 
 
@@ -110,6 +111,36 @@
 #define VMX_MISC_SAVE_EFER_LMA 0x0020
 #define VMX_MISC_ACTIVITY_HLT  0x0040
 
+static inline u32 vmx_basic_vmcs_revision_id(u64 vmx_basic)
+{
+   return vmx_basic & GENMASK_ULL(30, 0);
+}
+
+static inline u32 vmx_basic_vmcs_size(u64 vmx_basic)
+{
+   return (vmx_basic & GENMASK_ULL(44, 32)) >> 32;
+}
+
+static inline int vmx_misc_preemption_timer_rate(u64 vmx_misc)
+{
+   return vmx_misc & VMX_MISC_PREEMPTION_TIMER_RATE_MASK;
+}
+
+static inline int vmx_misc_cr3_count(u64 vmx_misc)
+{
+   return (vmx_misc & GENMASK_ULL(24, 16)) >> 16;
+}
+
+static inline int vmx_misc_max_msr(u64 vmx_misc)
+{
+   return (vmx_misc & GENMASK_ULL(27, 25)) >> 25;
+}
+
+static inline int vmx_misc_mseg_revid(u64 vmx_misc)
+{
+   return (vmx_misc & GENMASK_ULL(63, 32)) >> 32;
+}
+
 /* VMCS Encodings */
 enum vmcs_field {
VIRTUAL_PROCESSOR_ID= 0x,
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 0beb56a..01a2b9e 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -465,6 +465,12 @@ struct nested_vmx {
u32 nested_vmx_misc_high;
u32 nested_vmx_ept_caps;
u32 nested_vmx_vpid_caps;
+   u64 nested_vmx_basic;
+   u64 nested_vmx_cr0_fixed0;
+   u64 nested_vmx_cr0_fixed1;
+   u64 nested_vmx_cr4_fixed0;
+   u64 nested_vmx_cr4_fixed1;
+   u64 nested_vmx_vmcs_enum;
 };
 
 #define POSTED_INTR_ON  0
@@ -2826,6 +2832,36 @@ static void nested_vmx_setup_ctls_msrs(struct vcpu_vmx 
*vmx)
VMX_MISC_EMULATED_PREEMPTION_TIMER_RATE |
VMX_MISC_ACTIVITY_HLT;
vmx->nested.nested_vmx_misc_high = 0;
+
+   /*
+* This MSR reports some information about VMX support. We
+* should return information about the VMX we emulate for the
+* guest, and the VMCS structure we give it - not about the
+* VMX support of the underlying hardware.
+*/
+   vmx->nested.nested_vmx_basic =
+   VMCS12_REVISION |
+   VMX_BASIC_TRUE_CTLS |
+   ((u64)VMCS12_SIZE << VMX_BASIC_VMCS_SIZE_SHIFT) |
+   (VMX_BASIC_MEM_TYPE_WB << VMX_BASIC_MEM_TYPE_SHIFT);
+
+   if (cpu_has_vmx_basic_inout())
+   vmx->nested.nested_vmx_basic |= VMX_BASIC_INOUT;
+
+   /*
+* These MSRs specify bits which the guest must keep fixed (on or off)
+* while L1 is in VMXON mode (in L1's root mode, or running an L2).
+* We picked the standard core2 setting.
+*/
+#define VMXON_CR0_ALWAYSON (X86_CR0_PE | X86_CR0_PG | X86_CR0_NE)
+#define VMXON_CR4_ALWAYSON X86_CR4_VMXE
+   vmx->nested.nested_vmx_cr0_fixed0 = VMXON_CR0_ALWAYSON;
+   vmx->nested.nested_vmx_cr0_fixed1 = -1ULL;
+   vmx->nested.nested_vmx_cr4_fixed0 = VMXON_CR4_ALWAYSON;
+   vmx->nested.nested_vmx_cr4_fixed1 = -1ULL;
+
+   /* highest index: VMX_PREEMPTION_TIMER_VALUE */
+   vmx->nested.nested_vmx_vmcs_enum = 0x2e;
 }
 
 static inline bool vmx_control_verify(u32 control, u32 low, u32 high)
@@ -2841,24 +2877,233 @@ static inline u64 vmx_control_msr(u32 low, u32 high)
return low | ((u64)high << 32);
 }
 
-/* Returns 0 on success, non-0 otherwise. */
-static int vmx_get_vmx_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 *pdata)
+static bool is_bitwise_subset(u64 superset, u64 subset, u64 mask)
+{
+   superset &= mask;
+   subset &= mask;
+
+   return (superset | subset) == superset;
+}
+
+static int vmx_restore_vmx_basic(struct vcpu_vmx *vmx, u64 data)
+{
+   const u64 feature_and_reserved =
+

[PATCH v3 5/5] KVM: nVMX: load GUEST_EFER after GUEST_CR0 during emulated VM-entry

2016-11-29 Thread David Matlack

vmx_set_cr0() modifies GUEST_EFER and "IA-32e mode guest" in the current
VMCS. Call vmx_set_efer() after vmx_set_cr0() so that emulated VM-entry
is more faithful to VMCS12.

This patch correctly causes VM-entry to fail when "IA-32e mode guest" is
1 and GUEST_CR0.PG is 0. Previously this configuration would succeed and
"IA-32e mode guest" would silently be disabled by KVM.

Signed-off-by: David Matlack 
---
 arch/x86/kvm/vmx.c | 18 +-
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 49270c4..776dc67 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -10386,15 +10386,6 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu, 
struct vmcs12 *vmcs12)
nested_ept_init_mmu_context(vcpu);
}
 
-   if (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_IA32_EFER)
-   vcpu->arch.efer = vmcs12->guest_ia32_efer;
-   else if (vmcs12->vm_entry_controls & VM_ENTRY_IA32E_MODE)
-   vcpu->arch.efer |= (EFER_LMA | EFER_LME);
-   else
-   vcpu->arch.efer &= ~(EFER_LMA | EFER_LME);
-   /* Note: modifies VM_ENTRY/EXIT_CONTROLS and GUEST/HOST_IA32_EFER */
-   vmx_set_efer(vcpu, vcpu->arch.efer);
-
/*
 * This sets GUEST_CR0 to vmcs12->guest_cr0, with possibly a modified
 * TS bit (for lazy fpu) and bits which we consider mandatory enabled.
@@ -10409,6 +10400,15 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu, 
struct vmcs12 *vmcs12)
vmx_set_cr4(vcpu, vmcs12->guest_cr4);
vmcs_writel(CR4_READ_SHADOW, nested_read_cr4(vmcs12));
 
+   if (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_IA32_EFER)
+   vcpu->arch.efer = vmcs12->guest_ia32_efer;
+   else if (vmcs12->vm_entry_controls & VM_ENTRY_IA32E_MODE)
+   vcpu->arch.efer |= (EFER_LMA | EFER_LME);
+   else
+   vcpu->arch.efer &= ~(EFER_LMA | EFER_LME);
+   /* Note: modifies VM_ENTRY/EXIT_CONTROLS and GUEST/HOST_IA32_EFER */
+   vmx_set_efer(vcpu, vcpu->arch.efer);
+
/* shadow page tables on either EPT or shadow page tables */
kvm_set_cr3(vcpu, vmcs12->guest_cr3);
kvm_mmu_reset_context(vcpu);
-- 
2.8.0.rc3.226.g39d4020

[PATCH v3 4/5] KVM: nVMX: generate MSR_IA32_CR{0,4}_FIXED1 from guest CPUID

2016-11-29 Thread David Matlack

MSR_IA32_CR{0,4}_FIXED1 define which bits in CR0 and CR4 are allowed to
be 1 during VMX operation. Since the set of allowed-1 bits is the same
in and out of VMX operation, we can generate these MSRs entirely from
the guest's CPUID. This lets userspace avoiding having to save/restore
these MSRs.

This patch also initializes MSR_IA32_CR{0,4}_FIXED1 from the CPU's MSRs
by default. This is a saner than the current default of -1ull, which
includes bits that the host CPU does not support.

Signed-off-by: David Matlack 
---
 arch/x86/kvm/vmx.c | 55 +++---
 1 file changed, 52 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index b414513..49270c4 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -2849,16 +2849,18 @@ static void nested_vmx_setup_ctls_msrs(struct vcpu_vmx 
*vmx)
vmx->nested.nested_vmx_basic |= VMX_BASIC_INOUT;
 
/*
-* These MSRs specify bits which the guest must keep fixed (on or off)
+* These MSRs specify bits which the guest must keep fixed on
 * while L1 is in VMXON mode (in L1's root mode, or running an L2).
 * We picked the standard core2 setting.
 */
 #define VMXON_CR0_ALWAYSON (X86_CR0_PE | X86_CR0_PG | X86_CR0_NE)
 #define VMXON_CR4_ALWAYSON X86_CR4_VMXE
vmx->nested.nested_vmx_cr0_fixed0 = VMXON_CR0_ALWAYSON;
-   vmx->nested.nested_vmx_cr0_fixed1 = -1ULL;
vmx->nested.nested_vmx_cr4_fixed0 = VMXON_CR4_ALWAYSON;
-   vmx->nested.nested_vmx_cr4_fixed1 = -1ULL;
+
+   /* These MSRs specify bits which the guest must keep fixed off. */
+   rdmsrl(MSR_IA32_VMX_CR0_FIXED1, vmx->nested.nested_vmx_cr0_fixed1);
+   rdmsrl(MSR_IA32_VMX_CR4_FIXED1, vmx->nested.nested_vmx_cr4_fixed1);
 
/* highest index: VMX_PREEMPTION_TIMER_VALUE */
vmx->nested.nested_vmx_vmcs_enum = 0x2e;
@@ -9547,6 +9549,50 @@ static void vmcs_set_secondary_exec_control(u32 new_ctl)
 (new_ctl & ~mask) | (cur_ctl & mask));
 }
 
+/*
+ * Generate MSR_IA32_VMX_CR{0,4}_FIXED1 according to CPUID. Only set bits
+ * (indicating "allowed-1") if they are supported in the guest's CPUID.
+ */
+static void nested_vmx_cr_fixed1_bits_update(struct kvm_vcpu *vcpu)
+{
+   struct vcpu_vmx *vmx = to_vmx(vcpu);
+   struct kvm_cpuid_entry2 *entry;
+
+   vmx->nested.nested_vmx_cr0_fixed1 = 0x;
+   vmx->nested.nested_vmx_cr4_fixed1 = X86_CR4_PCE;
+
+#define cr4_fixed1_update(_cr4_mask, _reg, _cpuid_mask) do {   \
+   if (entry && (entry->_reg & (_cpuid_mask))) \
+   vmx->nested.nested_vmx_cr4_fixed1 |= (_cr4_mask);   \
+} while (0)
+
+   entry = kvm_find_cpuid_entry(vcpu, 0x1, 0);
+   cr4_fixed1_update(X86_CR4_VME,edx, bit(X86_FEATURE_VME));
+   cr4_fixed1_update(X86_CR4_PVI,edx, bit(X86_FEATURE_VME));
+   cr4_fixed1_update(X86_CR4_TSD,edx, bit(X86_FEATURE_TSC));
+   cr4_fixed1_update(X86_CR4_DE, edx, bit(X86_FEATURE_DE));
+   cr4_fixed1_update(X86_CR4_PSE,edx, bit(X86_FEATURE_PSE));
+   cr4_fixed1_update(X86_CR4_PAE,edx, bit(X86_FEATURE_PAE));
+   cr4_fixed1_update(X86_CR4_MCE,edx, bit(X86_FEATURE_MCE));
+   cr4_fixed1_update(X86_CR4_PGE,edx, bit(X86_FEATURE_PGE));
+   cr4_fixed1_update(X86_CR4_OSFXSR, edx, bit(X86_FEATURE_FXSR));
+   cr4_fixed1_update(X86_CR4_OSXMMEXCPT, edx, bit(X86_FEATURE_XMM));
+   cr4_fixed1_update(X86_CR4_VMXE,   ecx, bit(X86_FEATURE_VMX));
+   cr4_fixed1_update(X86_CR4_SMXE,   ecx, bit(X86_FEATURE_SMX));
+   cr4_fixed1_update(X86_CR4_PCIDE,  ecx, bit(X86_FEATURE_PCID));
+   cr4_fixed1_update(X86_CR4_OSXSAVE,ecx, bit(X86_FEATURE_XSAVE));
+
+   entry = kvm_find_cpuid_entry(vcpu, 0x7, 0);
+   cr4_fixed1_update(X86_CR4_FSGSBASE,   ebx, bit(X86_FEATURE_FSGSBASE));
+   cr4_fixed1_update(X86_CR4_SMEP,   ebx, bit(X86_FEATURE_SMEP));
+   cr4_fixed1_update(X86_CR4_SMAP,   ebx, bit(X86_FEATURE_SMAP));
+   cr4_fixed1_update(X86_CR4_PKE,ecx, bit(X86_FEATURE_PKU));
+   /* TODO: Use X86_CR4_UMIP and X86_FEATURE_UMIP macros */
+   cr4_fixed1_update(bit(11),ecx, bit(2));
+
+#undef cr4_fixed1_update
+}
+
 static void vmx_cpuid_update(struct kvm_vcpu *vcpu)
 {
struct kvm_cpuid_entry2 *best;
@@ -9588,6 +9634,9 @@ static void vmx_cpuid_update(struct kvm_vcpu *vcpu)
else
to_vmx(vcpu)->msr_ia32_feature_control_valid_bits &=
~FEATURE_CONTROL_VMXON_ENABLED_OUTSIDE_SMX;
+
+   if (nested_vmx_allowed(vcpu))
+   nested_vmx_cr_fixed1_bits_update(vcpu);
 }
 
 static void vmx_set_supported_cpuid(u32 func, struct kvm_cpuid_entry2 *entry)
-- 
2.8.0.rc3.226.g39d4020

[PATCH v3 3/5] KVM: nVMX: fix checks on CR{0,4} during virtual VMX operation

2016-11-29 Thread David Matlack

KVM emulates MSR_IA32_VMX_CR{0,4}_FIXED1 with the value -1ULL, meaning
all CR0 and CR4 bits are allowed to be 1 during VMX operation.

This does not match real hardware, which disallows the high 32 bits of
CR0 to be 1, and disallows reserved bits of CR4 to be 1 (including bits
which are defined in the SDM but missing according to CPUID). A guest
can induce a VM-entry failure by setting these bits in GUEST_CR0 and
GUEST_CR4, despite MSR_IA32_VMX_CR{0,4}_FIXED1 indicating they are
valid.

Since KVM has allowed all bits to be 1 in CR0 and CR4, the existing
checks on these registers do not verify must-be-0 bits. Fix these checks
to identify must-be-0 bits according to MSR_IA32_VMX_CR{0,4}_FIXED1.

This patch should introduce no change in behavior in KVM, since these
MSRs are still -1ULL.

Signed-off-by: David Matlack 
---
 arch/x86/kvm/vmx.c | 77 +-
 1 file changed, 53 insertions(+), 24 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 01a2b9e..b414513 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -2864,12 +2864,18 @@ static void nested_vmx_setup_ctls_msrs(struct vcpu_vmx 
*vmx)
vmx->nested.nested_vmx_vmcs_enum = 0x2e;
 }
 
+/*
+ * if fixed0[i] == 1: val[i] must be 1
+ * if fixed1[i] == 0: val[i] must be 0
+ */
+static inline bool fixed_bits_valid(u64 val, u64 fixed0, u64 fixed1)
+{
+   return ((val & fixed1) | fixed0) == val;
+}
+
 static inline bool vmx_control_verify(u32 control, u32 low, u32 high)
 {
-   /*
-* Bits 0 in high must be 0, and bits 1 in low must be 1.
-*/
-   return ((control & high) | low) == control;
+   return fixed_bits_valid(control, low, high);
 }
 
 static inline u64 vmx_control_msr(u32 low, u32 high)
@@ -4104,6 +4110,40 @@ static void ept_save_pdptrs(struct kvm_vcpu *vcpu)
  (unsigned long *)&vcpu->arch.regs_dirty);
 }
 
+static bool nested_guest_cr0_valid(struct kvm_vcpu *vcpu, unsigned long val)
+{
+   u64 fixed0 = to_vmx(vcpu)->nested.nested_vmx_cr0_fixed0;
+   u64 fixed1 = to_vmx(vcpu)->nested.nested_vmx_cr0_fixed1;
+   struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
+
+   if (to_vmx(vcpu)->nested.nested_vmx_secondary_ctls_high &
+   SECONDARY_EXEC_UNRESTRICTED_GUEST &&
+   nested_cpu_has2(vmcs12, SECONDARY_EXEC_UNRESTRICTED_GUEST))
+   fixed0 &= ~(X86_CR0_PE | X86_CR0_PG);
+
+   return fixed_bits_valid(val, fixed0, fixed1);
+}
+
+static bool nested_host_cr0_valid(struct kvm_vcpu *vcpu, unsigned long val)
+{
+   u64 fixed0 = to_vmx(vcpu)->nested.nested_vmx_cr0_fixed0;
+   u64 fixed1 = to_vmx(vcpu)->nested.nested_vmx_cr0_fixed1;
+
+   return fixed_bits_valid(val, fixed0, fixed1);
+}
+
+static bool nested_cr4_valid(struct kvm_vcpu *vcpu, unsigned long val)
+{
+   u64 fixed0 = to_vmx(vcpu)->nested.nested_vmx_cr4_fixed0;
+   u64 fixed1 = to_vmx(vcpu)->nested.nested_vmx_cr4_fixed1;
+
+   return fixed_bits_valid(val, fixed0, fixed1);
+}
+
+/* No difference in the restrictions on guest and host CR4 in VMX operation. */
+#define nested_guest_cr4_valid nested_cr4_valid
+#define nested_host_cr4_valid  nested_cr4_valid
+
 static int vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4);
 
 static void ept_update_paging_mode_cr0(unsigned long *hw_cr0,
@@ -4232,8 +4272,8 @@ static int vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned 
long cr4)
if (!nested_vmx_allowed(vcpu))
return 1;
}
-   if (to_vmx(vcpu)->nested.vmxon &&
-   ((cr4 & VMXON_CR4_ALWAYSON) != VMXON_CR4_ALWAYSON))
+
+   if (to_vmx(vcpu)->nested.vmxon && !nested_cr4_valid(vcpu, cr4))
return 1;
 
vcpu->arch.cr4 = cr4;
@@ -5852,18 +5892,6 @@ vmx_patch_hypercall(struct kvm_vcpu *vcpu, unsigned char 
*hypercall)
hypercall[2] = 0xc1;
 }
 
-static bool nested_cr0_valid(struct kvm_vcpu *vcpu, unsigned long val)
-{
-   unsigned long always_on = VMXON_CR0_ALWAYSON;
-   struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
-
-   if (to_vmx(vcpu)->nested.nested_vmx_secondary_ctls_high &
-   SECONDARY_EXEC_UNRESTRICTED_GUEST &&
-   nested_cpu_has2(vmcs12, SECONDARY_EXEC_UNRESTRICTED_GUEST))
-   always_on &= ~(X86_CR0_PE | X86_CR0_PG);
-   return (val & always_on) == always_on;
-}
-
 /* called to set cr0 as appropriate for a mov-to-cr0 exit. */
 static int handle_set_cr0(struct kvm_vcpu *vcpu, unsigned long val)
 {
@@ -5882,7 +5910,7 @@ static int handle_set_cr0(struct kvm_vcpu *vcpu, unsigned 
long val)
val = (val & ~vmcs12->cr0_guest_host_mask) |
(vmcs12->guest_cr0 & vmcs12->cr0_guest_host_mask);
 
-   if (!nested_cr0_valid(vcpu, val))
+   if (!nested_guest_cr0_valid(vcpu, val))
return 1;

[PATCH v3 0/5] VMX Capability MSRs

2016-11-29 Thread David Matlack

This patchset adds support setting the VMX capability MSRs from userspace.
This is required for migration of nested-capable VMs to different CPUs and
KVM versions.

Patch 1 generates the non-true VMX MSRs using the true MSRs, which allows
userspace to skip restoring them.

Patch 2 adds support for restoring the VMX capability MSRs.

Patches 3 and 4 make KVM's emulation of MSR_IA32_VMX_CR{0,4}_FIXED1 more
accurate.

Patch 5 fixes a bug in emulated VM-entry that came up when testing patches
3 and 4.

Changes since v2:
  * Generate CR0_FIXED1 in addition to CR4_FIXED1
  * Generate "non-true" capability MSRs from the "true" versions and remove
"non-true" MSRs from struct nested_vmx.
  * Disallow restore of CR{0,4}_FIXED1 and "non-true" MSRs since they are
generated.

Changes since v1:
  * Support restoring less-capable versions of MSR_IA32_VMX_BASIC,
MSR_IA32_VMX_CR{0,4}_FIXED{0,1}.
  * Include VMX_INS_OUTS in MSR_IA32_VMX_BASIC initial value.

David Matlack (5):
  KVM: nVMX: generate non-true VMX MSRs based on true versions
  KVM: nVMX: support restore of VMX capability MSRs
  KVM: nVMX: fix checks on CR{0,4} during virtual VMX operation
  KVM: nVMX: generate MSR_IA32_CR{0,4}_FIXED1 from guest CPUID
  KVM: nVMX: load GUEST_EFER after GUEST_CR0 during emulated VM-entry

 arch/x86/include/asm/vmx.h |  31 +++
 arch/x86/kvm/vmx.c | 479 +
 2 files changed, 427 insertions(+), 83 deletions(-)

-- 
2.8.0.rc3.226.g39d4020

[PATCH v3 1/5] KVM: nVMX: generate non-true VMX MSRs based on true versions

2016-11-29 Thread David Matlack

The "non-true" VMX capability MSRs can be generated from their "true"
counterparts, by OR-ing the default1 bits. The default1 bits are fixed
and defined in the SDM.

Since we can generate the non-true VMX MSRs from the true versions,
there's no need to store both in struct nested_vmx. This also lets
userspace avoid having to restore the non-true MSRs.

Note this does not preclude emulating MSR_IA32_VMX_BASIC[55]=0. To do so,
we simply need to set all the default1 bits in the true MSRs (such that
the true MSRs and the generated non-true MSRs are equal).

Signed-off-by: David Matlack 
Suggested-by: Paolo Bonzini 
---
 arch/x86/kvm/vmx.c | 45 +++--
 1 file changed, 19 insertions(+), 26 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 5382b82..0beb56a 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -446,19 +446,21 @@ struct nested_vmx {
u16 vpid02;
u16 last_vpid;
 
+   /*
+* We only store the "true" versions of the VMX capability MSRs. We
+* generate the "non-true" versions by setting the must-be-1 bits
+* according to the SDM.
+*/
u32 nested_vmx_procbased_ctls_low;
u32 nested_vmx_procbased_ctls_high;
-   u32 nested_vmx_true_procbased_ctls_low;
u32 nested_vmx_secondary_ctls_low;
u32 nested_vmx_secondary_ctls_high;
u32 nested_vmx_pinbased_ctls_low;
u32 nested_vmx_pinbased_ctls_high;
u32 nested_vmx_exit_ctls_low;
u32 nested_vmx_exit_ctls_high;
-   u32 nested_vmx_true_exit_ctls_low;
u32 nested_vmx_entry_ctls_low;
u32 nested_vmx_entry_ctls_high;
-   u32 nested_vmx_true_entry_ctls_low;
u32 nested_vmx_misc_low;
u32 nested_vmx_misc_high;
u32 nested_vmx_ept_caps;
@@ -2712,9 +2714,7 @@ static void nested_vmx_setup_ctls_msrs(struct vcpu_vmx 
*vmx)
vmx->nested.nested_vmx_exit_ctls_high |= VM_EXIT_CLEAR_BNDCFGS;
 
/* We support free control of debug control saving. */
-   vmx->nested.nested_vmx_true_exit_ctls_low =
-   vmx->nested.nested_vmx_exit_ctls_low &
-   ~VM_EXIT_SAVE_DEBUG_CONTROLS;
+   vmx->nested.nested_vmx_exit_ctls_low &= ~VM_EXIT_SAVE_DEBUG_CONTROLS;
 
/* entry controls */
rdmsr(MSR_IA32_VMX_ENTRY_CTLS,
@@ -2733,9 +2733,7 @@ static void nested_vmx_setup_ctls_msrs(struct vcpu_vmx 
*vmx)
vmx->nested.nested_vmx_entry_ctls_high |= VM_ENTRY_LOAD_BNDCFGS;
 
/* We support free control of debug control loading. */
-   vmx->nested.nested_vmx_true_entry_ctls_low =
-   vmx->nested.nested_vmx_entry_ctls_low &
-   ~VM_ENTRY_LOAD_DEBUG_CONTROLS;
+   vmx->nested.nested_vmx_entry_ctls_low &= ~VM_ENTRY_LOAD_DEBUG_CONTROLS;
 
/* cpu-based controls */
rdmsr(MSR_IA32_VMX_PROCBASED_CTLS,
@@ -2768,8 +2766,7 @@ static void nested_vmx_setup_ctls_msrs(struct vcpu_vmx 
*vmx)
CPU_BASED_USE_MSR_BITMAPS;
 
/* We support free control of CR3 access interception. */
-   vmx->nested.nested_vmx_true_procbased_ctls_low =
-   vmx->nested.nested_vmx_procbased_ctls_low &
+   vmx->nested.nested_vmx_procbased_ctls_low &=
~(CPU_BASED_CR3_LOAD_EXITING | CPU_BASED_CR3_STORE_EXITING);
 
/* secondary cpu-based controls */
@@ -2868,36 +2865,32 @@ static int vmx_get_vmx_msr(struct kvm_vcpu *vcpu, u32 
msr_index, u64 *pdata)
*pdata = vmx_control_msr(
vmx->nested.nested_vmx_pinbased_ctls_low,
vmx->nested.nested_vmx_pinbased_ctls_high);
+   if (msr_index == MSR_IA32_VMX_PINBASED_CTLS)
+   *pdata |= PIN_BASED_ALWAYSON_WITHOUT_TRUE_MSR;
break;
case MSR_IA32_VMX_TRUE_PROCBASED_CTLS:
-   *pdata = vmx_control_msr(
-   vmx->nested.nested_vmx_true_procbased_ctls_low,
-   vmx->nested.nested_vmx_procbased_ctls_high);
-   break;
case MSR_IA32_VMX_PROCBASED_CTLS:
*pdata = vmx_control_msr(
vmx->nested.nested_vmx_procbased_ctls_low,
vmx->nested.nested_vmx_procbased_ctls_high);
+   if (msr_index == MSR_IA32_VMX_PROCBASED_CTLS)
+   *pdata |= CPU_BASED_ALWAYSON_WITHOUT_TRUE_MSR;
break;
case MSR_IA32_VMX_TRUE_EXIT_CTLS:
-   *pdata = vmx_control_msr(
-   vmx->nested.nested_vmx_true_exit_ctls_low,
-   vmx->nested.nested_vmx_exit_ctls_high);
-   break;
case MSR_IA32_VMX_EXIT_CTLS:
*pdata = vmx_control_msr(
vmx->nested.nested_vmx_exit_ctls_low,

Re: [PATCH 1/4] KVM: nVMX: support restore of VMX capability MSRs

2016-11-29 Thread David Matlack

On Tue, Nov 29, 2016 at 12:01 AM, Paolo Bonzini  wrote:
>> On Mon, Nov 28, 2016 at 2:48 PM, Paolo Bonzini  wrote:
>> > On 28/11/2016 22:11, David Matlack wrote:
>> >> > PINBASED_CTLS, PROCBASED_CTLS, EXIT_CTLS and ENTRY_CTLS can be derived
>> >> > from their "true" counterparts, so I think it's better to remove the
>> >> > "non-true" ones from struct nested_vmx (and/or add the "true" ones when
>> >> > missing) and make them entirely computed.  But it can be done on top.
>> >>
>> >> Good point. And that would mean userspace does not need to restore the
>> >> non-true MSRs, right?
>> >
>> > Yes, sorry for being a bit too concise. :)
>>
>> I'll include this cleanup in the next version of the patchset since it
>> affects which MSRs userspace will restore. It looks like a pretty
>> simple patch.
>
> Don't bother removing the "non-true" registers from nested_vmx; you only
> need to adjust the userspace API.

I already wrote the patch, so unless there's an argument against
removing them I'll include it in the next patchset. Thanks!

>
>> >
>> >> KVM does not emulate MSR_IA32_VMX_BASIC[55]=0,
>> >> and will probably never want to.
>> >
>> > That's a separate question, MSR_IA32_VMX_BASIC[55]=0 basically means
>> > that the "true" capabilities are the same as the "default" capabilities.
>> >  If userspace wanted to set it that way, KVM right now would not hide
>> > the "true" capability MSR, but on the other hand the nested hypervisor
>> > should not even notice the difference.
>>
>> KVM would also need to use the non-true MSR in place of the true MSRs
>> when checking VMCS12 during VM-entry.
>
> It's not necessary, userspace would set the relevant bits to 1 in the true
> MSRs, for both the low and high parts.  If it doesn't, it's garbage in
> garbage out.
>
> Paolo

Re: [PATCH 1/4] KVM: nVMX: support restore of VMX capability MSRs

2016-11-28 Thread David Matlack

On Mon, Nov 28, 2016 at 2:48 PM, Paolo Bonzini  wrote:
> On 28/11/2016 22:11, David Matlack wrote:
>> > PINBASED_CTLS, PROCBASED_CTLS, EXIT_CTLS and ENTRY_CTLS can be derived
>> > from their "true" counterparts, so I think it's better to remove the
>> > "non-true" ones from struct nested_vmx (and/or add the "true" ones when
>> > missing) and make them entirely computed.  But it can be done on top.
>>
>> Good point. And that would mean userspace does not need to restore the
>> non-true MSRs, right?
>
> Yes, sorry for being a bit too concise. :)

I'll include this cleanup in the next version of the patchset since it
affects which MSRs userspace will restore. It looks like a pretty
simple patch.

>
>> KVM does not emulate MSR_IA32_VMX_BASIC[55]=0,
>> and will probably never want to.
>
> That's a separate question, MSR_IA32_VMX_BASIC[55]=0 basically means
> that the "true" capabilities are the same as the "default" capabilities.
>  If userspace wanted to set it that way, KVM right now would not hide
> the "true" capability MSR, but on the other hand the nested hypervisor
> should not even notice the difference.

KVM would also need to use the non-true MSR in place of the true MSRs
when checking VMCS12 during VM-entry.

>
> Paolo

Re: [PATCH 3/4] KVM: nVMX: accurate emulation of MSR_IA32_CR{0,4}_FIXED1

2016-11-28 Thread David Matlack

On Wed, Nov 23, 2016 at 3:28 PM, David Matlack  wrote:
> On Wed, Nov 23, 2016 at 2:11 PM, Paolo Bonzini  wrote:
>> On 23/11/2016 23:07, David Matlack wrote:
>>> A downside of this scheme is we'd have to remember to update
>>> nested_vmx_cr4_fixed1_update() before giving VMs new CPUID bits. If we
>>> forget, a VM could end up with different values for CR{0,4}_FIXED0 for
>>> the same CPUID depending on which version of KVM you're running on.

I've realized my concern here doesn't make sense. Such a VM would
likely fail to enter VMX operation, or #GP (unexpectedly) at some
point later. Linux, for example, does not appear to consult
MSR_IA32_VMX_CR4_FIXED1 when determining which bits of CR4 it can use
(regardless of whether it is in VMX operation or not).

>>
>> If userspace doesn't obey KVM_GET_SUPPORTED_CPUID, all bets are off
>> anyway, so I don't think it's a big deal.  However, if you want to make
>> it generated by userspace, that would be fine as well!
>
> Ok let's generate them in userspace.

I'm more inclined to generate them in the kernel, given the above.

Re: [PATCH 1/4] KVM: nVMX: support restore of VMX capability MSRs

2016-11-28 Thread David Matlack

On Wed, Nov 23, 2016 at 3:44 AM, Paolo Bonzini  wrote:
> On 23/11/2016 02:14, David Matlack wrote:
>>   switch (msr_index) {
>>   case MSR_IA32_VMX_BASIC:
>> + return vmx_restore_vmx_basic(vmx, data);
>> + case MSR_IA32_VMX_TRUE_PINBASED_CTLS:
>> + case MSR_IA32_VMX_PINBASED_CTLS:
>> + case MSR_IA32_VMX_TRUE_PROCBASED_CTLS:
>> + case MSR_IA32_VMX_PROCBASED_CTLS:
>> + case MSR_IA32_VMX_TRUE_EXIT_CTLS:
>> + case MSR_IA32_VMX_EXIT_CTLS:
>> + case MSR_IA32_VMX_TRUE_ENTRY_CTLS:
>> + case MSR_IA32_VMX_ENTRY_CTLS:
>
> PINBASED_CTLS, PROCBASED_CTLS, EXIT_CTLS and ENTRY_CTLS can be derived
> from their "true" counterparts, so I think it's better to remove the
> "non-true" ones from struct nested_vmx (and/or add the "true" ones when
> missing) and make them entirely computed.  But it can be done on top.

Good point. And that would mean userspace does not need to restore the
non-true MSRs, right? KVM does not emulate MSR_IA32_VMX_BASIC[55]=0,
and will probably never want to.

Re: [PATCH 2/4] KVM: nVMX: fix checks on CR{0,4} during virtual VMX operation

2016-11-28 Thread David Matlack

On Wed, Nov 23, 2016 at 3:31 AM, Paolo Bonzini  wrote:
> On 23/11/2016 02:14, David Matlack wrote:
>> +static bool fixed_bits_valid(u64 val, u64 fixed0, u64 fixed1)
>> +{
>> + return ((val & fixed0) == fixed0) && ((~val & ~fixed1) == ~fixed1);
>> +}
>> +
>
> This is the same as vmx_control_verify (except with u64 arguments
> instead of u32).

Good point. I'll remove this duplication in v3.

Re: [PATCH 0/4] VMX Capability MSRs

2016-11-28 Thread David Matlack

On Wed, Nov 23, 2016 at 3:45 AM, Paolo Bonzini  wrote:
>
> On 23/11/2016 02:14, David Matlack wrote:
>> This patchset includes v2 of "KVM: nVMX: support restore of VMX capability
>> MSRs" (patch 1) as well as some additional related patches that came up
>> while preparing v2.
>>
>> Patches 2 and 3 make KVM's emulation of MSR_IA32_VMX_CR{0,4}_FIXED1 more
>> accurate. Patch 4 fixes a bug in emulated VM-entry that came up when
>> testing patches 2 and 3.
>>
>> Changes since v1:
>>   * Support restoring less-capable versions of MSR_IA32_VMX_BASIC,
>> MSR_IA32_VMX_CR{0,4}_FIXED{0,1}.
>>   * Include VMX_INS_OUTS in MSR_IA32_VMX_BASIC initial value.
>>
>> David Matlack (4):
>>   KVM: nVMX: support restore of VMX capability MSRs
>>   KVM: nVMX: fix checks on CR{0,4} during virtual VMX operation
>>   KVM: nVMX: accurate emulation of MSR_IA32_CR{0,4}_FIXED1
>>   KVM: nVMX: load GUEST_EFER after GUEST_CR0 during emulated VM-entry
>>
>>  arch/x86/include/asm/vmx.h |  31 
>>  arch/x86/kvm/vmx.c | 443 
>> +++--
>>  2 files changed, 421 insertions(+), 53 deletions(-)
>>
>
> The main question is whether patches 2-3 actually make
> vmx_restore_fixed0/1_msr unnecessary, otherwise looks great.
>
> It would be nice to have a testcase for patch 4, since it could go in
> independently.

I've got a kvm-unit-test testcase for patches 2-4 but unfortunately it
depends on changes we've made internally to the kvm-unit-tests, and
we're a bit behind on getting those upstreamed.

>
> Paolo

Re: [PATCH 3/4] KVM: nVMX: accurate emulation of MSR_IA32_CR{0,4}_FIXED1

2016-11-23 Thread David Matlack

On Wed, Nov 23, 2016 at 2:11 PM, Paolo Bonzini  wrote:
> On 23/11/2016 23:07, David Matlack wrote:
>> A downside of this scheme is we'd have to remember to update
>> nested_vmx_cr4_fixed1_update() before giving VMs new CPUID bits. If we
>> forget, a VM could end up with different values for CR{0,4}_FIXED0 for
>> the same CPUID depending on which version of KVM you're running on.
>
> If userspace doesn't obey KVM_GET_SUPPORTED_CPUID, all bets are off
> anyway, so I don't think it's a big deal.  However, if you want to make
> it generated by userspace, that would be fine as well!

Ok let's generate them in userspace.

> That would simply entail removing this patch, wouldn't it?

Mostly. The first half of the patch (initialize from host MSRs) should stay.

Re: [PATCH 3/4] KVM: nVMX: accurate emulation of MSR_IA32_CR{0,4}_FIXED1

2016-11-23 Thread David Matlack

On Wed, Nov 23, 2016 at 11:24 AM, Paolo Bonzini  wrote:
>
>
> On 23/11/2016 20:16, David Matlack wrote:
>> > Oh, I thought userspace would do that!  Doing it in KVM is fine as well,
>> > but then do we need to give userspace access to CR{0,4}_FIXED{0,1} at all?
>>
>> I think it should be safe for userspace to skip restoring CR4_FIXED1,
>> since it is 100% generated based on CPUID. But I'd prefer to keep it
>> accessible from userspace, for consistency with the other VMX MSRs and
>> for flexibility. The auditing should ensure userspace doesn't restore
>> a CR4_FIXED1 that is inconsistent with CPUID.
>
> Or would it just allow userspace to put anything into it, even if it's
> inconsistent with CPUID, as long as it's consistent with the host?

It would not allow anything inconsistent with guest CPUID. The
auditing on restore of CR4_FIXED1 compares the new value with
vmx->nested.nested_vmx_cr4_fixed1, which is updated as part of setting
the guest's CPUID.

>
>> Userspace should restore CR0_FIXED1 in case future CPUs change which
>> bits of CR0 are valid in VMX operation. Userspace should also restore
>> CR{0,4}_FIXED0 so we have the flexibility to change the defaults in
>> KVM. Both of these situations seem unlikely but we might as well play
>> it safe, the cost is small.
>
> I disagree, there is always a cost.  Besides the fact that it's
> unlikely that there'll be any future CR0 bits at all, any changes would
> most likely be keyed by a new CPUID bit (the same as CR4) or execution
> control (the same as unrestricted guest).

That's true. So CR0_FIXED1 would not need to be accessible from
userspace either. This patch would need to be a little different then:
vmx_cpuid_update should also update vmx->nested.nested_vmx_cr0_fixed1
to 0x.

A downside of this scheme is we'd have to remember to update
nested_vmx_cr4_fixed1_update() before giving VMs new CPUID bits. If we
forget, a VM could end up with different values for CR{0,4}_FIXED0 for
the same CPUID depending on which version of KVM you're running on.

Hm, now I'm thinking you were right in the beginning. Userspace should
generate CR{0,4}_FIXED1, not the kernel. And KVM should allow
userspace to save/restore them.

>
> In the end, since we assume that userspace (any) has no idea of what to
> do with it, I see no good reason to make the MSRs available.
>
> Paolo

Re: [PATCH 3/4] KVM: nVMX: accurate emulation of MSR_IA32_CR{0,4}_FIXED1

2016-11-23 Thread David Matlack

On Wed, Nov 23, 2016 at 1:06 AM, Paolo Bonzini  wrote:
>
>> Set MSR_IA32_CR{0,4}_FIXED1 to match the CPU's MSRs.
>>
>> In addition, MSR_IA32_CR4_FIXED1 should reflect the available CR4 bits
>> according to CPUID. Whenever guest CPUID is updated by userspace,
>> regenerate MSR_IA32_CR4_FIXED1 to match it.
>>
>> Signed-off-by: David Matlack 
>
> Oh, I thought userspace would do that!  Doing it in KVM is fine as well,
> but then do we need to give userspace access to CR{0,4}_FIXED{0,1} at all?

I think it should be safe for userspace to skip restoring CR4_FIXED1,
since it is 100% generated based on CPUID. But I'd prefer to keep it
accessible from userspace, for consistency with the other VMX MSRs and
for flexibility. The auditing should ensure userspace doesn't restore
a CR4_FIXED1 that is inconsistent with CPUID.

Userspace should restore CR0_FIXED1 in case future CPUs change which
bits of CR0 are valid in VMX operation. Userspace should also restore
CR{0,4}_FIXED0 so we have the flexibility to change the defaults in
KVM. Both of these situations seem unlikely but we might as well play
it safe, the cost is small.

>
> Paolo
>
>> ---
>> Note: "x86/cpufeature: Add User-Mode Instruction Prevention definitions" has
>> not hit kvm/master yet so the macros for X86_CR4_UMIP and X86_FEATURE_UMIP
>> are not available.
>>
>>  arch/x86/kvm/vmx.c | 54
>>  +++---
>>  1 file changed, 51 insertions(+), 3 deletions(-)
>>
>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
>> index a2a5ad8..ac5d9c0 100644
>> --- a/arch/x86/kvm/vmx.c
>> +++ b/arch/x86/kvm/vmx.c
>> @@ -2852,16 +2852,18 @@ static void nested_vmx_setup_ctls_msrs(struct
>> vcpu_vmx *vmx)
>>   vmx->nested.nested_vmx_basic |= VMX_BASIC_INOUT;
>>
>>   /*
>> -  * These MSRs specify bits which the guest must keep fixed (on or off)
>> +  * These MSRs specify bits which the guest must keep fixed on
>>* while L1 is in VMXON mode (in L1's root mode, or running an L2).
>>* We picked the standard core2 setting.
>>*/
>>  #define VMXON_CR0_ALWAYSON (X86_CR0_PE | X86_CR0_PG | X86_CR0_NE)
>>  #define VMXON_CR4_ALWAYSON X86_CR4_VMXE
>>   vmx->nested.nested_vmx_cr0_fixed0 = VMXON_CR0_ALWAYSON;
>> - vmx->nested.nested_vmx_cr0_fixed1 = -1ULL;
>>   vmx->nested.nested_vmx_cr4_fixed0 = VMXON_CR4_ALWAYSON;
>> - vmx->nested.nested_vmx_cr4_fixed1 = -1ULL;
>> +
>> + /* These MSRs specify bits which the guest must keep fixed off. */
>> + rdmsrl(MSR_IA32_VMX_CR0_FIXED1, vmx->nested.nested_vmx_cr0_fixed1);
>> + rdmsrl(MSR_IA32_VMX_CR4_FIXED1, vmx->nested.nested_vmx_cr4_fixed1);
>>
>>   /* highest index: VMX_PREEMPTION_TIMER_VALUE */
>>   vmx->nested.nested_vmx_vmcs_enum = 0x2e;
>> @@ -9580,6 +9582,49 @@ static void vmcs_set_secondary_exec_control(u32
>> new_ctl)
>>(new_ctl & ~mask) | (cur_ctl & mask));
>>  }
>>
>> +/*
>> + * Generate MSR_IA32_VMX_CR4_FIXED1 according to CPUID. Only set bits
>> + * (indicating "allowed-1") if they are supported in the guest's CPUID.
>> + */
>> +static void nested_vmx_cr4_fixed1_update(struct kvm_vcpu *vcpu)
>> +{
>> + struct vcpu_vmx *vmx = to_vmx(vcpu);
>> + struct kvm_cpuid_entry2 *entry;
>> +
>> +#define update(_cr4_mask, _reg, _cpuid_mask) do {\
>> + if (entry && (entry->_reg & (_cpuid_mask))) \
>> + vmx->nested.nested_vmx_cr4_fixed1 |= (_cr4_mask);   \
>> +} while (0)
>> +
>> + vmx->nested.nested_vmx_cr4_fixed1 = X86_CR4_PCE;
>> +
>> + entry = kvm_find_cpuid_entry(vcpu, 0x1, 0);
>> + update(X86_CR4_VME,edx, bit(X86_FEATURE_VME));
>> + update(X86_CR4_PVI,edx, bit(X86_FEATURE_VME));
>> + update(X86_CR4_TSD,edx, bit(X86_FEATURE_TSC));
>> + update(X86_CR4_DE, edx, bit(X86_FEATURE_DE));
>> + update(X86_CR4_PSE,edx, bit(X86_FEATURE_PSE));
>> + update(X86_CR4_PAE,edx, bit(X86_FEATURE_PAE));
>> + update(X86_CR4_MCE,edx, bit(X86_FEATURE_MCE));
>> + update(X86_CR4_PGE,edx, bit(X86_FEATURE_PGE));
>> + update(X86_CR4_OSFXSR, edx, bit(X86_FEATURE_FXSR));
>> + update(X86_CR4_OSXMMEXCPT, edx, bit(X86_FEATURE_XMM));
>> + update(X86_CR4_VMXE,   ecx, bit(X86_FEATURE_VMX));
>> + update(X86_CR4_SMXE,   ecx, bit(X86_FEATURE_SMX));
>>

[PATCH 4/4] KVM: nVMX: load GUEST_EFER after GUEST_CR0 during emulated VM-entry

2016-11-22 Thread David Matlack

vmx_set_cr0() modifies GUEST_EFER and "IA-32e mode guest" in the current
VMCS. Call vmx_set_efer() after vmx_set_cr0() so that emulated VM-entry
is more faithful to VMCS12.

This patch correctly causes VM-entry to fail when "IA-32e mode guest" is
1 and GUEST_CR0.PG is 0. Previously this configuration would succeed and
"IA-32e mode guest" would silently be disabled by KVM.

Signed-off-by: David Matlack 
---
 arch/x86/kvm/vmx.c | 18 +-
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index ac5d9c0..86235fc 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -10418,15 +10418,6 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu, 
struct vmcs12 *vmcs12)
nested_ept_init_mmu_context(vcpu);
}
 
-   if (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_IA32_EFER)
-   vcpu->arch.efer = vmcs12->guest_ia32_efer;
-   else if (vmcs12->vm_entry_controls & VM_ENTRY_IA32E_MODE)
-   vcpu->arch.efer |= (EFER_LMA | EFER_LME);
-   else
-   vcpu->arch.efer &= ~(EFER_LMA | EFER_LME);
-   /* Note: modifies VM_ENTRY/EXIT_CONTROLS and GUEST/HOST_IA32_EFER */
-   vmx_set_efer(vcpu, vcpu->arch.efer);
-
/*
 * This sets GUEST_CR0 to vmcs12->guest_cr0, with possibly a modified
 * TS bit (for lazy fpu) and bits which we consider mandatory enabled.
@@ -10441,6 +10432,15 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu, 
struct vmcs12 *vmcs12)
vmx_set_cr4(vcpu, vmcs12->guest_cr4);
vmcs_writel(CR4_READ_SHADOW, nested_read_cr4(vmcs12));
 
+   if (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_IA32_EFER)
+   vcpu->arch.efer = vmcs12->guest_ia32_efer;
+   else if (vmcs12->vm_entry_controls & VM_ENTRY_IA32E_MODE)
+   vcpu->arch.efer |= (EFER_LMA | EFER_LME);
+   else
+   vcpu->arch.efer &= ~(EFER_LMA | EFER_LME);
+   /* Note: modifies VM_ENTRY/EXIT_CONTROLS and GUEST/HOST_IA32_EFER */
+   vmx_set_efer(vcpu, vcpu->arch.efer);
+
/* shadow page tables on either EPT or shadow page tables */
kvm_set_cr3(vcpu, vmcs12->guest_cr3);
kvm_mmu_reset_context(vcpu);
-- 
2.8.0.rc3.226.g39d4020

[PATCH 3/4] KVM: nVMX: accurate emulation of MSR_IA32_CR{0,4}_FIXED1

2016-11-22 Thread David Matlack

Set MSR_IA32_CR{0,4}_FIXED1 to match the CPU's MSRs.

In addition, MSR_IA32_CR4_FIXED1 should reflect the available CR4 bits
according to CPUID. Whenever guest CPUID is updated by userspace,
regenerate MSR_IA32_CR4_FIXED1 to match it.

Signed-off-by: David Matlack 
---
Note: "x86/cpufeature: Add User-Mode Instruction Prevention definitions" has
not hit kvm/master yet so the macros for X86_CR4_UMIP and X86_FEATURE_UMIP
are not available.

 arch/x86/kvm/vmx.c | 54 +++---
 1 file changed, 51 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index a2a5ad8..ac5d9c0 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -2852,16 +2852,18 @@ static void nested_vmx_setup_ctls_msrs(struct vcpu_vmx 
*vmx)
vmx->nested.nested_vmx_basic |= VMX_BASIC_INOUT;
 
/*
-* These MSRs specify bits which the guest must keep fixed (on or off)
+* These MSRs specify bits which the guest must keep fixed on
 * while L1 is in VMXON mode (in L1's root mode, or running an L2).
 * We picked the standard core2 setting.
 */
 #define VMXON_CR0_ALWAYSON (X86_CR0_PE | X86_CR0_PG | X86_CR0_NE)
 #define VMXON_CR4_ALWAYSON X86_CR4_VMXE
vmx->nested.nested_vmx_cr0_fixed0 = VMXON_CR0_ALWAYSON;
-   vmx->nested.nested_vmx_cr0_fixed1 = -1ULL;
vmx->nested.nested_vmx_cr4_fixed0 = VMXON_CR4_ALWAYSON;
-   vmx->nested.nested_vmx_cr4_fixed1 = -1ULL;
+
+   /* These MSRs specify bits which the guest must keep fixed off. */
+   rdmsrl(MSR_IA32_VMX_CR0_FIXED1, vmx->nested.nested_vmx_cr0_fixed1);
+   rdmsrl(MSR_IA32_VMX_CR4_FIXED1, vmx->nested.nested_vmx_cr4_fixed1);
 
/* highest index: VMX_PREEMPTION_TIMER_VALUE */
vmx->nested.nested_vmx_vmcs_enum = 0x2e;
@@ -9580,6 +9582,49 @@ static void vmcs_set_secondary_exec_control(u32 new_ctl)
 (new_ctl & ~mask) | (cur_ctl & mask));
 }
 
+/*
+ * Generate MSR_IA32_VMX_CR4_FIXED1 according to CPUID. Only set bits
+ * (indicating "allowed-1") if they are supported in the guest's CPUID.
+ */
+static void nested_vmx_cr4_fixed1_update(struct kvm_vcpu *vcpu)
+{
+   struct vcpu_vmx *vmx = to_vmx(vcpu);
+   struct kvm_cpuid_entry2 *entry;
+
+#define update(_cr4_mask, _reg, _cpuid_mask) do {  \
+   if (entry && (entry->_reg & (_cpuid_mask))) \
+   vmx->nested.nested_vmx_cr4_fixed1 |= (_cr4_mask);   \
+} while (0)
+
+   vmx->nested.nested_vmx_cr4_fixed1 = X86_CR4_PCE;
+
+   entry = kvm_find_cpuid_entry(vcpu, 0x1, 0);
+   update(X86_CR4_VME,edx, bit(X86_FEATURE_VME));
+   update(X86_CR4_PVI,edx, bit(X86_FEATURE_VME));
+   update(X86_CR4_TSD,edx, bit(X86_FEATURE_TSC));
+   update(X86_CR4_DE, edx, bit(X86_FEATURE_DE));
+   update(X86_CR4_PSE,edx, bit(X86_FEATURE_PSE));
+   update(X86_CR4_PAE,edx, bit(X86_FEATURE_PAE));
+   update(X86_CR4_MCE,edx, bit(X86_FEATURE_MCE));
+   update(X86_CR4_PGE,edx, bit(X86_FEATURE_PGE));
+   update(X86_CR4_OSFXSR, edx, bit(X86_FEATURE_FXSR));
+   update(X86_CR4_OSXMMEXCPT, edx, bit(X86_FEATURE_XMM));
+   update(X86_CR4_VMXE,   ecx, bit(X86_FEATURE_VMX));
+   update(X86_CR4_SMXE,   ecx, bit(X86_FEATURE_SMX));
+   update(X86_CR4_PCIDE,  ecx, bit(X86_FEATURE_PCID));
+   update(X86_CR4_OSXSAVE,ecx, bit(X86_FEATURE_XSAVE));
+
+   entry = kvm_find_cpuid_entry(vcpu, 0x7, 0);
+   update(X86_CR4_FSGSBASE,   ebx, bit(X86_FEATURE_FSGSBASE));
+   update(X86_CR4_SMEP,   ebx, bit(X86_FEATURE_SMEP));
+   update(X86_CR4_SMAP,   ebx, bit(X86_FEATURE_SMAP));
+   update(X86_CR4_PKE,ecx, bit(X86_FEATURE_PKU));
+   /* TODO: Use X86_CR4_UMIP and X86_FEATURE_UMIP macros */
+   update(bit(11),ecx, bit(2));
+
+#undef update
+}
+
 static void vmx_cpuid_update(struct kvm_vcpu *vcpu)
 {
struct kvm_cpuid_entry2 *best;
@@ -9621,6 +9666,9 @@ static void vmx_cpuid_update(struct kvm_vcpu *vcpu)
else
to_vmx(vcpu)->msr_ia32_feature_control_valid_bits &=
~FEATURE_CONTROL_VMXON_ENABLED_OUTSIDE_SMX;
+
+   if (nested_vmx_allowed(vcpu))
+   nested_vmx_cr4_fixed1_update(vcpu);
 }
 
 static void vmx_set_supported_cpuid(u32 func, struct kvm_cpuid_entry2 *entry)
-- 
2.8.0.rc3.226.g39d4020

[PATCH 2/4] KVM: nVMX: fix checks on CR{0,4} during virtual VMX operation

2016-11-22 Thread David Matlack

KVM emulates MSR_IA32_VMX_CR{0,4}_FIXED1 with the value -1ULL, meaning
all CR0 and CR4 bits are allowed to be 1 during VMX operation.

This does not match real hardware, which disallows the high 32 bits of
CR0 to be 1, and disallows reserved bits of CR4 to be 1 (including bits
which are defined in the SDM but missing according to CPUID). A guest
can induce a VM-entry failure by setting these bits in GUEST_CR0 and
GUEST_CR4, despite MSR_IA32_VMX_CR{0,4}_FIXED1 indicating they are
valid.

Since KVM has allowed all bits to be 1 in CR0 and CR4, the existing
checks on these registers do not verify must-be-0 bits. Fix these checks
to identify must-be-0 bits according to MSR_IA32_VMX_CR{0,4}_FIXED1.

This patch should introduce no change in behavior in KVM, since these
MSRs are still -1ULL.

Signed-off-by: David Matlack 
---
 arch/x86/kvm/vmx.c | 68 ++
 1 file changed, 48 insertions(+), 20 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 6ec3832..a2a5ad8 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -4138,6 +4138,45 @@ static void ept_save_pdptrs(struct kvm_vcpu *vcpu)
  (unsigned long *)&vcpu->arch.regs_dirty);
 }
 
+static bool fixed_bits_valid(u64 val, u64 fixed0, u64 fixed1)
+{
+   return ((val & fixed0) == fixed0) && ((~val & ~fixed1) == ~fixed1);
+}
+
+static bool nested_guest_cr0_valid(struct kvm_vcpu *vcpu, unsigned long val)
+{
+   u64 fixed0 = to_vmx(vcpu)->nested.nested_vmx_cr0_fixed0;
+   u64 fixed1 = to_vmx(vcpu)->nested.nested_vmx_cr0_fixed1;
+   struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
+
+   if (to_vmx(vcpu)->nested.nested_vmx_secondary_ctls_high &
+   SECONDARY_EXEC_UNRESTRICTED_GUEST &&
+   nested_cpu_has2(vmcs12, SECONDARY_EXEC_UNRESTRICTED_GUEST))
+   fixed0 &= ~(X86_CR0_PE | X86_CR0_PG);
+
+   return fixed_bits_valid(val, fixed0, fixed1);
+}
+
+static bool nested_host_cr0_valid(struct kvm_vcpu *vcpu, unsigned long val)
+{
+   u64 fixed0 = to_vmx(vcpu)->nested.nested_vmx_cr0_fixed0;
+   u64 fixed1 = to_vmx(vcpu)->nested.nested_vmx_cr0_fixed1;
+
+   return fixed_bits_valid(val, fixed0, fixed1);
+}
+
+static bool nested_cr4_valid(struct kvm_vcpu *vcpu, unsigned long val)
+{
+   u64 fixed0 = to_vmx(vcpu)->nested.nested_vmx_cr4_fixed0;
+   u64 fixed1 = to_vmx(vcpu)->nested.nested_vmx_cr4_fixed1;
+
+   return fixed_bits_valid(val, fixed0, fixed1);
+}
+
+/* No difference in the restrictions on guest and host CR4 in VMX operation. */
+#define nested_guest_cr4_valid nested_cr4_valid
+#define nested_host_cr4_valid  nested_cr4_valid
+
 static int vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4);
 
 static void ept_update_paging_mode_cr0(unsigned long *hw_cr0,
@@ -4266,8 +4305,8 @@ static int vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned 
long cr4)
if (!nested_vmx_allowed(vcpu))
return 1;
}
-   if (to_vmx(vcpu)->nested.vmxon &&
-   ((cr4 & VMXON_CR4_ALWAYSON) != VMXON_CR4_ALWAYSON))
+
+   if (to_vmx(vcpu)->nested.vmxon && !nested_cr4_valid(vcpu, cr4))
return 1;
 
vcpu->arch.cr4 = cr4;
@@ -5886,18 +5925,6 @@ vmx_patch_hypercall(struct kvm_vcpu *vcpu, unsigned char 
*hypercall)
hypercall[2] = 0xc1;
 }
 
-static bool nested_cr0_valid(struct kvm_vcpu *vcpu, unsigned long val)
-{
-   unsigned long always_on = VMXON_CR0_ALWAYSON;
-   struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
-
-   if (to_vmx(vcpu)->nested.nested_vmx_secondary_ctls_high &
-   SECONDARY_EXEC_UNRESTRICTED_GUEST &&
-   nested_cpu_has2(vmcs12, SECONDARY_EXEC_UNRESTRICTED_GUEST))
-   always_on &= ~(X86_CR0_PE | X86_CR0_PG);
-   return (val & always_on) == always_on;
-}
-
 /* called to set cr0 as appropriate for a mov-to-cr0 exit. */
 static int handle_set_cr0(struct kvm_vcpu *vcpu, unsigned long val)
 {
@@ -5916,7 +5943,7 @@ static int handle_set_cr0(struct kvm_vcpu *vcpu, unsigned 
long val)
val = (val & ~vmcs12->cr0_guest_host_mask) |
(vmcs12->guest_cr0 & vmcs12->cr0_guest_host_mask);
 
-   if (!nested_cr0_valid(vcpu, val))
+   if (!nested_guest_cr0_valid(vcpu, val))
return 1;
 
if (kvm_set_cr0(vcpu, val))
@@ -5925,8 +5952,9 @@ static int handle_set_cr0(struct kvm_vcpu *vcpu, unsigned 
long val)
return 0;
} else {
if (to_vmx(vcpu)->nested.vmxon &&
-   ((val & VMXON_CR0_ALWAYSON) != VMXON_CR0_ALWAYSON))
+   !nested_host_cr0_valid(vcpu, val))
return 1;
+
return kvm_set_cr0(vcpu, val);
}
 }
@@ -10472,15 +10500,15 @@ static int nested_vmx_run(struc

[PATCH 1/4] KVM: nVMX: support restore of VMX capability MSRs

2016-11-22 Thread David Matlack

The VMX capability MSRs advertise the set of features the KVM virtual
CPU can support. This set of features vary across different host CPUs
and KVM versions. This patch aims to addresses both sources of
differences, allowing VMs to be migrated across CPUs and KVM versions
without guest-visible changes to these MSRs. Note that cross-KVM-
version migration is only supported from this point forward.

When the VMX capability MSRs are restored, they are audited to check
that the set of features advertised are a subset of what KVM and the
CPU support.

Since the VMX capability MSRs are read-only, they do not need to be on
the default MSR save/restore lists. The userspace hypervisor can set
the values of these MSRs or read them from KVM at VCPU creation time,
and restore the same value after every save/restore.

Signed-off-by: David Matlack 
---
 arch/x86/include/asm/vmx.h |  31 +
 arch/x86/kvm/vmx.c | 317 +
 2 files changed, 324 insertions(+), 24 deletions(-)

diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index a002b07..a4ca897 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -25,6 +25,7 @@
 #define VMX_H
 
 
+#include 
 #include 
 #include 
 
@@ -110,6 +111,36 @@
 #define VMX_MISC_SAVE_EFER_LMA 0x0020
 #define VMX_MISC_ACTIVITY_HLT  0x0040
 
+static inline u32 vmx_basic_vmcs_revision_id(u64 vmx_basic)
+{
+   return vmx_basic & GENMASK_ULL(30, 0);
+}
+
+static inline u32 vmx_basic_vmcs_size(u64 vmx_basic)
+{
+   return (vmx_basic & GENMASK_ULL(44, 32)) >> 32;
+}
+
+static inline int vmx_misc_preemption_timer_rate(u64 vmx_misc)
+{
+   return vmx_misc & VMX_MISC_PREEMPTION_TIMER_RATE_MASK;
+}
+
+static inline int vmx_misc_cr3_count(u64 vmx_misc)
+{
+   return (vmx_misc & GENMASK_ULL(24, 16)) >> 16;
+}
+
+static inline int vmx_misc_max_msr(u64 vmx_misc)
+{
+   return (vmx_misc & GENMASK_ULL(27, 25)) >> 25;
+}
+
+static inline int vmx_misc_mseg_revid(u64 vmx_misc)
+{
+   return (vmx_misc & GENMASK_ULL(63, 32)) >> 32;
+}
+
 /* VMCS Encodings */
 enum vmcs_field {
VIRTUAL_PROCESSOR_ID= 0x,
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 5382b82..6ec3832 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -463,6 +463,12 @@ struct nested_vmx {
u32 nested_vmx_misc_high;
u32 nested_vmx_ept_caps;
u32 nested_vmx_vpid_caps;
+   u64 nested_vmx_basic;
+   u64 nested_vmx_cr0_fixed0;
+   u64 nested_vmx_cr0_fixed1;
+   u64 nested_vmx_cr4_fixed0;
+   u64 nested_vmx_cr4_fixed1;
+   u64 nested_vmx_vmcs_enum;
 };
 
 #define POSTED_INTR_ON  0
@@ -2829,6 +2835,36 @@ static void nested_vmx_setup_ctls_msrs(struct vcpu_vmx 
*vmx)
VMX_MISC_EMULATED_PREEMPTION_TIMER_RATE |
VMX_MISC_ACTIVITY_HLT;
vmx->nested.nested_vmx_misc_high = 0;
+
+   /*
+* This MSR reports some information about VMX support. We
+* should return information about the VMX we emulate for the
+* guest, and the VMCS structure we give it - not about the
+* VMX support of the underlying hardware.
+*/
+   vmx->nested.nested_vmx_basic =
+   VMCS12_REVISION |
+   VMX_BASIC_TRUE_CTLS |
+   ((u64)VMCS12_SIZE << VMX_BASIC_VMCS_SIZE_SHIFT) |
+   (VMX_BASIC_MEM_TYPE_WB << VMX_BASIC_MEM_TYPE_SHIFT);
+
+   if (cpu_has_vmx_basic_inout())
+   vmx->nested.nested_vmx_basic |= VMX_BASIC_INOUT;
+
+   /*
+* These MSRs specify bits which the guest must keep fixed (on or off)
+* while L1 is in VMXON mode (in L1's root mode, or running an L2).
+* We picked the standard core2 setting.
+*/
+#define VMXON_CR0_ALWAYSON (X86_CR0_PE | X86_CR0_PG | X86_CR0_NE)
+#define VMXON_CR4_ALWAYSON X86_CR4_VMXE
+   vmx->nested.nested_vmx_cr0_fixed0 = VMXON_CR0_ALWAYSON;
+   vmx->nested.nested_vmx_cr0_fixed1 = -1ULL;
+   vmx->nested.nested_vmx_cr4_fixed0 = VMXON_CR4_ALWAYSON;
+   vmx->nested.nested_vmx_cr4_fixed1 = -1ULL;
+
+   /* highest index: VMX_PREEMPTION_TIMER_VALUE */
+   vmx->nested.nested_vmx_vmcs_enum = 0x2e;
 }
 
 static inline bool vmx_control_verify(u32 control, u32 low, u32 high)
@@ -2844,24 +2880,260 @@ static inline u64 vmx_control_msr(u32 low, u32 high)
return low | ((u64)high << 32);
 }
 
-/* Returns 0 on success, non-0 otherwise. */
-static int vmx_get_vmx_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 *pdata)
+static bool is_bitwise_subset(u64 superset, u64 subset, u64 mask)
+{
+   superset &= mask;
+   subset &= mask;
+
+   return (superset | subset) == superset;
+}
+
+static int vmx_restore_vmx_basic(struct vcpu_vmx *vmx, u64 data)
+{
+   const u64 feature_and_reserved =
+

[PATCH 0/4] VMX Capability MSRs

2016-11-22 Thread David Matlack

This patchset includes v2 of "KVM: nVMX: support restore of VMX capability
MSRs" (patch 1) as well as some additional related patches that came up
while preparing v2.

Patches 2 and 3 make KVM's emulation of MSR_IA32_VMX_CR{0,4}_FIXED1 more
accurate. Patch 4 fixes a bug in emulated VM-entry that came up when
testing patches 2 and 3.

Changes since v1:
  * Support restoring less-capable versions of MSR_IA32_VMX_BASIC,
MSR_IA32_VMX_CR{0,4}_FIXED{0,1}.
  * Include VMX_INS_OUTS in MSR_IA32_VMX_BASIC initial value.

David Matlack (4):
  KVM: nVMX: support restore of VMX capability MSRs
  KVM: nVMX: fix checks on CR{0,4} during virtual VMX operation
  KVM: nVMX: accurate emulation of MSR_IA32_CR{0,4}_FIXED1
  KVM: nVMX: load GUEST_EFER after GUEST_CR0 during emulated VM-entry

 arch/x86/include/asm/vmx.h |  31 
 arch/x86/kvm/vmx.c | 443 +++--
 2 files changed, 421 insertions(+), 53 deletions(-)

-- 
2.8.0.rc3.226.g39d4020

Re: [PATCH v10 7/7] KVM: x86: virtualize cpuid faulting

2016-11-08 Thread David Matlack

On Tue, Nov 8, 2016 at 10:39 AM, Kyle Huey  wrote:
> Hardware support for faulting on the cpuid instruction is not required to
> emulate it, because cpuid triggers a VM exit anyways. KVM handles the relevant
> MSRs (MSR_PLATFORM_INFO and MSR_MISC_FEATURES_ENABLE) and upon a
> cpuid-induced VM exit checks the cpuid faulting state and the CPL.
> kvm_require_cpl is even kind enough to inject the GP fault for us.
>
> Signed-off-by: Kyle Huey 

Reviewed-by: David Matlack 

(v10)

Re: [PATCH v9 7/7] KVM: x86: virtualize cpuid faulting

2016-11-07 Thread David Matlack

On Sun, Nov 6, 2016 at 12:57 PM, Kyle Huey  wrote:
> Hardware support for faulting on the cpuid instruction is not required to
> emulate it, because cpuid triggers a VM exit anyways. KVM handles the relevant
> MSRs (MSR_PLATFORM_INFO and MSR_MISC_FEATURES_ENABLE) and upon a
> cpuid-induced VM exit checks the cpuid faulting state and the CPL.
> kvm_require_cpl is even kind enough to inject the GP fault for us.
>
> Signed-off-by: Kyle Huey 
> ---
>  arch/x86/include/asm/kvm_host.h |  2 ++
>  arch/x86/kvm/cpuid.c|  3 +++
>  arch/x86/kvm/x86.c  | 28 
>  3 files changed, 33 insertions(+)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index bdde807..5edef7b 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -592,16 +592,18 @@ struct kvm_vcpu_arch {
> u64 pat;
>
> unsigned switch_db_regs;
> unsigned long db[KVM_NR_DB_REGS];
> unsigned long dr6;
> unsigned long dr7;
> unsigned long eff_db[KVM_NR_DB_REGS];
> unsigned long guest_debug_dr7;
> +   bool cpuid_fault_supported;
> +   bool cpuid_fault;

Suggest storing these in MSR form:

u64 msr_platform_info;
u64 msr_misc_features_enables;

It will simplify the MSR get/set code, and make it easier to plumb
support for new bits in these MSRs.

>
> u64 mcg_cap;
> u64 mcg_status;
> u64 mcg_ctl;
> u64 mcg_ext_ctl;
> u64 *mce_banks;
>
> /* Cache MMIO info */
> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> index afa7bbb..ed8436a 100644
> --- a/arch/x86/kvm/cpuid.c
> +++ b/arch/x86/kvm/cpuid.c
> @@ -862,16 +862,19 @@ void kvm_cpuid(struct kvm_vcpu *vcpu, u32 *eax, u32 
> *ebx, u32 *ecx, u32 *edx)
> trace_kvm_cpuid(function, *eax, *ebx, *ecx, *edx);
>  }
>  EXPORT_SYMBOL_GPL(kvm_cpuid);
>
>  void kvm_emulate_cpuid(struct kvm_vcpu *vcpu)
>  {
> u32 function, eax, ebx, ecx, edx;
>
> +   if (vcpu->arch.cpuid_fault && !kvm_require_cpl(vcpu, 0))
> +   return;
> +
> function = eax = kvm_register_read(vcpu, VCPU_REGS_RAX);
> ecx = kvm_register_read(vcpu, VCPU_REGS_RCX);
> kvm_cpuid(vcpu, &eax, &ebx, &ecx, &edx);
> kvm_register_write(vcpu, VCPU_REGS_RAX, eax);
> kvm_register_write(vcpu, VCPU_REGS_RBX, ebx);
> kvm_register_write(vcpu, VCPU_REGS_RCX, ecx);
> kvm_register_write(vcpu, VCPU_REGS_RDX, edx);
> kvm_x86_ops->skip_emulated_instruction(vcpu);
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 3017de0..9cd6462 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -986,16 +986,18 @@ static u32 emulated_msrs[] = {
>
> MSR_IA32_TSC_ADJUST,
> MSR_IA32_TSCDEADLINE,
> MSR_IA32_MISC_ENABLE,
> MSR_IA32_MCG_STATUS,
> MSR_IA32_MCG_CTL,
> MSR_IA32_MCG_EXT_CTL,
> MSR_IA32_SMBASE,
> +   MSR_PLATFORM_INFO,
> +   MSR_MISC_FEATURES_ENABLES,
>  };
>
>  static unsigned num_emulated_msrs;
>
>  bool kvm_valid_efer(struct kvm_vcpu *vcpu, u64 efer)
>  {
> if (efer & efer_reserved_bits)
> return false;
> @@ -2269,16 +2271,29 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct 
> msr_data *msr_info)
> return 1;
> vcpu->arch.osvw.length = data;
> break;
> case MSR_AMD64_OSVW_STATUS:
> if (!guest_cpuid_has_osvw(vcpu))
> return 1;
> vcpu->arch.osvw.status = data;
> break;
> +   case MSR_PLATFORM_INFO:
> +   if (!msr_info->host_initiated ||
> +   data & ~PLATINFO_CPUID_FAULT ||
> +   (!!(data & PLATINFO_CPUID_FAULT) && 
> vcpu->arch.cpuid_fault))

Should that be a single exclamation point?

> +   return 1;
> +   vcpu->arch.cpuid_fault_supported = !!(data & 
> PLATINFO_CPUID_FAULT);

No need for "!!".

> +   break;
> +   case MSR_MISC_FEATURES_ENABLES:
> +   if (data & ~CPUID_FAULT_ENABLE ||
> +   !vcpu->arch.cpuid_fault_supported)
> +   return 1;
> +   vcpu->arch.cpuid_fault = !!(data & CPUID_FAULT_ENABLE);

No need for "!!".

> +   break;
> default:
> if (msr && (msr == vcpu->kvm->arch.xen_hvm_config.msr))
> return xen_hvm_config(vcpu, data);
> if (kvm_pmu_is_valid_msr(vcpu, msr))
> return kvm_pmu_set_msr(vcpu, msr_info);
> if (!ignore_msrs) {
> vcpu_unimpl(vcpu, "unhandled wrmsr: 0x%x data 
> 0x%llx\n",
> msr, data);
> @@ -2483,16 +2498,26 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, struct 
> msr_data *msr_info)
> return 1;
> ms

Re: [PATCH v8 7/7] KVM: x86: virtualize cpuid faulting

2016-11-04 Thread David Matlack

On Fri, Nov 4, 2016 at 2:57 PM, Paolo Bonzini  wrote:
>
> On 04/11/2016 21:34, David Matlack wrote:
>> On Mon, Oct 31, 2016 at 6:37 PM, Kyle Huey  wrote:
>>> +   case MSR_PLATFORM_INFO:
>>> +   /* cpuid faulting is supported */
>>> +   msr_info->data = PLATINFO_CPUID_FAULT;
>>> +   break;
>>
>> This could break save/restore, if for example, a VM is migrated to a
>> version of KVM without MSR_PLATFORM_INFO support. I think the way to
>> handle this is to make MSR_PLATFORM_INFO writeable (but only from
>> userspace) so that hypervisors can defend themselves (by setting this
>> MSR to 0).
>
> Right---and with my QEMU hat on, this feature will have to be enabled
> manually on the command line because of the way QEMU supports running
> with old kernels. :(  This however does not impact the KVM patch.
>
> We may decide that, because CPUID faulting doesn't have a CPUID bit and
> is relatively a "fringe" feature, we are okay if the kernel enables this
> unconditionally and then userspace can arrange to block migration (in
> QEMU this would use a subsection).  David, Eduardo, opinions?

Sounds reasonable. Accurate CPU virtualization might be another reason
to disable this feature from userspace.

My worry is a kernel rollback, where migrating to an older kernel
version is unavoidable.

>
>>
>>> +   case MSR_MISC_FEATURES_ENABLES:
>>> +   msr_info->data = 0;
>>> +   if (vcpu->arch.cpuid_fault)
>>> +   msr_info->data |= CPUID_FAULT_ENABLE;
>>> +   break;
>>
>> MSR_MISC_FEATURES_ENABLES should be added to emulated_msrs[] so that
>> the hypervisor will maintain the value of CPUID_FAULT_ENABLE across a
>> save/restore.
>
> This is definitely necessary.  Thanks David.
>
> Paolo
>
>>> default:
>>> if (kvm_pmu_is_valid_msr(vcpu, msr_info->index))
>>> return kvm_pmu_get_msr(vcpu, msr_info->index, 
>>> &msr_info->data);
>>> if (!ignore_msrs) {
>>> vcpu_unimpl(vcpu, "unhandled rdmsr: 0x%x\n", 
>>> msr_info->index);
>>> return 1;
>>> } else {
>>> vcpu_unimpl(vcpu, "ignored rdmsr: 0x%x\n", 
>>> msr_info->index);
>>> @@ -7493,16 +7507,18 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool 
>>> init_event)
>>> kvm_update_dr0123(vcpu);
>>> vcpu->arch.dr6 = DR6_INIT;
>>> kvm_update_dr6(vcpu);
>>> vcpu->arch.dr7 = DR7_FIXED_1;
>>> kvm_update_dr7(vcpu);
>>>
>>> vcpu->arch.cr2 = 0;
>>>
>>> +   vcpu->arch.cpuid_fault = false;
>>> +
>>> kvm_make_request(KVM_REQ_EVENT, vcpu);
>>> vcpu->arch.apf.msr_val = 0;
>>> vcpu->arch.st.msr_val = 0;
>>>
>>> kvmclock_reset(vcpu);
>>>
>>> kvm_clear_async_pf_completion_queue(vcpu);
>>> kvm_async_pf_hash_reset(vcpu);
>>> --
>>> 2.10.2
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe kvm" in
>>> the body of a message to majord...@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>

Re: [PATCH v8 7/7] KVM: x86: virtualize cpuid faulting

2016-11-04 Thread David Matlack

On Mon, Oct 31, 2016 at 6:37 PM, Kyle Huey  wrote:
> Hardware support for faulting on the cpuid instruction is not required to
> emulate it, because cpuid triggers a VM exit anyways. KVM handles the relevant
> MSRs (MSR_PLATFORM_INFO and MSR_MISC_FEATURES_ENABLE) and upon a
> cpuid-induced VM exit checks the cpuid faulting state and the CPL.
> kvm_require_cpl is even kind enough to inject the GP fault for us.
>
> Signed-off-by: Kyle Huey 
> ---
>  arch/x86/include/asm/kvm_host.h |  1 +
>  arch/x86/kvm/cpuid.c|  3 +++
>  arch/x86/kvm/x86.c  | 16 
>  3 files changed, 20 insertions(+)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 4b20f73..4a6e62b 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -592,16 +592,17 @@ struct kvm_vcpu_arch {
> u64 pat;
>
> unsigned switch_db_regs;
> unsigned long db[KVM_NR_DB_REGS];
> unsigned long dr6;
> unsigned long dr7;
> unsigned long eff_db[KVM_NR_DB_REGS];
> unsigned long guest_debug_dr7;
> +   bool cpuid_fault;
>
> u64 mcg_cap;
> u64 mcg_status;
> u64 mcg_ctl;
> u64 mcg_ext_ctl;
> u64 *mce_banks;
>
> /* Cache MMIO info */
> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> index afa7bbb..ed8436a 100644
> --- a/arch/x86/kvm/cpuid.c
> +++ b/arch/x86/kvm/cpuid.c
> @@ -862,16 +862,19 @@ void kvm_cpuid(struct kvm_vcpu *vcpu, u32 *eax, u32 
> *ebx, u32 *ecx, u32 *edx)
> trace_kvm_cpuid(function, *eax, *ebx, *ecx, *edx);
>  }
>  EXPORT_SYMBOL_GPL(kvm_cpuid);
>
>  void kvm_emulate_cpuid(struct kvm_vcpu *vcpu)
>  {
> u32 function, eax, ebx, ecx, edx;
>
> +   if (vcpu->arch.cpuid_fault && !kvm_require_cpl(vcpu, 0))
> +   return;
> +
> function = eax = kvm_register_read(vcpu, VCPU_REGS_RAX);
> ecx = kvm_register_read(vcpu, VCPU_REGS_RCX);
> kvm_cpuid(vcpu, &eax, &ebx, &ecx, &edx);
> kvm_register_write(vcpu, VCPU_REGS_RAX, eax);
> kvm_register_write(vcpu, VCPU_REGS_RBX, ebx);
> kvm_register_write(vcpu, VCPU_REGS_RCX, ecx);
> kvm_register_write(vcpu, VCPU_REGS_RDX, edx);
> kvm_x86_ops->skip_emulated_instruction(vcpu);
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index e375235..470c553 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -2269,16 +2269,21 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct 
> msr_data *msr_info)
> return 1;
> vcpu->arch.osvw.length = data;
> break;
> case MSR_AMD64_OSVW_STATUS:
> if (!guest_cpuid_has_osvw(vcpu))
> return 1;
> vcpu->arch.osvw.status = data;
> break;
> +   case MSR_MISC_FEATURES_ENABLES:
> +   if (data & ~CPUID_FAULT_ENABLE)
> +   return 1;

(Due to my comments below, PLATINFO_CPUID_FAULT will not necessarily
be enabled for guests. So this code will need to check if the virtual CPU
supports PLATINFO_CPUID_FAULT before enabling CPUID faulting.)

> +   vcpu->arch.cpuid_fault = !!(data & CPUID_FAULT_ENABLE);
> +   break;
> default:
> if (msr && (msr == vcpu->kvm->arch.xen_hvm_config.msr))
> return xen_hvm_config(vcpu, data);
> if (kvm_pmu_is_valid_msr(vcpu, msr))
> return kvm_pmu_set_msr(vcpu, msr_info);
> if (!ignore_msrs) {
> vcpu_unimpl(vcpu, "unhandled wrmsr: 0x%x data %llx\n",
> msr, data);
> @@ -2483,16 +2488,25 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, struct 
> msr_data *msr_info)
> return 1;
> msr_info->data = vcpu->arch.osvw.length;
> break;
> case MSR_AMD64_OSVW_STATUS:
> if (!guest_cpuid_has_osvw(vcpu))
> return 1;
> msr_info->data = vcpu->arch.osvw.status;
> break;
> +   case MSR_PLATFORM_INFO:
> +   /* cpuid faulting is supported */
> +   msr_info->data = PLATINFO_CPUID_FAULT;
> +   break;

This could break save/restore, if for example, a VM is migrated to a
version of KVM without MSR_PLATFORM_INFO support. I think the way to
handle this is to make MSR_PLATFORM_INFO writeable (but only from
userspace) so that hypervisors can defend themselves (by setting this
MSR to 0).

> +   case MSR_MISC_FEATURES_ENABLES:
> +   msr_info->data = 0;
> +   if (vcpu->arch.cpuid_fault)
> +   msr_info->data |= CPUID_FAULT_ENABLE;
> +   break;

MSR_MISC_FEATURES_ENABLES should be added to emulated_msrs[] so that
the hypervisor will maintain the value of CPUID_FAULT_ENABLE across a
save/restore.

Re: [PATCH 2/2] x86, kvm: use kvmclock to compute TSC deadline value

2016-09-09 Thread David Matlack

On Fri, Sep 9, 2016 at 9:38 AM, Paolo Bonzini  wrote:
>
> On 09/09/2016 00:13, David Matlack wrote:
>> Hi Paolo,
>>
>> On Tue, Sep 6, 2016 at 3:29 PM, Paolo Bonzini  wrote:
>>> Bad things happen if a guest using the TSC deadline timer is migrated.
>>> The guest doesn't re-calibrate the TSC after migration, and the
>>> TSC frequency can and will change unless your processor supports TSC
>>> scaling (on Intel this is only Skylake) or your data center is perfectly
>>> homogeneous.
>>
>> Sorry, I forgot to follow up on our discussion in v1. One thing we
>> discussed there was using the APIC Timer to workaround a changing TSC
>> rate. You pointed out KVM's TSC deadline timer got a nice performance
>> boost recently, which makes it preferable. Does it makes sense to
>> apply the same optimization (using the VMX preemption timer) to the
>> APIC Timer, instead of creating a new dependency on kvmclock?
>
> Hi, yes it does.  If we go that way kvmclock.c should be patched to
> blacklist the TSC deadline timer.  However, I won't have time to work on
> it anytime soon, so _if I get reviews_ I'll take this patch first.

Got it, thanks for the context.

Re: [PATCH 2/2] x86, kvm: use kvmclock to compute TSC deadline value

2016-09-08 Thread David Matlack

Hi Paolo,

On Tue, Sep 6, 2016 at 3:29 PM, Paolo Bonzini  wrote:
> Bad things happen if a guest using the TSC deadline timer is migrated.
> The guest doesn't re-calibrate the TSC after migration, and the
> TSC frequency can and will change unless your processor supports TSC
> scaling (on Intel this is only Skylake) or your data center is perfectly
> homogeneous.

Sorry, I forgot to follow up on our discussion in v1. One thing we
discussed there was using the APIC Timer to workaround a changing TSC
rate. You pointed out KVM's TSC deadline timer got a nice performance
boost recently, which makes it preferable. Does it makes sense to
apply the same optimization (using the VMX preemption timer) to the
APIC Timer, instead of creating a new dependency on kvmclock?

>
> The solution in this patch is to skip tsc_khz, and instead derive the
> frequency from kvmclock's (mult, shift) pair.  Because kvmclock
> parameters convert from tsc to nanoseconds, this needs a division
> but that's the least of our problems when the TSC_DEADLINE_MSR write
> costs 2000 clock cycles.  Luckily tsc_khz is really used by very little
> outside the tsc clocksource (which kvmclock replaces) and the TSC
> deadline timer.  Because KVM's local APIC doesn't need quirks, we
> provide a paravirt clockevent that still uses the deadline timer
> under the hood (as suggested by Andy Lutomirski).
>
> This patch does not handle the very first deadline, hoping that it
> is masked by the migration downtime (i.e. that the timer fires late
> anyway).
>
> Signed-off-by: Paolo Bonzini 
> ---
>  arch/x86/include/asm/apic.h |   1 +
>  arch/x86/kernel/apic/apic.c |   2 +-
>  arch/x86/kernel/kvmclock.c  | 156 
> 
>  3 files changed, 158 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/include/asm/apic.h b/arch/x86/include/asm/apic.h
> index f6e0bad1cde2..c88b0dcfdf3a 100644
> --- a/arch/x86/include/asm/apic.h
> +++ b/arch/x86/include/asm/apic.h
> @@ -53,6 +53,7 @@ extern unsigned int apic_verbosity;
>  extern int local_apic_timer_c2_ok;
>
>  extern int disable_apic;
> +extern int disable_apic_timer;
>  extern unsigned int lapic_timer_frequency;
>
>  #ifdef CONFIG_SMP
> diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
> index 5b63bec7d0af..d0c6d1e3d627 100644
> --- a/arch/x86/kernel/apic/apic.c
> +++ b/arch/x86/kernel/apic/apic.c
> @@ -169,7 +169,7 @@ __setup("apicpmtimer", setup_apicpmtimer);
>  unsigned long mp_lapic_addr;
>  int disable_apic;
>  /* Disable local APIC timer from the kernel commandline or via dmi quirk */
> -static int disable_apic_timer __initdata;
> +int disable_apic_timer __initdata;
>  /* Local APIC timer works in C2 */
>  int local_apic_timer_c2_ok;
>  EXPORT_SYMBOL_GPL(local_apic_timer_c2_ok);
> diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
> index 1d39bfbd26bb..365fa6494dd3 100644
> --- a/arch/x86/kernel/kvmclock.c
> +++ b/arch/x86/kernel/kvmclock.c
> @@ -17,6 +17,7 @@
>  */
>
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -245,6 +246,155 @@ static void kvm_shutdown(void)
> native_machine_shutdown();
>  }
>
> +#ifdef CONFIG_X86_LOCAL_APIC
> +/*
> + * kvmclock-based clock event implementation, used only together with the
> + * TSC deadline timer.  A subset of the normal LAPIC clockevent, but it
> + * uses kvmclock to convert nanoseconds to TSC.  This is necessary to
> + * handle changes to the TSC frequency, e.g. from live migration.
> + */
> +
> +static void kvmclock_lapic_timer_setup(unsigned lvtt_value)
> +{
> +   lvtt_value |= LOCAL_TIMER_VECTOR | APIC_LVT_TIMER_TSCDEADLINE;
> +   apic_write(APIC_LVTT, lvtt_value);
> +}
> +
> +static int kvmclock_lapic_timer_set_oneshot(struct clock_event_device *evt)
> +{
> +   kvmclock_lapic_timer_setup(0);
> +   printk_once(KERN_DEBUG "kvmclock: TSC deadline timer enabled\n");
> +
> +   /*
> +* See Intel SDM: TSC-Deadline Mode chapter. In xAPIC mode,
> +* writing to the APIC LVTT and TSC_DEADLINE MSR isn't serialized.
> +* According to Intel, MFENCE can do the serialization here.
> +*/
> +   asm volatile("mfence" : : : "memory");
> +   return 0;
> +}
> +
> +static int kvmclock_lapic_timer_stop(struct clock_event_device *evt)
> +{
> +   kvmclock_lapic_timer_setup(APIC_LVT_MASKED);
> +   wrmsrl(MSR_IA32_TSC_DEADLINE, -1);
> +   return 0;
> +}
> +
> +/*
> + * We already have the inverse of the (mult,shift) pair, though this means
> + * we need a division.  To avoid it we could compute a multiplicative inverse
> + * every time src->version changes.
> + */
> +#define KVMCLOCK_TSC_DEADLINE_MAX_BITS 38
> +#define KVMCLOCK_TSC_DEADLINE_MAX  ((1ull << 
> KVMCLOCK_TSC_DEADLINE_MAX_BITS) - 1)
> +
> +static int kvmclock_lapic_next_ktime(ktime_t expires,
> +struct clock_event_device *evt)
> +{
> +   u64 ns, tsc;
> +   u32 version;
> +   int cpu;
> +   struct

Re: [PATCH] kvm: x86: nVMX: maintain internal copy of current VMCS

2016-07-15 Thread David Matlack

On Thu, Jul 14, 2016 at 1:33 AM, Paolo Bonzini  wrote:
>
>
> On 14/07/2016 02:16, David Matlack wrote:
>> KVM maintains L1's current VMCS in guest memory, at the guest physical
>> page identified by the argument to VMPTRLD. This makes hairy
>> time-of-check to time-of-use bugs possible,as VCPUs can be writing
>> the the VMCS page in memory while KVM is emulating VMLAUNCH and
>> VMRESUME.
>>
>> The spec documents that writing to the VMCS page while it is loaded is
>> "undefined". Therefore it is reasonable to load the entire VMCS into
>> an internal cache during VMPTRLD and ignore writes to the VMCS page
>> -- the guest should be using VMREAD and VMWRITE to access the current
>> VMCS.
>>
>> To adhere to the spec, KVM should flush the current VMCS during VMPTRLD,
>> and the target VMCS during VMCLEAR (as given by the operand to VMCLEAR).
>> Since this implementation of VMCS caching only maintains the the current
>> VMCS, VMCLEAR will only do a flush if the operand to VMCLEAR is the
>> current VMCS pointer.
>>
>> KVM will also flush during VMXOFF, which is not mandated by the spec,
>> but also not in conflict with the spec.
>>
>> Signed-off-by: David Matlack 
>
> This is a good change.  There is another change that is possible on top:
> with this change you don't need current_vmcs12/current_vmcs12_page at
> all, I think.  You can just use current_vmptr and kvm_read/write_guest
> to write back the VMCS12, possibly the cached variants.

Good catch, I agree they can be removed.

>
> Of course this would just be a small simplification, so I'm applying the
> patch as is to kvm/next.

SGTM. Thanks for the review.

>
> Thanks,
>
> Paolo

[PATCH] kvm: x86: nVMX: maintain internal copy of current VMCS

2016-07-13 Thread David Matlack

KVM maintains L1's current VMCS in guest memory, at the guest physical
page identified by the argument to VMPTRLD. This makes hairy
time-of-check to time-of-use bugs possible,as VCPUs can be writing
the the VMCS page in memory while KVM is emulating VMLAUNCH and
VMRESUME.

The spec documents that writing to the VMCS page while it is loaded is
"undefined". Therefore it is reasonable to load the entire VMCS into
an internal cache during VMPTRLD and ignore writes to the VMCS page
-- the guest should be using VMREAD and VMWRITE to access the current
VMCS.

To adhere to the spec, KVM should flush the current VMCS during VMPTRLD,
and the target VMCS during VMCLEAR (as given by the operand to VMCLEAR).
Since this implementation of VMCS caching only maintains the the current
VMCS, VMCLEAR will only do a flush if the operand to VMCLEAR is the
current VMCS pointer.

KVM will also flush during VMXOFF, which is not mandated by the spec,
but also not in conflict with the spec.

Signed-off-by: David Matlack 
---
 arch/x86/kvm/vmx.c | 31 ---
 1 file changed, 28 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 64a79f2..640ad91 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -398,6 +398,12 @@ struct nested_vmx {
/* The host-usable pointer to the above */
struct page *current_vmcs12_page;
struct vmcs12 *current_vmcs12;
+   /*
+* Cache of the guest's VMCS, existing outside of guest memory.
+* Loaded from guest memory during VMPTRLD. Flushed to guest
+* memory during VMXOFF, VMCLEAR, VMPTRLD.
+*/
+   struct vmcs12 *cached_vmcs12;
struct vmcs *current_shadow_vmcs;
/*
 * Indicates if the shadow vmcs must be updated with the
@@ -841,7 +847,7 @@ static inline short vmcs_field_to_offset(unsigned long 
field)
 
 static inline struct vmcs12 *get_vmcs12(struct kvm_vcpu *vcpu)
 {
-   return to_vmx(vcpu)->nested.current_vmcs12;
+   return to_vmx(vcpu)->nested.cached_vmcs12;
 }
 
 static struct page *nested_get_page(struct kvm_vcpu *vcpu, gpa_t addr)
@@ -6866,10 +6872,16 @@ static int handle_vmon(struct kvm_vcpu *vcpu)
return 1;
}
 
+   vmx->nested.cached_vmcs12 = kmalloc(VMCS12_SIZE, GFP_KERNEL);
+   if (!vmx->nested.cached_vmcs12)
+   return -ENOMEM;
+
if (enable_shadow_vmcs) {
shadow_vmcs = alloc_vmcs();
-   if (!shadow_vmcs)
+   if (!shadow_vmcs) {
+   kfree(vmx->nested.cached_vmcs12);
return -ENOMEM;
+   }
/* mark vmcs as shadow */
shadow_vmcs->revision_id |= (1u << 31);
/* init shadow vmcs */
@@ -6940,6 +6952,11 @@ static inline void nested_release_vmcs12(struct vcpu_vmx 
*vmx)
vmcs_write64(VMCS_LINK_POINTER, -1ull);
}
vmx->nested.posted_intr_nv = -1;
+
+   /* Flush VMCS12 to guest memory */
+   memcpy(vmx->nested.current_vmcs12, vmx->nested.cached_vmcs12,
+  VMCS12_SIZE);
+
kunmap(vmx->nested.current_vmcs12_page);
nested_release_page(vmx->nested.current_vmcs12_page);
vmx->nested.current_vmptr = -1ull;
@@ -6960,6 +6977,7 @@ static void free_nested(struct vcpu_vmx *vmx)
nested_release_vmcs12(vmx);
if (enable_shadow_vmcs)
free_vmcs(vmx->nested.current_shadow_vmcs);
+   kfree(vmx->nested.cached_vmcs12);
/* Unpin physical memory we referred to in current vmcs02 */
if (vmx->nested.apic_access_page) {
nested_release_page(vmx->nested.apic_access_page);
@@ -7363,6 +7381,13 @@ static int handle_vmptrld(struct kvm_vcpu *vcpu)
vmx->nested.current_vmptr = vmptr;
vmx->nested.current_vmcs12 = new_vmcs12;
vmx->nested.current_vmcs12_page = page;
+   /*
+* Load VMCS12 from guest memory since it is not already
+* cached.
+*/
+   memcpy(vmx->nested.cached_vmcs12,
+  vmx->nested.current_vmcs12, VMCS12_SIZE);
+
if (enable_shadow_vmcs) {
vmcs_set_bits(SECONDARY_VM_EXEC_CONTROL,
  SECONDARY_EXEC_SHADOW_VMCS);
@@ -8326,7 +8351,7 @@ static void vmx_set_apic_access_page_addr(struct kvm_vcpu 
*vcpu, hpa_t hpa)
 * the next L2->L1 exit.
 */
if (!is_guest_mode(vcpu) ||
-   !nested_cpu_has2(vmx->nested.current_vmcs12,
+   !nested_cpu_has2(get_vmcs12(&vmx->vcpu),
 SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES))
vmcs_write64(APIC_ACCESS_ADDR, hpa);
 }
-- 
2.8.0.rc3.226.g39d4020

Re: [RFC PATCH] x86, kvm: use kvmclock to compute TSC deadline value

2016-07-06 Thread David Matlack

On Tue, Jul 5, 2016 at 10:36 AM, Paolo Bonzini  wrote:
> Bad things happen if a guest using the TSC deadline timer is migrated.
> The guest doesn't re-calibrate the TSC after migration, and the
> TSC frequency can and will change unless your processor supports TSC
> scaling (on Intel this is only Skylake) or your data center is perfectly
> homogeneous.
>
> The solution in this patch is to skip tsc_khz, and instead derive the
> frequency from kvmclock's (mult, shift) pair.  Because kvmclock
> parameters convert from tsc to nanoseconds, this needs a division
> but that's the least of our problems when the TSC_DEADLINE_MSR write
> costs 2000 clock cycles.  Luckily tsc_khz is really used by very little
> outside the tsc clocksource (which kvmclock replaces) and the TSC
> deadline timer.

Two other ways to solve the problem, I don't know if you've considered:
* Constrain the set of hosts a given VM can run on based on the TSC
rate. (So don't need a perfectly homogenous fleet, just need each VM
to be constrained to a homogenous subset).
* Disable the TSC deadline timer from QEMU by assigning a CPUID with
the TSC capability zeroed (at least among VMs which could migrate to
hosts with different TSC rates). These VMs will use the APIC timer
which runs at a nice fixed rate.

>
> This patch does not handle the very first deadline, hoping that it
> is masked by the migration downtime (i.e. that the timer fires late
> anyway).  I'd like a remark on the approach in general and ideas on how
> to handle the first deadline.  It's also possible to just blacklist the
> TSC deadline timer of course, and it's probably the best thing to do for
> stable kernels.

> It would require extending to other modes the
> implementation of preemption-timer based APIC timer.

This would be nice to have.

> It'd be a pity to
> lose the nice latency boost that the preemption timer offers.
>
> The patch is also quite ugly in the way it arranges for kvmclock to
> replace only a small part of setup_apic_timer; better ideas are welcome.
>
> Signed-off-by: Paolo Bonzini 
> ---
>  arch/x86/include/asm/apic.h |  2 ++
>  arch/x86/include/asm/x86_init.h |  5 +++
>  arch/x86/kernel/apic/apic.c | 15 +---
>  arch/x86/kernel/kvmclock.c  | 78 
> +
>  arch/x86/kernel/x86_init.c  |  1 +
>  5 files changed, 96 insertions(+), 5 deletions(-)
>
> diff --git a/arch/x86/include/asm/apic.h b/arch/x86/include/asm/apic.h
> index bc27611fa58f..cc4f45f66218 100644
> --- a/arch/x86/include/asm/apic.h
> +++ b/arch/x86/include/asm/apic.h
> @@ -135,6 +135,7 @@ extern void init_apic_mappings(void);
>  void register_lapic_address(unsigned long address);
>  extern void setup_boot_APIC_clock(void);
>  extern void setup_secondary_APIC_clock(void);
> +extern void setup_APIC_clockev(struct clock_event_device *levt);
>  extern int APIC_init_uniprocessor(void);
>
>  #ifdef CONFIG_X86_64
> @@ -170,6 +171,7 @@ static inline void init_apic_mappings(void) { }
>  static inline void disable_local_APIC(void) { }
>  # define setup_boot_APIC_clock x86_init_noop
>  # define setup_secondary_APIC_clock x86_init_noop
> +# define setup_APIC_clockev NULL
>  #endif /* !CONFIG_X86_LOCAL_APIC */
>
>  #ifdef CONFIG_X86_X2APIC
> diff --git a/arch/x86/include/asm/x86_init.h b/arch/x86/include/asm/x86_init.h
> index 4dcdf74dfed8..d0f099ab4dba 100644
> --- a/arch/x86/include/asm/x86_init.h
> +++ b/arch/x86/include/asm/x86_init.h
> @@ -7,6 +7,7 @@ struct mpc_bus;
>  struct mpc_cpu;
>  struct mpc_table;
>  struct cpuinfo_x86;
> +struct clock_event_device;
>
>  /**
>   * struct x86_init_mpparse - platform specific mpparse ops
> @@ -84,11 +85,15 @@ struct x86_init_paging {
>   * boot cpu
>   * @timer_init:initialize the platform timer 
> (default PIT/HPET)
>   * @wallclock_init:init the wallclock device
> + * @setup_APIC_clockev: tweak the clock_event_device for the LAPIC 
> timer,
> + *  if setup_boot_APIC_clock and/or
> + *  setup_secondary_APIC_clock are in use
>   */
>  struct x86_init_timers {
> void (*setup_percpu_clockev)(void);
> void (*timer_init)(void);
> void (*wallclock_init)(void);
> +   void (*setup_APIC_clockev)(struct clock_event_device *levt);
>  };
>
>  /**
> diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
> index 60078a67d7e3..b7a331f329d0 100644
> --- a/arch/x86/kernel/apic/apic.c
> +++ b/arch/x86/kernel/apic/apic.c
> @@ -558,15 +558,20 @@ static void setup_APIC_timer(void)
> memcpy(levt, &lapic_clockevent, sizeof(*levt));
> levt->cpumask = cpumask_of(smp_processor_id());
>
> +   x86_init.timers.setup_APIC_clockev(levt);
> +   clockevents_register_device(levt);
> +}
> +
> +void setup_APIC_clockev(struct clock_event_device *levt)
> +{
> if (this_cpu_has(X86_FEATURE_TSC_DEADLINE_TIMER)) {
> levt->features &=

Re: [PATCH v1 10/11] KVM: x86: add KVM_CAP_X2APIC_API

2016-07-01 Thread David Matlack

On Thu, Jun 30, 2016 at 1:54 PM, Radim Krčmář  wrote:
> KVM_CAP_X2APIC_API can be enabled to extend APIC ID in get/set ioctl and MSI
> addresses to 32 bits.  Both are needed to support x2APIC.
>
> The capability has to be toggleable and disabled by default, because get/set
> ioctl shifted and truncated APIC ID to 8 bits by using a non-standard protocol
> inspired by xAPIC and the change is not backward-compatible.
>
> Changes to MSI addresses follow the format used by interrupt remapping unit.
> The upper address word, that used to be 0, contains upper 24 bits of the LAPIC
> address in its upper 24 bits.  Lower 8 bits are reserved as 0.
> Using the upper address word is not backward-compatible either as we didn't
> check that userspace zeroed the word.  Reserved bits are still not explicitly
> checked, but non-zero data will affect LAPIC addresses, which will cause a 
> bug.
>
> Signed-off-by: Radim Krčmář 
> ---
>  v1:
>  * rewritten with a toggleable capability [Paolo]
>  * dropped MSI_ADDR_EXT_DEST_ID to enforce reserved bits
>
>  Documentation/virtual/kvm/api.txt | 26 ++
>  arch/x86/include/asm/kvm_host.h   |  4 +++-
>  arch/x86/kvm/irq_comm.c   | 14 ++
>  arch/x86/kvm/lapic.c  |  2 +-
>  arch/x86/kvm/vmx.c|  2 +-
>  arch/x86/kvm/x86.c| 12 
>  include/uapi/linux/kvm.h  |  1 +
>  7 files changed, 54 insertions(+), 7 deletions(-)
>
> diff --git a/Documentation/virtual/kvm/api.txt 
> b/Documentation/virtual/kvm/api.txt
> index 09efa9eb3926..0f978089a0f6 100644
> --- a/Documentation/virtual/kvm/api.txt
> +++ b/Documentation/virtual/kvm/api.txt
> @@ -1482,6 +1482,9 @@ struct kvm_irq_routing_msi {
> __u32 pad;
>  };
>
> +If KVM_CAP_X2APIC_API is enabled, then address_hi bits 31-8 contain bits 31-8
> +of destination id and address_hi bits 7-0 is must be 0.
> +
>  struct kvm_irq_routing_s390_adapter {
> __u64 ind_addr;
> __u64 summary_addr;
> @@ -1583,6 +1586,13 @@ struct kvm_lapic_state {
>  Reads the Local APIC registers and copies them into the input argument.  The
>  data format and layout are the same as documented in the architecture manual.
>
> +If KVM_CAP_X2APIC_API is enabled, then the format of APIC_ID register depends
> +on APIC mode (reported by MSR_IA32_APICBASE) of its VCPU.  The format follows
> +xAPIC otherwise.
> +
> +x2APIC stores APIC ID as little endian in bits 31-0 of APIC_ID register.
> +xAPIC stores bits 7-0 of APIC ID in register bits 31-24.
> +
>
>  4.58 KVM_SET_LAPIC
>
> @@ -1600,6 +1610,8 @@ struct kvm_lapic_state {
>  Copies the input argument into the Local APIC registers.  The data format
>  and layout are the same as documented in the architecture manual.
>
> +See the note about APIC_ID register in KVM_GET_LAPIC.
> +
>
>  4.59 KVM_IOEVENTFD
>
> @@ -2180,6 +2192,9 @@ struct kvm_msi {
>
>  No flags are defined so far. The corresponding field must be 0.
>
> +If KVM_CAP_X2APIC_API is enabled, then address_hi bits 31-8 contain bits 31-8
> +of destination id and address_hi bits 7-0 is must be 0.
> +
>
>  4.71 KVM_CREATE_PIT2
>
> @@ -3811,6 +3826,17 @@ Allows use of runtime-instrumentation introduced with 
> zEC12 processor.
>  Will return -EINVAL if the machine does not support runtime-instrumentation.
>  Will return -EBUSY if a VCPU has already been created.
>
> +7.7 KVM_CAP_X2APIC_API
> +
> +Architectures: x86
> +Parameters: none
> +Returns: 0 on success, -EINVAL if reserved parameters are not 0
> +
> +Enabling this capability changes the behavior of KVM_SET_GSI_ROUTING,
> +KVM_SIGNAL_MSI, KVM_SET_LAPIC, and KVM_GET_LAPIC.  See KVM_CAP_X2APIC_API
> +in their respective sections.
> +
> +
>  8. Other capabilities.
>  --
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 459a789cb3da..48b0ca18066c 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -782,6 +782,8 @@ struct kvm_arch {
> u32 ldr_mode;
> struct page *avic_logical_id_table_page;
> struct page *avic_physical_id_table_page;
> +
> +   bool x2apic_api;
>  };
>
>  struct kvm_vm_stat {
> @@ -1365,7 +1367,7 @@ bool kvm_intr_is_single_vcpu(struct kvm *kvm, struct 
> kvm_lapic_irq *irq,
>  struct kvm_vcpu **dest_vcpu);
>
>  void kvm_set_msi_irq(struct kvm_kernel_irq_routing_entry *e,
> -struct kvm_lapic_irq *irq);
> +struct kvm_lapic_irq *irq, bool x2apic_api);
>
>  static inline void kvm_arch_vcpu_blocking(struct kvm_vcpu *vcpu)
>  {
> diff --git a/arch/x86/kvm/irq_comm.c b/arch/x86/kvm/irq_comm.c
> index 47ad681a33fd..4594644ab090 100644
> --- a/arch/x86/kvm/irq_comm.c
> +++ b/arch/x86/kvm/irq_comm.c
> @@ -111,12 +111,17 @@ int kvm_irq_delivery_to_apic(struct kvm *kvm, struct 
> kvm_lapic *src,
>  }
>
>  void kvm_set_msi_irq(struct kvm_kernel_irq_routing_entry *e,
> -struct kvm_lapic_irq *i

Re: [RFC PATCH 2/2] KVM: x86: use __kvm_guest_exit

2016-06-16 Thread David Matlack

On Thu, Jun 16, 2016 at 9:47 AM, Paolo Bonzini  wrote:
> On 16/06/2016 18:43, David Matlack wrote:
>> On Thu, Jun 16, 2016 at 1:21 AM, Paolo Bonzini  wrote:
>>> This gains ~20 clock cycles per vmexit.  On Intel there is no need
>>> anymore to enable the interrupts in vmx_handle_external_intr, since we
>>> are using the "acknowledge interrupt on exit" feature.  AMD needs to do
>>> that temporarily, and must be careful to avoid the interrupt shadow.
>>>
>>> Signed-off-by: Paolo Bonzini 
>>> ---
>>>  arch/x86/kvm/svm.c |  6 ++
>>>  arch/x86/kvm/vmx.c |  4 +---
>>>  arch/x86/kvm/x86.c | 11 ++-
>>>  3 files changed, 9 insertions(+), 12 deletions(-)
>>>
>>> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
>>> index 5ff292778110..5bfdbbf1ce79 100644
>>> --- a/arch/x86/kvm/svm.c
>>> +++ b/arch/x86/kvm/svm.c
>>> @@ -4935,6 +4935,12 @@ out:
>>>  static void svm_handle_external_intr(struct kvm_vcpu *vcpu)
>>>  {
>>> local_irq_enable();
>>> +   /*
>>> +* We must execute an instruction with interrupts enabled, so
>>> +* the "cli" doesn't fall right on the interrupt shadow.
>>> +*/
>>> +   asm("nop");
>>> +   local_irq_disable();
>>>  }
>>>
>>>  static void svm_sched_in(struct kvm_vcpu *vcpu, int cpu)
>>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
>>> index 4e9657730bf6..a46bce9e3683 100644
>>> --- a/arch/x86/kvm/vmx.c
>>> +++ b/arch/x86/kvm/vmx.c
>>> @@ -8544,7 +8544,6 @@ static void vmx_handle_external_intr(struct kvm_vcpu 
>>> *vcpu)
>>> "push %[sp]\n\t"
>>>  #endif
>>> "pushf\n\t"
>>> -   "orl $0x200, (%%" _ASM_SP ")\n\t"
>>> __ASM_SIZE(push) " $%c[cs]\n\t"
>>> "call *%[entry]\n\t"
>>> :
>>> @@ -8557,8 +8556,7 @@ static void vmx_handle_external_intr(struct kvm_vcpu 
>>> *vcpu)
>>> [ss]"i"(__KERNEL_DS),
>>> [cs]"i"(__KERNEL_CS)
>>> );
>>> -   } else
>>> -   local_irq_enable();
>>> +   }
>>
>> If you make the else case the same as svm_handle_external_intr, can we
>> avoid requiring ack-intr-on-exit?
>
> Yes, but the sti/nop/cli would be useless if ack-intr-on-exit is
> available.  It's a bit ugly, so I RFCed the bold thing instead.

Ahh, and handle_external_intr is called on every VM-exit, not just
VM-exits caused by external interrupts. So we'd be doing the
sti/nop/cli quite often. I was thinking we never hit the else case
when the CPU supports ack-intr-on-exit.

>
> Are you thinking of some distros in particular that lack nested
> ack-intr-on-exit?  All processors have it as far as I know.

Nope, I just thought it was possible to avoid the requirement.

>
> Paolo
>
>
>>>  }
>>>
>>>  static bool vmx_has_high_real_mode_segbase(void)
>>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>>> index 7e3041ef050f..cc741b68139c 100644
>>> --- a/arch/x86/kvm/x86.c
>>> +++ b/arch/x86/kvm/x86.c
>>> @@ -6706,21 +6706,13 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
>>>
>>> kvm_put_guest_xcr0(vcpu);
>>>
>>> -   /* Interrupt is enabled by handle_external_intr() */
>>> kvm_x86_ops->handle_external_intr(vcpu);
>>>
>>> ++vcpu->stat.exits;
>>>
>>> -   /*
>>> -* We must have an instruction between local_irq_enable() and
>>> -* kvm_guest_exit(), so the timer interrupt isn't delayed by
>>> -* the interrupt shadow.  The stat.exits increment will do nicely.
>>> -* But we need to prevent reordering, hence this barrier():
>>> -*/
>>> -   barrier();
>>> -
>>> -   kvm_guest_exit();
>>> +   __kvm_guest_exit();
>>>
>>> +   local_irq_enable();
>>> preempt_enable();
>>>
>>> vcpu->srcu_idx = srcu_read_lock(&vcpu->kvm->srcu);
>>> --
>>> 1.8.3.1
>>>

Re: [RFC PATCH 2/2] KVM: x86: use __kvm_guest_exit

2016-06-16 Thread David Matlack

On Thu, Jun 16, 2016 at 1:21 AM, Paolo Bonzini  wrote:
> This gains ~20 clock cycles per vmexit.  On Intel there is no need
> anymore to enable the interrupts in vmx_handle_external_intr, since we
> are using the "acknowledge interrupt on exit" feature.  AMD needs to do
> that temporarily, and must be careful to avoid the interrupt shadow.
>
> Signed-off-by: Paolo Bonzini 
> ---
>  arch/x86/kvm/svm.c |  6 ++
>  arch/x86/kvm/vmx.c |  4 +---
>  arch/x86/kvm/x86.c | 11 ++-
>  3 files changed, 9 insertions(+), 12 deletions(-)
>
> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
> index 5ff292778110..5bfdbbf1ce79 100644
> --- a/arch/x86/kvm/svm.c
> +++ b/arch/x86/kvm/svm.c
> @@ -4935,6 +4935,12 @@ out:
>  static void svm_handle_external_intr(struct kvm_vcpu *vcpu)
>  {
> local_irq_enable();
> +   /*
> +* We must execute an instruction with interrupts enabled, so
> +* the "cli" doesn't fall right on the interrupt shadow.
> +*/
> +   asm("nop");
> +   local_irq_disable();
>  }
>
>  static void svm_sched_in(struct kvm_vcpu *vcpu, int cpu)
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 4e9657730bf6..a46bce9e3683 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -8544,7 +8544,6 @@ static void vmx_handle_external_intr(struct kvm_vcpu 
> *vcpu)
> "push %[sp]\n\t"
>  #endif
> "pushf\n\t"
> -   "orl $0x200, (%%" _ASM_SP ")\n\t"
> __ASM_SIZE(push) " $%c[cs]\n\t"
> "call *%[entry]\n\t"
> :
> @@ -8557,8 +8556,7 @@ static void vmx_handle_external_intr(struct kvm_vcpu 
> *vcpu)
> [ss]"i"(__KERNEL_DS),
> [cs]"i"(__KERNEL_CS)
> );
> -   } else
> -   local_irq_enable();
> +   }

If you make the else case the same as svm_handle_external_intr, can we
avoid requiring ack-intr-on-exit?

>  }
>
>  static bool vmx_has_high_real_mode_segbase(void)
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 7e3041ef050f..cc741b68139c 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -6706,21 +6706,13 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
>
> kvm_put_guest_xcr0(vcpu);
>
> -   /* Interrupt is enabled by handle_external_intr() */
> kvm_x86_ops->handle_external_intr(vcpu);
>
> ++vcpu->stat.exits;
>
> -   /*
> -* We must have an instruction between local_irq_enable() and
> -* kvm_guest_exit(), so the timer interrupt isn't delayed by
> -* the interrupt shadow.  The stat.exits increment will do nicely.
> -* But we need to prevent reordering, hence this barrier():
> -*/
> -   barrier();
> -
> -   kvm_guest_exit();
> +   __kvm_guest_exit();
>
> +   local_irq_enable();
> preempt_enable();
>
> vcpu->srcu_idx = srcu_read_lock(&vcpu->kvm->srcu);
> --
> 1.8.3.1
>

Re: [PATCH v4] KVM: halt-polling: poll for the upcoming fire timers

2016-05-24 Thread David Matlack

On Tue, May 24, 2016 at 4:11 PM, Wanpeng Li  wrote:
> 2016-05-25 6:38 GMT+08:00 David Matlack :
>> On Tue, May 24, 2016 at 12:57 AM, Wanpeng Li  wrote:
>>> From: Wanpeng Li 
>>>
>>> If an emulated lapic timer will fire soon(in the scope of 10us the
>>> base of dynamic halt-polling, lower-end of message passing workload
>>> latency TCP_RR's poll time < 10us) we can treat it as a short halt,
>>> and poll to wait it fire, the fire callback apic_timer_fn() will set
>>> KVM_REQ_PENDING_TIMER, and this flag will be check during busy poll.
>>> This can avoid context switch overhead and the latency which we wake
>>> up vCPU.
>>>
>>> This feature is slightly different from current advance expiration
>>> way. Advance expiration rely on the vCPU is running(do polling before
>>> vmentry). But in some cases, the timer interrupt may be blocked by
>>> other thread(i.e., IF bit is clear) and vCPU cannot be scheduled to
>>> run immediately. So even advance the timer early, vCPU may still see
>>> the latency. But polling is different, it ensures the vCPU to aware
>>> the timer expiration before schedule out.
>>>
>>> echo HRTICK > /sys/kernel/debug/sched_features in dynticks guests.
>>>
>>> Context switching - times in microseconds - smaller is better
>>> -
>>> Host OS  2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K
>>>  ctxsw  ctxsw  ctxsw ctxsw  ctxsw   ctxsw   ctxsw
>>> - - -- -- -- -- -- --- ---
>>> kernel Linux 4.6.0+ 7.9800   11.0   10.8   14.6 9.430013.010.2 
>>> vanilla
>>> kernel Linux 4.6.0+   15.3   13.6   10.7   12.5 9.12.8 7.38000 
>>> poll
>>
>> These results aren't very compelling. Sometimes polling is faster,
>> sometimes vanilla is faster, sometimes they are about the same.
>
> More processes and bigger cache footprints can get benefit from the
> result since I open the hrtimer for the precision preemption.

The VCPU is halted (idle), so the timer interrupt is not preempting
anything. Also I would not expect any preemption in a context
switching benchmark, the threads should be handing off execution to
one another.

I'm confused why timers would play any role in the performance of this
benchmark. Any idea why there's a speedup in the 8p/16K and 16p/64K
runs?

> Actually
> I try to emulate Yang's workload, https://lkml.org/lkml/2016/5/22/162.
> And his real workload can get more benefit as he mentioned,
> https://lkml.org/lkml/2016/5/19/667.
>
>> I imagine there are hyper sensitive workloads which cannot tolerate a
>> long tail in timer latency (e.g. realtime workloads). I would expect a
>> patch like this to provide a "smoothing effect", reducing that tail.
>> But for cloud/server workloads, I would not expect any sensitivity to
>> jitter in timer latency (especially while the VCPU is halted).
>
> Yang's is real cloud workload.

I have 2 issues with optimizing for Yang's workload. Yang, please
correct me if I am mis-characterizing it.
1. The delay in timer interrupts is caused by something disabling the
interrupts on the CPU for more than a millisecond. It seems that is
the real issue. I'm wary of using polling as a workaround.
2. The delay is caused by a separate task. Halt-polling would not help
in that scenario, it would yield the CPU to that task.

>
>>
>> Note that while halt-polling happens when the CPU is idle, it's still
>> not free. It constricts the scheduler's cpu load balancer, because the
>> CPU appears to be busy. In KVM's default configuration, I'd prefer to
>> only add more polling when the gain is clear. If there are guest
>> workloads that want this patch, I'd suggest polling for timers be
>> default-off. At minimum, there should be a module parameter to control
>> it (like Christian Borntraeger suggested).
>
> Yeah, I will add the module parameter in order to enable/disable.
>
> Regards,
> Wanpeng Li

Re: [PATCH v4] KVM: halt-polling: poll for the upcoming fire timers

2016-05-24 Thread David Matlack

On Tue, May 24, 2016 at 12:57 AM, Wanpeng Li  wrote:
> From: Wanpeng Li 
>
> If an emulated lapic timer will fire soon(in the scope of 10us the
> base of dynamic halt-polling, lower-end of message passing workload
> latency TCP_RR's poll time < 10us) we can treat it as a short halt,
> and poll to wait it fire, the fire callback apic_timer_fn() will set
> KVM_REQ_PENDING_TIMER, and this flag will be check during busy poll.
> This can avoid context switch overhead and the latency which we wake
> up vCPU.
>
> This feature is slightly different from current advance expiration
> way. Advance expiration rely on the vCPU is running(do polling before
> vmentry). But in some cases, the timer interrupt may be blocked by
> other thread(i.e., IF bit is clear) and vCPU cannot be scheduled to
> run immediately. So even advance the timer early, vCPU may still see
> the latency. But polling is different, it ensures the vCPU to aware
> the timer expiration before schedule out.
>
> echo HRTICK > /sys/kernel/debug/sched_features in dynticks guests.
>
> Context switching - times in microseconds - smaller is better
> -
> Host OS  2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K
>  ctxsw  ctxsw  ctxsw ctxsw  ctxsw   ctxsw   ctxsw
> - - -- -- -- -- -- --- ---
> kernel Linux 4.6.0+ 7.9800   11.0   10.8   14.6 9.430013.010.2 
> vanilla
> kernel Linux 4.6.0+   15.3   13.6   10.7   12.5 9.12.8 7.38000 
> poll

These results aren't very compelling. Sometimes polling is faster,
sometimes vanilla is faster, sometimes they are about the same.

I imagine there are hyper sensitive workloads which cannot tolerate a
long tail in timer latency (e.g. realtime workloads). I would expect a
patch like this to provide a "smoothing effect", reducing that tail.
But for cloud/server workloads, I would not expect any sensitivity to
jitter in timer latency (especially while the VCPU is halted).

Note that while halt-polling happens when the CPU is idle, it's still
not free. It constricts the scheduler's cpu load balancer, because the
CPU appears to be busy. In KVM's default configuration, I'd prefer to
only add more polling when the gain is clear. If there are guest
workloads that want this patch, I'd suggest polling for timers be
default-off. At minimum, there should be a module parameter to control
it (like Christian Borntraeger suggested).

>
> Cc: Paolo Bonzini 
> Cc: Radim Krčmář 
> Cc: David Matlack 
> Cc: Christian Borntraeger 
> Cc: Yang Zhang 
> Signed-off-by: Wanpeng Li 
> ---
> v3 -> v4:
>  * add module parameter halt_poll_ns_timer
>  * rename patch subject since lapic maybe just for x86.
> v2 -> v3:
>  * add Yang's statement to patch description
> v1 -> v2:
>  * add return statement to non-x86 archs
>  * capture never expire case for x86 (hrtimer is not started)
>
>  arch/arm/include/asm/kvm_host.h |  4 
>  arch/arm64/include/asm/kvm_host.h   |  4 
>  arch/mips/include/asm/kvm_host.h|  4 
>  arch/powerpc/include/asm/kvm_host.h |  4 
>  arch/s390/include/asm/kvm_host.h|  4 
>  arch/x86/kvm/lapic.c| 11 +++
>  arch/x86/kvm/lapic.h|  1 +
>  arch/x86/kvm/x86.c  |  5 +
>  include/linux/kvm_host.h|  1 +
>  virt/kvm/kvm_main.c | 15 +++
>  10 files changed, 49 insertions(+), 4 deletions(-)
>
> diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h
> index 0df6b1f..fdfbed9 100644
> --- a/arch/arm/include/asm/kvm_host.h
> +++ b/arch/arm/include/asm/kvm_host.h
> @@ -292,6 +292,10 @@ static inline void kvm_arch_sync_events(struct kvm *kvm) 
> {}
>  static inline void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu) {}
>  static inline void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) {}
>  static inline void kvm_arch_vcpu_block_finish(struct kvm_vcpu *vcpu) {}
> +static inline u64 kvm_arch_timer_remaining(struct kvm_vcpu *vcpu)
> +{
> +   return -1ULL;
> +}
>
>  static inline void kvm_arm_init_debug(void) {}
>  static inline void kvm_arm_setup_debug(struct kvm_vcpu *vcpu) {}
> diff --git a/arch/arm64/include/asm/kvm_host.h 
> b/arch/arm64/include/asm/kvm_host.h
> index e63d23b..f510d71 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -371,6 +371,10 @@ static inline void kvm_arch_sync_events(struct kvm *kvm) 
> {}
>  static inline void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu) {}
>  static inline void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) {}
>

Re: [PATCH v2] KVM: halt-polling: poll if emulated lapic timer will fire soon

2016-05-23 Thread David Matlack

On Mon, May 23, 2016 at 6:13 PM, Yang Zhang  wrote:
> On 2016/5/24 2:04, David Matlack wrote:
>>
>> On Sun, May 22, 2016 at 6:26 PM, Yang Zhang 
>> wrote:
>>>
>>> On 2016/5/21 2:37, David Matlack wrote:
>>>>
>>>>
>>>> It's not obvious to me why polling for a timer interrupt would improve
>>>> context switch latency. Can you explain a bit more?
>>>
>>>
>>>
>>> We have a workload which using high resolution timer(less than 1ms)
>>> inside
>>> guest. It rely on the timer to wakeup itself. Sometimes the timer is
>>> expected to fired just after the VCPU is blocked due to execute halt
>>> instruction. But the thread who is running in the CPU will turn off the
>>> hardware interrupt for long time due to disk access. This will cause the
>>> timer interrupt been blocked until the interrupt is re-open.
>>
>>
>> Does this happen on the idle thread (swapper)? If not, halt-polling
>> may not help; it only polls if there are no other runnable threads.
>
>
> Yes, there is no runnable task inside guest.

Sorry for the confusion, my question was about the host, not the
guest. Halt-polling only polls if there are no other runnable threads
on the host CPU (see the check for single_task_running() in
kvm_vcpu_block()).

>
>
>>
>>> For optimization, we let VCPU to poll for a while if the next timer will
>>> arrive soon before schedule out. And the result shows good when running
>>> several workloads inside guest.
>>
>>
>> Thanks for the explanation, I appreciate it.
>>
>>>
>>> --
>>> best regards
>>> yang
>
>
>
> --
> best regards
> yang

Re: [PATCH v2] KVM: halt-polling: poll if emulated lapic timer will fire soon

2016-05-23 Thread David Matlack

On Sun, May 22, 2016 at 6:26 PM, Yang Zhang  wrote:
> On 2016/5/21 2:37, David Matlack wrote:
>>
>> It's not obvious to me why polling for a timer interrupt would improve
>> context switch latency. Can you explain a bit more?
>
>
> We have a workload which using high resolution timer(less than 1ms) inside
> guest. It rely on the timer to wakeup itself. Sometimes the timer is
> expected to fired just after the VCPU is blocked due to execute halt
> instruction. But the thread who is running in the CPU will turn off the
> hardware interrupt for long time due to disk access. This will cause the
> timer interrupt been blocked until the interrupt is re-open.

Does this happen on the idle thread (swapper)? If not, halt-polling
may not help; it only polls if there are no other runnable threads.

> For optimization, we let VCPU to poll for a while if the next timer will
> arrive soon before schedule out. And the result shows good when running
> several workloads inside guest.

Thanks for the explanation, I appreciate it.

>
> --
> best regards
> yang

Re: [PATCH v3] KVM: halt-polling: poll if emulated lapic timer will fire soon

2016-05-23 Thread David Matlack

On Sun, May 22, 2016 at 5:42 PM, Wanpeng Li  wrote:
> From: Wanpeng Li 

I'm ok with this patch, but I'd like to better understand the target
workloads. What type of workloads do you expect to benefit from this?

>
> If an emulated lapic timer will fire soon(in the scope of 10us the
> base of dynamic halt-polling, lower-end of message passing workload
> latency TCP_RR's poll time < 10us) we can treat it as a short halt,
> and poll to wait it fire, the fire callback apic_timer_fn() will set
> KVM_REQ_PENDING_TIMER, and this flag will be check during busy poll.
> This can avoid context switch overhead and the latency which we wake
> up vCPU.
>
> This feature is slightly different from current advance expiration
> way. Advance expiration rely on the vCPU is running(do polling before
> vmentry). But in some cases, the timer interrupt may be blocked by
> other thread(i.e., IF bit is clear) and vCPU cannot be scheduled to
> run immediately. So even advance the timer early, vCPU may still see
> the latency. But polling is different, it ensures the vCPU to aware
> the timer expiration before schedule out.
>
> iperf TCP get ~6% bandwidth improvement.

I think my question got lost in the previous thread :). Can you
explain why TCP bandwidth improves with this patch?

>
> Cc: Paolo Bonzini 
> Cc: Radim Krčmář 
> Cc: David Matlack 
> Cc: Christian Borntraeger 
> Cc: Yang Zhang 
> Signed-off-by: Wanpeng Li 
> ---
> v2 -> v3:
>  * add Yang's statement to patch description
> v1 -> v2:
>  * add return statement to non-x86 archs
>  * capture never expire case for x86 (hrtimer is not started)
>
>  arch/arm/include/asm/kvm_host.h |  4 
>  arch/arm64/include/asm/kvm_host.h   |  4 
>  arch/mips/include/asm/kvm_host.h|  4 
>  arch/powerpc/include/asm/kvm_host.h |  4 
>  arch/s390/include/asm/kvm_host.h|  4 
>  arch/x86/kvm/lapic.c| 11 +++
>  arch/x86/kvm/lapic.h|  1 +
>  arch/x86/kvm/x86.c  |  5 +
>  include/linux/kvm_host.h|  1 +
>  virt/kvm/kvm_main.c | 14 ++
>  10 files changed, 48 insertions(+), 4 deletions(-)
>
> diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h
> index 4cd8732..a5fd858 100644
> --- a/arch/arm/include/asm/kvm_host.h
> +++ b/arch/arm/include/asm/kvm_host.h
> @@ -284,6 +284,10 @@ static inline void kvm_arch_sync_events(struct kvm *kvm) 
> {}
>  static inline void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu) {}
>  static inline void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) {}
>  static inline void kvm_arch_vcpu_block_finish(struct kvm_vcpu *vcpu) {}
> +static inline u64 kvm_arch_timer_remaining(struct kvm_vcpu *vcpu)
> +{
> +   return -1ULL;
> +}
>
>  static inline void kvm_arm_init_debug(void) {}
>  static inline void kvm_arm_setup_debug(struct kvm_vcpu *vcpu) {}
> diff --git a/arch/arm64/include/asm/kvm_host.h 
> b/arch/arm64/include/asm/kvm_host.h
> index d49399d..94e227a 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -359,6 +359,10 @@ static inline void kvm_arch_sync_events(struct kvm *kvm) 
> {}
>  static inline void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu) {}
>  static inline void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) {}
>  static inline void kvm_arch_vcpu_block_finish(struct kvm_vcpu *vcpu) {}
> +static inline u64 kvm_arch_timer_remaining(struct kvm_vcpu *vcpu)
> +{
> +   return -1ULL;
> +}
>
>  void kvm_arm_init_debug(void);
>  void kvm_arm_setup_debug(struct kvm_vcpu *vcpu);
> diff --git a/arch/mips/include/asm/kvm_host.h 
> b/arch/mips/include/asm/kvm_host.h
> index 9a37a10..456bc42 100644
> --- a/arch/mips/include/asm/kvm_host.h
> +++ b/arch/mips/include/asm/kvm_host.h
> @@ -813,6 +813,10 @@ static inline void kvm_arch_vcpu_uninit(struct kvm_vcpu 
> *vcpu) {}
>  static inline void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) {}
>  static inline void kvm_arch_vcpu_blocking(struct kvm_vcpu *vcpu) {}
>  static inline void kvm_arch_vcpu_unblocking(struct kvm_vcpu *vcpu) {}
> +static inline u64 kvm_arch_timer_remaining(struct kvm_vcpu *vcpu)
> +{
> +   return -1ULL;
> +}
>  static inline void kvm_arch_vcpu_block_finish(struct kvm_vcpu *vcpu) {}
>
>  #endif /* __MIPS_KVM_HOST_H__ */
> diff --git a/arch/powerpc/include/asm/kvm_host.h 
> b/arch/powerpc/include/asm/kvm_host.h
> index ec35af3..5986c79 100644
> --- a/arch/powerpc/include/asm/kvm_host.h
> +++ b/arch/powerpc/include/asm/kvm_host.h
> @@ -729,5 +729,9 @@ static inline void kvm_arch_exit(void) {}
>  static inline void kvm_arch_vcpu_blocking(struct kvm_vcpu *vcpu) {}
>

Re: [PATCH v2] KVM: halt-polling: poll if emulated lapic timer will fire soon

2016-05-20 Thread David Matlack

On Thu, May 19, 2016 at 7:04 PM, Yang Zhang  wrote:
> On 2016/5/20 2:36, David Matlack wrote:
>>
>> On Thu, May 19, 2016 at 11:01 AM, David Matlack 
>> wrote:
>>>
>>> On Thu, May 19, 2016 at 6:27 AM, Wanpeng Li  wrote:
>>>>
>>>> From: Wanpeng Li 
>>>>
>>>> If an emulated lapic timer will fire soon(in the scope of 10us the
>>>> base of dynamic halt-polling, lower-end of message passing workload
>>>> latency TCP_RR's poll time < 10us) we can treat it as a short halt,
>>>> and poll to wait it fire, the fire callback apic_timer_fn() will set
>>>> KVM_REQ_PENDING_TIMER, and this flag will be check during busy poll.
>>>> This can avoid context switch overhead and the latency which we wake
>>>> up vCPU.
>>>
>>>
>>> If I understand correctly, your patch aims to reduce the latency of
>>> (APIC Timer expires) -> (Guest resumes execution) using halt-polling.
>>> Let me know if I'm misunderstanding.
>>>
>>> In general, I don't think it makes sense to poll for timer interrupts.
>>> We know when the timer interrupt is going to arrive. If we care about
>>> the latency of delivering that interrupt to the guest, we should
>>> program the hrtimer to wake us up slightly early, and then deliver the
>>> virtual timer interrupt right on time (I think KVM's TSC Deadline
>>> Timer emulation already does this).
>>
>>
>> (It looks like the way to enable this feature is to set the module
>> parameter lapic_timer_advance_ns and make sure your guest is using the
>> TSC Deadline timer instead of the APIC Timer.)
>
>
> This feature is slightly different from current advance expiration way.
> Advance expiration rely on the VCPU is running(do polling before vmentry).
> But in some cases, the timer interrupt may be blocked by other thread(i.e.,
> IF bit is clear) and VCPU cannot be scheduled to run immediately. So even
> advance the timer early, VCPU may still see the latency. But polling is
> different, it ensures the VCPU to aware the timer expiration before schedule
> out.
>
>>
>>> I'm curious to know if this scheme
>>> would give the same performance improvement to iperf as your patch.
>>>
>>> We discussed this a bit before on the mailing list before
>>> (https://lkml.org/lkml/2016/3/29/680). I'd like to see halt-polling
>>> and timer interrupts go in the opposite direction: if the next timer
>>> event (from any timer) is less than vcpu->halt_poll_ns, don't poll at
>>> all.
>>>
>>>>
>>>> iperf TCP get ~6% bandwidth improvement.
>>>
>>>
>>> Can you explain why your patch results in this bandwidth improvement?
>
>
> It should be reasonable. I have seen the same improvement with ctx switch
> benchmark: The latency is reduce from ~2600ns to ~2300ns with the similar
> mechanism.(The same idea but different implementation)

It's not obvious to me why polling for a timer interrupt would improve
context switch latency. Can you explain a bit more?

>
>>>
>>>>
>>>> Cc: Paolo Bonzini 
>>>> Cc: Radim Krčmář 
>>>> Cc: David Matlack 
>>>> Cc: Christian Borntraeger 
>>>> Signed-off-by: Wanpeng Li 
>>>> ---
>>>> v1 -> v2:
>>>>  * add return statement to non-x86 archs
>>>>  * capture never expire case for x86 (hrtimer is not started)
>>>>
>>>>  arch/arm/include/asm/kvm_host.h |  4 
>>>>  arch/arm64/include/asm/kvm_host.h   |  4 
>>>>  arch/mips/include/asm/kvm_host.h|  4 
>>>>  arch/powerpc/include/asm/kvm_host.h |  4 
>>>>  arch/s390/include/asm/kvm_host.h|  4 
>>>>  arch/x86/kvm/lapic.c| 11 +++
>>>>  arch/x86/kvm/lapic.h|  1 +
>>>>  arch/x86/kvm/x86.c  |  5 +
>>>>  include/linux/kvm_host.h|  1 +
>>>>  virt/kvm/kvm_main.c | 14 ++
>>>>  10 files changed, 48 insertions(+), 4 deletions(-)
>>>>
>>>> diff --git a/arch/arm/include/asm/kvm_host.h
>>>> b/arch/arm/include/asm/kvm_host.h
>>>> index 4cd8732..a5fd858 100644
>>>> --- a/arch/arm/include/asm/kvm_host.h
>>>> +++ b/arch/arm/include/asm/kvm_host.h
>>>> @@ -284,6 +284,10 @@ static inline void kvm_arch_sync_events(struct kvm
>>>> *kvm) {}
>>>>  static inline void kvm_arch_

Re: [PATCH v2] KVM: halt-polling: poll if emulated lapic timer will fire soon

2016-05-19 Thread David Matlack

On Thu, May 19, 2016 at 11:01 AM, David Matlack  wrote:
> On Thu, May 19, 2016 at 6:27 AM, Wanpeng Li  wrote:
>> From: Wanpeng Li 
>>
>> If an emulated lapic timer will fire soon(in the scope of 10us the
>> base of dynamic halt-polling, lower-end of message passing workload
>> latency TCP_RR's poll time < 10us) we can treat it as a short halt,
>> and poll to wait it fire, the fire callback apic_timer_fn() will set
>> KVM_REQ_PENDING_TIMER, and this flag will be check during busy poll.
>> This can avoid context switch overhead and the latency which we wake
>> up vCPU.
>
> If I understand correctly, your patch aims to reduce the latency of
> (APIC Timer expires) -> (Guest resumes execution) using halt-polling.
> Let me know if I'm misunderstanding.
>
> In general, I don't think it makes sense to poll for timer interrupts.
> We know when the timer interrupt is going to arrive. If we care about
> the latency of delivering that interrupt to the guest, we should
> program the hrtimer to wake us up slightly early, and then deliver the
> virtual timer interrupt right on time (I think KVM's TSC Deadline
> Timer emulation already does this).

(It looks like the way to enable this feature is to set the module
parameter lapic_timer_advance_ns and make sure your guest is using the
TSC Deadline timer instead of the APIC Timer.)

> I'm curious to know if this scheme
> would give the same performance improvement to iperf as your patch.
>
> We discussed this a bit before on the mailing list before
> (https://lkml.org/lkml/2016/3/29/680). I'd like to see halt-polling
> and timer interrupts go in the opposite direction: if the next timer
> event (from any timer) is less than vcpu->halt_poll_ns, don't poll at
> all.
>
>>
>> iperf TCP get ~6% bandwidth improvement.
>
> Can you explain why your patch results in this bandwidth improvement?
>
>>
>> Cc: Paolo Bonzini 
>> Cc: Radim Krčmář 
>> Cc: David Matlack 
>> Cc: Christian Borntraeger 
>> Signed-off-by: Wanpeng Li 
>> ---
>> v1 -> v2:
>>  * add return statement to non-x86 archs
>>  * capture never expire case for x86 (hrtimer is not started)
>>
>>  arch/arm/include/asm/kvm_host.h |  4 
>>  arch/arm64/include/asm/kvm_host.h   |  4 
>>  arch/mips/include/asm/kvm_host.h|  4 
>>  arch/powerpc/include/asm/kvm_host.h |  4 
>>  arch/s390/include/asm/kvm_host.h|  4 
>>  arch/x86/kvm/lapic.c| 11 +++
>>  arch/x86/kvm/lapic.h|  1 +
>>  arch/x86/kvm/x86.c  |  5 +
>>  include/linux/kvm_host.h|  1 +
>>  virt/kvm/kvm_main.c | 14 ++
>>  10 files changed, 48 insertions(+), 4 deletions(-)
>>
>> diff --git a/arch/arm/include/asm/kvm_host.h 
>> b/arch/arm/include/asm/kvm_host.h
>> index 4cd8732..a5fd858 100644
>> --- a/arch/arm/include/asm/kvm_host.h
>> +++ b/arch/arm/include/asm/kvm_host.h
>> @@ -284,6 +284,10 @@ static inline void kvm_arch_sync_events(struct kvm 
>> *kvm) {}
>>  static inline void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu) {}
>>  static inline void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) {}
>>  static inline void kvm_arch_vcpu_block_finish(struct kvm_vcpu *vcpu) {}
>> +static inline u64 kvm_arch_timer_remaining(struct kvm_vcpu *vcpu)
>> +{
>> +   return -1ULL;
>> +}
>>
>>  static inline void kvm_arm_init_debug(void) {}
>>  static inline void kvm_arm_setup_debug(struct kvm_vcpu *vcpu) {}
>> diff --git a/arch/arm64/include/asm/kvm_host.h 
>> b/arch/arm64/include/asm/kvm_host.h
>> index d49399d..94e227a 100644
>> --- a/arch/arm64/include/asm/kvm_host.h
>> +++ b/arch/arm64/include/asm/kvm_host.h
>> @@ -359,6 +359,10 @@ static inline void kvm_arch_sync_events(struct kvm 
>> *kvm) {}
>>  static inline void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu) {}
>>  static inline void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) {}
>>  static inline void kvm_arch_vcpu_block_finish(struct kvm_vcpu *vcpu) {}
>> +static inline u64 kvm_arch_timer_remaining(struct kvm_vcpu *vcpu)
>> +{
>> +   return -1ULL;
>> +}
>>
>>  void kvm_arm_init_debug(void);
>>  void kvm_arm_setup_debug(struct kvm_vcpu *vcpu);
>> diff --git a/arch/mips/include/asm/kvm_host.h 
>> b/arch/mips/include/asm/kvm_host.h
>> index 9a37a10..456bc42 100644
>> --- a/arch/mips/include/asm/kvm_host.h
>> +++ b/arch/mips/include/asm/kvm_host.h
>> @@ -813,6 +813,10 @@ static inline void kvm_arch_vcpu_uninit(struct kvm_vcpu 
>&

Re: [PATCH v2] KVM: halt-polling: poll if emulated lapic timer will fire soon

2016-05-19 Thread David Matlack

On Thu, May 19, 2016 at 6:27 AM, Wanpeng Li  wrote:
> From: Wanpeng Li 
>
> If an emulated lapic timer will fire soon(in the scope of 10us the
> base of dynamic halt-polling, lower-end of message passing workload
> latency TCP_RR's poll time < 10us) we can treat it as a short halt,
> and poll to wait it fire, the fire callback apic_timer_fn() will set
> KVM_REQ_PENDING_TIMER, and this flag will be check during busy poll.
> This can avoid context switch overhead and the latency which we wake
> up vCPU.

If I understand correctly, your patch aims to reduce the latency of
(APIC Timer expires) -> (Guest resumes execution) using halt-polling.
Let me know if I'm misunderstanding.

In general, I don't think it makes sense to poll for timer interrupts.
We know when the timer interrupt is going to arrive. If we care about
the latency of delivering that interrupt to the guest, we should
program the hrtimer to wake us up slightly early, and then deliver the
virtual timer interrupt right on time (I think KVM's TSC Deadline
Timer emulation already does this). I'm curious to know if this scheme
would give the same performance improvement to iperf as your patch.

We discussed this a bit before on the mailing list before
(https://lkml.org/lkml/2016/3/29/680). I'd like to see halt-polling
and timer interrupts go in the opposite direction: if the next timer
event (from any timer) is less than vcpu->halt_poll_ns, don't poll at
all.

>
> iperf TCP get ~6% bandwidth improvement.

Can you explain why your patch results in this bandwidth improvement?

>
> Cc: Paolo Bonzini 
> Cc: Radim Krčmář 
> Cc: David Matlack 
> Cc: Christian Borntraeger 
> Signed-off-by: Wanpeng Li 
> ---
> v1 -> v2:
>  * add return statement to non-x86 archs
>  * capture never expire case for x86 (hrtimer is not started)
>
>  arch/arm/include/asm/kvm_host.h |  4 
>  arch/arm64/include/asm/kvm_host.h   |  4 
>  arch/mips/include/asm/kvm_host.h|  4 
>  arch/powerpc/include/asm/kvm_host.h |  4 
>  arch/s390/include/asm/kvm_host.h|  4 
>  arch/x86/kvm/lapic.c| 11 +++
>  arch/x86/kvm/lapic.h|  1 +
>  arch/x86/kvm/x86.c  |  5 +
>  include/linux/kvm_host.h|  1 +
>  virt/kvm/kvm_main.c | 14 ++
>  10 files changed, 48 insertions(+), 4 deletions(-)
>
> diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h
> index 4cd8732..a5fd858 100644
> --- a/arch/arm/include/asm/kvm_host.h
> +++ b/arch/arm/include/asm/kvm_host.h
> @@ -284,6 +284,10 @@ static inline void kvm_arch_sync_events(struct kvm *kvm) 
> {}
>  static inline void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu) {}
>  static inline void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) {}
>  static inline void kvm_arch_vcpu_block_finish(struct kvm_vcpu *vcpu) {}
> +static inline u64 kvm_arch_timer_remaining(struct kvm_vcpu *vcpu)
> +{
> +   return -1ULL;
> +}
>
>  static inline void kvm_arm_init_debug(void) {}
>  static inline void kvm_arm_setup_debug(struct kvm_vcpu *vcpu) {}
> diff --git a/arch/arm64/include/asm/kvm_host.h 
> b/arch/arm64/include/asm/kvm_host.h
> index d49399d..94e227a 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -359,6 +359,10 @@ static inline void kvm_arch_sync_events(struct kvm *kvm) 
> {}
>  static inline void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu) {}
>  static inline void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) {}
>  static inline void kvm_arch_vcpu_block_finish(struct kvm_vcpu *vcpu) {}
> +static inline u64 kvm_arch_timer_remaining(struct kvm_vcpu *vcpu)
> +{
> +   return -1ULL;
> +}
>
>  void kvm_arm_init_debug(void);
>  void kvm_arm_setup_debug(struct kvm_vcpu *vcpu);
> diff --git a/arch/mips/include/asm/kvm_host.h 
> b/arch/mips/include/asm/kvm_host.h
> index 9a37a10..456bc42 100644
> --- a/arch/mips/include/asm/kvm_host.h
> +++ b/arch/mips/include/asm/kvm_host.h
> @@ -813,6 +813,10 @@ static inline void kvm_arch_vcpu_uninit(struct kvm_vcpu 
> *vcpu) {}
>  static inline void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) {}
>  static inline void kvm_arch_vcpu_blocking(struct kvm_vcpu *vcpu) {}
>  static inline void kvm_arch_vcpu_unblocking(struct kvm_vcpu *vcpu) {}
> +static inline u64 kvm_arch_timer_remaining(struct kvm_vcpu *vcpu)
> +{
> +   return -1ULL;
> +}
>  static inline void kvm_arch_vcpu_block_finish(struct kvm_vcpu *vcpu) {}
>
>  #endif /* __MIPS_KVM_HOST_H__ */
> diff --git a/arch/powerpc/include/asm/kvm_host.h 
> b/arch/powerpc/include/asm/kvm_host.h
> index ec35af3..5986c79 100644
> --- a/arch/powerpc/include/asm/kvm_host.h
>

Re: [PATCH] kvm: x86: do not leak guest xcr0 into host interrupt handlers

2016-04-22 Thread David Matlack

On Fri, Apr 22, 2016 at 12:30 AM, Wanpeng Li  wrote:
> Hi Paolo and David,
> 2016-03-31 3:24 GMT+08:00 David Matlack :
>>
>> kernel_fpu_begin() saves the current fpu context. If this uses
>> XSAVE[OPT], it may leave the xsave area in an undesirable state.
>> According to the SDM, during XSAVE bit i of XSTATE_BV is not modified
>> if bit i is 0 in xcr0. So it's possible that XSTATE_BV[i] == 1 and
>> xcr0[i] == 0 following an XSAVE.
>
> How XSAVE save bit i since SDM mentioned that "XSAVE saves state
> component i if and only if RFBM[i] = 1. "?  RFBM[i] will be 0 if
> XSTATE_BV[i] == 1 && guest xcr0[i] == 0.

You are correct, RFBM[i] will be 0 and XSAVE does not save state
component i in this case. However, XSTATE_BV[i] is left untouched by
XSAVE (left as 1). On XRSTOR, the CPU checks if XSTATE_BV[i] == 1 &&
xcr0[i] == 0, and if so delivers a #GP.

If you are wondering how XSTATE_BV[i] could be 1 in the first place, I
suspect it is left over from a previous XSAVE (which sets XSTATE_BV[i]
to the value in XINUSE[i]).

>
> Regards,
> Wanpeng Li
>
>>
>> kernel_fpu_end() restores the fpu context. Now if any bit i in
>> XSTATE_BV == 1 while xcr0[i] == 0, XRSTOR generates a #GP. The
>> fault is trapped and SIGSEGV is delivered to the current process.

Re: [PATCH] kvm: x86: do not leak guest xcr0 into host interrupt handlers

2016-04-08 Thread David Matlack

On Fri, Apr 8, 2016 at 9:50 AM, Paolo Bonzini  wrote:
>
>
> On 08/04/2016 18:25, David Matlack wrote:
>> On Thu, Apr 7, 2016 at 12:03 PM, Paolo Bonzini  wrote:
>>>>
>>>> Thank you :). Let me know how testing goes.
>>>
>>> It went well.
>>
>> Great! How should we proceed?
>
> It will appear very soon on kvm/next and Radim will send the pull
> request to Linus next week (I'm having him practice before I go on
> vacation ;)).

Makes sense. Thanks for taking care of it!

>
> Paolo

Re: [PATCH] kvm: x86: do not leak guest xcr0 into host interrupt handlers

2016-04-08 Thread David Matlack

On Thu, Apr 7, 2016 at 12:03 PM, Paolo Bonzini  wrote:
>>
>> Thank you :). Let me know how testing goes.
>
> It went well.

Great! How should we proceed?

Re: [PATCH] kvm: x86: do not leak guest xcr0 into host interrupt handlers

2016-04-07 Thread David Matlack

On Thu, Apr 7, 2016 at 2:08 AM, Paolo Bonzini  wrote:
>
>
> On 05/04/2016 17:56, David Matlack wrote:
>> On Tue, Apr 5, 2016 at 4:28 AM, Paolo Bonzini  wrote:
>>>
>> ...
>>>
>>> While running my acceptance tests, in one case I got one CPU whose xcr0
>>> had leaked into the host.  This showed up as a SIGILL in strncasecmp's
>>> AVX code, and a simple program confirmed it:
>>>
>>> $ cat xgetbv.c
>>> #include 
>>> int main(void)
>>> {
>>> unsigned xcr0_h, xcr0_l;
>>> asm("xgetbv" : "=d"(xcr0_h), "=a"(xcr0_l) : "c"(0));
>>> printf("%08x:%08x\n", xcr0_h, xcr0_l);
>>> }
>>> $ gcc xgetbv.c -O2
>>> $ for i in `seq 0 55`; do echo $i `taskset -c $i ./a.out`; done|grep -v 
>>> 007
>>> 19 :0003
>>>
>>> I'm going to rerun the tests without this patch, as it seems the most
>>> likely culprit, and leave it out of the pull request if they pass.
>>
>> Agreed this is a very likely culprit. I think I see one way the
>> guest's xcr0 can leak into the host.
>
> That's cancel_injection, right?  If it's just about moving the load call
> below, I can do that.  Hmm, I will even test that today. :)

Yes that's what I was thinking, move kvm_load_guest_xcr0 below that if.

Thank you :). Let me know how testing goes.

>
> Paolo
>

Re: [PATCH] kvm: x86: do not leak guest xcr0 into host interrupt handlers

2016-04-05 Thread David Matlack

On Tue, Apr 5, 2016 at 4:28 AM, Paolo Bonzini  wrote:
>
...
>
> While running my acceptance tests, in one case I got one CPU whose xcr0
> had leaked into the host.  This showed up as a SIGILL in strncasecmp's
> AVX code, and a simple program confirmed it:
>
> $ cat xgetbv.c
> #include 
> int main(void)
> {
> unsigned xcr0_h, xcr0_l;
> asm("xgetbv" : "=d"(xcr0_h), "=a"(xcr0_l) : "c"(0));
> printf("%08x:%08x\n", xcr0_h, xcr0_l);
> }
> $ gcc xgetbv.c -O2
> $ for i in `seq 0 55`; do echo $i `taskset -c $i ./a.out`; done|grep -v 
> 007
> 19 :0003
>
> I'm going to rerun the tests without this patch, as it seems the most
> likely culprit, and leave it out of the pull request if they pass.

Agreed this is a very likely culprit. I think I see one way the
guest's xcr0 can leak into the host. I will do some testing an send
another version. Thanks.

>
> Paolo
>
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index e260ccb..8df1167 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -700,7 +700,6 @@ static int __kvm_set_xcr(struct kvm_vcpu *vcpu, u32 
>> index, u64 xcr)
>>   if ((xcr0 & XFEATURE_MASK_AVX512) != XFEATURE_MASK_AVX512)
>>   return 1;
>>   }
>> - kvm_put_guest_xcr0(vcpu);
>>   vcpu->arch.xcr0 = xcr0;
>>
>>   if ((xcr0 ^ old_xcr0) & XFEATURE_MASK_EXTEND)
>> @@ -6590,8 +6589,6 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
>>   kvm_x86_ops->prepare_guest_switch(vcpu);
>>   if (vcpu->fpu_active)
>>   kvm_load_guest_fpu(vcpu);
>> - kvm_load_guest_xcr0(vcpu);
>> -
>>   vcpu->mode = IN_GUEST_MODE;
>>
>>   srcu_read_unlock(&vcpu->kvm->srcu, vcpu->srcu_idx);
>> @@ -6607,6 +6604,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
>>
>>   local_irq_disable();
>>
>> + kvm_load_guest_xcr0(vcpu);
>> +
>>   if (vcpu->mode == EXITING_GUEST_MODE || vcpu->requests
>>   || need_resched() || signal_pending(current)) {

Here, after we've loaded the guest xcr0, if we enter this if
statement, we return from vcpu_enter_guest with the guest's xcr0 still
loaded.

>>   vcpu->mode = OUTSIDE_GUEST_MODE;
>> @@ -6667,6 +,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
>>   vcpu->mode = OUTSIDE_GUEST_MODE;
>>   smp_wmb();
>>
>> + kvm_put_guest_xcr0(vcpu);
>> +
>>   /* Interrupt is enabled by handle_external_intr() */
>>   kvm_x86_ops->handle_external_intr(vcpu);
>>
>> @@ -7314,7 +7315,6 @@ void kvm_load_guest_fpu(struct kvm_vcpu *vcpu)
>>* and assume host would use all available bits.
>>* Guest xcr0 would be loaded later.
>>*/
>> - kvm_put_guest_xcr0(vcpu);
>>   vcpu->guest_fpu_loaded = 1;
>>   __kernel_fpu_begin();
>>   __copy_kernel_to_fpregs(&vcpu->arch.guest_fpu.state);
>> @@ -7323,8 +7323,6 @@ void kvm_load_guest_fpu(struct kvm_vcpu *vcpu)
>>
>>  void kvm_put_guest_fpu(struct kvm_vcpu *vcpu)
>>  {
>> - kvm_put_guest_xcr0(vcpu);
>> -
>>   if (!vcpu->guest_fpu_loaded) {
>>   vcpu->fpu_counter = 0;
>>   return;
>>

[PATCH] kvm: x86: do not leak guest xcr0 into host interrupt handlers

2016-03-30 Thread David Matlack

An interrupt handler that uses the fpu can kill a KVM VM, if it runs
under the following conditions:
 - the guest's xcr0 register is loaded on the cpu
 - the guest's fpu context is not loaded
 - the host is using eagerfpu

Note that the guest's xcr0 register and fpu context are not loaded as
part of the atomic world switch into "guest mode". They are loaded by
KVM while the cpu is still in "host mode".

Usage of the fpu in interrupt context is gated by irq_fpu_usable(). The
interrupt handler will look something like this:

if (irq_fpu_usable()) {
kernel_fpu_begin();

[... code that uses the fpu ...]

kernel_fpu_end();
}

As long as the guest's fpu is not loaded and the host is using eager
fpu, irq_fpu_usable() returns true (interrupted_kernel_fpu_idle()
returns true). The interrupt handler proceeds to use the fpu with
the guest's xcr0 live.

kernel_fpu_begin() saves the current fpu context. If this uses
XSAVE[OPT], it may leave the xsave area in an undesirable state.
According to the SDM, during XSAVE bit i of XSTATE_BV is not modified
if bit i is 0 in xcr0. So it's possible that XSTATE_BV[i] == 1 and
xcr0[i] == 0 following an XSAVE.

kernel_fpu_end() restores the fpu context. Now if any bit i in
XSTATE_BV == 1 while xcr0[i] == 0, XRSTOR generates a #GP. The
fault is trapped and SIGSEGV is delivered to the current process.

Only pre-4.2 kernels appear to be vulnerable to this sequence of
events. Commit 653f52c ("kvm,x86: load guest FPU context more eagerly")
from 4.2 forces the guest's fpu to always be loaded on eagerfpu hosts.

This patch fixes the bug by keeping the host's xcr0 loaded outside
of the interrupts-disabled region where KVM switches into guest mode.

Cc: sta...@vger.kernel.org
Suggested-by: Andy Lutomirski 
Signed-off-by: David Matlack 
---
 arch/x86/kvm/x86.c | 10 --
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index e260ccb..8df1167 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -700,7 +700,6 @@ static int __kvm_set_xcr(struct kvm_vcpu *vcpu, u32 index, 
u64 xcr)
if ((xcr0 & XFEATURE_MASK_AVX512) != XFEATURE_MASK_AVX512)
return 1;
}
-   kvm_put_guest_xcr0(vcpu);
vcpu->arch.xcr0 = xcr0;
 
if ((xcr0 ^ old_xcr0) & XFEATURE_MASK_EXTEND)
@@ -6590,8 +6589,6 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
kvm_x86_ops->prepare_guest_switch(vcpu);
if (vcpu->fpu_active)
kvm_load_guest_fpu(vcpu);
-   kvm_load_guest_xcr0(vcpu);
-
vcpu->mode = IN_GUEST_MODE;
 
srcu_read_unlock(&vcpu->kvm->srcu, vcpu->srcu_idx);
@@ -6607,6 +6604,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 
local_irq_disable();
 
+   kvm_load_guest_xcr0(vcpu);
+
if (vcpu->mode == EXITING_GUEST_MODE || vcpu->requests
|| need_resched() || signal_pending(current)) {
vcpu->mode = OUTSIDE_GUEST_MODE;
@@ -6667,6 +,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
vcpu->mode = OUTSIDE_GUEST_MODE;
smp_wmb();
 
+   kvm_put_guest_xcr0(vcpu);
+
/* Interrupt is enabled by handle_external_intr() */
kvm_x86_ops->handle_external_intr(vcpu);
 
@@ -7314,7 +7315,6 @@ void kvm_load_guest_fpu(struct kvm_vcpu *vcpu)
 * and assume host would use all available bits.
 * Guest xcr0 would be loaded later.
 */
-   kvm_put_guest_xcr0(vcpu);
vcpu->guest_fpu_loaded = 1;
__kernel_fpu_begin();
__copy_kernel_to_fpregs(&vcpu->arch.guest_fpu.state);
@@ -7323,8 +7323,6 @@ void kvm_load_guest_fpu(struct kvm_vcpu *vcpu)
 
 void kvm_put_guest_fpu(struct kvm_vcpu *vcpu)
 {
-   kvm_put_guest_xcr0(vcpu);
-
if (!vcpu->guest_fpu_loaded) {
vcpu->fpu_counter = 0;
return;
-- 
2.8.0.rc3.226.g39d4020

Re: [PATCH] KVM: x86: reduce default value of halt_poll_ns parameter

2016-03-29 Thread David Matlack

On Tue, Mar 29, 2016 at 8:57 AM, Paolo Bonzini  wrote:
>
> Windows lets applications choose the frequency of the timer tick,
> and in Windows 10 the maximum rate was changed from 1024 Hz to
> 2048 Hz.  Unfortunately, because of the way the Windows API
> works, most applications who need a higher rate than the default
> 64 Hz will just do
>
>timeGetDevCaps(&tc, sizeof(tc));
>timeBeginPeriod(tc.wPeriodMin);
>
> and pick the maximum rate.  This causes very high CPU usage when
> playing media or games on Windows 10, even if the guest does not
> actually use the CPU very much, because the frequent timer tick
> causes halt_poll_ns to kick in.
>
> There is no really good solution, especially because Microsoft
> could sooner or later bump the limit to 4096 Hz, but for now
> the best we can do is lower a bit the upper limit for
> halt_poll_ns. :-(

This is a good solution for now. I don't think we lose noticeable
performance by lowering the max to 400 us.

Do you think it's ever useful to poll for a timer interrupt? It seems
like it wouldn't be. We don't need polling to deliver accurate timer
interrupts, KVM already delivers the TSC deadline timer slightly early
to account for injection delay. Maybe we can shrink polling anytime a
timer interrupt wakes up a VCPU.

>
> Reported-by: Jon Panozzo 
> Signed-off-by: Paolo Bonzini 
> ---
>  arch/x86/include/asm/kvm_host.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index f62a9f37f79f..b7e394485a5f 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -43,7 +43,7 @@
>
>  #define KVM_PIO_PAGE_OFFSET 1
>  #define KVM_COALESCED_MMIO_PAGE_OFFSET 2
> -#define KVM_HALT_POLL_NS_DEFAULT 50
> +#define KVM_HALT_POLL_NS_DEFAULT 40
>
>  #define KVM_IRQCHIP_NUM_PINS  KVM_IOAPIC_NUM_PINS
>
> --
> 1.8.3.1
>

Re: [PATCH 0/1] KVM: x86: using the fpu in interrupt context with a guest's xcr0

2016-03-19 Thread David Matlack

On Tue, Mar 15, 2016 at 8:48 PM, Andy Lutomirski  wrote:
>
> Why is it safe to rely on interrupted_kernel_fpu_idle?  That function
> is for interrupts, but is there any reason that KVM can't be preempted
> (or explicitly schedule) with XCR0 having some funny value?

KVM restores the host's xcr0 in the sched-out preempt notifier and
prior to returning to userspace.

Re: [PATCH 0/1] KVM: x86: using the fpu in interrupt context with a guest's xcr0

2016-03-19 Thread David Matlack

On Tue, Mar 15, 2016 at 8:43 PM, Xiao Guangrong
 wrote:
>
>
> On 03/16/2016 03:01 AM, David Matlack wrote:
>>
>> On Mon, Mar 14, 2016 at 12:46 AM, Xiao Guangrong
>>  wrote:
>>>
>>> On 03/12/2016 04:47 AM, David Matlack wrote:
>>>
>>>> I have not been able to trigger this bug on Linux 4.3, and suspect
>>>> it is due to this commit from Linux 4.2:
>>>>
>>>> 653f52c kvm,x86: load guest FPU context more eagerly
>>>>
>>>> With this commit, as long as the host is using eagerfpu, the guest's
>>>> fpu is always loaded just before the guest's xcr0 (vcpu->fpu_active
>>>> is always 1 in the following snippet):
>>>>
>>>> 6569 if (vcpu->fpu_active)
>>>> 6570 kvm_load_guest_fpu(vcpu);
>>>> 6571 kvm_load_guest_xcr0(vcpu);
>>>>
>>>> When the guest's fpu is loaded, irq_fpu_usable() returns false.
>>>
>>> Er, i did not see that commit introduced this change.
>>>
>>>>
>>>> We've included our workaround for this bug, which applies to Linux 3.11.
>>>> It does not apply cleanly to HEAD since the fpu subsystem was refactored
>>>> in Linux 4.2. While the latest kernel does not look vulnerable, we may
>>>> want to apply a fix to the vulnerable stable kernels.
>>>
>>> Is the latest kvm safe if we use !eager fpu?
>>
>> Yes I believe so. When !eagerfpu, interrupted_kernel_fpu_idle()
>> returns "!current->thread.fpu.fpregs_active && (read_cr0() &
>> X86_CR0_TS)". This should ensure the interrupt handler never does
>> XSAVE/XRSTOR with the guest's xcr0.
>
> interrupted_kernel_fpu_idle() returns true if KVM-based hypervisor (e.g.
> QEMU)
> is not using fpu.　That can not stop handler using fpu.

You are correct, the interrupt handler can still use the fpu. But
kernel_fpu_{begin,end} will not execute XSAVE / XRSTOR.

Re: [PATCH 0/3] KVM: VMX: fix handling inv{ept,vpid} and nested RHEL6 KVM

2016-03-18 Thread David Matlack

On Fri, Mar 18, 2016 at 9:09 AM, Paolo Bonzini  wrote:
> Patches 1 and 2 fix two cases where a guest could hang at 100% CPU
> due to mis-emulation of a failing invept or invvpid.

Will you be sending out kvm-unit-test test cases for these?

>
> Patch 3 works around a bug in RHEL6 KVM, which is exposed by nested
> VPID support; RHEL6 KVM uses single-context invvpid unconditionally,
> but until now KVM did not provide it.
>
> Paolo
>

For the series,

Reviewed-by: David Matlack 

> Paolo Bonzini (3):
>   KVM: VMX: avoid guest hang on invalid invept instruction
>   KVM: VMX: avoid guest hang on invalid invvpid instruction
>   KVM: VMX: fix nested vpid for old KVM guests
>
>  arch/x86/kvm/vmx.c | 16 +++-
>  1 file changed, 15 insertions(+), 1 deletion(-)
>
> --
> 1.8.3.1
>

Re: [PATCH 0/3] KVM: VMX: fix handling inv{ept,vpid} and nested RHEL6 KVM

2016-03-18 Thread David Matlack

On Fri, Mar 18, 2016 at 10:58 AM, Paolo Bonzini  wrote:
>
>
> On 18/03/2016 18:42, David Matlack wrote:
>> On Fri, Mar 18, 2016 at 9:09 AM, Paolo Bonzini  wrote:
>>> Patches 1 and 2 fix two cases where a guest could hang at 100% CPU
>>> due to mis-emulation of a failing invept or invvpid.
>>
>> Will you be sending out kvm-unit-test test cases for these?
>
> Yes, of course, especially for patches 1 and 2.

Thanks!

> However I first want to
> add a --enable-unsafe option for stuff that breaks particularly badly
> when the test fails.  We don't do nested virt CVEs (yet), but all of
> these would be treated as vulnerabilities if we did---the tests would
> effectively DoS the host.

How does this DoS the host? The guest is stuck executing the same
instruction over and over, but it's exiting to KVM every time,
allowing KVM to reschedule the VCPU. I would agree it DoSes the guest.

>
> The infamous #AC failure could also be under a flag like that, and I
> remember a similar topic popping up with a LAPIC fix from Google.
>
> Paolo
>
>>>
>>> Patch 3 works around a bug in RHEL6 KVM, which is exposed by nested
>>> VPID support; RHEL6 KVM uses single-context invvpid unconditionally,
>>> but until now KVM did not provide it.
>>>
>>> Paolo
>>>
>>
>> For the series,
>>
>> Reviewed-by: David Matlack 
>>
>>> Paolo Bonzini (3):
>>>   KVM: VMX: avoid guest hang on invalid invept instruction
>>>   KVM: VMX: avoid guest hang on invalid invvpid instruction
>>>   KVM: VMX: fix nested vpid for old KVM guests
>>>
>>>  arch/x86/kvm/vmx.c | 16 +++-
>>>  1 file changed, 15 insertions(+), 1 deletion(-)
>>>
>>> --
>>> 1.8.3.1
>>>

Re: [PATCH 0/1] KVM: x86: using the fpu in interrupt context with a guest's xcr0

2016-03-15 Thread David Matlack

On Mon, Mar 14, 2016 at 12:46 AM, Xiao Guangrong
 wrote:
>
>
> On 03/12/2016 04:47 AM, David Matlack wrote:
>
>> I have not been able to trigger this bug on Linux 4.3, and suspect
>> it is due to this commit from Linux 4.2:
>>
>> 653f52c kvm,x86: load guest FPU context more eagerly
>>
>> With this commit, as long as the host is using eagerfpu, the guest's
>> fpu is always loaded just before the guest's xcr0 (vcpu->fpu_active
>> is always 1 in the following snippet):
>>
>> 6569 if (vcpu->fpu_active)
>> 6570 kvm_load_guest_fpu(vcpu);
>> 6571 kvm_load_guest_xcr0(vcpu);
>>
>> When the guest's fpu is loaded, irq_fpu_usable() returns false.
>
>
> Er, i did not see that commit introduced this change.
>
>>
>> We've included our workaround for this bug, which applies to Linux 3.11.
>> It does not apply cleanly to HEAD since the fpu subsystem was refactored
>> in Linux 4.2. While the latest kernel does not look vulnerable, we may
>> want to apply a fix to the vulnerable stable kernels.
>
>
> Is the latest kvm safe if we use !eager fpu?

Yes I believe so. When !eagerfpu, interrupted_kernel_fpu_idle()
returns "!current->thread.fpu.fpregs_active && (read_cr0() &
X86_CR0_TS)". This should ensure the interrupt handler never does
XSAVE/XRSTOR with the guest's xcr0.

> Under this case,
> kvm_load_guest_fpu()
> is not called for every single VM-enter, that means kernel will use guest's
> xcr0 to
> save/restore XSAVE area.
>
> Maybe a simpler fix is just calling __kernel_fpu_begin() when the CPU
> switches
> to vCPU and reverts it when the vCPU is scheduled out or returns to
> userspace.
>

Re: [PATCH 1/1] KVM: don't allow irq_fpu_usable when the VCPU's XCR0 is loaded

2016-03-11 Thread David Matlack

On Fri, Mar 11, 2016 at 1:14 PM, Andy Lutomirski  wrote:
>
> On Fri, Mar 11, 2016 at 12:47 PM, David Matlack  wrote:
> > From: Eric Northup 
> >
> > Add a percpu boolean, tracking whether a KVM vCPU is running on the
> > host CPU.  KVM will set and clear it as it loads/unloads guest XCR0.
> > (Note that the rest of the guest FPU load/restore is safe, because
> > kvm_load_guest_fpu and kvm_put_guest_fpu call __kernel_fpu_begin()
> > and __kernel_fpu_end(), respectively.)  irq_fpu_usable() will then
> > also check for this percpu boolean.
>
> Is this better than just always keeping the host's XCR0 loaded outside
> if the KVM interrupts-disabled region?

Probably not. AFAICT KVM does not rely on it being loaded outside that
region. xsetbv isn't insanely expensive, is it? Maybe to minimize the
time spent with interrupts disabled it was put outside.

I do like that your solution would be contained to KVM.

>
> --Andy

[PATCH 0/1] KVM: x86: using the fpu in interrupt context with a guest's xcr0

2016-03-11 Thread David Matlack

We've found that an interrupt handler that uses the fpu can kill a KVM
VM, if it runs under the following conditions:
 - the guest's xcr0 register is loaded on the cpu
 - the guest's fpu context is not loaded
 - the host is using eagerfpu

Note that the guest's xcr0 register and fpu context are not loaded as
part of the atomic world switch into "guest mode". They are loaded by
KVM while the cpu is still in "host mode".

Usage of the fpu in interrupt context is gated by irq_fpu_usable(). The
interrupt handler will look something like this:

if (irq_fpu_usable()) {
kernel_fpu_begin();

[... code that uses the fpu ...]

kernel_fpu_end();
}

As long as the guest's fpu is not loaded and the host is using eager
fpu, irq_fpu_usable() returns true (interrupted_kernel_fpu_idle()
returns true). The interrupt handler proceeds to use the fpu with
the guest's xcr0 live.

kernel_fpu_begin() saves the current fpu context. If this uses
XSAVE[OPT], it may leave the xsave area in an undesirable state.
According to the SDM, during XSAVE bit i of XSTATE_BV is not modified
if bit i is 0 in xcr0. So it's possible that XSTATE_BV[i] == 1 and
xcr0[i] == 0 following an XSAVE.

kernel_fpu_end() restores the fpu context. Now if any bit i in
XSTATE_BV is 1 while xcr0[i] is 0, XRSTOR generates a #GP fault.
(This #GP gets trapped and turned into a SIGSEGV, which kills the
VM.)

In guests that have access to the same CPU features as the host, this
bug is more likely to reproduce during VM boot, while the guest xcr0
is 1. Once the guest's xcr0 is indistinguishable from the host's,
there is no issue.

I have not been able to trigger this bug on Linux 4.3, and suspect
it is due to this commit from Linux 4.2:

653f52c kvm,x86: load guest FPU context more eagerly

With this commit, as long as the host is using eagerfpu, the guest's
fpu is always loaded just before the guest's xcr0 (vcpu->fpu_active
is always 1 in the following snippet):

6569 if (vcpu->fpu_active)
6570 kvm_load_guest_fpu(vcpu);
6571 kvm_load_guest_xcr0(vcpu);

When the guest's fpu is loaded, irq_fpu_usable() returns false.

We've included our workaround for this bug, which applies to Linux 3.11.
It does not apply cleanly to HEAD since the fpu subsystem was refactored
in Linux 4.2. While the latest kernel does not look vulnerable, we may
want to apply a fix to the vulnerable stable kernels.

An equally effective solution may be to just backport 653f52c to stable.

Attached here is a stress module we used to reproduce the bug. It
fires IPIs at all online CPUs and uses the fpu in the IPI handler. We
found that running this module while booting a VM was an extremely
effective way to kill said VM :). For the kernel developers who are
working to make eagerfpu the global default, this module might be a
useful stress test, especially when run in the background during
other tests.

--- 8< ---
 irq_fpu_stress.c | 95 
 1 file changed, 95 insertions(+)
 create mode 100644 irq_fpu_stress.c

diff --git a/irq_fpu_stress.c b/irq_fpu_stress.c
new file mode 100644
index 000..faa6ba3
--- /dev/null
+++ b/irq_fpu_stress.c
@@ -0,0 +1,95 @@
+/*
+ * For the duration of time this module is loaded, this module fires
+ * IPIs at all CPUs and tries to use the FPU on that CPU in irq
+ * context.
+ */
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+#include 
+#include 
+
+MODULE_LICENSE("GPL");
+
+#define MODNAME "irq_fpu_stress"
+#undef pr_fmt
+#define pr_fmt(fmt) MODNAME": "fmt
+
+struct workqueue_struct *work_queue;
+struct work_struct work;
+
+struct {
+   atomic_t irq_fpu_usable;
+   atomic_t irq_fpu_unusable;
+   unsigned long num_tests;
+} stats;
+
+bool done;
+
+static void test_irq_fpu(void *info)
+{
+   BUG_ON(!in_interrupt());
+
+   if (irq_fpu_usable()) {
+   atomic_inc(&stats.irq_fpu_usable);
+
+   kernel_fpu_begin();
+   kernel_fpu_end();
+   } else {
+   atomic_inc(&stats.irq_fpu_unusable);
+   }
+}
+
+static void do_work(struct work_struct *w)
+{
+   pr_info("starting test\n");
+
+   stats.num_tests = 0;
+   atomic_set(&stats.irq_fpu_usable, 0);
+   atomic_set(&stats.irq_fpu_unusable, 0);
+
+   while (!ACCESS_ONCE(done)) {
+   preempt_disable();
+   smp_call_function_many(
+   cpu_online_mask, test_irq_fpu, NULL, 1 /* wait */);
+   preempt_enable();
+
+   stats.num_tests++;
+
+   if (need_resched())
+   schedule();
+   }
+
+   pr_info("finished test\n");
+}
+
+int init_module(void)
+{
+   work_queue = create_singlethread_workqueue(MODNAME);
+
+   INIT_WORK(&work, do_work);
+   queue_work(work_queue, &work);
+
+   return 0;
+}
+
+void cleanup_module(void)
+{
+   ACCESS_ONCE(done) = true;
+
+

[PATCH 1/1] KVM: don't allow irq_fpu_usable when the VCPU's XCR0 is loaded

2016-03-11 Thread David Matlack

From: Eric Northup 

Add a percpu boolean, tracking whether a KVM vCPU is running on the
host CPU.  KVM will set and clear it as it loads/unloads guest XCR0.
(Note that the rest of the guest FPU load/restore is safe, because
kvm_load_guest_fpu and kvm_put_guest_fpu call __kernel_fpu_begin()
and __kernel_fpu_end(), respectively.)  irq_fpu_usable() will then
also check for this percpu boolean.
---
 arch/x86/include/asm/i387.h |  3 +++
 arch/x86/kernel/i387.c  | 10 --
 arch/x86/kvm/x86.c  |  4 
 3 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/i387.h b/arch/x86/include/asm/i387.h
index ed8089d..ca2c173 100644
--- a/arch/x86/include/asm/i387.h
+++ b/arch/x86/include/asm/i387.h
@@ -14,6 +14,7 @@
 
 #include 
 #include 
+#include 
 
 struct pt_regs;
 struct user_i387_struct;
@@ -25,6 +26,8 @@ extern void math_state_restore(void);
 
 extern bool irq_fpu_usable(void);
 
+DECLARE_PER_CPU(bool, kvm_xcr0_loaded);
+
 /*
  * Careful: __kernel_fpu_begin/end() must be called with preempt disabled
  * and they don't touch the preempt state on their own.
diff --git a/arch/x86/kernel/i387.c b/arch/x86/kernel/i387.c
index b627746..9015828 100644
--- a/arch/x86/kernel/i387.c
+++ b/arch/x86/kernel/i387.c
@@ -19,6 +19,9 @@
 #include 
 #include 
 
+DEFINE_PER_CPU(bool, kvm_xcr0_loaded);
+EXPORT_PER_CPU_SYMBOL(kvm_xcr0_loaded);
+
 /*
  * Were we in an interrupt that interrupted kernel mode?
  *
@@ -33,8 +36,11 @@
  */
 static inline bool interrupted_kernel_fpu_idle(void)
 {
-   if (use_eager_fpu())
-   return __thread_has_fpu(current);
+   if (use_eager_fpu()) {
+   /* Preempt already disabled, safe to read percpu. */
+   return __thread_has_fpu(current) &&
+   !__this_cpu_read(kvm_xcr0_loaded);
+   }
 
return !__thread_has_fpu(current) &&
(read_cr0() & X86_CR0_TS);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index d21bce5..f0ba7a1 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -557,8 +557,10 @@ EXPORT_SYMBOL_GPL(kvm_lmsw);
 
 static void kvm_load_guest_xcr0(struct kvm_vcpu *vcpu)
 {
+   BUG_ON(this_cpu_read(kvm_xcr0_loaded) != vcpu->guest_xcr0_loaded);
if (kvm_read_cr4_bits(vcpu, X86_CR4_OSXSAVE) &&
!vcpu->guest_xcr0_loaded) {
+   this_cpu_write(kvm_xcr0_loaded, 1);
/* kvm_set_xcr() also depends on this */
xsetbv(XCR_XFEATURE_ENABLED_MASK, vcpu->arch.xcr0);
vcpu->guest_xcr0_loaded = 1;
@@ -571,7 +573,9 @@ static void kvm_put_guest_xcr0(struct kvm_vcpu *vcpu)
if (vcpu->arch.xcr0 != host_xcr0)
xsetbv(XCR_XFEATURE_ENABLED_MASK, host_xcr0);
vcpu->guest_xcr0_loaded = 0;
+   this_cpu_write(kvm_xcr0_loaded, 0);
}
+   BUG_ON(this_cpu_read(kvm_xcr0_loaded) != vcpu->guest_xcr0_loaded);
 }
 
 int __kvm_set_xcr(struct kvm_vcpu *vcpu, u32 index, u64 xcr)
-- 
2.7.0.rc3.207.g0ac5344

[PATCH] kvm: cap halt polling at exactly halt_poll_ns

2016-03-08 Thread David Matlack

When growing halt-polling, there is no check that the poll time exceeds
the limit. It's possible for vcpu->halt_poll_ns grow once past
halt_poll_ns, and stay there until a halt which takes longer than
vcpu->halt_poll_ns. For example, booting a Linux guest with
halt_poll_ns=11000:

 ... kvm:kvm_halt_poll_ns: vcpu 0: halt_poll_ns 0 (shrink 1)
 ... kvm:kvm_halt_poll_ns: vcpu 0: halt_poll_ns 1 (grow 0)
 ... kvm:kvm_halt_poll_ns: vcpu 0: halt_poll_ns 2 (grow 1)

Signed-off-by: David Matlack 
---
 virt/kvm/kvm_main.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index a11cfd2..9102ae1 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1952,6 +1952,9 @@ static void grow_halt_poll_ns(struct kvm_vcpu *vcpu)
else
val *= halt_poll_ns_grow;
 
+   if (val > halt_poll_ns)
+   val = halt_poll_ns;
+
vcpu->halt_poll_ns = val;
trace_kvm_halt_poll_ns_grow(vcpu->vcpu_id, val, old);
 }
-- 
2.7.0.rc3.207.g0ac5344

Re: [PATCH v2] KVM: VMX: disable PEBS before a guest entry

2016-03-04 Thread David Matlack

On Fri, Mar 4, 2016 at 6:08 AM, Radim Krčmář  wrote:
> Linux guests on Haswell (and also SandyBridge and Broadwell, at least)
> would crash if you decided to run a host command that uses PEBS, like
>   perf record -e 'cpu/mem-stores/pp' -a
>
> This happens because KVM is using VMX MSR switching to disable PEBS, but
> SDM [2015-12] 18.4.4.4 Re-configuring PEBS Facilities explains why it
> isn't safe:
>   When software needs to reconfigure PEBS facilities, it should allow a
>   quiescent period between stopping the prior event counting and setting
>   up a new PEBS event. The quiescent period is to allow any latent
>   residual PEBS records to complete its capture at their previously
>   specified buffer address (provided by IA32_DS_AREA).
>
> There might not be a quiescent period after the MSR switch, so a CPU
> ends up using host's MSR_IA32_DS_AREA to access an area in guest's
> memory.  (Or MSR switching is just buggy on some models.)
>
> The guest can learn something about the host this way:
> If the guest doesn't map address pointed by MSR_IA32_DS_AREA, it results
> in #PF where we leak host's MSR_IA32_DS_AREA through CR2.
>
> After that, a malicious guest can map and configure memory where
> MSR_IA32_DS_AREA is pointing and can therefore get an output from
> host's tracing.
>
> This is not a critical leak as the host must initiate with PEBS tracing
> and I have not been able to get a record from more than one instruction
> before vmentry in vmx_vcpu_run() (that place has most registers already
> overwritten with guest's).
>
> We could disable PEBS just few instructions before vmentry, but
> disabling it earlier shouldn't affect host tracing too much.
> We also don't need to switch MSR_IA32_PEBS_ENABLE on VMENTRY, but that
> optimization isn't worth its code, IMO.
>
> (If you are implementing PEBS for guests, be sure to handle the case
>  where both host and guest enable PEBS, because this patch doesn't.)
>
> Fixes: 26a4f3c08de4 ("perf/x86: disable PEBS on a guest entry.")
> Cc: 
> Reported-by: Jiří Olša 
> Signed-off-by: Radim Krčmář 

Reviewed-by: David Matlack 

BTW the commit message is great. Thanks for including so much detail.

> ---
>  v2
>   - moved code to add_atomic_switch_msr, so the patch will work [David]
>   - more appropriate "KVM: VMX:" in subject
>  v1: http://www.spinics.net/lists/kvm/msg128808.html
>
>  arch/x86/kvm/vmx.c | 7 +++
>  1 file changed, 7 insertions(+)
>
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 46154dac71e6..e5572696c9e3 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -1822,6 +1822,13 @@ static void add_atomic_switch_msr(struct vcpu_vmx 
> *vmx, unsigned msr,
> return;
> }
> break;
> +   case MSR_IA32_PEBS_ENABLE:
> +   /* PEBS needs a quiescent period after being disabled (to 
> write
> +* a record).  Disabling PEBS through VMX MSR swapping doesn't
> +* provide that period, so a CPU could write host's record 
> into
> +* guest's memory.
> +*/
> +   wrmsrl(MSR_IA32_PEBS_ENABLE, 0);
> }
>
> for (i = 0; i < m->nr; ++i)
> --
> 2.7.2
>

Re: [PATCH] KVM: x86: disable PEBS before a guest entry

2016-03-03 Thread David Matlack

On Thu, Mar 3, 2016 at 10:53 AM, Radim Krčmář  wrote:
> Linux guests on Haswell (and also SandyBridge and Broadwell, at least)
> would crash if you decided to run a host command that uses PEBS, like
>   perf record -e 'cpu/mem-stores/pp' -a
>
> This happens because KVM is using VMX MSR switching to disable PEBS, but
> SDM [2015-12] 18.4.4.4 Re-configuring PEBS Facilities explains why it
> isn't safe:
>   When software needs to reconfigure PEBS facilities, it should allow a
>   quiescent period between stopping the prior event counting and setting
>   up a new PEBS event. The quiescent period is to allow any latent
>   residual PEBS records to complete its capture at their previously
>   specified buffer address (provided by IA32_DS_AREA).
>
> There might not be a quiescent period after the MSR switch, so a CPU
> ends up using host's MSR_IA32_DS_AREA to access an area in guest's
> memory.  (Or MSR switching is just buggy on some models.)
>
> The guest can learn something about the host this way:
> If the guest doesn't map address pointed by MSR_IA32_DS_AREA, it results
> in #PF where we leak host's MSR_IA32_DS_AREA through CR2.
>
> After that, a malicious guest can map and configure memory where
> MSR_IA32_DS_AREA is pointing and can therefore get an output from
> host's tracing.
>
> This is not a critical leak as the host must initiate with PEBS tracing
> and I have not been able to get a record from more than one instruction
> before vmentry in vmx_vcpu_run() (that place has most registers already
> overwritten with guest's).
>
> We could disable PEBS just few instructions before vmentry, but
> disabling it earlier shouldn't affect host tracing too much.
> We also don't need to switch MSR_IA32_PEBS_ENABLE on VMENTRY, but that
> optimization isn't worth its code, IMO.
>
> (If you are implementing PEBS for guests, be sure to handle the case
>  where both host and guest enable PEBS, because this patch doesn't.)
>
> Fixes: 26a4f3c08de4 ("perf/x86: disable PEBS on a guest entry.")
> Cc: 
> Reported-by: Jiří Olša 
> Signed-off-by: Radim Krčmář 
> ---
>  arch/x86/kvm/vmx.c | 7 +++
>  1 file changed, 7 insertions(+)
>
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 46154dac71e6..946582f4f105 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -1767,6 +1767,13 @@ static void clear_atomic_switch_msr(struct vcpu_vmx 
> *vmx, unsigned msr)
> return;
> }
> break;
> +   case MSR_IA32_PEBS_ENABLE:
> +   /* PEBS needs a quiescent period after being disabled (to 
> write
> +* a record).  Disabling PEBS through VMX MSR swapping doesn't
> +* provide that period, so a CPU could write host's record 
> into
> +* guest's memory.
> +*/
> +   wrmsrl(MSR_IA32_PEBS_ENABLE, 0);

Should this go in add_atomic_switch_msr instead of clear_atomic_switch_msr?

> }
>
> for (i = 0; i < m->nr; ++i)
> --
> 2.7.2
>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v9 17/18] KVM: Update Posted-Interrupts Descriptor when vCPU is blocked

2015-10-15 Thread David Matlack

On Wed, Oct 14, 2015 at 6:33 PM, Wu, Feng  wrote:
>
>> -Original Message-
>> From: David Matlack [mailto:dmatl...@google.com]
>> Sent: Thursday, October 15, 2015 7:41 AM
>> To: Wu, Feng 
>> Cc: Paolo Bonzini ; alex.william...@redhat.com; Joerg
>> Roedel ; Marcelo Tosatti ;
>> eric.au...@linaro.org; kvm list ; iommu@lists.linux-
>> foundation.org; linux-kernel@vger.kernel.org
>> Subject: Re: [PATCH v9 17/18] KVM: Update Posted-Interrupts Descriptor when
>> vCPU is blocked
>>
>> Hi Feng.
>>
>> On Fri, Sep 18, 2015 at 7:29 AM, Feng Wu  wrote:
>> > This patch updates the Posted-Interrupts Descriptor when vCPU
>> > is blocked.
>> >
>> > pre-block:
>> > - Add the vCPU to the blocked per-CPU list
>> > - Set 'NV' to POSTED_INTR_WAKEUP_VECTOR
>> >
>> > post-block:
>> > - Remove the vCPU from the per-CPU list
>>
>> I'm wondering what happens if a posted interrupt arrives at the IOMMU
>> after pre-block and before post-block.
>>
>> In pre_block, NV is set to POSTED_INTR_WAKEUP_VECTOR. IIUC, this means
>> future posted interrupts will not trigger "Posted-Interrupt Processing"
>> (PIR will not get copied to VIRR). Instead, the IOMMU will do ON := 1,
>> PIR |= (1 << vector), and send POSTED_INTR_WAKEUP_VECTOR. PIWV calls
>> wakeup_handler which does kvm_vcpu_kick. kvm_vcpu_kick does a wait-queue
>> wakeup and possibly a scheduler ipi.
>>
>> But the VCPU is sitting in kvm_vcpu_block. It spins and/or schedules
>> (wait queue) until it has a reason to wake up. I couldn't find a code
>> path from kvm_vcpu_block that lead to checking ON or PIR. How does the
>> blocked VCPU "receive" the posted interrupt? (And when does Posted-
>> Interrupt Processing get triggered?)
>
> In the pre_block, it also change the 'NDST' filed to the pCPU, on which the 
> vCPU
> is put to the per-CPU list 'blocked_vcpu_on_cpu', so when posted-interrupts
> come it, it will sent the wakeup notification event to the pCPU above, then in
> the wakeup_handler, it can find the vCPU from the per-CPU list, hence
> kvm_vcpu_kick can wake up it.

Thank you for your response. I was actually confused about something
else. After wakeup_handler->kvm_vcpu_kick causes the vcpu to wake up,
that vcpu calls kvm_vcpu_check_block() to check if there are pending
events, otherwise the vcpu goes back to sleep. I had trouble yesterday
finding the code path from kvm_vcpu_check_block() which checks PIR/ON.

But after spending more time reading the source code this morning I
found that kvm_vcpu_check_block() eventually calls into
vmx_sync_pir_to_irr(), which copies PIR to IRR and clears ON. And then
apic_find_highest_irr() detects the pending posted interrupt.

>
> Thanks,
> Feng
>
>>
>> Thanks!
>>
>> >
>> > Signed-off-by: Feng Wu 
>> > ---
>> > v9:
>> > - Add description for blocked_vcpu_on_cpu_lock in
>> Documentation/virtual/kvm/locking.txt
>> > - Check !kvm_arch_has_assigned_device(vcpu->kvm) first, then
>> >   !irq_remapping_cap(IRQ_POSTING_CAP)
>> >
>> > v8:
>> > - Rename 'pi_pre_block' to 'pre_block'
>> > - Rename 'pi_post_block' to 'post_block'
>> > - Change some comments
>> > - Only add the vCPU to the blocking list when the VM has assigned devices.
>> >
>> >  Documentation/virtual/kvm/locking.txt |  12 +++
>> >  arch/x86/include/asm/kvm_host.h   |  13 +++
>> >  arch/x86/kvm/vmx.c| 153
>> ++
>> >  arch/x86/kvm/x86.c|  53 +---
>> >  include/linux/kvm_host.h  |   3 +
>> >  virt/kvm/kvm_main.c   |   3 +
>> >  6 files changed, 227 insertions(+), 10 deletions(-)
>> >
>> > diff --git a/Documentation/virtual/kvm/locking.txt
>> b/Documentation/virtual/kvm/locking.txt
>> > index d68af4d..19f94a6 100644
>> > --- a/Documentation/virtual/kvm/locking.txt
>> > +++ b/Documentation/virtual/kvm/locking.txt
>> > @@ -166,3 +166,15 @@ Comment:   The srcu read lock must be held while
>> accessing memslots (e.g.
>> > MMIO/PIO address->device structure mapping (kvm->buses).
>> > The srcu index can be stored in kvm_vcpu->srcu_idx per vcpu
>> > if it is needed by multiple functions.
>> > +
>> > +Name:  blocked_vcpu_on_cpu_lock
>> > +Type:  spinlock_t
>> > +Arch

Re: [PATCH v9 17/18] KVM: Update Posted-Interrupts Descriptor when vCPU is blocked

2015-10-14 Thread David Matlack

Hi Feng.

On Fri, Sep 18, 2015 at 7:29 AM, Feng Wu  wrote:
> This patch updates the Posted-Interrupts Descriptor when vCPU
> is blocked.
>
> pre-block:
> - Add the vCPU to the blocked per-CPU list
> - Set 'NV' to POSTED_INTR_WAKEUP_VECTOR
>
> post-block:
> - Remove the vCPU from the per-CPU list

I'm wondering what happens if a posted interrupt arrives at the IOMMU
after pre-block and before post-block.

In pre_block, NV is set to POSTED_INTR_WAKEUP_VECTOR. IIUC, this means
future posted interrupts will not trigger "Posted-Interrupt Processing"
(PIR will not get copied to VIRR). Instead, the IOMMU will do ON := 1,
PIR |= (1 << vector), and send POSTED_INTR_WAKEUP_VECTOR. PIWV calls
wakeup_handler which does kvm_vcpu_kick. kvm_vcpu_kick does a wait-queue
wakeup and possibly a scheduler ipi.

But the VCPU is sitting in kvm_vcpu_block. It spins and/or schedules
(wait queue) until it has a reason to wake up. I couldn't find a code
path from kvm_vcpu_block that lead to checking ON or PIR. How does the
blocked VCPU "receive" the posted interrupt? (And when does Posted-
Interrupt Processing get triggered?)

Thanks!

>
> Signed-off-by: Feng Wu 
> ---
> v9:
> - Add description for blocked_vcpu_on_cpu_lock in 
> Documentation/virtual/kvm/locking.txt
> - Check !kvm_arch_has_assigned_device(vcpu->kvm) first, then
>   !irq_remapping_cap(IRQ_POSTING_CAP)
>
> v8:
> - Rename 'pi_pre_block' to 'pre_block'
> - Rename 'pi_post_block' to 'post_block'
> - Change some comments
> - Only add the vCPU to the blocking list when the VM has assigned devices.
>
>  Documentation/virtual/kvm/locking.txt |  12 +++
>  arch/x86/include/asm/kvm_host.h   |  13 +++
>  arch/x86/kvm/vmx.c| 153 
> ++
>  arch/x86/kvm/x86.c|  53 +---
>  include/linux/kvm_host.h  |   3 +
>  virt/kvm/kvm_main.c   |   3 +
>  6 files changed, 227 insertions(+), 10 deletions(-)
>
> diff --git a/Documentation/virtual/kvm/locking.txt 
> b/Documentation/virtual/kvm/locking.txt
> index d68af4d..19f94a6 100644
> --- a/Documentation/virtual/kvm/locking.txt
> +++ b/Documentation/virtual/kvm/locking.txt
> @@ -166,3 +166,15 @@ Comment:   The srcu read lock must be held while 
> accessing memslots (e.g.
> MMIO/PIO address->device structure mapping (kvm->buses).
> The srcu index can be stored in kvm_vcpu->srcu_idx per vcpu
> if it is needed by multiple functions.
> +
> +Name:  blocked_vcpu_on_cpu_lock
> +Type:  spinlock_t
> +Arch:  x86
> +Protects:  blocked_vcpu_on_cpu
> +Comment:   This is a per-CPU lock and it is used for VT-d 
> posted-interrupts.
> +   When VT-d posted-interrupts is supported and the VM has 
> assigned
> +   devices, we put the blocked vCPU on the list 
> blocked_vcpu_on_cpu
> +   protected by blocked_vcpu_on_cpu_lock, when VT-d hardware 
> issues
> +   wakeup notification event since external interrupts from the
> +   assigned devices happens, we will find the vCPU on the list to
> +   wakeup.
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 0ddd353..304fbb5 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -552,6 +552,8 @@ struct kvm_vcpu_arch {
>  */
> bool write_fault_to_shadow_pgtable;
>
> +   bool halted;
> +
> /* set at EPT violation at this point */
> unsigned long exit_qualification;
>
> @@ -864,6 +866,17 @@ struct kvm_x86_ops {
> /* pmu operations of sub-arch */
> const struct kvm_pmu_ops *pmu_ops;
>
> +   /*
> +* Architecture specific hooks for vCPU blocking due to
> +* HLT instruction.
> +* Returns for .pre_block():
> +*- 0 means continue to block the vCPU.
> +*- 1 means we cannot block the vCPU since some event
> +*happens during this period, such as, 'ON' bit in
> +*posted-interrupts descriptor is set.
> +*/
> +   int (*pre_block)(struct kvm_vcpu *vcpu);
> +   void (*post_block)(struct kvm_vcpu *vcpu);
> int (*update_pi_irte)(struct kvm *kvm, unsigned int host_irq,
>   uint32_t guest_irq, bool set);
>  };
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 902a67d..9968896 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -879,6 +879,13 @@ static DEFINE_PER_CPU(struct vmcs *, current_vmcs);
>  static DEFINE_PER_CPU(struct list_head, loaded_vmcss_on_cpu);
>  static DEFINE_PER_CPU(struct desc_ptr, host_gdt);
>
> +/*
> + * We maintian a per-CPU linked-list of vCPU, so in wakeup_handler() we
> + * can find which vCPU should be waken up.
> + */
> +static DEFINE_PER_CPU(struct list_head, blocked_vcpu_on_cpu);
> +static DEFINE_PER_CPU(spinlock_t, blocked_vcpu_on_cpu_lock);
> +
>  static unsigned long *

Re: [PATCH 04/12] KVM: x86: Replace call-back set_tsc_khz() with a common function

2015-10-05 Thread David Matlack

On Mon, Oct 5, 2015 at 12:53 PM, Radim Krčmář  wrote:
> 2015-09-28 13:38+0800, Haozhong Zhang:
>> Both VMX and SVM propagate virtual_tsc_khz in the same way, so this
>> patch removes the call-back set_tsc_khz() and replaces it with a common
>> function.
>>
>> Signed-off-by: Haozhong Zhang 
>> ---
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> +static void set_tsc_khz(struct kvm_vcpu *vcpu, u32 user_tsc_khz, bool scale)
>> +{
>> + u64 ratio, khz;
> | [...]
>> + khz = user_tsc_khz;
>
> I'd use "user_tsc_khz" directly.
>
>> + /* TSC scaling required  - calculate ratio */
>> + shift = (kvm_tsc_scaling_ratio_frac_bits <= 32) ?
>> + kvm_tsc_scaling_ratio_frac_bits : 32;
>> + ratio = khz << shift;
>> + do_div(ratio, tsc_khz);
>> + ratio <<= (kvm_tsc_scaling_ratio_frac_bits - shift);
>
> VMX is losing 16 bits by this operation;  normal fixed point division
> could get us a smaller drift (and an one-liner here) ...
> at 4.3 GHz, 32 instead of 48 bits after decimal point translate to one
> "lost" TSC tick per second, in the worst case.

We can easily avoid losing precision on x86_64 (divq allows a 128-bit
dividend). 32-bit can just lose the 16 bits of precision (TSC scaling
is only available on SkyLake, and I'd be surprised if there were
many hosts running KVM in protected mode on SkyLake :)).

>
> Please mention that we are truncating on purpose :)
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] KVM: add halt_attempted_poll to VCPU stats

2015-09-15 Thread David Matlack

On Tue, Sep 15, 2015 at 9:27 AM, Paolo Bonzini  wrote:
> This new statistic can help diagnosing VCPUs that, for any reason,
> trigger bad behavior of halt_poll_ns autotuning.
>
> For example, say halt_poll_ns = 48, and wakeups are spaced exactly
> like 479us, 481us, 479us, 481us. Then KVM always fails polling and wastes
> 10+20+40+80+160+320+480 = 1110 microseconds out of every
> 479+481+479+481+479+481+479 = 3359 microseconds. The VCPU then
> is consuming about 30% more CPU than it would use without
> polling.  This would show as an abnormally high number of
> attempted polling compared to the successful polls.

Reviewed-by: David Matlack 

>
> Cc: Christian Borntraeger  Cc: David Matlack 
> Signed-off-by: Paolo Bonzini 
> ---
>  arch/arm/include/asm/kvm_host.h | 1 +
>  arch/arm64/include/asm/kvm_host.h   | 1 +
>  arch/mips/include/asm/kvm_host.h| 1 +
>  arch/mips/kvm/mips.c| 1 +
>  arch/powerpc/include/asm/kvm_host.h | 1 +
>  arch/powerpc/kvm/book3s.c   | 1 +
>  arch/powerpc/kvm/booke.c| 1 +
>  arch/s390/include/asm/kvm_host.h| 1 +
>  arch/s390/kvm/kvm-s390.c| 1 +
>  arch/x86/include/asm/kvm_host.h | 1 +
>  arch/x86/kvm/x86.c  | 1 +
>  virt/kvm/kvm_main.c | 1 +
>  12 files changed, 12 insertions(+)
>
> diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h
> index dcba0fa5176e..687ddeba3611 100644
> --- a/arch/arm/include/asm/kvm_host.h
> +++ b/arch/arm/include/asm/kvm_host.h
> @@ -148,6 +148,7 @@ struct kvm_vm_stat {
>
>  struct kvm_vcpu_stat {
> u32 halt_successful_poll;
> +   u32 halt_attempted_poll;
> u32 halt_wakeup;
>  };
>
> diff --git a/arch/arm64/include/asm/kvm_host.h 
> b/arch/arm64/include/asm/kvm_host.h
> index 415938dc45cf..486594583cc6 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -195,6 +195,7 @@ struct kvm_vm_stat {
>
>  struct kvm_vcpu_stat {
> u32 halt_successful_poll;
> +   u32 halt_attempted_poll;
> u32 halt_wakeup;
>  };
>
> diff --git a/arch/mips/include/asm/kvm_host.h 
> b/arch/mips/include/asm/kvm_host.h
> index e8c8d9d0c45f..3a54dbca9f7e 100644
> --- a/arch/mips/include/asm/kvm_host.h
> +++ b/arch/mips/include/asm/kvm_host.h
> @@ -128,6 +128,7 @@ struct kvm_vcpu_stat {
> u32 msa_disabled_exits;
> u32 flush_dcache_exits;
> u32 halt_successful_poll;
> +   u32 halt_attempted_poll;
> u32 halt_wakeup;
>  };
>
> diff --git a/arch/mips/kvm/mips.c b/arch/mips/kvm/mips.c
> index cd4c129ce743..49ff3bfc007e 100644
> --- a/arch/mips/kvm/mips.c
> +++ b/arch/mips/kvm/mips.c
> @@ -55,6 +55,7 @@ struct kvm_stats_debugfs_item debugfs_entries[] = {
> { "msa_disabled", VCPU_STAT(msa_disabled_exits), KVM_STAT_VCPU },
> { "flush_dcache", VCPU_STAT(flush_dcache_exits), KVM_STAT_VCPU },
> { "halt_successful_poll", VCPU_STAT(halt_successful_poll), 
> KVM_STAT_VCPU },
> +   { "halt_attempted_poll", VCPU_STAT(halt_attempted_poll), 
> KVM_STAT_VCPU },
> { "halt_wakeup",  VCPU_STAT(halt_wakeup),KVM_STAT_VCPU },
> {NULL}
>  };
> diff --git a/arch/powerpc/include/asm/kvm_host.h 
> b/arch/powerpc/include/asm/kvm_host.h
> index 98eebbf66340..195886a583ba 100644
> --- a/arch/powerpc/include/asm/kvm_host.h
> +++ b/arch/powerpc/include/asm/kvm_host.h
> @@ -108,6 +108,7 @@ struct kvm_vcpu_stat {
> u32 dec_exits;
> u32 ext_intr_exits;
> u32 halt_successful_poll;
> +   u32 halt_attempted_poll;
> u32 halt_wakeup;
> u32 dbell_exits;
> u32 gdbell_exits;
> diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
> index d75bf325f54a..cf009167d208 100644
> --- a/arch/powerpc/kvm/book3s.c
> +++ b/arch/powerpc/kvm/book3s.c
> @@ -53,6 +53,7 @@ struct kvm_stats_debugfs_item debugfs_entries[] = {
> { "ext_intr",VCPU_STAT(ext_intr_exits) },
> { "queue_intr",  VCPU_STAT(queue_intr) },
> { "halt_successful_poll", VCPU_STAT(halt_successful_poll), },
> +   { "halt_attempted_poll", VCPU_STAT(halt_attempted_poll), },
> { "halt_wakeup", VCPU_STAT(halt_wakeup) },
> { "pf_storage",  VCPU_STAT(pf_storage) },
> { "sp_storage",  VCPU_STAT(sp_storage) },
> diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c
> index ae458f0fd061..fd5875179e5c 100644
> --- a/arch/powerpc/kvm/booke.c
> +++ b/arch/powerpc/kvm/booke.c
> @@ -63,6 +63,7 @@ struct kvm_stats_debugfs_item debugfs_e

Re: [PATCH] staging: slicoss: remove unused variables

2015-09-04 Thread David Matlack

On Fri, Sep 4, 2015 at 6:23 AM, Sudip Mukherjee
 wrote:
> These variables were only assigned some values but they were never used.
>
> Signed-off-by: Sudip Mukherjee 
> ---
>  drivers/staging/slicoss/slicoss.c | 27 ++-
>  1 file changed, 6 insertions(+), 21 deletions(-)
>
> diff --git a/drivers/staging/slicoss/slicoss.c 
> b/drivers/staging/slicoss/slicoss.c
> index 8585970..1536ca0 100644
> --- a/drivers/staging/slicoss/slicoss.c
> +++ b/drivers/staging/slicoss/slicoss.c
> @@ -199,10 +199,8 @@ static void slic_mcast_set_mask(struct adapter *adapter)
>  static void slic_timer_ping(ulong dev)
>  {
> struct adapter *adapter;
> -   struct sliccard *card;
>
> adapter = netdev_priv((struct net_device *)dev);
> -   card = adapter->card;
>
> adapter->pingtimer.expires = jiffies + (PING_TIMER_INTERVAL * HZ);
> add_timer(&adapter->pingtimer);
> @@ -1719,7 +1717,6 @@ static u32 slic_rcvqueue_reinsert(struct adapter 
> *adapter, struct sk_buff *skb)
>   */
>  static void slic_link_event_handler(struct adapter *adapter)
>  {
> -   int status;
> struct slic_shmem *pshmem;
>
> if (adapter->state != ADAPT_UP) {
> @@ -1730,15 +1727,13 @@ static void slic_link_event_handler(struct adapter 
> *adapter)
> pshmem = (struct slic_shmem *)(unsigned long)adapter->phys_shmem;
>
>  #if BITS_PER_LONG == 64
> -   status = slic_upr_request(adapter,
> - SLIC_UPR_RLSR,
> - SLIC_GET_ADDR_LOW(&pshmem->linkstatus),
> - SLIC_GET_ADDR_HIGH(&pshmem->linkstatus),
> - 0, 0);
> +   slic_upr_request(adapter, SLIC_UPR_RLSR,
> +SLIC_GET_ADDR_LOW(&pshmem->linkstatus),
> +SLIC_GET_ADDR_HIGH(&pshmem->linkstatus), 0, 0);
>  #else
> -   status = slic_upr_request(adapter, SLIC_UPR_RLSR,
> -   (u32) &pshmem->linkstatus,  /* no 4GB wrap guaranteed */
> - 0, 0, 0);
> +   slic_upr_request(adapter, SLIC_UPR_RLSR,
> +(u32)&pshmem->linkstatus, /* no 4GB wrap guaranteed 
> */
> +0, 0, 0);

Is status safe to ignore?

>  #endif
>  }
>
> @@ -2078,8 +2073,6 @@ static void slic_interrupt_card_up(u32 isr, struct 
> adapter *adapter,
> adapter->error_interrupts++;
> if (isr & ISR_RMISS) {
> int count;
> -   int pre_count;
> -   int errors;
>
> struct slic_rcvqueue *rcvq =
> &adapter->rcvqueue;
> @@ -2088,8 +2081,6 @@ static void slic_interrupt_card_up(u32 isr, struct 
> adapter *adapter,
>
> if (!rcvq->errors)
> rcv_count = rcvq->count;
> -   pre_count = rcvq->count;
> -   errors = rcvq->errors;
>
> while (rcvq->count < SLIC_RCVQ_FILLTHRESH) {
> count = slic_rcvqueue_fill(adapter);
> @@ -2650,9 +2641,7 @@ static int slic_card_init(struct sliccard *card, struct 
> adapter *adapter)
> ushort calc_chksum;
> struct slic_config_mac *pmac;
> unsigned char fruformat;
> -   unsigned char oemfruformat;
> struct atk_fru *patkfru;
> -   union oemfru *poemfru;
> unsigned long flags;
>
> /* Reset everything except PCI configuration space */
> @@ -2742,8 +2731,6 @@ static int slic_card_init(struct sliccard *card, struct 
> adapter *adapter)
> pmac = pOeeprom->MacInfo;
> fruformat = pOeeprom->FruFormat;
> patkfru = &pOeeprom->AtkFru;
> -   oemfruformat = pOeeprom->OemFruFormat;
> -   poemfru = &pOeeprom->OemFru;
> macaddrs = 2;
> /* Minor kludge for Oasis card
>  get 2 MAC addresses from the
> @@ -2757,8 +2744,6 @@ static int slic_card_init(struct sliccard *card, struct 
> adapter *adapter)
> pmac = peeprom->u2.mac.MacInfo;
> fruformat = peeprom->FruFormat;
> patkfru = &peeprom->AtkFru;
> -   oemfruformat = peeprom->OemFruFormat;
> -   poemfru = &peeprom->OemFru;
> break;
> }
>
> --
> 1.9.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the

Re: [PATCH v6 2/3] KVM: dynamic halt_poll_ns adjustment

2015-09-03 Thread David Matlack

On Thu, Sep 3, 2015 at 2:23 AM, Wanpeng Li  wrote:
>
> How about something like:
>
> @@ -1941,10 +1976,14 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu)
>  */
> if (kvm_vcpu_check_block(vcpu) < 0) {
> ++vcpu->stat.halt_successful_poll;
> -   goto out;
> +   break;
> }
> cur = ktime_get();
> } while (single_task_running() && ktime_before(cur, stop));
> +
> +   poll_ns = ktime_to_ns(cur) - ktime_to_ns(start);
> +   if (ktime_before(cur, stop) && single_task_running())
> +   goto out;

I would prefer an explicit signal (e.g. set a bool to true before breaking out
of the loop, and check it here) to avoid duplicating the loop exit condition.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v6 2/3] KVM: dynamic halt_poll_ns adjustment

2015-09-02 Thread David Matlack

On Wed, Sep 2, 2015 at 12:12 PM, Paolo Bonzini  wrote:
>
>
> On 02/09/2015 20:09, David Matlack wrote:
>> On Wed, Sep 2, 2015 at 12:29 AM, Wanpeng Li  wrote:
>>> There is a downside of always-poll since poll is still happened for idle
>>> vCPUs which can waste cpu usage. This patch adds the ability to adjust
>>> halt_poll_ns dynamically, to grow halt_poll_ns when shot halt is detected,
>>> and to shrink halt_poll_ns when long halt is detected.
>>>
>>> There are two new kernel parameters for changing the halt_poll_ns:
>>> halt_poll_ns_grow and halt_poll_ns_shrink.
>>>
>>> no-poll  always-polldynamic-poll
>>> ---
>>> Idle (nohz) vCPU %c0 0.15%0.3%0.2%
>>> Idle (250HZ) vCPU %c01.1% 4.6%~14%1.2%
>>> TCP_RR latency   34us 27us26.7us
>>>
>>> "Idle (X) vCPU %c0" is the percent of time the physical cpu spent in
>>> c0 over 60 seconds (each vCPU is pinned to a pCPU). (nohz) means the
>>> guest was tickless. (250HZ) means the guest was ticking at 250HZ.
>>>
>>> The big win is with ticking operating systems. Running the linux guest
>>> with nohz=off (and HZ=250), we save 3.4%~12.8% CPUs/second and get close
>>> to no-polling overhead levels by using the dynamic-poll. The savings
>>> should be even higher for higher frequency ticks.
>>>
>>> Suggested-by: David Matlack 
>>> Signed-off-by: Wanpeng Li 
>>> ---
>>>  virt/kvm/kvm_main.c | 61 
>>> ++---
>>>  1 file changed, 58 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>>> index c06e57c..3cff02f 100644
>>> --- a/virt/kvm/kvm_main.c
>>> +++ b/virt/kvm/kvm_main.c
>>> @@ -66,9 +66,18 @@
>>>  MODULE_AUTHOR("Qumranet");
>>>  MODULE_LICENSE("GPL");
>>>
>>> -static unsigned int halt_poll_ns;
>>> +/* halt polling only reduces halt latency by 5-7 us, 500us is enough */
>>> +static unsigned int halt_poll_ns = 50;
>>>  module_param(halt_poll_ns, uint, S_IRUGO | S_IWUSR);
>>>
>>> +/* Default doubles per-vcpu halt_poll_ns. */
>>> +static unsigned int halt_poll_ns_grow = 2;
>>> +module_param(halt_poll_ns_grow, int, S_IRUGO);
>>> +
>>> +/* Default resets per-vcpu halt_poll_ns . */
>>> +static unsigned int halt_poll_ns_shrink;
>>> +module_param(halt_poll_ns_shrink, int, S_IRUGO);
>>> +
>>>  /*
>>>   * Ordering of locks:
>>>   *
>>> @@ -1907,6 +1916,31 @@ void kvm_vcpu_mark_page_dirty(struct kvm_vcpu *vcpu, 
>>> gfn_t gfn)
>>>  }
>>>  EXPORT_SYMBOL_GPL(kvm_vcpu_mark_page_dirty);
>>>
>>> +static void grow_halt_poll_ns(struct kvm_vcpu *vcpu)
>>> +{
>>> +   int val = vcpu->halt_poll_ns;
>>> +
>>> +   /* 10us base */
>>> +   if (val == 0 && halt_poll_ns_grow)
>>> +   val = 1;
>>> +   else
>>> +   val *= halt_poll_ns_grow;
>>> +
>>> +   vcpu->halt_poll_ns = val;
>>> +}
>>> +
>>> +static void shrink_halt_poll_ns(struct kvm_vcpu *vcpu)
>>> +{
>>> +   int val = vcpu->halt_poll_ns;
>>> +
>>> +   if (halt_poll_ns_shrink == 0)
>>> +   val = 0;
>>> +   else
>>> +   val /= halt_poll_ns_shrink;
>>> +
>>> +   vcpu->halt_poll_ns = val;
>>> +}
>>> +
>>>  static int kvm_vcpu_check_block(struct kvm_vcpu *vcpu)
>>>  {
>>> if (kvm_arch_vcpu_runnable(vcpu)) {
>>> @@ -1929,6 +1963,7 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu)
>>> ktime_t start, cur;
>>> DEFINE_WAIT(wait);
>>> bool waited = false;
>>> +   u64 poll_ns = 0, wait_ns = 0, block_ns = 0;
>>>
>>> start = cur = ktime_get();
>>> if (vcpu->halt_poll_ns) {
>>> @@ -1941,10 +1976,15 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu)
>>>  */
>>> if (kvm_vcpu_check_block(vcpu) < 0) {
>>> ++vcpu->stat.halt_successful_poll;
>>> -   goto out;
>>> +   break;
>

Re: [PATCH v6 0/3] KVM: Dynamic Halt-Polling

2015-09-02 Thread David Matlack

On Wed, Sep 2, 2015 at 12:29 AM, Wanpeng Li  wrote:
> v5 -> v6:
>  * fix wait_ns and poll_ns

Thanks for bearing with me through all the reviews. I think it's on the
verge of being done :). There are just few small things to fix.

>
> v4 -> v5:
>  * set base case 10us and max poll time 500us
>  * handle short/long halt, idea from David, many thanks David ;-)
>
> v3 -> v4:
>  * bring back grow vcpu->halt_poll_ns when interrupt arrives and shrinks
>when idle VCPU is detected
>
> v2 -> v3:
>  * grow/shrink vcpu->halt_poll_ns by *halt_poll_ns_grow or 
> /halt_poll_ns_shrink
>  * drop the macros and hard coding the numbers in the param definitions
>  * update the comments "5-7 us"
>  * remove halt_poll_ns_max and use halt_poll_ns as the max halt_poll_ns time,
>vcpu->halt_poll_ns start at zero
>  * drop the wrappers
>  * move the grow/shrink logic before "out:" w/ "if (waited)"
>
> v1 -> v2:
>  * change kvm_vcpu_block to read halt_poll_ns from the vcpu instead of
>the module parameter
>  * use the shrink/grow matrix which is suggested by David
>  * set halt_poll_ns_max to 2ms
>
> There is a downside of always-poll since poll is still happened for idle
> vCPUs which can waste cpu usage. This patchset add the ability to adjust
> halt_poll_ns dynamically, to grow halt_poll_ns when shot halt is detected,
> and to shrink halt_poll_ns when long halt is detected.
>
> There are two new kernel parameters for changing the halt_poll_ns:
> halt_poll_ns_grow and halt_poll_ns_shrink.
>
> no-poll  always-polldynamic-poll
> ---
> Idle (nohz) vCPU %c0 0.15%0.3%0.2%
> Idle (250HZ) vCPU %c01.1% 4.6%~14%1.2%
> TCP_RR latency   34us 27us26.7us
>
> "Idle (X) vCPU %c0" is the percent of time the physical cpu spent in
> c0 over 60 seconds (each vCPU is pinned to a pCPU). (nohz) means the
> guest was tickless. (250HZ) means the guest was ticking at 250HZ.
>
> The big win is with ticking operating systems. Running the linux guest
> with nohz=off (and HZ=250), we save 3.4%~12.8% CPUs/second and get close
> to no-polling overhead levels by using the dynamic-poll. The savings
> should be even higher for higher frequency ticks.
>
>
> Wanpeng Li (3):
>   KVM: make halt_poll_ns per-vCPU
>   KVM: dynamic halt_poll_ns adjustment
>   KVM: trace kvm_halt_poll_ns grow/shrink
>
>  include/linux/kvm_host.h   |  1 +
>  include/trace/events/kvm.h | 30 
>  virt/kvm/kvm_main.c| 70 
> ++
>  3 files changed, 96 insertions(+), 5 deletions(-)
>
> --
> 1.9.1
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v6 3/3] KVM: trace kvm_halt_poll_ns grow/shrink

2015-09-02 Thread David Matlack

On Wed, Sep 2, 2015 at 12:42 AM, Wanpeng Li  wrote:
> Tracepoint for dynamic halt_pool_ns, fired on every potential change.
>
> Signed-off-by: Wanpeng Li 
> ---
>  include/trace/events/kvm.h | 30 ++
>  virt/kvm/kvm_main.c|  8 ++--
>  2 files changed, 36 insertions(+), 2 deletions(-)
>
> diff --git a/include/trace/events/kvm.h b/include/trace/events/kvm.h
> index a44062d..75ddf80 100644
> --- a/include/trace/events/kvm.h
> +++ b/include/trace/events/kvm.h
> @@ -356,6 +356,36 @@ TRACE_EVENT(
>   __entry->address)
>  );
>
> +TRACE_EVENT(kvm_halt_poll_ns,
> +   TP_PROTO(bool grow, unsigned int vcpu_id, int new, int old),
> +   TP_ARGS(grow, vcpu_id, new, old),
> +
> +   TP_STRUCT__entry(
> +   __field(bool, grow)
> +   __field(unsigned int, vcpu_id)
> +   __field(int, new)
> +   __field(int, old)
> +   ),
> +
> +   TP_fast_assign(
> +   __entry->grow   = grow;
> +   __entry->vcpu_id= vcpu_id;
> +   __entry->new= new;
> +   __entry->old= old;
> +   ),
> +
> +   TP_printk("vcpu %u: halt_pool_ns %d (%s %d)",

s/pool/poll/

> +   __entry->vcpu_id,
> +   __entry->new,
> +   __entry->grow ? "grow" : "shrink",
> +   __entry->old)
> +);
> +
> +#define trace_kvm_halt_poll_ns_grow(vcpu_id, new, old) \
> +   trace_kvm_halt_poll_ns(true, vcpu_id, new, old)
> +#define trace_kvm_halt_poll_ns_shrink(vcpu_id, new, old) \
> +   trace_kvm_halt_poll_ns(false, vcpu_id, new, old)
> +
>  #endif
>
>  #endif /* _TRACE_KVM_MAIN_H */
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 3cff02f..fee339e 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1918,8 +1918,9 @@ EXPORT_SYMBOL_GPL(kvm_vcpu_mark_page_dirty);
>
>  static void grow_halt_poll_ns(struct kvm_vcpu *vcpu)
>  {
> -   int val = vcpu->halt_poll_ns;
> +   int old, val;
>
> +   old = val = vcpu->halt_poll_ns;
> /* 10us base */
> if (val == 0 && halt_poll_ns_grow)
> val = 1;
> @@ -1927,18 +1928,21 @@ static void grow_halt_poll_ns(struct kvm_vcpu *vcpu)
> val *= halt_poll_ns_grow;
>
> vcpu->halt_poll_ns = val;
> +   trace_kvm_halt_poll_ns_grow(vcpu->vcpu_id, val, old);
>  }
>
>  static void shrink_halt_poll_ns(struct kvm_vcpu *vcpu)
>  {
> -   int val = vcpu->halt_poll_ns;
> +   int old, val;
>
> +   old = val = vcpu->halt_poll_ns;
> if (halt_poll_ns_shrink == 0)
> val = 0;
> else
> val /= halt_poll_ns_shrink;
>
> vcpu->halt_poll_ns = val;
> +   trace_kvm_halt_poll_ns_shrink(vcpu->vcpu_id, val, old);
>  }
>
>  static int kvm_vcpu_check_block(struct kvm_vcpu *vcpu)
> --
> 1.9.1
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v6 2/3] KVM: dynamic halt_poll_ns adjustment

2015-09-02 Thread David Matlack

On Wed, Sep 2, 2015 at 12:29 AM, Wanpeng Li  wrote:
> There is a downside of always-poll since poll is still happened for idle
> vCPUs which can waste cpu usage. This patch adds the ability to adjust
> halt_poll_ns dynamically, to grow halt_poll_ns when shot halt is detected,
> and to shrink halt_poll_ns when long halt is detected.
>
> There are two new kernel parameters for changing the halt_poll_ns:
> halt_poll_ns_grow and halt_poll_ns_shrink.
>
> no-poll  always-polldynamic-poll
> ---
> Idle (nohz) vCPU %c0 0.15%0.3%0.2%
> Idle (250HZ) vCPU %c01.1% 4.6%~14%1.2%
> TCP_RR latency   34us 27us26.7us
>
> "Idle (X) vCPU %c0" is the percent of time the physical cpu spent in
> c0 over 60 seconds (each vCPU is pinned to a pCPU). (nohz) means the
> guest was tickless. (250HZ) means the guest was ticking at 250HZ.
>
> The big win is with ticking operating systems. Running the linux guest
> with nohz=off (and HZ=250), we save 3.4%~12.8% CPUs/second and get close
> to no-polling overhead levels by using the dynamic-poll. The savings
> should be even higher for higher frequency ticks.
>
> Suggested-by: David Matlack 
> Signed-off-by: Wanpeng Li 
> ---
>  virt/kvm/kvm_main.c | 61 
> ++---
>  1 file changed, 58 insertions(+), 3 deletions(-)
>
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index c06e57c..3cff02f 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -66,9 +66,18 @@
>  MODULE_AUTHOR("Qumranet");
>  MODULE_LICENSE("GPL");
>
> -static unsigned int halt_poll_ns;
> +/* halt polling only reduces halt latency by 5-7 us, 500us is enough */
> +static unsigned int halt_poll_ns = 50;
>  module_param(halt_poll_ns, uint, S_IRUGO | S_IWUSR);
>
> +/* Default doubles per-vcpu halt_poll_ns. */
> +static unsigned int halt_poll_ns_grow = 2;
> +module_param(halt_poll_ns_grow, int, S_IRUGO);
> +
> +/* Default resets per-vcpu halt_poll_ns . */
> +static unsigned int halt_poll_ns_shrink;
> +module_param(halt_poll_ns_shrink, int, S_IRUGO);
> +
>  /*
>   * Ordering of locks:
>   *
> @@ -1907,6 +1916,31 @@ void kvm_vcpu_mark_page_dirty(struct kvm_vcpu *vcpu, 
> gfn_t gfn)
>  }
>  EXPORT_SYMBOL_GPL(kvm_vcpu_mark_page_dirty);
>
> +static void grow_halt_poll_ns(struct kvm_vcpu *vcpu)
> +{
> +   int val = vcpu->halt_poll_ns;
> +
> +   /* 10us base */
> +   if (val == 0 && halt_poll_ns_grow)
> +   val = 1;
> +   else
> +   val *= halt_poll_ns_grow;
> +
> +   vcpu->halt_poll_ns = val;
> +}
> +
> +static void shrink_halt_poll_ns(struct kvm_vcpu *vcpu)
> +{
> +   int val = vcpu->halt_poll_ns;
> +
> +   if (halt_poll_ns_shrink == 0)
> +   val = 0;
> +   else
> +   val /= halt_poll_ns_shrink;
> +
> +   vcpu->halt_poll_ns = val;
> +}
> +
>  static int kvm_vcpu_check_block(struct kvm_vcpu *vcpu)
>  {
> if (kvm_arch_vcpu_runnable(vcpu)) {
> @@ -1929,6 +1963,7 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu)
> ktime_t start, cur;
> DEFINE_WAIT(wait);
> bool waited = false;
> +   u64 poll_ns = 0, wait_ns = 0, block_ns = 0;
>
> start = cur = ktime_get();
> if (vcpu->halt_poll_ns) {
> @@ -1941,10 +1976,15 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu)
>  */
> if (kvm_vcpu_check_block(vcpu) < 0) {
> ++vcpu->stat.halt_successful_poll;
> -   goto out;
> +   break;
> }
> cur = ktime_get();
> } while (single_task_running() && ktime_before(cur, stop));
> +
> +   if (ktime_before(cur, stop)) {

You can't use 'cur' to tell if the interrupt arrived. single_task_running()
can break us out of the loop before 'stop'.

> +   poll_ns = ktime_to_ns(cur) - ktime_to_ns(start);

Put this line before the if(). block_ns should always include the time
spent polling; even if polling does not succeed.

> +   goto out;
> +   }
> }
>
> for (;;) {
> @@ -1959,9 +1999,24 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu)
>
> finish_wait(&vcpu->wq, &wait);
> cur = ktime_get();
> +   wait_ns = ktime_to_ns(cur) - ktime_to_ns(st

Re: [PATCH v4 0/3] KVM: Dynamic Halt-Polling

2015-09-01 Thread David Matlack

On Tue, Sep 1, 2015 at 5:29 PM, Wanpeng Li  wrote:
> On 9/2/15 7:24 AM, David Matlack wrote:
>>
>> On Tue, Sep 1, 2015 at 3:58 PM, Wanpeng Li  wrote:

>>>
>>> Why this can happen?
>>
>> Ah, probably because I'm missing 9c8fd1ba220 (KVM: x86: optimize delivery
>> of TSC deadline timer interrupt). I don't think the edge case exists in
>> the latest kernel.
>
>
> Yeah, hope we both(include Peter Kieser) can test against latest kvm tree to
> avoid confusing. The reason to introduce the adaptive halt-polling toggle is
> to handle the "edge case" as you mentioned above. So I think we can make
> more efforts improve v4 instead. I will improve v4 to handle short halt
> today. ;-)

That's fine. It's just easier to convey my ideas with a patch. FYI the
other reason for the toggle patch was to add the timer for kvm_vcpu_block,
which I think is the only way to get dynamic halt-polling right. Feel free
to work on top of v4!

>

>>>
>>> Did you test your patch against a windows guest?
>>
>> I have not. I tested against a 250HZ linux guest to check how it performs
>> against a ticking guest. Presumably, windows should be the same, but at a
>> higher tick rate. Do you have a test for Windows?
>
>
> I just test the idle vCPUs usage.
>
>
> V4 for windows 10:
>
> +-++---+
> | | |
> |
> |  w/o halt-poll   |  w/ halt-poll  | dynamic(v4) halt-poll
> |
> +-++---+
> | | |
> |
> |~2.1%|~3.0%  | ~2.4%
> |
> +-++---+

I'm not seeing the same results with v4. With a 250HZ ticking guest
I see 15% c0 with halt_poll_ns=200 and 1.27% with halt_poll_ns=0.
Are you running one vcpu per pcpu?

(The reason for the overhead: the new tracepoint shows each vcpu is
alternating between 0 and 500 us.)

>
> V4  for linux guest:
>
> +-++---+
> | ||   |
> |  w/o halt-poll  |  w/ halt-poll  | dynamic halt-poll |
> +-++---+
> | ||   |
> |~0.9%|~1.8%   | ~1.2% |
> +-++---+
>
>
> Regards,
> Wanpeng Li
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v4 0/3] KVM: Dynamic Halt-Polling

2015-09-01 Thread David Matlack

On Tue, Sep 1, 2015 at 3:58 PM, Wanpeng Li  wrote:
> On 9/2/15 6:34 AM, David Matlack wrote:
>>
>> On Tue, Sep 1, 2015 at 3:30 PM, Wanpeng Li  wrote:
>>>
>>> On 9/2/15 5:45 AM, David Matlack wrote:
>>>>
>>>> On Thu, Aug 27, 2015 at 2:47 AM, Wanpeng Li 
>>>> wrote:
>>>>>
>>>>> v3 -> v4:
>>>>>* bring back grow vcpu->halt_poll_ns when interrupt arrives and
>>>>> shrinks
>>>>>  when idle VCPU is detected
>>>>>
>>>>> v2 -> v3:
>>>>>* grow/shrink vcpu->halt_poll_ns by *halt_poll_ns_grow or
>>>>> /halt_poll_ns_shrink
>>>>>* drop the macros and hard coding the numbers in the param
>>>>> definitions
>>>>>* update the comments "5-7 us"
>>>>>* remove halt_poll_ns_max and use halt_poll_ns as the max
>>>>> halt_poll_ns
>>>>> time,
>>>>>  vcpu->halt_poll_ns start at zero
>>>>>* drop the wrappers
>>>>>* move the grow/shrink logic before "out:" w/ "if (waited)"
>>>>
>>>> I posted a patchset which adds dynamic poll toggling (on/off switch). I
>>>> think
>>>> this gives you a good place to build your dynamic growth patch on top.
>>>> The
>>>> toggling patch has close to zero overhead for idle VMs and equivalent
>>>> performance VMs doing message passing as always-poll. It's a patch
>>>> that's
>>>> been
>>>> in my queue for a few weeks but just haven't had the time to send out.
>>>> We
>>>> can
>>>> win even more with your patchset by only polling as much as we need (via
>>>> dynamic growth/shrink). It also gives us a better place to stand for
>>>> choosing
>>>> a default for halt_poll_ns. (We can run experiments and see how high
>>>> vcpu->halt_poll_ns tends to grow.)
>>>>
>>>> The reason I posted a separate patch for toggling is because it adds
>>>> timers
>>>> to kvm_vcpu_block and deals with a weird edge case (kvm_vcpu_block can
>>>> get
>>>> called multiple times for one halt). To do dynamic poll adjustment
>
>
> Why this can happen?

Ah, probably because I'm missing 9c8fd1ba220 (KVM: x86: optimize delivery
of TSC deadline timer interrupt). I don't think the edge case exists in
the latest kernel.

>
>
>>>> correctly,
>>>> we have to time the length of each halt. Otherwise we hit some bad edge
>>>> cases:
>>>>
>>>> v3: v3 had lots of idle overhead. It's because vcpu->halt_poll_ns
>>>> grew
>>>> every
>>>> time we had a long halt. So idle VMs looked like: 0 us -> 500 us ->
>>>> 1
>>>> ms ->
>>>> 2 ms -> 4 ms -> 0 us. Ideally vcpu->halt_poll_ns should just stay at
>>>> 0
>>>> when
>>>> the halts are long.
>>>>
>>>> v4: v4 fixed the idle overhead problem but broke dynamic growth for
>>>> message
>>>> passing VMs. Every time a VM did a short halt, vcpu->halt_poll_ns
>>>> would
>>>> grow.
>>>> That means vcpu->halt_poll_ns will always be maxed out, even when
>>>> the
>>>> halt
>>>> time is much less than the max.
>>>>
>>>> I think we can fix both edge cases if we make grow/shrink decisions
>>>> based
>>>> on
>>>> the length of kvm_vcpu_block rather than the arrival of a guest
>>>> interrupt
>>>> during polling.
>>>>
>>>> Some thoughts for dynamic growth:
>>>> * Given Windows 10 timer tick (1 ms), let's set the maximum poll
>>>> time
>>>> to
>>>>   less than 1ms. 200 us has been a good value for always-poll. We
>>>> can
>>>>   probably go a bit higher once we have your patch. Maybe 500 us?
>
>
> Did you test your patch against a windows guest?

I have not. I tested against a 250HZ linux guest to check how it performs
against a ticking guest. Presumably, windows should be the same, but at a
higher tick rate. Do you have a test for Windows?

>
>>>>
>>>> * The base case of dynamic growth (the first grow() after being at
>>>> 0)
>>>> should
>>>>   be small. 500 us is too big. When I run TCP_RR in my guest I see
>>>> poll
>>>> times
>>>>   of < 10 us. TCP_RR is on the lower-end of message passing workload
>>>> latency,
>>>>   so 10 us would be a good base case.
>>>
>>>
>>> How to get your TCP_RR benchmark?
>>>
>>> Regards,
>>> Wanpeng Li
>>
>> Install the netperf package, or build from here:
>> http://www.netperf.org/netperf/DownloadNetperf.html
>>
>> In the vm:
>>
>> # ./netserver
>> # ./netperf -t TCP_RR
>>
>> Be sure to use an SMP guest (we want TCP_RR to be a cross-core message
>> passing workload in order to test halt-polling).
>
>
> Ah, ok, I use the same benchmark as yours.
>
> Regards,
> Wanpeng Li
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v4 0/3] KVM: Dynamic Halt-Polling

2015-09-01 Thread David Matlack

On Tue, Sep 1, 2015 at 3:30 PM, Wanpeng Li  wrote:
> On 9/2/15 5:45 AM, David Matlack wrote:
>>
>> On Thu, Aug 27, 2015 at 2:47 AM, Wanpeng Li 
>> wrote:
>>>
>>> v3 -> v4:
>>>   * bring back grow vcpu->halt_poll_ns when interrupt arrives and shrinks
>>> when idle VCPU is detected
>>>
>>> v2 -> v3:
>>>   * grow/shrink vcpu->halt_poll_ns by *halt_poll_ns_grow or
>>> /halt_poll_ns_shrink
>>>   * drop the macros and hard coding the numbers in the param definitions
>>>   * update the comments "5-7 us"
>>>   * remove halt_poll_ns_max and use halt_poll_ns as the max halt_poll_ns
>>> time,
>>> vcpu->halt_poll_ns start at zero
>>>   * drop the wrappers
>>>   * move the grow/shrink logic before "out:" w/ "if (waited)"
>>
>> I posted a patchset which adds dynamic poll toggling (on/off switch). I
>> think
>> this gives you a good place to build your dynamic growth patch on top. The
>> toggling patch has close to zero overhead for idle VMs and equivalent
>> performance VMs doing message passing as always-poll. It's a patch that's
>> been
>> in my queue for a few weeks but just haven't had the time to send out. We
>> can
>> win even more with your patchset by only polling as much as we need (via
>> dynamic growth/shrink). It also gives us a better place to stand for
>> choosing
>> a default for halt_poll_ns. (We can run experiments and see how high
>> vcpu->halt_poll_ns tends to grow.)
>>
>> The reason I posted a separate patch for toggling is because it adds
>> timers
>> to kvm_vcpu_block and deals with a weird edge case (kvm_vcpu_block can get
>> called multiple times for one halt). To do dynamic poll adjustment
>> correctly,
>> we have to time the length of each halt. Otherwise we hit some bad edge
>> cases:
>>
>>v3: v3 had lots of idle overhead. It's because vcpu->halt_poll_ns grew
>> every
>>time we had a long halt. So idle VMs looked like: 0 us -> 500 us -> 1
>> ms ->
>>2 ms -> 4 ms -> 0 us. Ideally vcpu->halt_poll_ns should just stay at 0
>> when
>>the halts are long.
>>
>>v4: v4 fixed the idle overhead problem but broke dynamic growth for
>> message
>>passing VMs. Every time a VM did a short halt, vcpu->halt_poll_ns would
>> grow.
>>That means vcpu->halt_poll_ns will always be maxed out, even when the
>> halt
>>time is much less than the max.
>>
>> I think we can fix both edge cases if we make grow/shrink decisions based
>> on
>> the length of kvm_vcpu_block rather than the arrival of a guest interrupt
>> during polling.
>>
>> Some thoughts for dynamic growth:
>>* Given Windows 10 timer tick (1 ms), let's set the maximum poll time
>> to
>>  less than 1ms. 200 us has been a good value for always-poll. We can
>>  probably go a bit higher once we have your patch. Maybe 500 us?
>>
>>* The base case of dynamic growth (the first grow() after being at 0)
>> should
>>  be small. 500 us is too big. When I run TCP_RR in my guest I see poll
>> times
>>  of < 10 us. TCP_RR is on the lower-end of message passing workload
>> latency,
>>  so 10 us would be a good base case.
>
>
> How to get your TCP_RR benchmark?
>
> Regards,
> Wanpeng Li

Install the netperf package, or build from here:
http://www.netperf.org/netperf/DownloadNetperf.html

In the vm:

# ./netserver
# ./netperf -t TCP_RR

Be sure to use an SMP guest (we want TCP_RR to be a cross-core message
passing workload in order to test halt-polling).

>
>
>>> v1 -> v2:
>>>   * change kvm_vcpu_block to read halt_poll_ns from the vcpu instead of
>>> the module parameter
>>>   * use the shrink/grow matrix which is suggested by David
>>>   * set halt_poll_ns_max to 2ms
>>>
>>> There is a downside of halt_poll_ns since poll is still happen for idle
>>> VCPU which can waste cpu usage. This patchset add the ability to adjust
>>> halt_poll_ns dynamically, grows halt_poll_ns if an interrupt arrives and
>>> shrinks halt_poll_ns when idle VCPU is detected.
>>>
>>> There are two new kernel parameters for changing the halt_poll_ns:
>>> halt_poll_ns_grow and halt_poll_ns_shrink.
>>>
>>>
>>> Test w/ high cpu overcommit ratio, pin vCPUs, and the halt_poll_ns of
>>> halt-poll is the default 50ns, the max halt_poll_ns of dynamic
>>> halt-

Re: [PATCH v4 0/3] KVM: Dynamic Halt-Polling

2015-09-01 Thread David Matlack

On Thu, Aug 27, 2015 at 2:47 AM, Wanpeng Li  wrote:
> v3 -> v4:
>  * bring back grow vcpu->halt_poll_ns when interrupt arrives and shrinks
>when idle VCPU is detected
>
> v2 -> v3:
>  * grow/shrink vcpu->halt_poll_ns by *halt_poll_ns_grow or 
> /halt_poll_ns_shrink
>  * drop the macros and hard coding the numbers in the param definitions
>  * update the comments "5-7 us"
>  * remove halt_poll_ns_max and use halt_poll_ns as the max halt_poll_ns time,
>vcpu->halt_poll_ns start at zero
>  * drop the wrappers
>  * move the grow/shrink logic before "out:" w/ "if (waited)"

I posted a patchset which adds dynamic poll toggling (on/off switch). I think
this gives you a good place to build your dynamic growth patch on top. The
toggling patch has close to zero overhead for idle VMs and equivalent
performance VMs doing message passing as always-poll. It's a patch that's been
in my queue for a few weeks but just haven't had the time to send out. We can
win even more with your patchset by only polling as much as we need (via
dynamic growth/shrink). It also gives us a better place to stand for choosing
a default for halt_poll_ns. (We can run experiments and see how high
vcpu->halt_poll_ns tends to grow.)

The reason I posted a separate patch for toggling is because it adds timers
to kvm_vcpu_block and deals with a weird edge case (kvm_vcpu_block can get
called multiple times for one halt). To do dynamic poll adjustment correctly,
we have to time the length of each halt. Otherwise we hit some bad edge cases:

  v3: v3 had lots of idle overhead. It's because vcpu->halt_poll_ns grew every
  time we had a long halt. So idle VMs looked like: 0 us -> 500 us -> 1 ms ->
  2 ms -> 4 ms -> 0 us. Ideally vcpu->halt_poll_ns should just stay at 0 when
  the halts are long.

  v4: v4 fixed the idle overhead problem but broke dynamic growth for message
  passing VMs. Every time a VM did a short halt, vcpu->halt_poll_ns would grow.
  That means vcpu->halt_poll_ns will always be maxed out, even when the halt
  time is much less than the max.

I think we can fix both edge cases if we make grow/shrink decisions based on
the length of kvm_vcpu_block rather than the arrival of a guest interrupt
during polling.

Some thoughts for dynamic growth:
  * Given Windows 10 timer tick (1 ms), let's set the maximum poll time to
less than 1ms. 200 us has been a good value for always-poll. We can
probably go a bit higher once we have your patch. Maybe 500 us?

  * The base case of dynamic growth (the first grow() after being at 0) should
be small. 500 us is too big. When I run TCP_RR in my guest I see poll times
of < 10 us. TCP_RR is on the lower-end of message passing workload latency,
so 10 us would be a good base case.

>
> v1 -> v2:
>  * change kvm_vcpu_block to read halt_poll_ns from the vcpu instead of
>the module parameter
>  * use the shrink/grow matrix which is suggested by David
>  * set halt_poll_ns_max to 2ms
>
> There is a downside of halt_poll_ns since poll is still happen for idle
> VCPU which can waste cpu usage. This patchset add the ability to adjust
> halt_poll_ns dynamically, grows halt_poll_ns if an interrupt arrives and
> shrinks halt_poll_ns when idle VCPU is detected.
>
> There are two new kernel parameters for changing the halt_poll_ns:
> halt_poll_ns_grow and halt_poll_ns_shrink.
>
>
> Test w/ high cpu overcommit ratio, pin vCPUs, and the halt_poll_ns of
> halt-poll is the default 50ns, the max halt_poll_ns of dynamic
> halt-poll is 2ms. Then watch the %C0 in the dump of Powertop tool.
> The test method is almost from David.
>
> +-++---+
> | ||   |
> |  w/o halt-poll  |  w/ halt-poll  | dynamic halt-poll |
> +-++---+
> | ||   |
> |~0.9%|~1.8%   | ~1.2% |
> +-++---+
>
> The always halt-poll will increase ~0.9% cpu usage for idle vCPUs and the
> dynamic halt-poll drop it to ~0.3% which means that reduce the 67% overhead
> introduced by always halt-poll.
>
> Wanpeng Li (3):
>   KVM: make halt_poll_ns per-VCPU
>   KVM: dynamic halt_poll_ns adjustment
>   KVM: trace kvm_halt_poll_ns grow/shrink
>
>  include/linux/kvm_host.h   |  1 +
>  include/trace/events/kvm.h | 30 
>  virt/kvm/kvm_main.c| 50 
> +++---
>  3 files changed, 78 insertions(+), 3 deletions(-)
> --
> 1.9.1
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 0/2] Adaptive halt-polling toggle

2015-09-01 Thread David Matlack

This patchset adds a dynamic on/off switch for polling. This patchset
gets good performance on its own for both idle and Message Passing
workloads.

   no-poll always-polladaptive-toggle
-
Idle (nohz) VCPU %c0   0.120.32   0.15
Idle (250HZ) VCPU %c0  1.226.35   1.27
TCP_RR latency 39 us   25 us  25 us

(3.16 Linux guest, halt_poll_ns=20)

"Idle (X) VCPU %c0" is the percent of time the physical cpu spent in
c0 over 60 seconds (each VCPU is pinned to a PCPU). (nohz) means the
guest was tickless. (250HZ) means the guest was ticking at 250HZ.

The big win is with ticking operating systems. Running the linux guest
with nohz=off (and HZ=250), we save 5% CPUs/second and get close to
no-polling overhead levels by using the adaptive toggle. The savings
should be even higher for higher frequency ticks.

Since we get low idle overhead with polling now, halt_poll_ns defaults
to 20, instead of 0. We can increase halt_poll_ns a bit more once
we have dynamic halt-polling length adjustments (Wanpeng's patch). We
should however keep halt_poll_ns below 1 ms since that is the tick
frequency used by windows.

David Matlack (1):
  kvm: adaptive halt-polling toggle

Wanpeng Li (1):
  KVM: make halt_poll_ns per-VCPU

 include/linux/kvm_host.h   |   1 +
 include/trace/events/kvm.h |  23 ++
 virt/kvm/kvm_main.c| 111 ++---
 3 files changed, 99 insertions(+), 36 deletions(-)

-- 
2.5.0.457.gab17608

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2/2] kvm: adaptive halt-polling toggle

2015-09-01 Thread David Matlack

This patch removes almost all of the overhead of polling for idle VCPUs
by disabling polling for long halts. The length of the previous halt
is used as a predictor for the current halt:

  if (length of previous halt < halt_poll_ns): poll for halt_poll_ns
  else: don't poll

This tends to work well in practice. For VMs running Message Passing
workloads, all halts are short and so the VCPU should always poll. When
a VCPU is idle, all halts are long and so the VCPU should never halt.
Experimental results on an IvyBridge host show adaptive toggling gets
close to the best of both worlds.

   no-poll always-polladaptive-toggle
-
Idle (nohz) VCPU %c0   0.120.32   0.15
Idle (250HZ) VCPU %c0  1.226.35   1.27
TCP_RR latency 39 us   25 us  25 us

(3.16 Linux guest, halt_poll_ns=20)

The big win is with ticking operating systems. Running the linux guest
with nohz=off (and HZ=250), we save 5% CPU/second and get close to
no-polling overhead levels by using the adaptive toggle. The savings
should be even higher for higher frequency ticks.

Signed-off-by: David Matlack 
---
 include/trace/events/kvm.h |  23 ++
 virt/kvm/kvm_main.c| 110 ++---
 2 files changed, 97 insertions(+), 36 deletions(-)

diff --git a/include/trace/events/kvm.h b/include/trace/events/kvm.h
index a44062d..34e0b11 100644
--- a/include/trace/events/kvm.h
+++ b/include/trace/events/kvm.h
@@ -38,22 +38,27 @@ TRACE_EVENT(kvm_userspace_exit,
 );
 
 TRACE_EVENT(kvm_vcpu_wakeup,
-   TP_PROTO(__u64 ns, bool waited),
-   TP_ARGS(ns, waited),
+   TP_PROTO(bool poll, bool success, __u64 poll_ns, __u64 wait_ns),
+   TP_ARGS(poll, success, poll_ns, wait_ns),
 
TP_STRUCT__entry(
-   __field(__u64,  ns  )
-   __field(bool,   waited  )
+   __field( bool,  poll)
+   __field( bool,  success )
+   __field(__u64,  poll_ns )
+   __field(__u64,  wait_ns )
),
 
TP_fast_assign(
-   __entry->ns = ns;
-   __entry->waited = waited;
+   __entry->poll   = poll;
+   __entry->success= success;
+   __entry->poll_ns= poll_ns;
+   __entry->wait_ns= wait_ns;
),
 
-   TP_printk("%s time %lld ns",
- __entry->waited ? "wait" : "poll",
- __entry->ns)
+   TP_printk("%s %s, poll ns %lld, wait ns %lld",
+ __entry->poll ? "poll" : "wait",
+ __entry->success ? "success" : "fail",
+ __entry->poll_ns, __entry->wait_ns)
 );
 
 #if defined(CONFIG_HAVE_KVM_IRQFD)
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 977ffb1..3a66694 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -66,7 +66,8 @@
 MODULE_AUTHOR("Qumranet");
 MODULE_LICENSE("GPL");
 
-static unsigned int halt_poll_ns;
+/* The maximum amount of time a vcpu will poll for interrupts while halted. */
+static unsigned int halt_poll_ns = 20;
 module_param(halt_poll_ns, uint, S_IRUGO | S_IWUSR);
 
 /*
@@ -1907,6 +1908,7 @@ void kvm_vcpu_mark_page_dirty(struct kvm_vcpu *vcpu, 
gfn_t gfn)
 }
 EXPORT_SYMBOL_GPL(kvm_vcpu_mark_page_dirty);
 
+/* This sets KVM_REQ_UNHALT if an interrupt arrives. */
 static int kvm_vcpu_check_block(struct kvm_vcpu *vcpu)
 {
if (kvm_arch_vcpu_runnable(vcpu)) {
@@ -1921,47 +1923,101 @@ static int kvm_vcpu_check_block(struct kvm_vcpu *vcpu)
return 0;
 }
 
-/*
- * The vCPU has executed a HLT instruction with in-kernel mode enabled.
- */
-void kvm_vcpu_block(struct kvm_vcpu *vcpu)
+static void
+update_vcpu_block_predictor(struct kvm_vcpu *vcpu, u64 poll_ns, u64 wait_ns)
 {
-   ktime_t start, cur;
-   DEFINE_WAIT(wait);
-   bool waited = false;
-
-   start = cur = ktime_get();
-   if (vcpu->halt_poll_ns) {
-   ktime_t stop = ktime_add_ns(ktime_get(), vcpu->halt_poll_ns);
-
-   do {
-   /*
-* This sets KVM_REQ_UNHALT if an interrupt
-* arrives.
-*/
-   if (kvm_vcpu_check_block(vcpu) < 0) {
-   ++vcpu->stat.halt_successful_poll;
-   goto out;
-   }
-   cur = ktime_get();
-   } while (single_task_running() && ktime_before(cur, stop));
+   u64 block_ns = pol

[PATCH 1/2] KVM: make halt_poll_ns per-VCPU

2015-09-01 Thread David Matlack

From: Wanpeng Li 

Change halt_poll_ns into per-VCPU variable, seeded from module parameter,
to allow greater flexibility.

Signed-off-by: Wanpeng Li 
---
 include/linux/kvm_host.h | 1 +
 virt/kvm/kvm_main.c  | 5 +++--
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 05e99b8..382cbef 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -241,6 +241,7 @@ struct kvm_vcpu {
int sigset_active;
sigset_t sigset;
struct kvm_vcpu_stat stat;
+   unsigned int halt_poll_ns;
 
 #ifdef CONFIG_HAS_IOMEM
int mmio_needed;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 8b8a444..977ffb1 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -217,6 +217,7 @@ int kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, 
unsigned id)
vcpu->kvm = kvm;
vcpu->vcpu_id = id;
vcpu->pid = NULL;
+   vcpu->halt_poll_ns = 0;
init_waitqueue_head(&vcpu->wq);
kvm_async_pf_vcpu_init(vcpu);
 
@@ -1930,8 +1931,8 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu)
bool waited = false;
 
start = cur = ktime_get();
-   if (halt_poll_ns) {
-   ktime_t stop = ktime_add_ns(ktime_get(), halt_poll_ns);
+   if (vcpu->halt_poll_ns) {
+   ktime_t stop = ktime_add_ns(ktime_get(), vcpu->halt_poll_ns);
 
do {
/*
-- 
2.5.0.457.gab17608

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 2/3] KVM: dynamic halt_poll_ns adjustment

2015-08-27 Thread David Matlack

On Thu, Aug 27, 2015 at 2:59 AM, Wanpeng Li  wrote:
> Hi David,
> On 8/26/15 1:19 AM, David Matlack wrote:
>>
>> Thanks for writing v2, Wanpeng.
>>
>> On Mon, Aug 24, 2015 at 11:35 PM, Wanpeng Li 
>> wrote:
>>>
>>> There is a downside of halt_poll_ns since poll is still happen for idle
>>> VCPU which can waste cpu usage. This patch adds the ability to adjust
>>> halt_poll_ns dynamically.
>>
>> What testing have you done with these patches? Do you know if this removes
>> the overhead of polling in idle VCPUs? Do we lose any of the performance
>> from always polling?
>>
>>> There are two new kernel parameters for changing the halt_poll_ns:
>>> halt_poll_ns_grow and halt_poll_ns_shrink. A third new parameter,
>>> halt_poll_ns_max, controls the maximal halt_poll_ns; it is internally
>>> rounded down to a closest multiple of halt_poll_ns_grow. The shrink/grow
>>> matrix is suggested by David:
>>>
>>> if (poll successfully for interrupt): stay the same
>>>else if (length of kvm_vcpu_block is longer than halt_poll_ns_max):
>>> shrink
>>>else if (length of kvm_vcpu_block is less than halt_poll_ns_max): grow
>>
>> The way you implemented this wasn't what I expected. I thought you would
>> time
>> the whole function (kvm_vcpu_block). But I like your approach better. It's
>> simpler and [by inspection] does what we want.
>
>
> I see there is more idle vCPUs overhead w/ this method even more than always
> halt-poll, so I bring back grow vcpu->halt_poll_ns when interrupt arrives
> and shrinks when idle VCPU is detected. The perfomance looks good in v4.

Why did this patch have a worse idle overhead than always poll?

>
> Regards,
> Wanpeng Li
>
>
>>
>>>halt_poll_ns_shrink/ |
>>>halt_poll_ns_grow| grow halt_poll_ns| shrink halt_poll_ns
>>>-+--+---
>>>< 1  |  = halt_poll_ns  |  = 0
>>>< halt_poll_ns   | *= halt_poll_ns_grow | /= halt_poll_ns_shrink
>>>otherwise| += halt_poll_ns_grow | -= halt_poll_ns_shrink
>>
>> I was curious why you went with this approach rather than just the
>> middle row, or just the last row. Do you think we'll want the extra
>> flexibility?
>>
>>> Signed-off-by: Wanpeng Li 
>>> ---
>>>   virt/kvm/kvm_main.c | 65
>>> -
>>>   1 file changed, 64 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>>> index 93db833..2a4962b 100644
>>> --- a/virt/kvm/kvm_main.c
>>> +++ b/virt/kvm/kvm_main.c
>>> @@ -66,9 +66,26 @@
>>>   MODULE_AUTHOR("Qumranet");
>>>   MODULE_LICENSE("GPL");
>>>
>>> -static unsigned int halt_poll_ns;
>>> +#define KVM_HALT_POLL_NS  50
>>> +#define KVM_HALT_POLL_NS_GROW   2
>>> +#define KVM_HALT_POLL_NS_SHRINK 0
>>> +#define KVM_HALT_POLL_NS_MAX 200
>>
>> The macros are not necessary. Also, hard coding the numbers in the param
>> definitions will make reading the comments above them easier.
>>
>>> +
>>> +static unsigned int halt_poll_ns = KVM_HALT_POLL_NS;
>>>   module_param(halt_poll_ns, uint, S_IRUGO | S_IWUSR);
>>>
>>> +/* Default doubles per-vcpu halt_poll_ns. */
>>> +static unsigned int halt_poll_ns_grow = KVM_HALT_POLL_NS_GROW;
>>> +module_param(halt_poll_ns_grow, int, S_IRUGO);
>>> +
>>> +/* Default resets per-vcpu halt_poll_ns . */
>>> +static unsigned int halt_poll_ns_shrink = KVM_HALT_POLL_NS_SHRINK;
>>> +module_param(halt_poll_ns_shrink, int, S_IRUGO);
>>> +
>>> +/* halt polling only reduces halt latency by 10-15 us, 2ms is enough */
>>
>> Ah, I misspoke before. I was thinking about round-trip latency. The
>> latency
>> of a single halt is reduced by about 5-7 us.
>>
>>> +static unsigned int halt_poll_ns_max = KVM_HALT_POLL_NS_MAX;
>>> +module_param(halt_poll_ns_max, int, S_IRUGO);
>>
>> We can remove halt_poll_ns_max. vcpu->halt_poll_ns can always start at
>> zero
>> and grow from there. Then we just need one module param to keep
>> vcpu->halt_poll_ns from growing too large.
>>
>> [ It would make more sense to remove halt_poll_ns and keep
>> halt_poll_ns_max,
>>but since halt_poll_ns already exists in upstream kernels, we probably
>> can'

Re: [PATCH v2 2/3] KVM: dynamic halt_poll_ns adjustment

2015-08-25 Thread David Matlack

Thanks for writing v2, Wanpeng.

On Mon, Aug 24, 2015 at 11:35 PM, Wanpeng Li  wrote:
> There is a downside of halt_poll_ns since poll is still happen for idle
> VCPU which can waste cpu usage. This patch adds the ability to adjust
> halt_poll_ns dynamically.

What testing have you done with these patches? Do you know if this removes
the overhead of polling in idle VCPUs? Do we lose any of the performance
from always polling?

>
> There are two new kernel parameters for changing the halt_poll_ns:
> halt_poll_ns_grow and halt_poll_ns_shrink. A third new parameter,
> halt_poll_ns_max, controls the maximal halt_poll_ns; it is internally
> rounded down to a closest multiple of halt_poll_ns_grow. The shrink/grow
> matrix is suggested by David:
>
> if (poll successfully for interrupt): stay the same
>   else if (length of kvm_vcpu_block is longer than halt_poll_ns_max): shrink
>   else if (length of kvm_vcpu_block is less than halt_poll_ns_max): grow

The way you implemented this wasn't what I expected. I thought you would time
the whole function (kvm_vcpu_block). But I like your approach better. It's
simpler and [by inspection] does what we want.

>
>   halt_poll_ns_shrink/ |
>   halt_poll_ns_grow| grow halt_poll_ns| shrink halt_poll_ns
>   -+--+---
>   < 1  |  = halt_poll_ns  |  = 0
>   < halt_poll_ns   | *= halt_poll_ns_grow | /= halt_poll_ns_shrink
>   otherwise| += halt_poll_ns_grow | -= halt_poll_ns_shrink

I was curious why you went with this approach rather than just the
middle row, or just the last row. Do you think we'll want the extra
flexibility?

>
> Signed-off-by: Wanpeng Li 
> ---
>  virt/kvm/kvm_main.c | 65 
> -
>  1 file changed, 64 insertions(+), 1 deletion(-)
>
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 93db833..2a4962b 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -66,9 +66,26 @@
>  MODULE_AUTHOR("Qumranet");
>  MODULE_LICENSE("GPL");
>
> -static unsigned int halt_poll_ns;
> +#define KVM_HALT_POLL_NS  50
> +#define KVM_HALT_POLL_NS_GROW   2
> +#define KVM_HALT_POLL_NS_SHRINK 0
> +#define KVM_HALT_POLL_NS_MAX 200

The macros are not necessary. Also, hard coding the numbers in the param
definitions will make reading the comments above them easier.

> +
> +static unsigned int halt_poll_ns = KVM_HALT_POLL_NS;
>  module_param(halt_poll_ns, uint, S_IRUGO | S_IWUSR);
>
> +/* Default doubles per-vcpu halt_poll_ns. */
> +static unsigned int halt_poll_ns_grow = KVM_HALT_POLL_NS_GROW;
> +module_param(halt_poll_ns_grow, int, S_IRUGO);
> +
> +/* Default resets per-vcpu halt_poll_ns . */
> +static unsigned int halt_poll_ns_shrink = KVM_HALT_POLL_NS_SHRINK;
> +module_param(halt_poll_ns_shrink, int, S_IRUGO);
> +
> +/* halt polling only reduces halt latency by 10-15 us, 2ms is enough */

Ah, I misspoke before. I was thinking about round-trip latency. The latency
of a single halt is reduced by about 5-7 us.

> +static unsigned int halt_poll_ns_max = KVM_HALT_POLL_NS_MAX;
> +module_param(halt_poll_ns_max, int, S_IRUGO);

We can remove halt_poll_ns_max. vcpu->halt_poll_ns can always start at zero
and grow from there. Then we just need one module param to keep
vcpu->halt_poll_ns from growing too large.

[ It would make more sense to remove halt_poll_ns and keep halt_poll_ns_max,
  but since halt_poll_ns already exists in upstream kernels, we probably can't
  remove it. ]

> +
>  /*
>   * Ordering of locks:
>   *
> @@ -1907,6 +1924,48 @@ void kvm_vcpu_mark_page_dirty(struct kvm_vcpu *vcpu, 
> gfn_t gfn)
>  }
>  EXPORT_SYMBOL_GPL(kvm_vcpu_mark_page_dirty);
>
> +static unsigned int __grow_halt_poll_ns(unsigned int val)
> +{
> +   if (halt_poll_ns_grow < 1)
> +   return halt_poll_ns;
> +
> +   val = min(val, halt_poll_ns_max);
> +
> +   if (val == 0)
> +   return halt_poll_ns;
> +
> +   if (halt_poll_ns_grow < halt_poll_ns)
> +   val *= halt_poll_ns_grow;
> +   else
> +   val += halt_poll_ns_grow;
> +
> +   return val;
> +}
> +
> +static unsigned int __shrink_halt_poll_ns(int val, int modifier, int minimum)

minimum never gets used.

> +{
> +   if (modifier < 1)
> +   return 0;
> +
> +   if (modifier < halt_poll_ns)
> +   val /= modifier;
> +   else
> +   val -= modifier;
> +
> +   return val;
> +}
> +
> +static void grow_halt_poll_ns(struct kvm_vcpu *vcpu)

These wrappers aren't necessary.

> +{
> +   vcpu->halt_poll_ns = __grow_halt_poll_ns(vcpu->halt_poll_ns);
> +}
> +
> +static void shrink_halt_poll_ns(struct kvm_vcpu *vcpu)
> +{
> +   vcpu->halt_poll_ns = __shrink_halt_poll_ns(vcpu->halt_poll_ns,
> +   halt_poll_ns_shrink, halt_poll_ns);
> +}
> +
>  static int kvm_vcpu_check_block(struct kvm_vcpu *vcpu)
>  {
> if (kvm_arch_vcpu_runnable(vcpu)) {
> @@ -19

Re: [PATCH 2/3] KVM: dynamise halt_poll_ns adjustment

2015-08-24 Thread David Matlack

On Mon, Aug 24, 2015 at 5:53 AM, Wanpeng Li  wrote:
> There are two new kernel parameters for changing the halt_poll_ns:
> halt_poll_ns_grow and halt_poll_ns_shrink. halt_poll_ns_grow affects
> halt_poll_ns when an interrupt arrives and halt_poll_ns_shrink
> does it when idle VCPU is detected.
>
>   halt_poll_ns_shrink/ |
>   halt_poll_ns_grow| interrupt arrives| idle VCPU is detected
>   -+--+---
>   < 1  |  = halt_poll_ns  |  = 0
>   < halt_poll_ns   | *= halt_poll_ns_grow | /= halt_poll_ns_shrink
>   otherwise| += halt_poll_ns_grow | -= halt_poll_ns_shrink
>
> A third new parameter, halt_poll_ns_max, controls the maximal halt_poll_ns;
> it is internally rounded down to a closest multiple of halt_poll_ns_grow.

I like the idea of growing and shrinking halt_poll_ns, but I'm not sure
we grow and shrink in the right places here. For example, if vcpu->halt_poll_ns
gets down to 0, I don't see how it can then grow back up.

This might work better:
  if (poll successfully for interrupt): stay the same
  else if (length of kvm_vcpu_block is longer than halt_poll_ns_max): shrink
  else if (length of kvm_vcpu_block is less than halt_poll_ns_max): grow

where halt_poll_ns_max is something reasonable, like 2 millisecond.

You get diminishing returns from halt polling as the length of the
halt gets longer (halt polling only reduces halt latency by 10-15 us).
So there's little benefit to polling longer than a few milliseconds.

>
> Signed-off-by: Wanpeng Li 
> ---
>  virt/kvm/kvm_main.c | 81 
> -
>  1 file changed, 80 insertions(+), 1 deletion(-)
>
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index a122b52..bcfbd35 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -66,9 +66,28 @@
>  MODULE_AUTHOR("Qumranet");
>  MODULE_LICENSE("GPL");
>
> -static unsigned int halt_poll_ns;
> +#define KVM_HALT_POLL_NS  50
> +#define KVM_HALT_POLL_NS_GROW   2
> +#define KVM_HALT_POLL_NS_SHRINK 0
> +#define KVM_HALT_POLL_NS_MAX \
> +   INT_MAX / KVM_HALT_POLL_NS_GROW
> +
> +static unsigned int halt_poll_ns = KVM_HALT_POLL_NS;
>  module_param(halt_poll_ns, uint, S_IRUGO | S_IWUSR);
>
> +/* Default doubles per-vcpu halt_poll_ns. */
> +static int halt_poll_ns_grow = KVM_HALT_POLL_NS_GROW;
> +module_param(halt_poll_ns_grow, int, S_IRUGO);
> +
> +/* Default resets per-vcpu halt_poll_ns . */
> +int halt_poll_ns_shrink = KVM_HALT_POLL_NS_SHRINK;
> +module_param(halt_poll_ns_shrink, int, S_IRUGO);
> +
> +/* Default is to compute the maximum so we can never overflow. */
> +unsigned int halt_poll_ns_actual_max = KVM_HALT_POLL_NS_MAX;
> +unsigned int halt_poll_ns_max = KVM_HALT_POLL_NS_MAX;
> +module_param(halt_poll_ns_max, int, S_IRUGO);
> +
>  /*
>   * Ordering of locks:
>   *
> @@ -1907,6 +1926,62 @@ void kvm_vcpu_mark_page_dirty(struct kvm_vcpu *vcpu, 
> gfn_t gfn)
>  }
>  EXPORT_SYMBOL_GPL(kvm_vcpu_mark_page_dirty);
>
> +static unsigned int __grow_halt_poll_ns(unsigned int val)
> +{
> +   if (halt_poll_ns_grow < 1)
> +   return halt_poll_ns;
> +
> +   val = min(val, halt_poll_ns_actual_max);
> +
> +   if (val == 0)
> +   return halt_poll_ns;
> +
> +   if (halt_poll_ns_grow < halt_poll_ns)
> +   val *= halt_poll_ns_grow;
> +   else
> +   val += halt_poll_ns_grow;
> +
> +   return val;
> +}
> +
> +static unsigned int __shrink_halt_poll_ns(int val, int modifier, int minimum)
> +{
> +   if (modifier < 1)
> +   return 0;
> +
> +   if (modifier < halt_poll_ns)
> +   val /= modifier;
> +   else
> +   val -= modifier;
> +
> +   return val;
> +}
> +
> +static void grow_halt_poll_ns(struct kvm_vcpu *vcpu)
> +{
> +   vcpu->halt_poll_ns = __grow_halt_poll_ns(vcpu->halt_poll_ns);
> +}
> +
> +static void shrink_halt_poll_ns(struct kvm_vcpu *vcpu)
> +{
> +   vcpu->halt_poll_ns = __shrink_halt_poll_ns(vcpu->halt_poll_ns,
> +   halt_poll_ns_shrink, halt_poll_ns);
> +}
> +
> +/*
> + * halt_poll_ns_actual_max is computed to be one grow_halt_poll_ns() below
> + * halt_poll_ns_max. (See __grow_halt_poll_ns for the reason.)
> + * This prevents overflows, because ple_halt_poll_ns is int.
> + * halt_poll_ns_max effectively rounded down to a multiple of 
> halt_poll_ns_grow in
> + * this process.
> + */
> +static void update_halt_poll_ns_actual_max(void)
> +{
> +   halt_poll_ns_actual_max =
> +   __shrink_halt_poll_ns(max(halt_poll_ns_max, halt_poll_ns),
> +   halt_poll_ns_grow, INT_MIN);
> +}
> +
>  static int kvm_vcpu_check_block(struct kvm_vcpu *vcpu)
>  {
> if (kvm_arch_vcpu_runnable(vcpu)) {
> @@ -1941,6 +2016,7 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu)
>  */
> if (kvm_vcpu_check_block(vcpu) < 0) {
>

Re: [PATCH 1/3] KVM: make halt_poll_ns per-VCPU

2015-08-24 Thread David Matlack

On Mon, Aug 24, 2015 at 5:53 AM, Wanpeng Li  wrote:
> Change halt_poll_ns into per-VCPU variable, seeded from module parameter,
> to allow greater flexibility.

You should also change kvm_vcpu_block to read halt_poll_ns from
the vcpu instead of the module parameter.

>
> Signed-off-by: Wanpeng Li 
> ---
>  include/linux/kvm_host.h | 1 +
>  virt/kvm/kvm_main.c  | 1 +
>  2 files changed, 2 insertions(+)
>
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 81089cf..1bef9e2 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -242,6 +242,7 @@ struct kvm_vcpu {
> int sigset_active;
> sigset_t sigset;
> struct kvm_vcpu_stat stat;
> +   unsigned int halt_poll_ns;
>
>  #ifdef CONFIG_HAS_IOMEM
> int mmio_needed;
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index d8db2f8f..a122b52 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -217,6 +217,7 @@ int kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, 
> unsigned id)
> vcpu->kvm = kvm;
> vcpu->vcpu_id = id;
> vcpu->pid = NULL;
> +   vcpu->halt_poll_ns = halt_poll_ns;
> init_waitqueue_head(&vcpu->wq);
> kvm_async_pf_vcpu_init(vcpu);
>
> --
> 1.9.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] staging:slicoss:slicoss.h remove volatile variables

2015-06-26 Thread David Matlack

Hi Vikul, welcome! See my comment below...

On Fri, Jun 26, 2015 at 12:57 PM, Vikul Gupta  wrote:
> I am a high school student trying to become familiar with the opensource
> process and linux kernel. This is my first submission to the mailing list.
>
> I fixed the slicoss sub-system. The TODO file asks to remove volatile
> variables - also, checkpatch.pl warnings included volatile variables.
>
> I removed "volatile" from the variables /isr /and /linkstatus/ in the header
> file, because they are not needed. The two variables are used in the
> slicoss.c file, where /isr/ is used as function parameters, string outputs,
> pointers, logic, and one assignment, while /linkstatus /is used as pointers,
> logic, and one assignment. All but the assignments will not change these
> variables, and the assignment does not warrant a volatile qualifier.

It is not safe to simply drop volatile from these fields. For example,
slic_card_init polls on isr waiting for the device to write to it. If you
drop volatile, the compiler is within its rights to pull the read out of
the loop.

>
> To make sure the changes were correct, I ran the files with checkpatch.pl
> again, test built it, and rebooted it.
>
> Signed-off-by: Vikul Gupta 
>
> diff --git a/drivers/staging/slicoss/slic.h b/drivers/staging/slicoss/slic.h
> index 3a5aa88..f19f86a 100644
> --- a/drivers/staging/slicoss/slic.h
> +++ b/drivers/staging/slicoss/slic.h
> @@ -357,8 +357,8 @@ struct base_driver {
>  };
>
>  struct slic_shmem {
> -volatile u32  isr;
> -volatile u32  linkstatus;
> +u32  isr;
> +u32  linkstatus;
>  volatile struct slic_stats inicstats;
>  };
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 12/15] KVM: MTRR: introduce mtrr_for_each_mem_type

2015-06-08 Thread David Matlack

On Sat, May 30, 2015 at 3:59 AM, Xiao Guangrong
 wrote:
> It walks all MTRRs and gets all the memory cache type setting for the
> specified range also it checks if the range is fully covered by MTRRs
>
> Signed-off-by: Xiao Guangrong 
> ---
>  arch/x86/kvm/mtrr.c | 183 
> 
>  1 file changed, 183 insertions(+)
>
> diff --git a/arch/x86/kvm/mtrr.c b/arch/x86/kvm/mtrr.c
> index e59d138..35f86303 100644
> --- a/arch/x86/kvm/mtrr.c
> +++ b/arch/x86/kvm/mtrr.c
> @@ -395,6 +395,189 @@ void kvm_vcpu_mtrr_init(struct kvm_vcpu *vcpu)
> INIT_LIST_HEAD(&vcpu->arch.mtrr_state.head);
>  }
>
> +struct mtrr_looker {
> +   /* input fields. */
> +   struct kvm_mtrr *mtrr_state;
> +   u64 start;
> +   u64 end;
> +
> +   /* output fields. */
> +   int mem_type;
> +   /* [start, end) is fully covered in MTRRs? */

s/fully/not fully/ ?

> +   bool partial_map;
> +
> +   /* private fields. */
> +   union {
> +   /* used for fixed MTRRs. */
> +   struct {
> +   int index;
> +   int seg;
> +   };
> +
> +   /* used for var MTRRs. */
> +   struct {
> +   struct kvm_mtrr_range *range;
> +   /* max address has been covered in var MTRRs. */
> +   u64 start_max;
> +   };
> +   };
> +
> +   bool fixed;
> +};
> +
> +static void mtrr_lookup_init(struct mtrr_looker *looker,
> +struct kvm_mtrr *mtrr_state, u64 start, u64 end)
> +{
> +   looker->mtrr_state = mtrr_state;
> +   looker->start = start;
> +   looker->end = end;
> +}
> +
> +static u64 fixed_mtrr_range_end_addr(int seg, int index)
> +{
> +   struct fixed_mtrr_segment *mtrr_seg = &fixed_seg_table[seg];
> +
> +return mtrr_seg->start + mtrr_seg->range_size * index;

Should be (index + 1)?

> +}
> +
> +static bool mtrr_lookup_fixed_start(struct mtrr_looker *looker)
> +{
> +   int seg, index;
> +
> +   if (!looker->mtrr_state->fixed_mtrr_enabled)
> +   return false;
> +
> +   seg = fixed_mtrr_addr_to_seg(looker->start);
> +   if (seg < 0)
> +   return false;
> +
> +   looker->fixed = true;
> +   index = fixed_mtrr_addr_seg_to_range_index(looker->start, seg);
> +   looker->index = index;
> +   looker->seg = seg;
> +   looker->mem_type = looker->mtrr_state->fixed_ranges[index];
> +   looker->start = fixed_mtrr_range_end_addr(seg, index);
> +   return true;
> +}
> +
> +static bool match_var_range(struct mtrr_looker *looker,
> +   struct kvm_mtrr_range *range)
> +{
> +   u64 start, end;
> +
> +   var_mtrr_range(range, &start, &end);
> +   if (!(start >= looker->end || end <= looker->start)) {
> +   looker->range = range;
> +   looker->mem_type = range->base & 0xff;
> +
> +   /*
> +* the function is called when we do kvm_mtrr.head walking
> +* that means range has the minimum base address interleaves
> +* with [looker->start_max, looker->end).
> +*/

I'm having trouble understanding this comment. I think this is what you
are trying to say:

  this function is called when we do kvm_mtrr.head walking. range has the
  minimum base address which interleaves [looker->start_max, looker->end).

Let me know if I parsed it wrong.

> +   looker->partial_map |= looker->start_max < start;
> +
> +   /* update the max address has been covered. */
> +   looker->start_max = max(looker->start_max, end);
> +   return true;
> +   }
> +
> +   return false;
> +}
> +
> +static void mtrr_lookup_var_start(struct mtrr_looker *looker)
> +{
> +   struct kvm_mtrr *mtrr_state = looker->mtrr_state;
> +   struct kvm_mtrr_range *range;
> +
> +   looker->fixed = false;
> +   looker->partial_map = false;
> +   looker->start_max = looker->start;
> +   looker->mem_type = -1;
> +
> +   list_for_each_entry(range, &mtrr_state->head, node)
> +   if (match_var_range(looker, range))
> +   return;
> +
> +   looker->partial_map = true;
> +}
> +
> +static void mtrr_lookup_fixed_next(struct mtrr_looker *looker)
> +{
> +   struct fixed_mtrr_segment *eseg = &fixed_seg_table[looker->seg];
> +   struct kvm_mtrr *mtrr_state = looker->mtrr_state;
> +   u64 end;
> +
> +   if (looker->start >= looker->end) {
> +   looker->mem_type = -1;
> +   looker->partial_map = false;
> +   return;
> +   }
> +
> +   WARN_ON(!looker->fixed);
> +
> +   looker->index++;
> +   end = fixed_mtrr_range_end_addr(looker->seg, looker->index);
> +
> +   /* switch to next segment. */
> +   if (end >= eseg->end) {
> +   looker->seg++;
> +

Re: [PATCH 09/15] KVM: MTRR: introduce var_mtrr_range

2015-06-08 Thread David Matlack

On Sat, May 30, 2015 at 3:59 AM, Xiao Guangrong
 wrote:
> It gets the range for the specified variable MTRR
>
> Signed-off-by: Xiao Guangrong 
> ---
>  arch/x86/kvm/mtrr.c | 19 +--
>  1 file changed, 13 insertions(+), 6 deletions(-)
>
> diff --git a/arch/x86/kvm/mtrr.c b/arch/x86/kvm/mtrr.c
> index 888441e..aeb9767 100644
> --- a/arch/x86/kvm/mtrr.c
> +++ b/arch/x86/kvm/mtrr.c
> @@ -217,10 +217,21 @@ static int fixed_msr_to_range_index(u32 msr)
> return 0;
>  }
>
> +static void var_mtrr_range(struct kvm_mtrr_range *range, u64 *start, u64 
> *end)
> +{
> +   u64 mask;
> +
> +   *start = range->base & PAGE_MASK;
> +
> +   mask = range->mask & PAGE_MASK;
> +   mask |= ~0ULL << boot_cpu_data.x86_phys_bits;
> +   *end = ((*start & mask) | ~mask) + 1;
> +}
> +
>  static void update_mtrr(struct kvm_vcpu *vcpu, u32 msr)
>  {
> struct kvm_mtrr *mtrr_state = &vcpu->arch.mtrr_state;
> -   gfn_t start, end, mask;
> +   gfn_t start, end;
> int index;
>
> if (msr == MSR_IA32_CR_PAT || !tdp_enabled ||
> @@ -244,11 +255,7 @@ static void update_mtrr(struct kvm_vcpu *vcpu, u32 msr)
> default:
> /* variable range MTRRs. */
> index = (msr - 0x200) / 2;
> -   start = mtrr_state->var_ranges[index].base & PAGE_MASK;
> -   mask = mtrr_state->var_ranges[index].mask & PAGE_MASK;
> -   mask |= ~0ULL << cpuid_maxphyaddr(vcpu);

Why did you drop this in favor of boot_cpu_data.x86_phys_bits?

> -
> -   end = ((start & mask) | ~mask) + 1;
> +   var_mtrr_range(&mtrr_state->var_ranges[index], &start, &end);
> }
>
>  do_zap:
> --
> 2.1.0
>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

1 2 >

1 - 100 of 169 matches

Mail list logo