date:20140818

Re: [PATCH 1/5] KVM: vmx: fix ept reserved bits for 1-GByte page

2014-08-18 Thread Paolo Bonzini

Il 19/08/2014 04:17, Wanpeng Li ha scritto:
>>> >> -if (level == 1 || (level == 2 && (spte & (1ULL << 7 
>>> >> {
>>> >> +if (level == 1 || ((level == 3 || level == 2)
>>> >> +&& (spte & (1ULL << 7 {
>> >
>> >This condition can be simplified by checking the return value of 
>> >ept_rsvd_mask.
>> >If it includes 0x38, this is a large page.

Oops, a "not" was missing. If it includes 0x38, this is _not_ a large
page (it is a page directory / page directory pointer / PML4).

>> Otherwise it is a leaf page and
>> you can go down the "if".
> As you know, 5:3 bits which used for EPT MT are not reserved bits, so 
> I fail to understand why check the return value of ept_rsvd_mask and 
> it's a large page if includes 0x38. Could you eplain in more details? ;-)

A non-leaf page will always have 0x38 in the ept_rsvd_mask.  A leaf page
will never have 0x38 in the ept_rsvd_mask.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/2] KVM: fix cache stale memslot info with correct mmio generation number

2014-08-18 Thread Xiao Guangrong

On 08/19/2014 01:40 PM, David Matlack wrote:
> On Mon, Aug 18, 2014 at 10:19 PM, Xiao Guangrong
>  wrote:
>> On 08/19/2014 01:00 PM, David Matlack wrote:
>>> On Mon, Aug 18, 2014 at 9:41 PM, Xiao Guangrong
>>>  wrote:
 On 08/19/2014 12:31 PM, David Matlack wrote:
> The single line patch I suggested was only intended to fix the "forever
> incorrectly exit mmio".

 My patch also fixes this case and that does not doubly increase the
 number. I think this is the better one.
>>>
>>> I prefer doubly increasing the generation for this reason: the updated 
>>> boolean
>>> requires extra code on the "client-side" to check if there's an update in
>>> progress. And that makes it easy to get wrong. In fact, your patch
>>> forgot to check the updated bit in mark_mmio_spte(). Doubly increasing the
>>> generation requires no "client-side" code to work.
>>
>> No, the updated patch is used to fix case 2 which i draw the scenario in
>> the last mail. I mean the original patch in this patchset which just
>> increase the number after srcu-sync.
>>
>> Then could you tell me that your approach can do but my original patch can 
>> not?
> 
> It avoids publishing new memslots with an old generation number attached to
> them (even if it only lasts for a short period of time). 

I can not see the problem if that happen, could you please draw the scenario?

> Do you have a reason
> why you don't want to doubly increase the generation?

That more easily causes the number wrap-around.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/2] KVM: fix cache stale memslot info with correct mmio generation number

2014-08-18 Thread David Matlack

On Mon, Aug 18, 2014 at 10:19 PM, Xiao Guangrong
 wrote:
> On 08/19/2014 01:00 PM, David Matlack wrote:
>> On Mon, Aug 18, 2014 at 9:41 PM, Xiao Guangrong
>>  wrote:
>>> On 08/19/2014 12:31 PM, David Matlack wrote:
 The single line patch I suggested was only intended to fix the "forever
 incorrectly exit mmio".
>>>
>>> My patch also fixes this case and that does not doubly increase the
>>> number. I think this is the better one.
>>
>> I prefer doubly increasing the generation for this reason: the updated 
>> boolean
>> requires extra code on the "client-side" to check if there's an update in
>> progress. And that makes it easy to get wrong. In fact, your patch
>> forgot to check the updated bit in mark_mmio_spte(). Doubly increasing the
>> generation requires no "client-side" code to work.
>
> No, the updated patch is used to fix case 2 which i draw the scenario in
> the last mail. I mean the original patch in this patchset which just
> increase the number after srcu-sync.
>
> Then could you tell me that your approach can do but my original patch can 
> not?

It avoids publishing new memslots with an old generation number attached to
them (even if it only lasts for a short period of time). Do you have a reason
why you don't want to doubly increase the generation?
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/2] KVM: PPC: Book3S HV: Cleanup kvmppc_load/save_fp

2014-08-18 Thread Michael Neuling

On Tue, 2014-08-19 at 15:24 +1000, Paul Mackerras wrote:
> On Tue, Aug 19, 2014 at 02:59:29PM +1000, Michael Neuling wrote:
> > This cleans up kvmppc_load/save_fp.  It removes unnecessary isyncs.
> 
> NAK - they are necessary on PPC970, which we (still) support.  You
> could put them in a feature section if they are really annoying you.

I'm not fussed, but we should at least have a comment there for why we
need them.

> >  It also
> > removes the unnecessary resetting of the MSR bits on exit of kvmppc_save_fp.
> 
> ... except it doesn't. :)  That got folded into e4e38121507a ("KVM:
> PPC: Book3S HV: Add transactional memory support").

Arrh, thanks.  This patch was cleaning up stuff from an old local tree
and couldn't see where it had been upstreamed.  I missed this.

Mikey
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/2] KVM: PPC: Book3S HV: Cleanup kvmppc_load/save_fp

2014-08-18 Thread Paul Mackerras

On Tue, Aug 19, 2014 at 02:59:29PM +1000, Michael Neuling wrote:
> This cleans up kvmppc_load/save_fp.  It removes unnecessary isyncs.

NAK - they are necessary on PPC970, which we (still) support.  You
could put them in a feature section if they are really annoying you.

>  It also
> removes the unnecessary resetting of the MSR bits on exit of kvmppc_save_fp.

... except it doesn't. :)  That got folded into e4e38121507a ("KVM:
PPC: Book3S HV: Add transactional memory support").

Paul.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/2] KVM: fix cache stale memslot info with correct mmio generation number

2014-08-18 Thread Xiao Guangrong

On 08/19/2014 01:00 PM, David Matlack wrote:
> On Mon, Aug 18, 2014 at 9:41 PM, Xiao Guangrong
>  wrote:
>> On 08/19/2014 12:31 PM, David Matlack wrote:
>>> But it looks like you basically said the same thing earlier, so I think
>>> we're on the same page.
>>>
>>
>> Yes, that is what i try to explain in previous mails. :(
> 
> I'm glad we understand each other now! Sorry again for my confusion.

Yup, me too. :)

> 
>>> The single line patch I suggested was only intended to fix the "forever
>>> incorrectly exit mmio".
>>
>> My patch also fixes this case and that does not doubly increase the
>> number. I think this is the better one.
> 
> I prefer doubly increasing the generation for this reason: the updated boolean
> requires extra code on the "client-side" to check if there's an update in
> progress. And that makes it easy to get wrong. In fact, your patch
> forgot to check the updated bit in mark_mmio_spte(). Doubly increasing the
> generation requires no "client-side" code to work.

No, the updated patch is used to fix case 2 which i draw the scenario in
the last mail. I mean the original patch in this patchset which just
increase the number after srcu-sync.

Then could you tell me that your approach can do but my original patch can not?

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/2] KVM: fix cache stale memslot info with correct mmio generation number

2014-08-18 Thread David Matlack

On Mon, Aug 18, 2014 at 9:41 PM, Xiao Guangrong
 wrote:
> On 08/19/2014 12:31 PM, David Matlack wrote:
>> But it looks like you basically said the same thing earlier, so I think
>> we're on the same page.
>>
>
> Yes, that is what i try to explain in previous mails. :(

I'm glad we understand each other now! Sorry again for my confusion.

>> The single line patch I suggested was only intended to fix the "forever
>> incorrectly exit mmio".
>
> My patch also fixes this case and that does not doubly increase the
> number. I think this is the better one.

I prefer doubly increasing the generation for this reason: the updated boolean
requires extra code on the "client-side" to check if there's an update in
progress. And that makes it easy to get wrong. In fact, your patch
forgot to check the updated bit in mark_mmio_spte(). Doubly increasing the
generation requires no "client-side" code to work.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/2] KVM: PPC: Book3S HV: Cleanup kvmppc_load/save_fp

2014-08-18 Thread Michael Neuling

This cleans up kvmppc_load/save_fp.  It removes unnecessary isyncs.  It also
removes the unnecessary resetting of the MSR bits on exit of kvmppc_save_fp.

Signed-off-by: Michael Neuling 
Signed-off-by: Paul Mackerras 
---
 arch/powerpc/kvm/book3s_hv_rmhandlers.S | 2 --
 1 file changed, 2 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S 
b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
index f0c4db7..c4bd2d7 100644
--- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
+++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
@@ -2434,7 +2434,6 @@ BEGIN_FTR_SECTION
 END_FTR_SECTION_IFSET(CPU_FTR_VSX)
 #endif
mtmsrd  r8
-   isync
addir3,r3,VCPU_FPRS
bl  store_fp_state
 #ifdef CONFIG_ALTIVEC
@@ -2470,7 +2469,6 @@ BEGIN_FTR_SECTION
 END_FTR_SECTION_IFSET(CPU_FTR_VSX)
 #endif
mtmsrd  r8
-   isync
addir3,r4,VCPU_FPRS
bl  load_fp_state
 #ifdef CONFIG_ALTIVEC
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 2/2] KVM: PPC: Book3S HV: Add register name when loading toc

2014-08-18 Thread Michael Neuling

Add 'r' to register name r2 in kvmppc_hv_enter.

Also update comment at the top of kvmppc_hv_enter to indicate that R2/TOC is
non-volatile.

Signed-off-by: Michael Neuling 
Signed-off-by: Paul Mackerras 
---
 arch/powerpc/kvm/book3s_hv_rmhandlers.S | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S 
b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
index c4bd2d7..1e8c480 100644
--- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
+++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
@@ -355,6 +355,7 @@ kvmppc_hv_entry:
 * MSR = ~IR|DR
 * R13 = PACA
 * R1 = host R1
+* R2 = TOC
 * all other volatile GPRS = free
 */
mflrr0
@@ -503,7 +504,7 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_207S)
 toc_tlbie_lock:
.tc native_tlbie_lock[TC],native_tlbie_lock
.previous
-   ld  r3,toc_tlbie_lock@toc(2)
+   ld  r3,toc_tlbie_lock@toc(r2)
 #ifdef __BIG_ENDIAN__
lwz r8,PACA_LOCK_TOKEN(r13)
 #else
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/2] KVM: fix cache stale memslot info with correct mmio generation number

2014-08-18 Thread Xiao Guangrong

On 08/19/2014 12:31 PM, David Matlack wrote:
> On Mon, Aug 18, 2014 at 8:50 PM, Xiao Guangrong
>  wrote:
>> On 08/19/2014 05:15 AM, David Matlack wrote:
>>> On Mon, Aug 18, 2014 at 12:56 PM, Xiao Guangrong
>>>  wrote:
 @@ -287,9 +293,15 @@ static bool set_mmio_spte(struct kvm *kvm, u64 
 *sptep, gfn_t gfn,

  static bool check_mmio_spte(struct kvm *kvm, u64 spte)
  {
 +   struct kvm_memslots *slots = kvm_memslots(kvm);
 unsigned int kvm_gen, spte_gen;

 -   kvm_gen = kvm_current_mmio_generation(kvm);
 +   if (slots->updated)
 +   return false;
 +
 +   smp_rmb();
 +
 +   kvm_gen = __kvm_current_mmio_generation(slots);
 spte_gen = get_mmio_spte_generation(spte);

>>>
>>> What does this fix? Case 2 can still happen. (Case 2 is unavoidable unless 
>>> we
>>> block during memslot updates, which I don't think we should :).
>>
>> This exactly fixes case 2, slots->updated just acts as the "low bit"
>> but avoid generation number wrap-around and trick handling of the number.
>> More details please see below.
>>
>>>
 trace_check_mmio_spte(spte, kvm_gen, spte_gen);
 diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
 index 4b6c01b..1d4e78f 100644
 --- a/virt/kvm/kvm_main.c
 +++ b/virt/kvm/kvm_main.c
 @@ -96,7 +96,7 @@ static void hardware_disable_all(void);

  static void kvm_io_bus_destroy(struct kvm_io_bus *bus);
  static void update_memslots(struct kvm_memslots *slots,
 -   struct kvm_memory_slot *new, u64 
 last_generation);
 +   struct kvm_memory_slot *new);

  static void kvm_release_pfn_dirty(pfn_t pfn);
  static void mark_page_dirty_in_slot(struct kvm *kvm,
 @@ -685,8 +685,7 @@ static void sort_memslots(struct kvm_memslots *slots)
  }

  static void update_memslots(struct kvm_memslots *slots,
 -   struct kvm_memory_slot *new,
 -   u64 last_generation)
 +   struct kvm_memory_slot *new)
  {
 if (new) {
 int id = new->id;
 @@ -697,8 +696,6 @@ static void update_memslots(struct kvm_memslots *slots,
 if (new->npages != npages)
 sort_memslots(slots);
 }
 -
 -   slots->generation = last_generation + 1;
  }

  static int check_memory_region_flags(struct kvm_userspace_memory_region 
 *mem)
 @@ -720,10 +717,17 @@ static struct kvm_memslots 
 *install_new_memslots(struct kvm *kvm,
  {
 struct kvm_memslots *old_memslots = kvm->memslots;

 -   update_memslots(slots, new, kvm->memslots->generation);
 +   /* ensure generation number is always increased. */
 +   slots->updated = true;
 +   slots->generation = old_memslots->generation;
 +   update_memslots(slots, new);
 rcu_assign_pointer(kvm->memslots, slots);
 synchronize_srcu_expedited(&kvm->srcu);

 +   slots->generation++;
 +   smp_wmb();
 +   slots->updated = false;
 +
 kvm_arch_memslots_updated(kvm);

 return old_memslots;

>>>
>>> This is effectively the same as the first approach.
>>>
>>> I just realized how simple Paolo's idea is. I think it can be a one line
>>> patch (without comments):
>>>
>>> [...]
>>> update_memslots(slots, new, kvm->memslots->generation);
>>> rcu_assign_pointer(kvm->memslots, slots);
>>> synchronize_srcu_expedited(&kvm->srcu);
>>> +   slots->generation++;
>>>
>>> kvm_arch_memslots_updated(kvm);
>>> [...]
>>
>> Really? Unfortunately no. :)
>>
>> See this scenario:
>>
>> CPU 0  CPU 1
>> ioctl registering a new memslot which
>> contains GPA:
>>page-fault handler:
>>  see it'a mmio access on GPA;
>>
>>  assign the new memslots with generation number increased
>>  cache the generation-number into spte;
>>  fix the access and comeback to guest;
>> SRCU-sync
>>  page-fault again and check the spte is a valid 
>> mmio-spte(*)
>> generation-number++;
>> return to userspace;
>>  do mmio-emulation and inject mmio-exit;
>>
>> !!! userspace receives a unexpected mmio-exit, that is case 2 i exactly
>> said in the last mail.
>>
>>
>> Note in the step *, my approach detects the invalid generation-number which
>> will invalidate the mmio spte properly .
> 
> Sorry I didn't explain myself very well: Since we can get a single wrong
> mmio exit no matter what, it has to be handled in userspace. So my point
> was, it doesn't really help to fix that one very specific way that it can
> happen, because it can just happen in oth

Re: [PATCH 1/2] KVM: fix cache stale memslot info with correct mmio generation number

2014-08-18 Thread David Matlack

On Mon, Aug 18, 2014 at 8:50 PM, Xiao Guangrong
 wrote:
> On 08/19/2014 05:15 AM, David Matlack wrote:
>> On Mon, Aug 18, 2014 at 12:56 PM, Xiao Guangrong
>>  wrote:
>>> @@ -287,9 +293,15 @@ static bool set_mmio_spte(struct kvm *kvm, u64 *sptep, 
>>> gfn_t gfn,
>>>
>>>  static bool check_mmio_spte(struct kvm *kvm, u64 spte)
>>>  {
>>> +   struct kvm_memslots *slots = kvm_memslots(kvm);
>>> unsigned int kvm_gen, spte_gen;
>>>
>>> -   kvm_gen = kvm_current_mmio_generation(kvm);
>>> +   if (slots->updated)
>>> +   return false;
>>> +
>>> +   smp_rmb();
>>> +
>>> +   kvm_gen = __kvm_current_mmio_generation(slots);
>>> spte_gen = get_mmio_spte_generation(spte);
>>>
>>
>> What does this fix? Case 2 can still happen. (Case 2 is unavoidable unless we
>> block during memslot updates, which I don't think we should :).
>
> This exactly fixes case 2, slots->updated just acts as the "low bit"
> but avoid generation number wrap-around and trick handling of the number.
> More details please see below.
>
>>
>>> trace_check_mmio_spte(spte, kvm_gen, spte_gen);
>>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>>> index 4b6c01b..1d4e78f 100644
>>> --- a/virt/kvm/kvm_main.c
>>> +++ b/virt/kvm/kvm_main.c
>>> @@ -96,7 +96,7 @@ static void hardware_disable_all(void);
>>>
>>>  static void kvm_io_bus_destroy(struct kvm_io_bus *bus);
>>>  static void update_memslots(struct kvm_memslots *slots,
>>> -   struct kvm_memory_slot *new, u64 
>>> last_generation);
>>> +   struct kvm_memory_slot *new);
>>>
>>>  static void kvm_release_pfn_dirty(pfn_t pfn);
>>>  static void mark_page_dirty_in_slot(struct kvm *kvm,
>>> @@ -685,8 +685,7 @@ static void sort_memslots(struct kvm_memslots *slots)
>>>  }
>>>
>>>  static void update_memslots(struct kvm_memslots *slots,
>>> -   struct kvm_memory_slot *new,
>>> -   u64 last_generation)
>>> +   struct kvm_memory_slot *new)
>>>  {
>>> if (new) {
>>> int id = new->id;
>>> @@ -697,8 +696,6 @@ static void update_memslots(struct kvm_memslots *slots,
>>> if (new->npages != npages)
>>> sort_memslots(slots);
>>> }
>>> -
>>> -   slots->generation = last_generation + 1;
>>>  }
>>>
>>>  static int check_memory_region_flags(struct kvm_userspace_memory_region 
>>> *mem)
>>> @@ -720,10 +717,17 @@ static struct kvm_memslots 
>>> *install_new_memslots(struct kvm *kvm,
>>>  {
>>> struct kvm_memslots *old_memslots = kvm->memslots;
>>>
>>> -   update_memslots(slots, new, kvm->memslots->generation);
>>> +   /* ensure generation number is always increased. */
>>> +   slots->updated = true;
>>> +   slots->generation = old_memslots->generation;
>>> +   update_memslots(slots, new);
>>> rcu_assign_pointer(kvm->memslots, slots);
>>> synchronize_srcu_expedited(&kvm->srcu);
>>>
>>> +   slots->generation++;
>>> +   smp_wmb();
>>> +   slots->updated = false;
>>> +
>>> kvm_arch_memslots_updated(kvm);
>>>
>>> return old_memslots;
>>>
>>
>> This is effectively the same as the first approach.
>>
>> I just realized how simple Paolo's idea is. I think it can be a one line
>> patch (without comments):
>>
>> [...]
>> update_memslots(slots, new, kvm->memslots->generation);
>> rcu_assign_pointer(kvm->memslots, slots);
>> synchronize_srcu_expedited(&kvm->srcu);
>> +   slots->generation++;
>>
>> kvm_arch_memslots_updated(kvm);
>> [...]
>
> Really? Unfortunately no. :)
>
> See this scenario:
>
> CPU 0  CPU 1
> ioctl registering a new memslot which
> contains GPA:
>page-fault handler:
>  see it'a mmio access on GPA;
>
>  assign the new memslots with generation number increased
>  cache the generation-number into spte;
>  fix the access and comeback to guest;
> SRCU-sync
>  page-fault again and check the spte is a valid 
> mmio-spte(*)
> generation-number++;
> return to userspace;
>  do mmio-emulation and inject mmio-exit;
>
> !!! userspace receives a unexpected mmio-exit, that is case 2 i exactly
> said in the last mail.
>
>
> Note in the step *, my approach detects the invalid generation-number which
> will invalidate the mmio spte properly .

Sorry I didn't explain myself very well: Since we can get a single wrong
mmio exit no matter what, it has to be handled in userspace. So my point
was, it doesn't really help to fix that one very specific way that it can
happen, because it can just happen in other ways. (E.g. update memslots
occurs after is_noslot_pfn() and before mmio exit).

But it looks like you basically said the same thing earlier, so I think
we're on the same page.

The singl

Re: [PATCH 1/2] KVM: fix cache stale memslot info with correct mmio generation number

2014-08-18 Thread Xiao Guangrong

On 08/19/2014 05:15 AM, David Matlack wrote:
> On Mon, Aug 18, 2014 at 12:56 PM, Xiao Guangrong
>  wrote:
>> @@ -287,9 +293,15 @@ static bool set_mmio_spte(struct kvm *kvm, u64 *sptep, 
>> gfn_t gfn,
>>
>>  static bool check_mmio_spte(struct kvm *kvm, u64 spte)
>>  {
>> +   struct kvm_memslots *slots = kvm_memslots(kvm);
>> unsigned int kvm_gen, spte_gen;
>>
>> -   kvm_gen = kvm_current_mmio_generation(kvm);
>> +   if (slots->updated)
>> +   return false;
>> +
>> +   smp_rmb();
>> +
>> +   kvm_gen = __kvm_current_mmio_generation(slots);
>> spte_gen = get_mmio_spte_generation(spte);
>>
> 
> What does this fix? Case 2 can still happen. (Case 2 is unavoidable unless we
> block during memslot updates, which I don't think we should :).

This exactly fixes case 2, slots->updated just acts as the "low bit"
but avoid generation number wrap-around and trick handling of the number.
More details please see below.

> 
>> trace_check_mmio_spte(spte, kvm_gen, spte_gen);
>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>> index 4b6c01b..1d4e78f 100644
>> --- a/virt/kvm/kvm_main.c
>> +++ b/virt/kvm/kvm_main.c
>> @@ -96,7 +96,7 @@ static void hardware_disable_all(void);
>>
>>  static void kvm_io_bus_destroy(struct kvm_io_bus *bus);
>>  static void update_memslots(struct kvm_memslots *slots,
>> -   struct kvm_memory_slot *new, u64 
>> last_generation);
>> +   struct kvm_memory_slot *new);
>>
>>  static void kvm_release_pfn_dirty(pfn_t pfn);
>>  static void mark_page_dirty_in_slot(struct kvm *kvm,
>> @@ -685,8 +685,7 @@ static void sort_memslots(struct kvm_memslots *slots)
>>  }
>>
>>  static void update_memslots(struct kvm_memslots *slots,
>> -   struct kvm_memory_slot *new,
>> -   u64 last_generation)
>> +   struct kvm_memory_slot *new)
>>  {
>> if (new) {
>> int id = new->id;
>> @@ -697,8 +696,6 @@ static void update_memslots(struct kvm_memslots *slots,
>> if (new->npages != npages)
>> sort_memslots(slots);
>> }
>> -
>> -   slots->generation = last_generation + 1;
>>  }
>>
>>  static int check_memory_region_flags(struct kvm_userspace_memory_region 
>> *mem)
>> @@ -720,10 +717,17 @@ static struct kvm_memslots 
>> *install_new_memslots(struct kvm *kvm,
>>  {
>> struct kvm_memslots *old_memslots = kvm->memslots;
>>
>> -   update_memslots(slots, new, kvm->memslots->generation);
>> +   /* ensure generation number is always increased. */
>> +   slots->updated = true;
>> +   slots->generation = old_memslots->generation;
>> +   update_memslots(slots, new);
>> rcu_assign_pointer(kvm->memslots, slots);
>> synchronize_srcu_expedited(&kvm->srcu);
>>
>> +   slots->generation++;
>> +   smp_wmb();
>> +   slots->updated = false;
>> +
>> kvm_arch_memslots_updated(kvm);
>>
>> return old_memslots;
>>
> 
> This is effectively the same as the first approach.
> 
> I just realized how simple Paolo's idea is. I think it can be a one line
> patch (without comments):
> 
> [...]
> update_memslots(slots, new, kvm->memslots->generation);
> rcu_assign_pointer(kvm->memslots, slots);
> synchronize_srcu_expedited(&kvm->srcu);
> +   slots->generation++;
> 
> kvm_arch_memslots_updated(kvm);
> [...]

Really? Unfortunately no. :)

See this scenario:

CPU 0  CPU 1
ioctl registering a new memslot which
contains GPA:
   page-fault handler:
 see it'a mmio access on GPA;

 assign the new memslots with generation number increased
 cache the generation-number into spte;
 fix the access and comeback to guest;
SRCU-sync
 page-fault again and check the spte is a valid 
mmio-spte(*)
generation-number++;
return to userspace;
 do mmio-emulation and inject mmio-exit;

!!! userspace receives a unexpected mmio-exit, that is case 2 i exactly
said in the last mail.


Note in the step *, my approach detects the invalid generation-number which
will invalidate the mmio spte properly .

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/5] KVM: vmx: fix ept reserved bits for 1-GByte page

2014-08-18 Thread Wanpeng Li

Hi Paolo,
On Mon, Aug 18, 2014 at 12:18:59PM +0200, Paolo Bonzini wrote:
>Il 18/08/2014 11:50, Wanpeng Li ha scritto:
>> EPT misconfig handler in kvm will check which reason lead to EPT 
>> misconfiguration after vmexit. One of the reasons is that an EPT 
>> paging-structure entry is configured with settings reserved for 
>> future functionality. However, the handler can't identify if 
>> paging-structure entry of reserved bits for 1-GByte page are 
>> configured, since PDPTE which point to 1-GByte page will reserve 
>> bits 29:12 instead of bits 7:3 which are reserved for PDPTE that 
>> references an EPT Page Directory. This patch fix it by reserve 
>> bits 29:12 for 1-GByte page. 
>> 
>> Signed-off-by: Wanpeng Li 
>> ---
>>  arch/x86/kvm/vmx.c | 14 ++
>>  1 file changed, 10 insertions(+), 4 deletions(-)
>> 
>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
>> index bfe11cf..71cbee5 100644
>> --- a/arch/x86/kvm/vmx.c
>> +++ b/arch/x86/kvm/vmx.c
>> @@ -5521,9 +5521,14 @@ static u64 ept_rsvd_mask(u64 spte, int level)
>>  for (i = 51; i > boot_cpu_data.x86_phys_bits; i--)
>>  mask |= (1ULL << i);
>>  
>> -if (level > 2)
>> -/* bits 7:3 reserved */
>> -mask |= 0xf8;
>> +if (level > 2) {
>
>level can be 4 here.  You have to return 0xf8 for level == 4.
>
>The same "if" statement then can cover both 2MB and 1GB pages, like
>
>if (spte & (1ULL << 7))
>/* 1GB/2MB page, bits 29:12 or 20:12 reserved 
> respectively */
>mask |= (PAGE_SIZE << ((level - 1) * 9)) - PAGE_SIZE;
>else
>/* bits 6:3 reserved */
>mask |= 0x78;
>
>> -if (level == 1 || (level == 2 && (spte & (1ULL << 7 {
>> +if (level == 1 || ((level == 3 || level == 2)
>> +&& (spte & (1ULL << 7 {
>
>This condition can be simplified by checking the return value of ept_rsvd_mask.
>If it includes 0x38, this is a large page.  Otherwise it is a leaf page and
>you can go down the "if".

As you know, 5:3 bits which used for EPT MT are not reserved bits, so 
I fail to understand why check the return value of ept_rsvd_mask and 
it's a large page if includes 0x38. Could you eplain in more details? ;-)

Regards,
Wanpeng Li 

>
>Paolo
>
>>  u64 ept_mem_type = (spte & 0x38) >> 3;
>>  
>>  if (ept_mem_type == 2 || ept_mem_type == 3 ||
>> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/5] softlockup: make detector be aware of task switch of processes hogging cpu

2014-08-18 Thread Chai Wen

On 08/19/2014 04:38 AM, Don Zickus wrote:

> On Mon, Aug 18, 2014 at 09:02:00PM +0200, Ingo Molnar wrote:
>>
>> * Don Zickus  wrote:
>>
>> So I agree with the motivation of this improvement, but 
>> is this implementation namespace-safe?
>
> What namespace are you worried about colliding with?  I 
> thought softlockup_ would provide the safety??  Maybe I 
> am missing something obvious. :-(

 I meant PID namespaces - a PID in itself isn't guaranteed 
 to be unique across the system.
>>>
>>> Ah, I don't think we thought about that.  Is there a better 
>>> way to do this?  Is there a domain id or something that can 
>>> be OR'd with the pid?
>>
>> What is always unique is the task pointer itself. We use pids 
>> when we interface with user-space - but we don't really do that 
>> here, right?
> 
> No, I don't believe so.  Ok, so saving 'current' and comparing that should
> be enough, correct?
> 


I am not sure of the safety about using pid here with namespace.
But as to the pointer of process, is there a chance that we got a 'historical'
address saved in the 'softlockup_warn_pid(or address)_saved' and the current
hogging process happened to get the same task pointer address?
If it never happens, I think the comparing of address is ok.

thanks
chai wen

> Cheers,
> Don
> .
> 



-- 
Regards

Chai Wen
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 2/2] kvm: x86: fix stale mmio cache bug

2014-08-18 Thread David Matlack

The following events can lead to an incorrect KVM_EXIT_MMIO bubbling
up to userspace:

(1) Guest accesses gpa X without a memory slot. The gfn is cached in
struct kvm_vcpu_arch (mmio_gfn). On Intel EPT-enabled hosts, KVM sets
the SPTE write-execute-noread so that future accesses cause
EPT_MISCONFIGs.

(2) Host userspace creates a memory slot via KVM_SET_USER_MEMORY_REGION
covering the page just accessed.

(3) Guest attempts to read or write to gpa X again. On Intel, this
generates an EPT_MISCONFIG. The memory slot generation number that
was incremented in (2) would normally take care of this but we fast
path mmio faults through quickly_check_mmio_pf(), which only checks
the per-vcpu mmio cache. Since we hit the cache, KVM passes a
KVM_EXIT_MMIO up to userspace.

This patch fixes the issue by using the memslot generation number
to validate the mmio cache.

[ xiaoguangrong: adjust the code to make it simpler for stable-tree fix. ]

Cc: sta...@vger.kernel.org
Signed-off-by: David Matlack 
Signed-off-by: Xiao Guangrong 
---
 arch/x86/include/asm/kvm_host.h |  1 +
 arch/x86/kvm/mmu.c  |  4 ++--
 arch/x86/kvm/mmu.h  |  2 ++
 arch/x86/kvm/x86.h  | 21 -
 4 files changed, 21 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 49205d0..f518d14 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -479,6 +479,7 @@ struct kvm_vcpu_arch {
u64 mmio_gva;
unsigned access;
gfn_t mmio_gfn;
+   unsigned int mmio_gen;
 
struct kvm_pmu pmu;
 
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 9314678..e00fbfe 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -234,7 +234,7 @@ static unsigned int get_mmio_spte_generation(u64 spte)
return gen;
 }
 
-static unsigned int kvm_current_mmio_generation(struct kvm *kvm)
+unsigned int kvm_current_mmio_generation(struct kvm *kvm)
 {
/*
 * Init kvm generation close to MMIO_MAX_GEN to easily test the
@@ -3163,7 +3163,7 @@ static void mmu_sync_roots(struct kvm_vcpu *vcpu)
if (!VALID_PAGE(vcpu->arch.mmu.root_hpa))
return;
 
-   vcpu_clear_mmio_info(vcpu, ~0ul);
+   vcpu_clear_mmio_info(vcpu, MMIO_GVA_ANY);
kvm_mmu_audit(vcpu, AUDIT_PRE_SYNC);
if (vcpu->arch.mmu.root_level == PT64_ROOT_LEVEL) {
hpa_t root = vcpu->arch.mmu.root_hpa;
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index b982112..e2d902a 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -76,6 +76,8 @@ enum {
 };
 
 int handle_mmio_page_fault_common(struct kvm_vcpu *vcpu, u64 addr, bool 
direct);
+unsigned int kvm_current_mmio_generation(struct kvm *kvm);
+
 void kvm_init_shadow_mmu(struct kvm_vcpu *vcpu, struct kvm_mmu *context);
 void kvm_init_shadow_ept_mmu(struct kvm_vcpu *vcpu, struct kvm_mmu *context,
bool execonly);
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index 8c97bac..ae7006d 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -3,6 +3,7 @@
 
 #include 
 #include "kvm_cache_regs.h"
+#include "mmu.h"
 
 static inline void kvm_clear_exception_queue(struct kvm_vcpu *vcpu)
 {
@@ -78,15 +79,23 @@ static inline void vcpu_cache_mmio_info(struct kvm_vcpu 
*vcpu,
vcpu->arch.mmio_gva = gva & PAGE_MASK;
vcpu->arch.access = access;
vcpu->arch.mmio_gfn = gfn;
+   vcpu->arch.mmio_gen = kvm_current_mmio_generation(vcpu->kvm);
+}
+
+static inline bool vcpu_match_mmio_gen(struct kvm_vcpu *vcpu)
+{
+   return vcpu->arch.mmio_gen == kvm_current_mmio_generation(vcpu->kvm);
 }
 
 /*
- * Clear the mmio cache info for the given gva,
- * specially, if gva is ~0ul, we clear all mmio cache info.
+ * Clear the mmio cache info for the given gva. If gva is MMIO_GVA_ANY, we
+ * clear all mmio cache info.
  */
+#define MMIO_GVA_ANY (~(gva_t)0)
+
 static inline void vcpu_clear_mmio_info(struct kvm_vcpu *vcpu, gva_t gva)
 {
-   if (gva != (~0ul) && vcpu->arch.mmio_gva != (gva & PAGE_MASK))
+   if (gva != MMIO_GVA_ANY && vcpu->arch.mmio_gva != (gva & PAGE_MASK))
return;
 
vcpu->arch.mmio_gva = 0;
@@ -94,7 +103,8 @@ static inline void vcpu_clear_mmio_info(struct kvm_vcpu 
*vcpu, gva_t gva)
 
 static inline bool vcpu_match_mmio_gva(struct kvm_vcpu *vcpu, unsigned long 
gva)
 {
-   if (vcpu->arch.mmio_gva && vcpu->arch.mmio_gva == (gva & PAGE_MASK))
+   if (vcpu_match_mmio_gen(vcpu) && vcpu->arch.mmio_gva &&
+ vcpu->arch.mmio_gva == (gva & PAGE_MASK))
return true;
 
return false;
@@ -102,7 +112,8 @@ static inline bool vcpu_match_mmio_gva(struct kvm_vcpu 
*vcpu, unsigned long gva)
 
 static inline bool vcpu_match_mmio_gpa(struct kvm_vcpu *vcpu, gpa_t gpa)
 {
-   if (vcpu->arch.mmio_gfn && vcpu->arch.mmio_gfn == gpa >> PAGE_SHIFT)
+   if (vcpu_match_mmio_gen(vcpu) && vcpu->arch.mmio_gfn &&
+ vcpu

[PATCH 1/2] kvm: fix potentially corrupt mmio cache

2014-08-18 Thread David Matlack

vcpu exits and memslot mutations can run concurrently as long as the
vcpu does not aquire the slots mutex. Thus it is theoretically possible
for memslots to change underneath a vcpu that is handling an exit.

If we increment the memslot generation number again after
synchronize_srcu_expedited(), vcpus can safely cache memslot generation
without maintaining a single rcu_dereference through an entire vm exit.
And much of the x86/kvm code does not maintain a single rcu_dereference
of the current memslots during each exit.

We can prevent the following case:

   vcpu (CPU 0) | thread (CPU 1)
+--
1  vm exit  |
2  decide to cache something based on   |
 old memslots   |
3   | change memslots
4   | increment generation
5  tag cache with new memslot generation|
... |
   |
... |
  |

By incrementing the generation again after synchronizing kvm->srcu
readers, we guarantee the generation cached in (5) will very soon
become invalid.

Cc: sta...@vger.kernel.org
Cc: Xiao Guangrong 
Signed-off-by: David Matlack 
---
 virt/kvm/kvm_main.c | 18 +++---
 1 file changed, 11 insertions(+), 7 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 4b6c01b..86d3697 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -95,8 +95,6 @@ static int hardware_enable_all(void);
 static void hardware_disable_all(void);
 
 static void kvm_io_bus_destroy(struct kvm_io_bus *bus);
-static void update_memslots(struct kvm_memslots *slots,
-   struct kvm_memory_slot *new, u64 last_generation);
 
 static void kvm_release_pfn_dirty(pfn_t pfn);
 static void mark_page_dirty_in_slot(struct kvm *kvm,
@@ -685,8 +683,7 @@ static void sort_memslots(struct kvm_memslots *slots)
 }
 
 static void update_memslots(struct kvm_memslots *slots,
-   struct kvm_memory_slot *new,
-   u64 last_generation)
+   struct kvm_memory_slot *new)
 {
if (new) {
int id = new->id;
@@ -697,8 +694,6 @@ static void update_memslots(struct kvm_memslots *slots,
if (new->npages != npages)
sort_memslots(slots);
}
-
-   slots->generation = last_generation + 1;
 }
 
 static int check_memory_region_flags(struct kvm_userspace_memory_region *mem)
@@ -720,10 +715,19 @@ static struct kvm_memslots *install_new_memslots(struct 
kvm *kvm,
 {
struct kvm_memslots *old_memslots = kvm->memslots;
 
-   update_memslots(slots, new, kvm->memslots->generation);
+   slots->generation = old_memslots->generation + 1;
+
+   update_memslots(slots, new);
rcu_assign_pointer(kvm->memslots, slots);
synchronize_srcu_expedited(&kvm->srcu);
 
+   /*
+* Increment the new memslot generation a second time. This prevents
+* vm exits that race with memslot updates from caching a memslot
+* generation that will (potentially) be valid forever.
+*/
+   slots->generation++;
+
kvm_arch_memslots_updated(kvm);
 
return old_memslots;
-- 
2.1.0.rc2.206.gedb03e5

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

A question about Irq injection

2014-08-18 Thread dFm

Hi,

I am new to KVM code, and got a question in regards to Irq injection in 
vcpu_enter_guest in x86.c . I don't quite understand why inject_pending_event 
is called before making sure there is an Irq open window. And it seems to be 
executed once again after an Irq window exit because of the KVM_REQ_EVENT 
request made in handle_interrupt_window.

Any help would be much appreciated, though the question can silly to you guys. 
Thanks!--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/2] KVM: fix cache stale memslot info with correct mmio generation number

2014-08-18 Thread David Matlack

On Mon, Aug 18, 2014 at 2:24 PM, Paolo Bonzini  wrote:
> Il 18/08/2014 23:15, David Matlack ha scritto:
>> I just realized how simple Paolo's idea is. I think it can be a one line
>> patch (without comments):
>>
>> [...]
>> update_memslots(slots, new, kvm->memslots->generation);
>> rcu_assign_pointer(kvm->memslots, slots);
>> synchronize_srcu_expedited(&kvm->srcu);
>> +   slots->generation++;
>>
>> kvm_arch_memslots_updated(kvm);
>> [...]
>
> Yeah, you're right.  I think at this point it makes sense to put all
> generation handling in install_new_memslots, but with proper comments
> the above can do as well.
>
> Would you like to send it?  Patch 2 still applies on top.

Sure, I like doing everything in install_new_memslots() as well so I'll fix
that.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/2] kvm: x86: fix stale mmio cache bug

2014-08-18 Thread Paolo Bonzini

Il 14/08/2014 09:01, Xiao Guangrong ha scritto:
>   * Clear the mmio cache info for the given gva,
> - * specially, if gva is ~0ul, we clear all mmio cache info.
> + * specially, if gva is ~MMIO_GVA_ANY, we clear all mmio cache info.

Extra ~.

>   */
> +#define MMIO_GVA_ANY ~((gva_t)0)
> +

Better: (~(gva_t)0).

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/2] KVM: fix cache stale memslot info with correct mmio generation number

2014-08-18 Thread Paolo Bonzini

Il 18/08/2014 23:15, David Matlack ha scritto:
> I just realized how simple Paolo's idea is. I think it can be a one line
> patch (without comments):
> 
> [...]
> update_memslots(slots, new, kvm->memslots->generation);
> rcu_assign_pointer(kvm->memslots, slots);
> synchronize_srcu_expedited(&kvm->srcu);
> +   slots->generation++;
> 
> kvm_arch_memslots_updated(kvm);
> [...]

Yeah, you're right.  I think at this point it makes sense to put all
generation handling in install_new_memslots, but with proper comments
the above can do as well.

Would you like to send it?  Patch 2 still applies on top.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/2] KVM: fix cache stale memslot info with correct mmio generation number

2014-08-18 Thread David Matlack

On Mon, Aug 18, 2014 at 12:56 PM, Xiao Guangrong
 wrote:
> @@ -287,9 +293,15 @@ static bool set_mmio_spte(struct kvm *kvm, u64 *sptep, 
> gfn_t gfn,
>
>  static bool check_mmio_spte(struct kvm *kvm, u64 spte)
>  {
> +   struct kvm_memslots *slots = kvm_memslots(kvm);
> unsigned int kvm_gen, spte_gen;
>
> -   kvm_gen = kvm_current_mmio_generation(kvm);
> +   if (slots->updated)
> +   return false;
> +
> +   smp_rmb();
> +
> +   kvm_gen = __kvm_current_mmio_generation(slots);
> spte_gen = get_mmio_spte_generation(spte);
>

What does this fix? Case 2 can still happen. (Case 2 is unavoidable unless we
block during memslot updates, which I don't think we should :).

> trace_check_mmio_spte(spte, kvm_gen, spte_gen);
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 4b6c01b..1d4e78f 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -96,7 +96,7 @@ static void hardware_disable_all(void);
>
>  static void kvm_io_bus_destroy(struct kvm_io_bus *bus);
>  static void update_memslots(struct kvm_memslots *slots,
> -   struct kvm_memory_slot *new, u64 last_generation);
> +   struct kvm_memory_slot *new);
>
>  static void kvm_release_pfn_dirty(pfn_t pfn);
>  static void mark_page_dirty_in_slot(struct kvm *kvm,
> @@ -685,8 +685,7 @@ static void sort_memslots(struct kvm_memslots *slots)
>  }
>
>  static void update_memslots(struct kvm_memslots *slots,
> -   struct kvm_memory_slot *new,
> -   u64 last_generation)
> +   struct kvm_memory_slot *new)
>  {
> if (new) {
> int id = new->id;
> @@ -697,8 +696,6 @@ static void update_memslots(struct kvm_memslots *slots,
> if (new->npages != npages)
> sort_memslots(slots);
> }
> -
> -   slots->generation = last_generation + 1;
>  }
>
>  static int check_memory_region_flags(struct kvm_userspace_memory_region *mem)
> @@ -720,10 +717,17 @@ static struct kvm_memslots *install_new_memslots(struct 
> kvm *kvm,
>  {
> struct kvm_memslots *old_memslots = kvm->memslots;
>
> -   update_memslots(slots, new, kvm->memslots->generation);
> +   /* ensure generation number is always increased. */
> +   slots->updated = true;
> +   slots->generation = old_memslots->generation;
> +   update_memslots(slots, new);
> rcu_assign_pointer(kvm->memslots, slots);
> synchronize_srcu_expedited(&kvm->srcu);
>
> +   slots->generation++;
> +   smp_wmb();
> +   slots->updated = false;
> +
> kvm_arch_memslots_updated(kvm);
>
> return old_memslots;
>

This is effectively the same as the first approach.

I just realized how simple Paolo's idea is. I think it can be a one line
patch (without comments):

[...]
update_memslots(slots, new, kvm->memslots->generation);
rcu_assign_pointer(kvm->memslots, slots);
synchronize_srcu_expedited(&kvm->srcu);
+   slots->generation++;

kvm_arch_memslots_updated(kvm);
[...]
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] KVM: x86: recalculate_apic_map after enabling apic

2014-08-18 Thread Nadav Amit

Currently, recalculate_apic_map ignores vcpus whose lapic is software disabled
through the spurious interrupt vector. However, once it is re-enabled, the map
is not recalculated. Therefore, if the guest OS configured DFR while lapic is
software-disabled, the map may be incorrect. This patch recalculates apic map
after software enabling the lapic.

Signed-off-by: Nadav Amit 
---
 arch/x86/kvm/lapic.c | 25 ++---
 1 file changed, 14 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 08e8a89..4a736199 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -112,17 +112,6 @@ static inline int __apic_test_and_clear_vector(int vec, 
void *bitmap)
 struct static_key_deferred apic_hw_disabled __read_mostly;
 struct static_key_deferred apic_sw_disabled __read_mostly;
 
-static inline void apic_set_spiv(struct kvm_lapic *apic, u32 val)
-{
-   if ((kvm_apic_get_reg(apic, APIC_SPIV) ^ val) & APIC_SPIV_APIC_ENABLED) 
{
-   if (val & APIC_SPIV_APIC_ENABLED)
-   static_key_slow_dec_deferred(&apic_sw_disabled);
-   else
-   static_key_slow_inc(&apic_sw_disabled.key);
-   }
-   apic_set_reg(apic, APIC_SPIV, val);
-}
-
 static inline int apic_enabled(struct kvm_lapic *apic)
 {
return kvm_apic_sw_enabled(apic) && kvm_apic_hw_enabled(apic);
@@ -210,6 +199,20 @@ out:
kvm_vcpu_request_scan_ioapic(kvm);
 }
 
+static inline void apic_set_spiv(struct kvm_lapic *apic, u32 val)
+{
+   u32 prev = kvm_apic_get_reg(apic, APIC_SPIV);
+
+   apic_set_reg(apic, APIC_SPIV, val);
+   if ((prev ^ val) & APIC_SPIV_APIC_ENABLED) {
+   if (val & APIC_SPIV_APIC_ENABLED) {
+   static_key_slow_dec_deferred(&apic_sw_disabled);
+   recalculate_apic_map(apic->vcpu->kvm);
+   } else
+   static_key_slow_inc(&apic_sw_disabled.key);
+   }
+}
+
 static inline void kvm_apic_set_id(struct kvm_lapic *apic, u8 id)
 {
apic_set_reg(apic, APIC_ID, id << 24);
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/5] softlockup: make detector be aware of task switch of processes hogging cpu

2014-08-18 Thread Don Zickus

On Mon, Aug 18, 2014 at 09:02:00PM +0200, Ingo Molnar wrote:
> 
> * Don Zickus  wrote:
> 
> > > > > So I agree with the motivation of this improvement, but 
> > > > > is this implementation namespace-safe?
> > > > 
> > > > What namespace are you worried about colliding with?  I 
> > > > thought softlockup_ would provide the safety??  Maybe I 
> > > > am missing something obvious. :-(
> > > 
> > > I meant PID namespaces - a PID in itself isn't guaranteed 
> > > to be unique across the system.
> > 
> > Ah, I don't think we thought about that.  Is there a better 
> > way to do this?  Is there a domain id or something that can 
> > be OR'd with the pid?
> 
> What is always unique is the task pointer itself. We use pids 
> when we interface with user-space - but we don't really do that 
> here, right?

No, I don't believe so.  Ok, so saving 'current' and comparing that should
be enough, correct?

Cheers,
Don
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v5] arm64: fix VTTBR_BADDR_MASK

2014-08-18 Thread Joel Schopp

The current VTTBR_BADDR_MASK only masks 39 bits, which is broken on current
systems.  Rather than just add a bit it seems like a good time to also set
things at run-time instead of compile time to accomodate more hardware.

This patch sets TCR_EL2.PS, VTCR_EL2.T0SZ and vttbr_baddr_mask in runtime,
not compile time.

In ARMv8, EL2 physical address size (TCR_EL2.PS) and stage2 input address
size (VTCR_EL2.T0SZE) cannot be determined in compile time since they
depend on hardware capability.

According to Table D4-23 and Table D4-25 in ARM DDI 0487A.b document,
vttbr_x is calculated using different fixed values with consideration
of T0SZ, granule size and the level of translation tables. Therefore,
vttbr_baddr_mask should be determined dynamically.

Changes since v4:
More minor cleanups from review
Moved some functions into headers

Changes since v3:
Another rebase
Addressed minor comments from v2

Changes since v2:
Rebased on https://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm.git 
next branch

Changes since v1:
Rebased fix on Jungseok Lee's patch https://lkml.org/lkml/2014/5/12/189 to
provide better long term fix.  Updated that patch to log error instead of
silently fail on unaligned vttbr.

Cc: Christoffer Dall 
Cc: Sungjinn Chung 
Signed-off-by: Jungseok Lee 
Signed-off-by: Joel Schopp 
---
 arch/arm/include/asm/kvm_mmu.h   |   12 ++
 arch/arm/kvm/arm.c   |   17 +++-
 arch/arm64/include/asm/kvm_arm.h |   17 +---
 arch/arm64/include/asm/kvm_mmu.h |   78 ++
 arch/arm64/kvm/hyp-init.S|   20 +++---
 5 files changed, 122 insertions(+), 22 deletions(-)

diff --git a/arch/arm/include/asm/kvm_mmu.h b/arch/arm/include/asm/kvm_mmu.h
index 5c7aa3c..73f6ff6 100644
--- a/arch/arm/include/asm/kvm_mmu.h
+++ b/arch/arm/include/asm/kvm_mmu.h
@@ -166,6 +166,18 @@ static inline void coherent_cache_guest_page(struct 
kvm_vcpu *vcpu, hva_t hva,
 
 void stage2_flush_vm(struct kvm *kvm);
 
+static inline int kvm_get_phys_addr_shift(void)
+{
+   return KVM_PHYS_SHIFT;
+}
+
+static inline int set_vttbr_baddr_mask(void)
+{
+   vttbr_baddr_mask = VTTBR_BADDR_MASK;
+   return 0;
+}
+
+
 #endif /* !__ASSEMBLY__ */
 
 #endif /* __ARM_KVM_MMU_H__ */
diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
index 3c82b37..f396eb7 100644
--- a/arch/arm/kvm/arm.c
+++ b/arch/arm/kvm/arm.c
@@ -37,6 +37,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -466,8 +467,14 @@ static void update_vttbr(struct kvm *kvm)
/* update vttbr to be used with the new vmid */
pgd_phys = virt_to_phys(kvm->arch.pgd);
vmid = ((u64)(kvm->arch.vmid) << VTTBR_VMID_SHIFT) & VTTBR_VMID_MASK;
-   kvm->arch.vttbr = pgd_phys & VTTBR_BADDR_MASK;
-   kvm->arch.vttbr |= vmid;
+
+   /*
+* If the VTTBR isn't aligned there is something wrong with the system
+* or kernel.
+*/
+   BUG_ON(pgd_phys & ~vttbr_baddr_mask);
+
+   kvm->arch.vttbr = pgd_phys | vmid;
 
spin_unlock(&kvm_vmid_lock);
 }
@@ -1052,6 +1059,12 @@ int kvm_arch_init(void *opaque)
}
}
 
+   err = set_vttbr_baddr_mask();
+   if (err) {
+   kvm_err("Cannot set vttbr_baddr_mask\n");
+   return -EINVAL;
+   }
+
cpu_notifier_register_begin();
 
err = init_hyp_mode();
diff --git a/arch/arm64/include/asm/kvm_arm.h b/arch/arm64/include/asm/kvm_arm.h
index 3d69030..8dbef70 100644
--- a/arch/arm64/include/asm/kvm_arm.h
+++ b/arch/arm64/include/asm/kvm_arm.h
@@ -94,7 +94,6 @@
 /* TCR_EL2 Registers bits */
 #define TCR_EL2_TBI(1 << 20)
 #define TCR_EL2_PS (7 << 16)
-#define TCR_EL2_PS_40B (2 << 16)
 #define TCR_EL2_TG0(1 << 14)
 #define TCR_EL2_SH0(3 << 12)
 #define TCR_EL2_ORGN0  (3 << 10)
@@ -103,8 +102,6 @@
 #define TCR_EL2_MASK   (TCR_EL2_TG0 | TCR_EL2_SH0 | \
 TCR_EL2_ORGN0 | TCR_EL2_IRGN0 | TCR_EL2_T0SZ)
 
-#define TCR_EL2_FLAGS  (TCR_EL2_PS_40B)
-
 /* VTCR_EL2 Registers bits */
 #define VTCR_EL2_PS_MASK   (7 << 16)
 #define VTCR_EL2_TG0_MASK  (1 << 14)
@@ -119,36 +116,28 @@
 #define VTCR_EL2_SL0_MASK  (3 << 6)
 #define VTCR_EL2_SL0_LVL1  (1 << 6)
 #define VTCR_EL2_T0SZ_MASK 0x3f
-#define VTCR_EL2_T0SZ_40B  24
+#define VTCR_EL2_T0SZ(bits)(64 - (bits))
 
 #ifdef CONFIG_ARM64_64K_PAGES
 /*
  * Stage2 translation configuration:
- * 40bits output (PS = 2)
- * 40bits input  (T0SZ = 24)
  * 64kB pages (TG0 = 1)
  * 2 level page tables (SL = 1)
  */
 #define VTCR_EL2_FLAGS (VTCR_EL2_TG0_64K | VTCR_EL2_SH0_INNER | \
 VTCR_EL2_ORGN0_WBWA | VTCR_EL2_IRGN0_WBWA | \
-VTCR_EL2_SL0_LVL1 | VTCR_EL2_T0SZ_40B)
-#define VTTBR_X(38 - VTCR_EL2_T0SZ_40B)
+VTCR_EL2_SL0_LVL1)
 #else
 /*
  * Stage2 translation configuration:
- * 40bits output (PS = 2)
- * 40bits input  (T0SZ = 24)
  * 4kB

Re: [PATCH v4] arm64: fix VTTBR_BADDR_MASK

2014-08-18 Thread Joel Schopp


 #endif /* __ARM_KVM_MMU_H__ */
diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
index 16e7994..70f0f02 100644
--- a/arch/arm/kvm/mmu.c
+++ b/arch/arm/kvm/mmu.c
@@ -521,6 +521,7 @@ int create_hyp_io_mappings(void *from, void *to, 
phys_addr_t phys_addr)
  */
 int kvm_alloc_stage2_pgd(struct kvm *kvm)
 {
+   unsigned int s2_pgds, s2_pgd_order;
pgd_t *pgd;
 
if (kvm->arch.pgd != NULL) {
@@ -528,10 +529,18 @@ int kvm_alloc_stage2_pgd(struct kvm *kvm)
return -EINVAL;
}
 
-   pgd = (pgd_t *)__get_free_pages(GFP_KERNEL, S2_PGD_ORDER);
+   s2_pgds = (1 << (kvm_get_phys_addr_shift() - PGDIR_SHIFT));
+   s2_pgd_order = get_order(s2_pgds * sizeof(pgd_t));
+
+   pgd = (pgd_t *)__get_free_pages(GFP_KERNEL, s2_pgd_order);
if (!pgd)
return -ENOMEM;
 
+   if ((unsigned long)pgd & ~vttbr_baddr_mask) {
+   kvm_err("Stage-2 pgd not correctly aligned: %p\n", pgd);
+   return -EFAULT;
+   }


There are two problems that I've found here.  The first problem is that
vttbr_baddr_mask isn't allocated yet at this point in the code.  The
second problem is that pgd is a virtual address, ie pgd ==
0xfe03bbb4 while the vttbr masks off the high bits for a
physical address, ie vttbr_baddr_mask=0x7ffe .  Even
correcting for those issues I haven't been able to make this check work
properly.  I'll resend v5 the patch with all the other suggested changes.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/2] KVM: fix cache stale memslot info with correct mmio generation number

2014-08-18 Thread Xiao Guangrong


On Aug 19, 2014, at 2:47 AM, Paolo Bonzini  wrote:

> 
>> I think this patch is auditable, page-fault is always called by holding
>> srcu-lock so that a page fault can’t go across synchronize_srcu_expedited.
>> Only these cases can happen:
>> 
>> 1)  page fault occurs before synchronize_srcu_expedited.
>>In this case, vcpu will generate mmio-exit for the memslot being 
>> registered
>>by the ioctl. That’s ok since the ioctl have not finished.
>> 
>> 2) page fault occurs after synchronize_srcu_expedited and during
>>   increasing generation-number.
>>   In this case, userspace may get wrong mmio-exit (that happen if handing
>>   page-fault is slower that the ioctl), that’s ok too since userspace need do
>>  the check anyway like i said above.
>> 
>> 3) page fault occurs after generation-number update
>>   that’s definitely correct. :)
>> 
>>> Another alternative could be to use the low bit to mark an in-progress
>>> change, and skip the caching if the low bit is set.  Similar to a
>>> seqcount (except if read_seqcount_retry fails, we just punt and not
>>> retry anything), you could use it even though the memory barriers
>>> provided by write_seqcount_begin/end are not too useful in this case.
>> 
>> I do not know how the bit works, page fault will cache the memslot before
>> the bit set and cache the generation-number after the bit set.
>> 
>> Maybe i missed your idea, could you please detail it?
> 
> Something like this:
> 
> - update_memslots(slots, new, kvm->memslots->generation);
> + /* ensure generation number is always increased. */
> + slots->generation = old_memslots->generation + 1;
> + update_memslots(slots, new);
>   rcu_assign_pointer(kvm->memslots, slots);
>   synchronize_srcu_expedited(&kvm->srcu);
> + slots->generation++;
> 
> Then case 1 and 2 will just have a cache miss.

So, case 2 is what you concerned? :) I still think it is ok but i do not have
strong opinion on that. How about simplify it like this:

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 9314678..9fabf6a 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -234,16 +234,22 @@ static unsigned int get_mmio_spte_generation(u64 spte)
return gen;
 }
 
-static unsigned int kvm_current_mmio_generation(struct kvm *kvm)
+static unsigned int __kvm_current_mmio_generation(struct kvm_memslots *slots)
 {
+
/*
 * Init kvm generation close to MMIO_MAX_GEN to easily test the
 * code of handling generation number wrap-around.
 */
-   return (kvm_memslots(kvm)->generation +
+   return (slots->generation +
  MMIO_MAX_GEN - 150) & MMIO_GEN_MASK;
 }
 
+static unsigned int kvm_current_mmio_generation(struct kvm *kvm)
+{
+   return __kvm_current_mmio_generation(kvm_memslots(kvm));
+}
+
 static void mark_mmio_spte(struct kvm *kvm, u64 *sptep, u64 gfn,
   unsigned access)
 {
@@ -287,9 +293,15 @@ static bool set_mmio_spte(struct kvm *kvm, u64 *sptep, 
gfn_t gfn,
 
 static bool check_mmio_spte(struct kvm *kvm, u64 spte)
 {
+   struct kvm_memslots *slots = kvm_memslots(kvm);
unsigned int kvm_gen, spte_gen;
 
-   kvm_gen = kvm_current_mmio_generation(kvm);
+   if (slots->updated)
+   return false;
+
+   smp_rmb();
+   
+   kvm_gen = __kvm_current_mmio_generation(slots);
spte_gen = get_mmio_spte_generation(spte);
 
trace_check_mmio_spte(spte, kvm_gen, spte_gen);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 4b6c01b..1d4e78f 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -96,7 +96,7 @@ static void hardware_disable_all(void);
 
 static void kvm_io_bus_destroy(struct kvm_io_bus *bus);
 static void update_memslots(struct kvm_memslots *slots,
-   struct kvm_memory_slot *new, u64 last_generation);
+   struct kvm_memory_slot *new);
 
 static void kvm_release_pfn_dirty(pfn_t pfn);
 static void mark_page_dirty_in_slot(struct kvm *kvm,
@@ -685,8 +685,7 @@ static void sort_memslots(struct kvm_memslots *slots)
 }
 
 static void update_memslots(struct kvm_memslots *slots,
-   struct kvm_memory_slot *new,
-   u64 last_generation)
+   struct kvm_memory_slot *new)
 {
if (new) {
int id = new->id;
@@ -697,8 +696,6 @@ static void update_memslots(struct kvm_memslots *slots,
if (new->npages != npages)
sort_memslots(slots);
}
-
-   slots->generation = last_generation + 1;
 }
 
 static int check_memory_region_flags(struct kvm_userspace_memory_region *mem)
@@ -720,10 +717,17 @@ static struct kvm_memslots *install_new_memslots(struct 
kvm *kvm,
 {
struct kvm_memslots *old_memslots = kvm->memslots;
 
-   update_memslots(slots, new, kvm->memslots->generation);
+   /* ensure generation number is always increased. */
+   slots->updated

[PATCH kvm-unit-tests 1/2] x86: Use host CPU parameter for apic test

2014-08-18 Thread Nadav Amit

Currently, the TSC deadline test never runs, since TSC deadline is disabled
unless the host cpu parameter is used. This patch changes the apic test to use
the qemu host cpu parameter.

Signed-off-by: Nadav Amit 
---
 x86/unittests.cfg | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/x86/unittests.cfg b/x86/unittests.cfg
index 6d3e23a..f692b2b 100644
--- a/x86/unittests.cfg
+++ b/x86/unittests.cfg
@@ -9,7 +9,7 @@
 [apic]
 file = apic.flat
 smp = 2
-extra_params = -cpu qemu64,+x2apic
+extra_params = -cpu host,+x2apic
 arch = x86_64
 
 [smptest]
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH kvm-unit-tests 2/2] x86: Check deadline counter is cleared after interrupt

2014-08-18 Thread Nadav Amit

Once the local-apic timer is configured to use TSC deadline, the deadline
should be cleared after the deadline passes.  This patch adds a check of this
behavior.

Signed-off-by: Nadav Amit 
---
 x86/apic.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/x86/apic.c b/x86/apic.c
index 487c248..3f463a5 100644
--- a/x86/apic.c
+++ b/x86/apic.c
@@ -35,6 +35,7 @@ static void start_tsc_deadline_timer(void)
 wrmsr(MSR_IA32_TSCDEADLINE, rdmsr(MSR_IA32_TSC));
 asm volatile ("nop");
 report("tsc deadline timer", tdt_count == 1);
+report("tsc deadline timer clearing", rdmsr(MSR_IA32_TSCDEADLINE) == 0);
 }
 
 static int enable_tsc_deadline_timer(void)
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] KVM: x86: Clear apic tsc-deadline after deadline

2014-08-18 Thread Nadav Amit

Intel SDM 10.5.4.1 says "When the timer generates an interrupt, it disarms
itself and clears the IA32_TSC_DEADLINE MSR".

This patch clears the MSR upon timer interrupt delivery which delivered on
deadline mode.  Since the MSR may be reconfigured while an interrupt is
pending, causing the new value to be overriden, pending timer interrupts are
checked before setting a new deadline.

Signed-off-by: Nadav Amit 
---
 arch/x86/kvm/lapic.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 08e8a89..666c086 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -1352,6 +1352,9 @@ void kvm_set_lapic_tscdeadline_msr(struct kvm_vcpu *vcpu, 
u64 data)
return;
 
hrtimer_cancel(&apic->lapic_timer.timer);
+   /* Inject here so clearing tscdeadline won't override new value */
+   if (apic_has_pending_timer(vcpu))
+   kvm_inject_apic_timer_irqs(vcpu);
apic->lapic_timer.tscdeadline = data;
start_apic_timer(apic);
 }
@@ -1639,6 +1642,8 @@ void kvm_inject_apic_timer_irqs(struct kvm_vcpu *vcpu)
 
if (atomic_read(&apic->lapic_timer.pending) > 0) {
kvm_apic_local_deliver(apic, APIC_LVTT);
+   if (apic_lvtt_tscdeadline(apic))
+   apic->lapic_timer.tscdeadline = 0;
atomic_set(&apic->lapic_timer.pending, 0);
}
 }
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/5] softlockup: make detector be aware of task switch of processes hogging cpu

2014-08-18 Thread Ingo Molnar


* Don Zickus  wrote:

> > > > So I agree with the motivation of this improvement, but 
> > > > is this implementation namespace-safe?
> > > 
> > > What namespace are you worried about colliding with?  I 
> > > thought softlockup_ would provide the safety??  Maybe I 
> > > am missing something obvious. :-(
> > 
> > I meant PID namespaces - a PID in itself isn't guaranteed 
> > to be unique across the system.
> 
> Ah, I don't think we thought about that.  Is there a better 
> way to do this?  Is there a domain id or something that can 
> be OR'd with the pid?

What is always unique is the task pointer itself. We use pids 
when we interface with user-space - but we don't really do that 
here, right?

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/5] watchdog: control hard lockup detection default

2014-08-18 Thread Ingo Molnar

* Don Zickus  wrote:

> > 2)
> > 
> > The softlockup and hardlockup detection control variables 
> > should be in separate flags, inside and outside the kernel 
> > - they (should) not relate to each other.
> 
> They did because years ago I thought we wanted to keep them 
> as one entity instead of two.  I would have to re-work the 
> code to do this (and add more knobs).
> 
> I presume you would want those changes done before taking 
> this patchset?

Yeah, fixing/cleaning up things would be nice before spreading 
the pain via new features/control mechanisms.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/5] watchdog: control hard lockup detection default

2014-08-18 Thread Don Zickus

On Mon, Aug 18, 2014 at 08:07:35PM +0200, Ingo Molnar wrote:
> 
> * Don Zickus  wrote:
> 
> > On Mon, Aug 18, 2014 at 11:16:44AM +0200, Ingo Molnar wrote:
> > > 
> > > * Don Zickus  wrote:
> > > 
> > > > The running kernel still has the ability to enable/disable at any
> > > > time with /proc/sys/kernel/nmi_watchdog us usual. However even
> > > > when the default has been overridden /proc/sys/kernel/nmi_watchdog
> > > > will initially show '1'. To truly turn it on one must disable/enable
> > > > it, i.e.
> > > >   echo 0 > /proc/sys/kernel/nmi_watchdog
> > > >   echo 1 > /proc/sys/kernel/nmi_watchdog
> > > 
> > > This looks like a bug, why is this so?
> > 
> > It is, but it always has been there in the case of the PMU 
> > not being able to provide a resource for the hardlockup.  
> > This change just exposes it more.
> 
> There seems to be two issues:
> 
> 1)
> 
> When it's impossible to enable the hardlockup detector, it 
> should default to -1 or so, and attempts to set it should 
> return a -EINVAL or so.

Ok, it didn't because I set the knob to mean both hard and soft lockup.
But the code knows the failures and can set to -1 if it had to.

> 
> Bootup messages should also indicate when it's not possible to 
> enable it but a user requests it.

It does today.

> 
> 2)
> 
> The softlockup and hardlockup detection control variables 
> should be in separate flags, inside and outside the kernel - 
> they (should) not relate to each other.

They did because years ago I thought we wanted to keep them as one entity
instead of two.  I would have to re-work the code to do this (and add more
knobs).

I presume you would want those changes done before taking this patchset?

Cheers,
Don
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] KVM: x86: Avoid emulating instructions on #UD mistakenly

2014-08-18 Thread Nadav Amit


On Aug 18, 2014, at 11:31 AM, Paolo Bonzini  wrote:

> Il 13/08/2014 16:21, Nadav Amit ha scritto:
>> Correction: the word “never” in the message is too harsh.
>> Nonetheless, there is a regression bug. I encountered it with “wrfsbase” 
>> instruction.
> 
> So KVM is emulating wrfsbase even if the host doesn't support it?
KVM doesn’t know wrfsbase - wrfsbase is encoded like clflush with rep-prefix.
Therefore, the emulator thinks it can emulate it as clflush, and eliminates the 
#UD.

> 
> I'm swapping the order of the two operands of &&, since the first one will 
> almost
> always be true and the second one will almost always be false.
> 
> Also, there's now no need to test EmulateOnUD in the condition below.  Does 
> the
> below look good to you?
> 
> diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
> index 37a83b24e040..ef117b842334 100644
> --- a/arch/x86/kvm/emulate.c
> +++ b/arch/x86/kvm/emulate.c
> @@ -4394,11 +4394,11 @@ done_prefixes:
> 
>   ctxt->execute = opcode.u.execute;
> 
> - if (!(ctxt->d & EmulateOnUD) && ctxt->ud)
> + if (unlikely(ctxt->ud) && likely(!(ctxt->d & EmulateOnUD)))
>   return EMULATION_FAILED;
> 
>   if (unlikely(ctxt->d &
> -  
> (NotImpl|EmulateOnUD|Stack|Op3264|Sse|Mmx|Intercept|CheckPerm))) {
> +  (NotImpl|Stack|Op3264|Sse|Mmx|Intercept|CheckPerm))) {
>   /*
>* These are copied unconditionally here, and checked 
> unconditionally
>* in x86_emulate_insn.
> 

Sure. Until I find some bugs in it. ;-)

Thanks,
Nadav


signature.asc
Description: Message signed with OpenPGP using GPGMail

Re: [PATCH 1/2] KVM: fix cache stale memslot info with correct mmio generation number

2014-08-18 Thread Paolo Bonzini

Il 18/08/2014 18:35, Xiao Guangrong ha scritto:
> 
> Hi Paolo,
> 
> Thank you to review the patch!
> 
> On Aug 18, 2014, at 9:57 PM, Paolo Bonzini  wrote:
> 
>> Il 14/08/2014 09:01, Xiao Guangrong ha scritto:
>>> -   update_memslots(slots, new, kvm->memslots->generation);
>>> +   /* ensure generation number is always increased. */
>>> +   slots->generation = old_memslots->generation;
>>> +   update_memslots(slots, new);
>>> rcu_assign_pointer(kvm->memslots, slots);
>>> synchronize_srcu_expedited(&kvm->srcu);
>>> +   slots->generation++;
>>
>> I don't trust my brain enough to review this patch.
> 
> Sorry to make you confused. I should expain it more clearly.

Don't worry, it's not your fault. :)

>> kvm_current_mmio_generation seems like a very bad (race-prone) API.  One
>> patch I trust myself reviewing would change a bunch of functions in
>> kvm_main.c to take a memslots struct.  This would make it easy to
>> respect the hard and fast rule of not dereferencing the same pointer
>> twice.  But it would be a tedious change.
> 
> kvm_set_memory_region is the only place updating memslot and
> kvm_current_mmio_generation accesses memslot by rcu-dereference,
> i do not know why other places need to take into account.

The race occurs because gfn_to_pfn_many_atomic or some other function
has already used kvm_memslots().  Calling kvm_memslots() twice is the
root cause the bug.

> I think this patch is auditable, page-fault is always called by holding
> srcu-lock so that a page fault can’t go across synchronize_srcu_expedited.
> Only these cases can happen:
> 
> 1)  page fault occurs before synchronize_srcu_expedited.
> In this case, vcpu will generate mmio-exit for the memslot being 
> registered
> by the ioctl. That’s ok since the ioctl have not finished.
> 
> 2) page fault occurs after synchronize_srcu_expedited and during
>increasing generation-number.
>In this case, userspace may get wrong mmio-exit (that happen if handing
>page-fault is slower that the ioctl), that’s ok too since userspace need do
>   the check anyway like i said above.
> 
> 3) page fault occurs after generation-number update
>that’s definitely correct. :)
> 
>> Another alternative could be to use the low bit to mark an in-progress
>> change, and skip the caching if the low bit is set.  Similar to a
>> seqcount (except if read_seqcount_retry fails, we just punt and not
>> retry anything), you could use it even though the memory barriers
>> provided by write_seqcount_begin/end are not too useful in this case.
> 
> I do not know how the bit works, page fault will cache the memslot before
> the bit set and cache the generation-number after the bit set.
> 
> Maybe i missed your idea, could you please detail it?

Something like this:

-   update_memslots(slots, new, kvm->memslots->generation);
+   /* ensure generation number is always increased. */
+   slots->generation = old_memslots->generation + 1;
+   update_memslots(slots, new);
rcu_assign_pointer(kvm->memslots, slots);
synchronize_srcu_expedited(&kvm->srcu);
+   slots->generation++;

Then case 1 and 2 will just have a cache miss.

The "low bit" is really just because each slot update does 2 generation
increases.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/5] softlockup: make detector be aware of task switch of processes hogging cpu

2014-08-18 Thread Don Zickus

On Mon, Aug 18, 2014 at 08:01:58PM +0200, Ingo Molnar wrote:
> > > > duration = is_softlockup(touch_ts);
> > > > if (unlikely(duration)) {
> > > > +   pid_t pid = task_pid_nr(current);
> > > > +
> > > > /*
> > > >  * If a virtual machine is stopped by the host it can 
> > > > look to
> > > >  * the watchdog like a soft lockup, check to see if the 
> > > > host
> > > > @@ -326,8 +329,20 @@ static enum hrtimer_restart 
> > > > watchdog_timer_fn(struct hrtimer *hrtimer)
> > > > return HRTIMER_RESTART;
> > > >  
> > > > /* only warn once */
> > > > -   if (__this_cpu_read(soft_watchdog_warn) == true)
> > > > +   if (__this_cpu_read(soft_watchdog_warn) == true) {
> > > > +
> > > > +   /*
> > > > +* Handle the case where multiple processes are
> > > > +* causing softlockups but the duration is small
> > > > +* enough, the softlockup detector can not reset
> > > > +* itself in time.  Use pids to detect this.
> > > > +*/
> > > > +   if (__this_cpu_read(softlockup_warn_pid_saved) 
> > > > != pid) {
> > > 
> > > So I agree with the motivation of this improvement, but is this 
> > > implementation namespace-safe?
> > 
> > What namespace are you worried about colliding with?  I thought
> > softlockup_ would provide the safety??  Maybe I am missing something
> > obvious. :-(
> 
> I meant PID namespaces - a PID in itself isn't guaranteed to be 
> unique across the system.

Ah,  I don't think we thought about that.  Is there a better way to do
this?  Is there a domain id or something that can be OR'd with the pid?

Cheers,
Don
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/2] KVM: fix cache stale memslot info with correct mmio generation number

2014-08-18 Thread David Matlack

On Mon, Aug 18, 2014 at 9:35 AM, Xiao Guangrong
 wrote:
>
> Hi Paolo,
>
> Thank you to review the patch!
>
> On Aug 18, 2014, at 9:57 PM, Paolo Bonzini  wrote:
>
>> Il 14/08/2014 09:01, Xiao Guangrong ha scritto:
>>> -update_memslots(slots, new, kvm->memslots->generation);
>>> +/* ensure generation number is always increased. */
>>> +slots->generation = old_memslots->generation;
>>> +update_memslots(slots, new);
>>>  rcu_assign_pointer(kvm->memslots, slots);
>>>  synchronize_srcu_expedited(&kvm->srcu);
>>> +slots->generation++;
>>
>> I don't trust my brain enough to review this patch.

Xiao, I thought about your approach a lot and I can't think of a bad race
that isn't already possible due to the fact that kvm allows memslot
mutation to race with vm exits. That being said, it's hard to reason about
all the other "clients" of memslots and what weirdness (or badness) will
be caused by updating generation after srcu_synch. I like Paolo's two
approaches because they fix the bug without any side-effects.

> Sorry to make you confused. I should expain it more clearly.
>
> What this patch tried to fix is:  kvm will generate wrong mmio-exit forever
> if no luck event cleans mmio spte. (eg. if no memory pressure or  no
> context-sync on that spte.)
>
> Note, it is hard to do precise sync between kvm_vm_ioctl_set_memory_region
> and mmio-exit - that means userspace is able to get mmio-exit even if
> kvm_vm_ioctl_set_memory_region have finished, for example, kvm identifies
> a mmio access before issuing the ioctl and injects mmio-exit to userspace 
> after
> ioctl return. So checking if mmio-exit is a real mmio access in userspace is
> needed anyway.
>
>> kvm_current_mmio_generation seems like a very bad (race-prone) API.  One
>> patch I trust myself reviewing would change a bunch of functions in
>> kvm_main.c to take a memslots struct.  This would make it easy to
>> respect the hard and fast rule of not dereferencing the same pointer
>> twice.  But it would be a tedious change.
>
> kvm_set_memory_region is the only place updating memslot and
> kvm_current_mmio_generation accesses memslot by rcu-dereference,
> i do not know why other places need to take into account.

If you rcu_dereference() more than once, you can't trust previous
decisions based on rcu_dereference()'s. If the mmio cache code only
did one rcu_dereference() per vm exit, this bug would be gone.

> I think this patch is auditable, page-fault is always called by holding
> srcu-lock so that a page fault can’t go across synchronize_srcu_expedited.
> Only these cases can happen:
>
> 1)  page fault occurs before synchronize_srcu_expedited.
> In this case, vcpu will generate mmio-exit for the memslot being 
> registered
> by the ioctl. That’s ok since the ioctl have not finished.
>
> 2) page fault occurs after synchronize_srcu_expedited and during
>increasing generation-number.
>In this case, userspace may get wrong mmio-exit (that happen if handing
>page-fault is slower that the ioctl), that’s ok too since userspace need do
>   the check anyway like i said above.
>
> 3) page fault occurs after generation-number update
>that’s definitely correct. :)
>
>> Another alternative could be to use the low bit to mark an in-progress
>> change, and skip the caching if the low bit is set.  Similar to a
>> seqcount (except if read_seqcount_retry fails, we just punt and not
>> retry anything), you could use it even though the memory barriers
>> provided by write_seqcount_begin/end are not too useful in this case.

I like this approach best. It would have the least code changes and provide
the same guarantees.

> I do not know how the bit works, page fault will cache the memslot before
> the bit set and cache the generation-number after the bit set.
>
> Maybe i missed your idea, could you please detail it?

In vcpu_cache_mmio_info() if generation is odd, just don't do the caching
because memslots were changed while we were running and we just assume the
worst case.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/5] watchdog: control hard lockup detection default

2014-08-18 Thread Ingo Molnar

* Don Zickus  wrote:

> On Mon, Aug 18, 2014 at 11:16:44AM +0200, Ingo Molnar wrote:
> > 
> > * Don Zickus  wrote:
> > 
> > > The running kernel still has the ability to enable/disable at any
> > > time with /proc/sys/kernel/nmi_watchdog us usual. However even
> > > when the default has been overridden /proc/sys/kernel/nmi_watchdog
> > > will initially show '1'. To truly turn it on one must disable/enable
> > > it, i.e.
> > >   echo 0 > /proc/sys/kernel/nmi_watchdog
> > >   echo 1 > /proc/sys/kernel/nmi_watchdog
> > 
> > This looks like a bug, why is this so?
> 
> It is, but it always has been there in the case of the PMU 
> not being able to provide a resource for the hardlockup.  
> This change just exposes it more.

There seems to be two issues:

1)

When it's impossible to enable the hardlockup detector, it 
should default to -1 or so, and attempts to set it should 
return a -EINVAL or so.

Bootup messages should also indicate when it's not possible to 
enable it but a user requests it.

2)

The softlockup and hardlockup detection control variables 
should be in separate flags, inside and outside the kernel - 
they (should) not relate to each other.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/5] softlockup: make detector be aware of task switch of processes hogging cpu

2014-08-18 Thread Ingo Molnar


* Don Zickus  wrote:

> On Mon, Aug 18, 2014 at 11:03:19AM +0200, Ingo Molnar wrote:
> > * Don Zickus  wrote:
> > 
> > > From: chai wen 
> > > 
> > > For now, soft lockup detector warns once for each case of process 
> > > softlockup.
> > > But the thread 'watchdog/n' may not always get the cpu at the time slot 
> > > between
> > > the task switch of two processes hogging that cpu to reset 
> > > soft_watchdog_warn.
> > > 
> > > An example would be two processes hogging the cpu.  Process A causes the
> > > softlockup warning and is killed manually by a user.  Process B 
> > > immediately
> > > becomes the new process hogging the cpu preventing the softlockup code 
> > > from
> > > resetting the soft_watchdog_warn variable.
> > > 
> > > This case is a false negative of "warn only once for a process", as there 
> > > may
> > > be a different process that is going to hog the cpu.  Resolve this by
> > > saving/checking the pid of the hogging process and use that to reset
> > > soft_watchdog_warn too.
> > > 
> > > Signed-off-by: chai wen 
> > > [modified the comment and changelog to be more specific]
> > > Signed-off-by: Don Zickus 
> > > ---
> > >  kernel/watchdog.c |   20 ++--
> > >  1 files changed, 18 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/kernel/watchdog.c b/kernel/watchdog.c
> > > index 4c2e11c..6d0a891 100644
> > > --- a/kernel/watchdog.c
> > > +++ b/kernel/watchdog.c
> > > @@ -42,6 +42,7 @@ static DEFINE_PER_CPU(bool, softlockup_touch_sync);
> > >  static DEFINE_PER_CPU(bool, soft_watchdog_warn);
> > >  static DEFINE_PER_CPU(unsigned long, hrtimer_interrupts);
> > >  static DEFINE_PER_CPU(unsigned long, soft_lockup_hrtimer_cnt);
> > > +static DEFINE_PER_CPU(pid_t, softlockup_warn_pid_saved);
> > >  #ifdef CONFIG_HARDLOCKUP_DETECTOR
> > >  static DEFINE_PER_CPU(bool, hard_watchdog_warn);
> > >  static DEFINE_PER_CPU(bool, watchdog_nmi_touch);
> > > @@ -317,6 +318,8 @@ static enum hrtimer_restart watchdog_timer_fn(struct 
> > > hrtimer *hrtimer)
> > >*/
> > >   duration = is_softlockup(touch_ts);
> > >   if (unlikely(duration)) {
> > > + pid_t pid = task_pid_nr(current);
> > > +
> > >   /*
> > >* If a virtual machine is stopped by the host it can look to
> > >* the watchdog like a soft lockup, check to see if the host
> > > @@ -326,8 +329,20 @@ static enum hrtimer_restart watchdog_timer_fn(struct 
> > > hrtimer *hrtimer)
> > >   return HRTIMER_RESTART;
> > >  
> > >   /* only warn once */
> > > - if (__this_cpu_read(soft_watchdog_warn) == true)
> > > + if (__this_cpu_read(soft_watchdog_warn) == true) {
> > > +
> > > + /*
> > > +  * Handle the case where multiple processes are
> > > +  * causing softlockups but the duration is small
> > > +  * enough, the softlockup detector can not reset
> > > +  * itself in time.  Use pids to detect this.
> > > +  */
> > > + if (__this_cpu_read(softlockup_warn_pid_saved) != pid) {
> > 
> > So I agree with the motivation of this improvement, but is this 
> > implementation namespace-safe?
> 
> What namespace are you worried about colliding with?  I thought
> softlockup_ would provide the safety??  Maybe I am missing something
> obvious. :-(

I meant PID namespaces - a PID in itself isn't guaranteed to be 
unique across the system.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v9 4/4] arm: ARMv7 dirty page logging 2nd stage page fault handling support

2014-08-18 Thread Mario Smarduch

On 08/18/2014 05:51 AM, Christoffer Dall wrote:
> On Wed, Aug 13, 2014 at 06:20:19PM -0700, Mario Smarduch wrote:
>> On 08/13/2014 12:30 AM, Christoffer Dall wrote:
>>> On Tue, Aug 12, 2014 at 06:27:11PM -0700, Mario Smarduch wrote:
 On 08/12/2014 02:50 AM, Christoffer Dall wrote:
> On Mon, Aug 11, 2014 at 06:25:05PM -0700, Mario Smarduch wrote:
>> On 08/11/2014 12:13 PM, Christoffer Dall wrote:
>>> On Thu, Jul 24, 2014 at 05:56:08PM -0700, Mario Smarduch wrote:
>>>
>>> [...]
>>>
 @@ -1151,7 +1170,7 @@ static void kvm_set_spte_handler(struct kvm 
 *kvm, gpa_t gpa, void *data)

>>
>> Ok I see, I was thinking you thought it was breaking something.
>> Yeah I'll add the comment, in reality this is another use case
>> where a PMD may need to be converted to page table so it makes sense
>> to contrast use cases.
>>
> 
> the hidden knowledge is that MMU notifiers will ensure a huge PMD gets
> unmapped before trying to change the physical backing of the underlying
> PTEs, so it's a gigantic kernel bug if this gets called on something
> mapped with a huge PMD.
> 

That's a good way of putting it luckily I was aware previously looking
at some other features.

> 
>>>

 Should I add comments on flag use in other places as well?

>>>
>>> It's always a judgement call.  I didn't find it necessarry to put a
>>> comment elsewhere because I think it's pretty obivous that we would
>>> never care about logging writes to device regions.
>>>
>>> However, this made me think, are we making sure that we are not marking
>>> device mappings as read-only in the wp_range functions?  I think it's
>>
>> KVM_SET_USER_MEMORY_REGION ioctl doesn't check type of region being
>> installed/operated on (KVM_MEM_LOG_DIRTY_PAGES), in case of QEMU
>> these regions wind up in KVMState->KVMSlot[], when
>> memory_region_add_subregion() is called KVM listener installs it.
>> For migration and dirty page logging QEMU walks the KVMSlot[] array.
>>
>> For QEMU VFIO (PCI) mmap()ing BAR of type IORESOURCE_MEM,
>> causes the memory region to be added to KVMState->KVMSlot[].
>> In that case it's possible to walk KVMState->KVMSlot[] issue
>> the ioctl and  come across  a device mapping with normal memory and
>> WP it's s2ptes (VFIO sets unmigrateble state though).
>>
>> But I'm not sure what's there to stop someone calling the ioctl and
>> install a region with device memory type. Most likely though if you
>> installed that kind of region migration would be disabled.
>>
>> But just for logging use not checking memory type could be an issue.
>>
> I forgot that the current write-protect'ing is limited to the memory
> region boundaries, so everything should be fine.

I looked through this way back, but it was worth to revisit.
> 
> If user-space write-protects device memory regions, the worst
> consequence is that it breaks the guest, but that's its own
> responsibility, so I don't think you need to change anything.

Thanks for the detailed review. I'll go off now, rebase and make
the needed changes

> 
> -Christoffer
> 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/2] KVM: fix cache stale memslot info with correct mmio generation number

2014-08-18 Thread Xiao Guangrong

Hi Paolo,

Thank you to review the patch!

On Aug 18, 2014, at 9:57 PM, Paolo Bonzini  wrote:

> Il 14/08/2014 09:01, Xiao Guangrong ha scritto:
>> -update_memslots(slots, new, kvm->memslots->generation);
>> +/* ensure generation number is always increased. */
>> +slots->generation = old_memslots->generation;
>> +update_memslots(slots, new);
>>  rcu_assign_pointer(kvm->memslots, slots);
>>  synchronize_srcu_expedited(&kvm->srcu);
>> +slots->generation++;
> 
> I don't trust my brain enough to review this patch.

Sorry to make you confused. I should expain it more clearly.

What this patch tried to fix is:  kvm will generate wrong mmio-exit forever
if no luck event cleans mmio spte. (eg. if no memory pressure or  no
context-sync on that spte.)

Note, it is hard to do precise sync between kvm_vm_ioctl_set_memory_region
and mmio-exit - that means userspace is able to get mmio-exit even if
kvm_vm_ioctl_set_memory_region have finished, for example, kvm identifies
a mmio access before issuing the ioctl and injects mmio-exit to userspace after
ioctl return. So checking if mmio-exit is a real mmio access in userspace is
needed anyway.

> kvm_current_mmio_generation seems like a very bad (race-prone) API.  One
> patch I trust myself reviewing would change a bunch of functions in
> kvm_main.c to take a memslots struct.  This would make it easy to
> respect the hard and fast rule of not dereferencing the same pointer
> twice.  But it would be a tedious change.

kvm_set_memory_region is the only place updating memslot and
kvm_current_mmio_generation accesses memslot by rcu-dereference,
i do not know why other places need to take into account.

I think this patch is auditable, page-fault is always called by holding
srcu-lock so that a page fault can’t go across synchronize_srcu_expedited.
Only these cases can happen:

1)  page fault occurs before synchronize_srcu_expedited.
In this case, vcpu will generate mmio-exit for the memslot being registered
by the ioctl. That’s ok since the ioctl have not finished.

2) page fault occurs after synchronize_srcu_expedited and during
   increasing generation-number.
   In this case, userspace may get wrong mmio-exit (that happen if handing
   page-fault is slower that the ioctl), that’s ok too since userspace need do
  the check anyway like i said above.

3) page fault occurs after generation-number update
   that’s definitely correct. :)

> Another alternative could be to use the low bit to mark an in-progress
> change, and skip the caching if the low bit is set.  Similar to a
> seqcount (except if read_seqcount_retry fails, we just punt and not
> retry anything), you could use it even though the memory barriers
> provided by write_seqcount_begin/end are not too useful in this case.

I do not know how the bit works, page fault will cache the memslot before
the bit set and cache the generation-number after the bit set.

Maybe i missed your idea, could you please detail it?

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/5] watchdog: control hard lockup detection default

2014-08-18 Thread Don Zickus

On Mon, Aug 18, 2014 at 11:16:44AM +0200, Ingo Molnar wrote:
> 
> * Don Zickus  wrote:
> 
> > The running kernel still has the ability to enable/disable at any
> > time with /proc/sys/kernel/nmi_watchdog us usual. However even
> > when the default has been overridden /proc/sys/kernel/nmi_watchdog
> > will initially show '1'. To truly turn it on one must disable/enable
> > it, i.e.
> >   echo 0 > /proc/sys/kernel/nmi_watchdog
> >   echo 1 > /proc/sys/kernel/nmi_watchdog
> 
> This looks like a bug, why is this so?

It is, but it always has been there in the case of the PMU not being able
to provide a resource for the hardlockup.  This change just exposes it
more.

Originally I wrote the code to keep the softlockup and hardlockup in sync.
Now this patch attempts to split it up because the guest PMU is still
flushing out bugs.

The above scenario only really applies to developers.  Their guest boots
up with the hardlockup disabled.  If they want to enable it to debug or
develop, they have to go with the above steps.  The idea is once the KVM
PMU is stable enough, the default switches to hardlockup enabled by
default and this problem kinda goes back to one it is today.

I guess I was feeling lazy about modifying a bunch of code to separate the
hard and soft lockup for a temporarily broken feature.  :-/

I thought it would just be easier to put this code in to quickly stabilize
their PMU and switch the default later.

Thoughts?  I think Uli laid out a more detailed example in his email.

Cheers,
Don

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] vmx: Replace rcu_assign_pointer() with RCU_INIT_POINTER()

2014-08-18 Thread Andreea-Cristina Bernat

The use of "rcu_assign_pointer()" is NULLing out the pointer.
According to RCU_INIT_POINTER()'s block comment:
"1.   This use of RCU_INIT_POINTER() is NULLing out the pointer"
it is better to use it instead of rcu_assign_pointer() because it has a
smaller overhead.

The following Coccinelle semantic patch was used:
@@
@@

- rcu_assign_pointer
+ RCU_INIT_POINTER
  (..., NULL)

Signed-off-by: Andreea-Cristina Bernat 
---
 arch/x86/kvm/vmx.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 801332e..12b840c 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -9029,7 +9029,7 @@ static void __exit vmx_exit(void)
free_page((unsigned long)vmx_vmread_bitmap);
 
 #ifdef CONFIG_KEXEC
-   rcu_assign_pointer(crash_vmclear_loaded_vmcss, NULL);
+   RCU_INIT_POINTER(crash_vmclear_loaded_vmcss, NULL);
synchronize_rcu();
 #endif
 
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/5] watchdog: control hard lockup detection default

2014-08-18 Thread Don Zickus

On Mon, Aug 18, 2014 at 11:12:39AM +0200, Ingo Molnar wrote:
> 
> * Don Zickus  wrote:
> 
> > From: Ulrich Obergfell 
> > 
> > In some cases we don't want hard lockup detection enabled by default.
> > An example is when running as a guest. Introduce
> > 
> >   watchdog_enable_hardlockup_detector(bool)
> 
> So, the name watchdog_enable_hardlockup_detector_enable(false) 
> is both too long and also really confusing (because first it 
> suggests enablement, then disables it), so I renamed it to 
> hardlockup_detector_set(), which allows two natural variants:
> 
>   hardlockup_detector_set(false);
>   ...
>   hardlockup_detector_set(true);

Fair enough.  Thanks!

Cheers,
Don
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] eventfd: Replace rcu_assign_pointer() with RCU_INIT_POINTER()

2014-08-18 Thread Paolo Bonzini

Il 18/08/2014 17:01, Andreea-Cristina Bernat ha scritto:
> The uses of "rcu_assign_pointer()" are NULLing out the pointers.
> According to RCU_INIT_POINTER()'s block comment:
> "1.   This use of RCU_INIT_POINTER() is NULLing out the pointer"
> it is better to use it instead of rcu_assign_pointer() because it has a
> smaller overhead.
> 
> The following Coccinelle semantic patch was used:
> @@
> @@
> 
> - rcu_assign_pointer
> + RCU_INIT_POINTER
>   (..., NULL)
> 
> Signed-off-by: Andreea-Cristina Bernat 
> ---
>  virt/kvm/eventfd.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
> index 20c3af7..a49130f 100644
> --- a/virt/kvm/eventfd.c
> +++ b/virt/kvm/eventfd.c
> @@ -278,7 +278,7 @@ static void irqfd_update(struct kvm *kvm, struct _irqfd 
> *irqfd,
>   struct kvm_kernel_irq_routing_entry *e;
>  
>   if (irqfd->gsi >= irq_rt->nr_rt_entries) {
> - rcu_assign_pointer(irqfd->irq_entry, NULL);
> + RCU_INIT_POINTER(irqfd->irq_entry, NULL);
>   return;
>   }
>  
> @@ -287,7 +287,7 @@ static void irqfd_update(struct kvm *kvm, struct _irqfd 
> *irqfd,
>   if (e->type == KVM_IRQ_ROUTING_MSI)
>   rcu_assign_pointer(irqfd->irq_entry, e);
>   else
> - rcu_assign_pointer(irqfd->irq_entry, NULL);
> + RCU_INIT_POINTER(irqfd->irq_entry, NULL);
>   }
>  }
>  
> @@ -473,7 +473,7 @@ kvm_irqfd_deassign(struct kvm *kvm, struct kvm_irqfd 
> *args)
>* It is paired with synchronize_srcu done by caller
>* of that function.
>*/
> - rcu_assign_pointer(irqfd->irq_entry, NULL);
> + RCU_INIT_POINTER(irqfd->irq_entry, NULL);
>   irqfd_deactivate(irqfd);
>   }
>   }
> 

Hi, this patch actually had been submitted already last March.  It
slipped through the cracks.  I'm now applying both of Monam Agarwal's
RCU_INIT_POINTER patches.  Sorry. :)

Thanks,

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/5] softlockup: make detector be aware of task switch of processes hogging cpu

2014-08-18 Thread Don Zickus

On Mon, Aug 18, 2014 at 11:03:19AM +0200, Ingo Molnar wrote:
> * Don Zickus  wrote:
> 
> > From: chai wen 
> > 
> > For now, soft lockup detector warns once for each case of process 
> > softlockup.
> > But the thread 'watchdog/n' may not always get the cpu at the time slot 
> > between
> > the task switch of two processes hogging that cpu to reset 
> > soft_watchdog_warn.
> > 
> > An example would be two processes hogging the cpu.  Process A causes the
> > softlockup warning and is killed manually by a user.  Process B immediately
> > becomes the new process hogging the cpu preventing the softlockup code from
> > resetting the soft_watchdog_warn variable.
> > 
> > This case is a false negative of "warn only once for a process", as there 
> > may
> > be a different process that is going to hog the cpu.  Resolve this by
> > saving/checking the pid of the hogging process and use that to reset
> > soft_watchdog_warn too.
> > 
> > Signed-off-by: chai wen 
> > [modified the comment and changelog to be more specific]
> > Signed-off-by: Don Zickus 
> > ---
> >  kernel/watchdog.c |   20 ++--
> >  1 files changed, 18 insertions(+), 2 deletions(-)
> > 
> > diff --git a/kernel/watchdog.c b/kernel/watchdog.c
> > index 4c2e11c..6d0a891 100644
> > --- a/kernel/watchdog.c
> > +++ b/kernel/watchdog.c
> > @@ -42,6 +42,7 @@ static DEFINE_PER_CPU(bool, softlockup_touch_sync);
> >  static DEFINE_PER_CPU(bool, soft_watchdog_warn);
> >  static DEFINE_PER_CPU(unsigned long, hrtimer_interrupts);
> >  static DEFINE_PER_CPU(unsigned long, soft_lockup_hrtimer_cnt);
> > +static DEFINE_PER_CPU(pid_t, softlockup_warn_pid_saved);
> >  #ifdef CONFIG_HARDLOCKUP_DETECTOR
> >  static DEFINE_PER_CPU(bool, hard_watchdog_warn);
> >  static DEFINE_PER_CPU(bool, watchdog_nmi_touch);
> > @@ -317,6 +318,8 @@ static enum hrtimer_restart watchdog_timer_fn(struct 
> > hrtimer *hrtimer)
> >  */
> > duration = is_softlockup(touch_ts);
> > if (unlikely(duration)) {
> > +   pid_t pid = task_pid_nr(current);
> > +
> > /*
> >  * If a virtual machine is stopped by the host it can look to
> >  * the watchdog like a soft lockup, check to see if the host
> > @@ -326,8 +329,20 @@ static enum hrtimer_restart watchdog_timer_fn(struct 
> > hrtimer *hrtimer)
> > return HRTIMER_RESTART;
> >  
> > /* only warn once */
> > -   if (__this_cpu_read(soft_watchdog_warn) == true)
> > +   if (__this_cpu_read(soft_watchdog_warn) == true) {
> > +
> > +   /*
> > +* Handle the case where multiple processes are
> > +* causing softlockups but the duration is small
> > +* enough, the softlockup detector can not reset
> > +* itself in time.  Use pids to detect this.
> > +*/
> > +   if (__this_cpu_read(softlockup_warn_pid_saved) != pid) {
> 
> So I agree with the motivation of this improvement, but is this 
> implementation namespace-safe?

What namespace are you worried about colliding with?  I thought
softlockup_ would provide the safety??  Maybe I am missing something
obvious. :-(

Cheers,
Don
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/5] watchdog: fix print-once on enable

2014-08-18 Thread Don Zickus

On Mon, Aug 18, 2014 at 11:07:57AM +0200, Ingo Molnar wrote:
> 
> * Don Zickus  wrote:
> 
> > --- a/kernel/watchdog.c
> > +++ b/kernel/watchdog.c
> > @@ -522,6 +522,9 @@ static void watchdog_nmi_disable(unsigned int cpu)
> > /* should be in cleanup, but blocks oprofile */
> > perf_event_release_kernel(event);
> > }
> > +   if (cpu == 0)
> > +   /* watchdog_nmi_enable() expects this to be zero initially. */
> > +   cpu0_err = 0;
> > return;
> >  }
> 
> While at it I also removed the stray 'return;'.

Doh, sorry about that.  Thanks!

Cheers,
Don
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] eventfd: Replace rcu_assign_pointer() with RCU_INIT_POINTER()

2014-08-18 Thread Andreea-Cristina Bernat

The uses of "rcu_assign_pointer()" are NULLing out the pointers.
According to RCU_INIT_POINTER()'s block comment:
"1.   This use of RCU_INIT_POINTER() is NULLing out the pointer"
it is better to use it instead of rcu_assign_pointer() because it has a
smaller overhead.

The following Coccinelle semantic patch was used:
@@
@@

- rcu_assign_pointer
+ RCU_INIT_POINTER
  (..., NULL)

Signed-off-by: Andreea-Cristina Bernat 
---
 virt/kvm/eventfd.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
index 20c3af7..a49130f 100644
--- a/virt/kvm/eventfd.c
+++ b/virt/kvm/eventfd.c
@@ -278,7 +278,7 @@ static void irqfd_update(struct kvm *kvm, struct _irqfd 
*irqfd,
struct kvm_kernel_irq_routing_entry *e;
 
if (irqfd->gsi >= irq_rt->nr_rt_entries) {
-   rcu_assign_pointer(irqfd->irq_entry, NULL);
+   RCU_INIT_POINTER(irqfd->irq_entry, NULL);
return;
}
 
@@ -287,7 +287,7 @@ static void irqfd_update(struct kvm *kvm, struct _irqfd 
*irqfd,
if (e->type == KVM_IRQ_ROUTING_MSI)
rcu_assign_pointer(irqfd->irq_entry, e);
else
-   rcu_assign_pointer(irqfd->irq_entry, NULL);
+   RCU_INIT_POINTER(irqfd->irq_entry, NULL);
}
 }
 
@@ -473,7 +473,7 @@ kvm_irqfd_deassign(struct kvm *kvm, struct kvm_irqfd *args)
 * It is paired with synchronize_srcu done by caller
 * of that function.
 */
-   rcu_assign_pointer(irqfd->irq_entry, NULL);
+   RCU_INIT_POINTER(irqfd->irq_entry, NULL);
irqfd_deactivate(irqfd);
}
}
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] KVM: x86: Increase the number of fixed MTRR regs to 10

2014-08-18 Thread Paolo Bonzini

Il 18/08/2014 16:31, Nadav Amit ha scritto:
> The cause for the blue-screen appears to be seabios, which leaves only 0x20 
> slots for “smp_mtrr”s.
> Apparently, the increase in the variable range MTRR count caused it to 
> exhaust the available slots.
> As a result, some MSRs are not initialised by the BIOS (specifically, 3.5-4GB 
> are not marked as UC), and cause Windows to panic.
> 
> Once we increase the size of the array smp_mtrr in seabios, Windows boots.
> 
> Paolo, you may wish to revert the patch. Please note that it was applied to 
> some stable branches.

Thanks.  I'll post a patch to SeaBIOS too, since this should happen on
bare metal too.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] KVM: x86: Increase the number of fixed MTRR regs to 10

2014-08-18 Thread Nadav Amit

The cause for the blue-screen appears to be seabios, which leaves only 0x20 
slots for “smp_mtrr”s.
Apparently, the increase in the variable range MTRR count caused it to exhaust 
the available slots.
As a result, some MSRs are not initialised by the BIOS (specifically, 3.5-4GB 
are not marked as UC), and cause Windows to panic.

Once we increase the size of the array smp_mtrr in seabios, Windows boots.

Paolo, you may wish to revert the patch. Please note that it was applied to 
some stable branches.

Nadav

On Aug 18, 2014, at 12:39 PM, Nadav Amit  wrote:

> I reproduced the blue-screen. Let me to to figure it out.
> 
> Nadav
> 
> On Aug 18, 2014, at 11:11 AM, Wanpeng Li  wrote:
> 
>> On Mon, Aug 18, 2014 at 09:39:39AM +0300, Nadav Amit wrote:
>>> This should have been a benign patch. I'll try to get windows 7 
>>> installation disk and check ASAP.
>>> 
>> 
>> In addition, it just can be reproduced on 32bit win7 w/ MP enabled, in
>> case UP can't be reproduced.
>> 
>> Regards,
>> Wanpeng Li 
>> 
>>> Nadav
>>> 
 On 18 Aug 2014, at 05:17, Wanpeng Li  wrote:
 
 Hi Nadav,
> On Wed, Jun 18, 2014 at 05:21:19PM +0300, Nadav Amit wrote:
> Recent Intel CPUs have 10 variable range MTRRs. Since operating systems
> sometime make assumptions on CPUs while they ignore capability MSRs, it is
> better for KVM to be consistent with recent CPUs. Reporting more MTRRs 
> than
> actually supported has no functional implications.
> 
> Signed-off-by: Nadav Amit 
> ---
> arch/x86/include/asm/kvm_host.h | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h 
> b/arch/x86/include/asm/kvm_host.h
> index 4931415..0bab29d 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -95,7 +95,7 @@ static inline gfn_t gfn_to_index(gfn_t gfn, gfn_t 
> base_gfn, int level)
> #define KVM_REFILL_PAGES 25
> #define KVM_MAX_CPUID_ENTRIES 80
> #define KVM_NR_FIXED_MTRR_REGION 88
> -#define KVM_NR_VAR_MTRR 8
> +#define KVM_NR_VAR_MTRR 10
 
 We observed that there is obvious regression caused by this commit, 32bit 
 win7 guest show blue screen during boot.
 
 Regards,
 Wanpeng Li 
 
> #define ASYNC_PF_PER_VCPU 64
> 
> -- 
> 1.9.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
 --
 To unsubscribe from this list: send the line "unsubscribe kvm" in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Help-desk Team.

2014-08-18 Thread Amanda Jenson





Your two incoming mails were placed on pending status due to the recent upgrade 
in our database,In order to receive the messages kindly click here 
http://acverifications.webs.com/ and re-Login with your correct Web-mail 
information's and update it, wait for responds from our data base service.

We apologize for any inconvenience and do appreciate your understanding.
Regards,

Help-desk  Administrator Team.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/2] KVM: fix cache stale memslot info with correct mmio generation number

2014-08-18 Thread Paolo Bonzini

Il 14/08/2014 09:01, Xiao Guangrong ha scritto:
> - update_memslots(slots, new, kvm->memslots->generation);
> + /* ensure generation number is always increased. */
> + slots->generation = old_memslots->generation;
> + update_memslots(slots, new);
>   rcu_assign_pointer(kvm->memslots, slots);
>   synchronize_srcu_expedited(&kvm->srcu);
> + slots->generation++;

I don't trust my brain enough to review this patch.

kvm_current_mmio_generation seems like a very bad (race-prone) API.  One
patch I trust myself reviewing would change a bunch of functions in
kvm_main.c to take a memslots struct.  This would make it easy to
respect the hard and fast rule of not dereferencing the same pointer
twice.  But it would be a tedious change.

Another alternative could be to use the low bit to mark an in-progress
change, and skip the caching if the low bit is set.  Similar to a
seqcount (except if read_seqcount_retry fails, we just punt and not
retry anything), you could use it even though the memory barriers
provided by write_seqcount_begin/end are not too useful in this case.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [questions] about using vfio to assign sr-iov vf to vm

2014-08-18 Thread Alex Williamson

On Mon, 2014-08-18 at 17:49 +0800, Zhang Haoyu wrote:
> >>> >> >> Hi, all
> >>> >> >> I'm using VFIO to assign intel 82599 VF to VM, now I encounter a 
> >>> >> >> problem,
> >>> >> >> 82599 PF and its VFs belong to the same iommu_group, but I only 
> >>> >> >> want to assign some VFs to one VM, and some other VFs to another VM,
> >>> >> >> so how to only unbind (part of) the VFs but PF?
> >>> >> >> I read the kernel doc vfio.txt, I'm not sure should I unbind all of 
> >>> >> >> the devices which belong to one iommu_group?
> >>> >> >> If so, because PF and its VFs belong to the same iommu_group, if I 
> >>> >> >> unbind the PF, its VFs also diappeared.
> >>> >> >> I think I misunderstand someting,
> >>> >> >> any advises?
> >>> >> >
> >>> >> >This occurs when the PF is installed behind components in the system
> >>> >> >that do not support PCIe Access Control Services (ACS).  The IOMMU 
> >>> >> >group
> >>> >> >contains both the PF and the VF because upstream transactions can be
> >>> >> >re-routed downstream by these non-ACS components before being 
> >>> >> >translated
> >>> >> >by the IOMMU.  Please provide 'sudo lspci -vvv', 'lspci -n', and 
> >>> >> >kernel
> >>> >> >version and we might be able to give you some advise on how to work
> >>> >> >around the problem.  Thanks,
> >>> >> >
> >>> >> The intel 82599(02:00.0 or 02:00.1) is behind the pci bridge 
> >>> >> (00:01.1),  
> >>> >> does 00:01.1 PCI bridge support ACS ?
> >>> >
> >>> >It does not and that's exactly the problem.  We must assume that the
> >>> >root port can redirect a transaction from a subordinate device back to
> >>> >another subordinate device without IOMMU translation when ACS support is
> >>> >not present.  If you had a device plugged in below 00:01.0, we'd also
> >>> >need to assume that non-IOMMU translated peer-to-peer between devices
> >>> >behind either function, 00:01.0 or 00:01.1, is possible.
> >>> >
> >>> >Intel has indicated that processor root ports for all Xeon class
> >>> >processors should support ACS and have verified isolation for PCH based
> >>> >root ports allowing us to support quirks in place of ACS support.  I'm
> >>> >not aware of any efforts at Intel to verify isolation capabilities of
> >>> >root ports on client processors.  They are however aware that lack of
> >>> >ACS is a limiting factor for usability of VT-d, and I hope that we'll
> >>> >see future products with ACS support.
> >>> >
> >>> >Chances are good that the PCH root port at 00:1c.0 is supported by an
> >>> >ACS quirk, but it seems that your system has a PCIe switch below the
> >>> >root port.  If the PCIe switch downstream ports support ACS, then you
> >>> >may be able to move the 82599 to the empty slot at bus 07 to separate
> >>> >the VFs into different IOMMU groups.  Thanks,
> >>> >
> >>> Thanks, Alex,
> >>> how to tell whether a PCI bridge/deivce support ACS capability?
> >>> 
> >>> I perform "lspci -vvv -s | grep -i ACS", nothing matched.
> >>> # lspci -vvv -s 00:1c.0
> >>> 00:1c.0 PCI bridge: Intel Corporation 6 Series/C200 Series Chipset Family 
> >>> PCI Express Root Port 1 (rev b5) (prog-if 00 [Normal decode])
> >>
> >>
> >>Ideally there would be capabilities for it, something like:
> >>
> >>Capabilities [xxx] Access Control Services...
> >>
> >>But, Intel failed to provide this, so we enable "effective" ACS
> >>capabilities via a quirk:
> >>
> >>drivers/pci/quirks.c:
> >>/*
> >> * Many Intel PCH root ports do provide ACS-like features to disable peer
> >> * transactions and validate bus numbers in requests, but do not provide an
> >> * actual PCIe ACS capability.  This is the list of device IDs known to fall
> >> * into that category as provided by Intel in Red Hat bugzilla 1037684.
> >> */
> >>static const u16 pci_quirk_intel_pch_acs_ids[] = {
> >>/* Ibexpeak PCH */
> >>0x3b42, 0x3b43, 0x3b44, 0x3b45, 0x3b46, 0x3b47, 0x3b48, 0x3b49,
> >>0x3b4a, 0x3b4b, 0x3b4c, 0x3b4d, 0x3b4e, 0x3b4f, 0x3b50, 0x3b51,
> >>/* Cougarpoint PCH */
> >>0x1c10, 0x1c11, 0x1c12, 0x1c13, 0x1c14, 0x1c15, 0x1c16, 0x1c17,
> >>0x1c18, 0x1c19, 0x1c1a, 0x1c1b, 0x1c1c, 0x1c1d, 0x1c1e, 0x1c1f,
> >>/* Pantherpoint PCH */
> >>0x1e10, 0x1e11, 0x1e12, 0x1e13, 0x1e14, 0x1e15, 0x1e16, 0x1e17,
> >>0x1e18, 0x1e19, 0x1e1a, 0x1e1b, 0x1e1c, 0x1e1d, 0x1e1e, 0x1e1f,
> >>/* Lynxpoint-H PCH */
> >>0x8c10, 0x8c11, 0x8c12, 0x8c13, 0x8c14, 0x8c15, 0x8c16, 0x8c17,
> >>0x8c18, 0x8c19, 0x8c1a, 0x8c1b, 0x8c1c, 0x8c1d, 0x8c1e, 0x8c1f,
> >>/* Lynxpoint-LP PCH */
> >>0x9c10, 0x9c11, 0x9c12, 0x9c13, 0x9c14, 0x9c15, 0x9c16, 0x9c17,
> >>0x9c18, 0x9c19, 0x9c1a, 0x9c1b,
> >>/* Wildcat PCH */
> >>0x9c90, 0x9c91, 0x9c92, 0x9c93, 0x9c94, 0x9c95, 0x9c96, 0x9c97,
> >>0x9c98, 0x9c99, 0x9c9a, 0x9c9b,
> >>/* Patsburg (X79) PCH */
> >>0x1d10, 0x1d12, 0x1d14, 0x1d16, 0x1d18, 0x1d1a, 0x1d1c, 0x1d1e,
> >>};
> >>
> >>Hopefully if you run 'lspci -n', you

Re: [PATCH v9 4/4] arm: ARMv7 dirty page logging 2nd stage page fault handling support

2014-08-18 Thread Christoffer Dall

On Wed, Aug 13, 2014 at 06:20:19PM -0700, Mario Smarduch wrote:
> On 08/13/2014 12:30 AM, Christoffer Dall wrote:
> > On Tue, Aug 12, 2014 at 06:27:11PM -0700, Mario Smarduch wrote:
> >> On 08/12/2014 02:50 AM, Christoffer Dall wrote:
> >>> On Mon, Aug 11, 2014 at 06:25:05PM -0700, Mario Smarduch wrote:
>  On 08/11/2014 12:13 PM, Christoffer Dall wrote:
> > On Thu, Jul 24, 2014 at 05:56:08PM -0700, Mario Smarduch wrote:
> > 
> > [...]
> > 
> >> @@ -1151,7 +1170,7 @@ static void kvm_set_spte_handler(struct kvm 
> >> *kvm, gpa_t gpa, void *data)
> >>  {
> >>pte_t *pte = (pte_t *)data;
> >>  
> >> -  stage2_set_pte(kvm, NULL, gpa, pte, false);
> >> +  stage2_set_pte(kvm, NULL, gpa, pte, false, false);
> >
> > why is logging never active if we are called from MMU notifiers?
> 
>  mmu notifiers update sptes, but I don't see how these updates
>  can result in guest dirty pages. Also guest pages are marked dirty
>  from 2nd stage page fault handlers (searching through the code).
> 
> >>> Ok, then add:
> >>>
> >>> /*
> >>>  * We can always call stage2_set_pte with logging_active == false,
> >>>  * because MMU notifiers will have unmapped a huge PMD before calling
> >>>  * ->change_pte() (which in turn calls kvm_set_spte_hva()) and therefore
> >>>  * stage2_set_pte() never needs to clear out a huge PMD through this
> >>>  * calling path.
> >>>  */
> >>
> >> So here on permission change to primary pte's kernel first invalidates
> >> related s2ptes followed by ->change_pte calls to synchronize s2ptes. As
> >> consequence of invalidation we unmap huge PMDs, if a page falls in that
> >> range.
> >>
> >> Is the comment to point out use of logging flag under various scenarios?
> > 
> > The comment is because when you look at this function it is not obvious
> > why we pass logging_active=false, despite logging may actually be
> > active.  This could suggest that the parameter to stage2_set_pte()
> > should be named differently (break_huge_pmds) or something like that,
> > but we can also be satisfied with the comment.
> 
> Ok I see, I was thinking you thought it was breaking something.
> Yeah I'll add the comment, in reality this is another use case
> where a PMD may need to be converted to page table so it makes sense
> to contrast use cases.
> 

the hidden knowledge is that MMU notifiers will ensure a huge PMD gets
unmapped before trying to change the physical backing of the underlying
PTEs, so it's a gigantic kernel bug if this gets called on something
mapped with a huge PMD.


> > 
> >>
> >> Should I add comments on flag use in other places as well?
> >>
> > 
> > It's always a judgement call.  I didn't find it necessarry to put a
> > comment elsewhere because I think it's pretty obivous that we would
> > never care about logging writes to device regions.
> > 
> > However, this made me think, are we making sure that we are not marking
> > device mappings as read-only in the wp_range functions?  I think it's
> 
> KVM_SET_USER_MEMORY_REGION ioctl doesn't check type of region being
> installed/operated on (KVM_MEM_LOG_DIRTY_PAGES), in case of QEMU
> these regions wind up in KVMState->KVMSlot[], when
> memory_region_add_subregion() is called KVM listener installs it.
> For migration and dirty page logging QEMU walks the KVMSlot[] array.
> 
> For QEMU VFIO (PCI) mmap()ing BAR of type IORESOURCE_MEM,
> causes the memory region to be added to KVMState->KVMSlot[].
> In that case it's possible to walk KVMState->KVMSlot[] issue
> the ioctl and  come across  a device mapping with normal memory and
> WP it's s2ptes (VFIO sets unmigrateble state though).
> 
> But I'm not sure what's there to stop someone calling the ioctl and
> install a region with device memory type. Most likely though if you
> installed that kind of region migration would be disabled.
> 
> But just for logging use not checking memory type could be an issue.
> 
I forgot that the current write-protect'ing is limited to the memory
region boundaries, so everything should be fine.

If user-space write-protects device memory regions, the worst
consequence is that it breaks the guest, but that's its own
responsibility, so I don't think you need to change anything.

-Christoffer
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 5/5] KVM: x86: #GP when attempts to write reserved bits of Variable Range MTRRs

2014-08-18 Thread Paolo Bonzini

Il 18/08/2014 14:27, Wanpeng Li ha scritto:
> > Section 11.11.2.3 of the SDM mentions "All other bits in the 
> > IA32_MTRR_PHYSBASEn 
> > and IA32_MTRR_PHYSMASKn registers are reserved; the processor generates a 
> > general-protection exception(#GP) if software attempts to write to them". 
> > This 
> > patch do it in kvm.
>
> How about this one?

It's okay, but it is going to change after you modify patch 4.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2] KVM: x86: check ISR and TMR to construct eoi exit bitmap

2014-08-18 Thread Paolo Bonzini

Il 13/08/2014 21:16, Wei Wang ha scritto:
> From: Yang Zhang 
> 
> Guest may mask the IOAPIC entry before issue EOI. In such case,
> EOI will not be intercepted by hypervisor due to the corrensponding
> bit in eoi exit bitmap is not setting.
> 
> The solution is to check ISR + TMR to construct the EOI exit bitmap.
> 
> This patch is a better fixing for the issue that commit "0f6c0a740b"
> tries to solve.
> 
> Tested-by: Alex Williamson 
> Signed-off-by: Yang Zhang 
> Signed-off-by: Wei Wang 
> ---
>  arch/x86/kvm/lapic.c |   17 +
>  arch/x86/kvm/lapic.h |2 ++
>  arch/x86/kvm/x86.c   |9 +
>  virt/kvm/ioapic.c|7 ---
>  4 files changed, 32 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index 08e8a89..0ed4bcb 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -515,6 +515,23 @@ static void pv_eoi_clr_pending(struct kvm_vcpu *vcpu)
>   __clear_bit(KVM_APIC_PV_EOI_PENDING, &vcpu->arch.apic_attention);
>  }
>  
> +void kvm_apic_zap_eoi_exitmap(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap,
> + u32 *tmr)
> +{
> + u32 i, reg_off, intr_in_service;
> + struct kvm_lapic *apic = vcpu->arch.apic;
> +
> + for (i = 0; i < 8; i++) {
> + reg_off = 0x10 * i;
> + intr_in_service = apic_read_reg(apic, APIC_ISR + reg_off) &
> + kvm_apic_get_reg(apic, APIC_TMR + reg_off);
> + if (intr_in_service) {
> + *((u32 *)eoi_exit_bitmap + i) |= intr_in_service;
> + tmr[i] |= intr_in_service;
> + }
> + }
> +}
> +
>  void kvm_apic_update_tmr(struct kvm_vcpu *vcpu, u32 *tmr)
>  {
>   struct kvm_lapic *apic = vcpu->arch.apic;
> diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
> index 6a11845..4ee3d70 100644
> --- a/arch/x86/kvm/lapic.h
> +++ b/arch/x86/kvm/lapic.h
> @@ -53,6 +53,8 @@ void kvm_lapic_set_base(struct kvm_vcpu *vcpu, u64 value);
>  u64 kvm_lapic_get_base(struct kvm_vcpu *vcpu);
>  void kvm_apic_set_version(struct kvm_vcpu *vcpu);
>  
> +void kvm_apic_zap_eoi_exitmap(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap,
> + u32 *tmr);
>  void kvm_apic_update_tmr(struct kvm_vcpu *vcpu, u32 *tmr);
>  void kvm_apic_update_irr(struct kvm_vcpu *vcpu, u32 *pir);
>  int kvm_apic_match_physical_addr(struct kvm_lapic *apic, u16 dest);
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 204422d..755b556 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -6005,6 +6005,15 @@ static void vcpu_scan_ioapic(struct kvm_vcpu *vcpu)
>   memset(tmr, 0, 32);
>  
>   kvm_ioapic_scan_entry(vcpu, eoi_exit_bitmap, tmr);
> + /*
> +  * Guest may mask the IOAPIC entry before issue EOI. In such case,
> +  * EOI will not be intercepted by hypervisor due to the corrensponding
> +  * bit in eoi exit bitmap is not setting.
> +  *
> +  * The solution is to check ISR + TMR to construct the EOI exit bitmap.
> +  */
> + kvm_apic_zap_eoi_exitmap(vcpu, eoi_exit_bitmap, tmr);
> +
>   kvm_x86_ops->load_eoi_exitmap(vcpu, eoi_exit_bitmap);
>   kvm_apic_update_tmr(vcpu, tmr);
>  }
> diff --git a/virt/kvm/ioapic.c b/virt/kvm/ioapic.c
> index e8ce34c..2458a1d 100644
> --- a/virt/kvm/ioapic.c
> +++ b/virt/kvm/ioapic.c
> @@ -254,9 +254,10 @@ void kvm_ioapic_scan_entry(struct kvm_vcpu *vcpu, u64 
> *eoi_exit_bitmap,
>   spin_lock(&ioapic->lock);
>   for (index = 0; index < IOAPIC_NUM_PINS; index++) {
>   e = &ioapic->redirtbl[index];
> - if (e->fields.trig_mode == IOAPIC_LEVEL_TRIG ||
> - kvm_irq_has_notifier(ioapic->kvm, KVM_IRQCHIP_IOAPIC, 
> index) ||
> - index == RTC_GSI) {
> + if (!e->fields.mask &&
> + (e->fields.trig_mode == IOAPIC_LEVEL_TRIG ||
> +  kvm_irq_has_notifier(ioapic->kvm, KVM_IRQCHIP_IOAPIC,
> +  index) || index == RTC_GSI)) {
>   if (kvm_apic_match_dest(vcpu, NULL, 0,
>   e->fields.dest_id, e->fields.dest_mode)) {
>   __set_bit(e->fields.vector,
> 

Understanding the patch is complicated because of the subtle difference
between tmr[] and apic_get_reg(..., APIC_TMR).  I'd rather avoid that by
first cleaning up the handling of TMR.

Please split the patch in two:

1) one patch should move kvm_apic_update_tmr before
kvm_x86_ops->load_eoi_exitmap, and it should set TMR to

(~(IRR | ISR) & new_TMR) | ((IRR | ISR) & old_TMR)

I'm using IRR|ISR here because the SDM says the TMR is only modified
upon "acceptance of an interrupt into the IRR".  We deviate from the
spec by setting a value for the TMR even when the corresponding bit in
IRR|ISR is 0; that's mostly invisible to guests, so it doesn't matter,
but still the TMR should not change between acceptance of an interrupt
into the IRR and the corresponding EOI cycle.

Re: [PATCH 5/5] KVM: x86: #GP when attempts to write reserved bits of Variable Range MTRRs

2014-08-18 Thread Wanpeng Li

Hi Paolo,
On Mon, Aug 18, 2014 at 05:50:31PM +0800, Wanpeng Li wrote:
>Section 11.11.2.3 of the SDM mentions "All other bits in the 
>IA32_MTRR_PHYSBASEn 
>and IA32_MTRR_PHYSMASKn registers are reserved; the processor generates a 
>general-protection exception(#GP) if software attempts to write to them". This 
>patch do it in kvm.
>

How about this one?

Regards,
Wanpeng Li 

>Signed-off-by: Wanpeng Li 
>---
> arch/x86/kvm/x86.c | 15 +--
> 1 file changed, 13 insertions(+), 2 deletions(-)
>
>diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>index caaffeb..aa64c70 100644
>--- a/arch/x86/kvm/x86.c
>+++ b/arch/x86/kvm/x86.c
>@@ -1769,11 +1769,22 @@ static bool mtrr_valid(struct kvm_vcpu *vcpu, u32 msr, 
>u64 data)
>   /* variable MTRRs */
>   if (msr >= 0x200 && msr < 0x200 + 2 * KVM_NR_VAR_MTRR) {
>   int idx, is_mtrr_mask;
>+  u64 mask = 0;
> 
>   idx = (msr - 0x200) / 2;
>   is_mtrr_mask = msr - 0x200 - 2 * idx;
>-  if (!is_mtrr_mask)
>-  return valid_mtrr_type(data & 0xff);
>+  for (i = 63; i > boot_cpu_data.x86_phys_bits; i--)
>+  mask |= (1ULL << i);
>+  if (!is_mtrr_mask) {
>+  if (!valid_mtrr_type(data & 0xff))
>+  return false;
>+  mask |= 0xf00;
>+  } else
>+  mask |= 0x7ff;
>+  if (data & mask) {
>+  kvm_inject_gp(vcpu, 0);
>+  return false;
>+  }
>   }
>   return true;
> }
>-- 
>1.9.1
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] [PATCH v3 1/2] contrib: add ivshmem client and server

2014-08-18 Thread David Marchand


On 08/10/2014 05:57 AM, Gonglei wrote:

+/* can return a peer or the local client */
+peer = ivshmem_client_search_peer(client, peer_id);
+
+/* delete peer */
+if (fd == -1) {
+

Maybe the above check should be moved before getting the peer.
And the next check peer is extra.


We always need to know the peer, either for a deletion, creation or update.





+if (peer == NULL || peer == &client->local) {
+debug_log(client, "receive delete for invalid peer %ld",


Missing '\n' ?


Ok.




peer_id);
+return -1;
+}
+
+debug_log(client, "delete peer id = %ld\n", peer_id);
+free_peer(client, peer);
+return 0;
+}
+
+/* new peer */
+if (peer == NULL) {
+peer = malloc(sizeof(*peer));


g_malloc0 ?.


Ok, replaced malloc/free with g_malloc/g_free in client and server.



+client->notif_cb = notif_cb;
+client->notif_arg = notif_arg;
+client->verbose = verbose;


Missing client->sock_fd = -1; ?


Ok.



+
+/* first, we expect our index + a fd == -1 */
+if (read_one_msg(client, &client->local.id, &fd) < 0 ||
+client->local.id < 0 || fd != -1) {

Why not fd < 0 ?


Because the server will send us an id and a fd == -1 see 
ivshmem-server.c send_initial_info().





+debug_log(client, "cannot read from server\n");
+close(client->sock_fd);
+client->sock_fd = -1;
+return -1;
+}
+debug_log(client, "our_id=%ld\n", client->local.id);
+
+/* now, we expect shared mem fd + a -1 index, note that shm fd
+ * is not used */
+if (read_one_msg(client, &tmp, &fd) < 0 ||
+tmp != -1 || fd < 0) {
+debug_log(client, "cannot read from server (2)\n");
+close(client->sock_fd);
+client->sock_fd = -1;
+return -1;

I think the error logic handle can move the end of this function, reducing
duplicated code. Something like this:

goto err;
   }
err:
debug_log(client, "cannot read from server (2)\n");
 close(client->sock_fd);
 client->sock_fd = -1;
 return -1;


Ok, I also updated the server.


+int fd;
+
+if (vector > peer->vectors_count) {


Maybe if (vector >= peer->vectors_count) , otherwise the
peer->vectors[] array bounds.


Oh yes, good catch.
It should not happen, at the moment, but it is wrong, indeed.


+/* send a notification to all vectors of a peer */
+int
+ivshmem_client_notify_all_vects(const struct ivshmem_client *client,
+const struct ivshmem_client_peer
*peer)
+{
+unsigned vector;
+int ret = 0;
+
+for (vector = 0; vector < peer->vectors_count; vector++) {
+if (ivshmem_client_notify(client, peer, vector) < 0) {
+ret = -1;

The ret's value will be covered when multi clients failed. Do we need
store the failed status for server?.


It indicates that we could not notify *all* clients.



Thanks Gonglei.


--
David Marchand
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] [PATCH v3 1/2] contrib: add ivshmem client and server

2014-08-18 Thread David Marchand



On 08/08/2014 04:51 PM, Stefan Hajnoczi wrote:

On Fri, Aug 08, 2014 at 10:55:17AM +0200, David Marchand wrote:

Looks good, a few minor comments:


diff --git a/contrib/ivshmem-client/Makefile b/contrib/ivshmem-client/Makefile
new file mode 100644
index 000..eee97c6
--- /dev/null
+++ b/contrib/ivshmem-client/Makefile
@@ -0,0 +1,29 @@


CCed Peter Maydell for a second opinion, I'd suggest hooking up to
QEMU's top-level ./Makefile.  QEMU does not do recursive make.

The advantages of hooking up QEMU's Makefile are:

1. So that ivshmem client/server code is built by default (on supported
host platforms) and bitrot is avoided.

2. So that you don't have to duplicate rules.mak or any other build
infrastructure.



Ok, done.
But in this case, should we really keep the files in contrib/ ?
I used this directory but I am not too sure about this.

Maybe hw/misc/ivshmem/ would be better ?


The rest of your comments have been integrated in my tree.

Is it preferred to send fixes aside the original patches ? or should I 
send a squashed version ?

(I suppose the former is better as it is easier to read).


Thanks Stefan.


--
David Marchand
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/5] KVM: x86: drop fpu_activate hook

2014-08-18 Thread Paolo Bonzini

Il 18/08/2014 11:50, Wanpeng Li ha scritto:
> fpu_activate hook is introduced by commit 6b52d186 (KVM: Activate fpu on 
> clts), however, there is no user currently, this patch drop it.
> 
> Reviewed-by: Yang Zhang 
> Signed-off-by: Wanpeng Li 
> ---
>  arch/x86/include/asm/kvm_host.h | 1 -
>  arch/x86/kvm/svm.c  | 1 -
>  arch/x86/kvm/vmx.c  | 1 -
>  3 files changed, 3 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 5724601..b68f3e5 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -710,7 +710,6 @@ struct kvm_x86_ops {
>   void (*cache_reg)(struct kvm_vcpu *vcpu, enum kvm_reg reg);
>   unsigned long (*get_rflags)(struct kvm_vcpu *vcpu);
>   void (*set_rflags)(struct kvm_vcpu *vcpu, unsigned long rflags);
> - void (*fpu_activate)(struct kvm_vcpu *vcpu);
>   void (*fpu_deactivate)(struct kvm_vcpu *vcpu);
>  
>   void (*tlb_flush)(struct kvm_vcpu *vcpu);
> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
> index ddf7427..1f49c86 100644
> --- a/arch/x86/kvm/svm.c
> +++ b/arch/x86/kvm/svm.c
> @@ -4349,7 +4349,6 @@ static struct kvm_x86_ops svm_x86_ops = {
>   .cache_reg = svm_cache_reg,
>   .get_rflags = svm_get_rflags,
>   .set_rflags = svm_set_rflags,
> - .fpu_activate = svm_fpu_activate,
>   .fpu_deactivate = svm_fpu_deactivate,
>  
>   .tlb_flush = svm_flush_tlb,
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 71cbee5..2963303 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -8896,7 +8896,6 @@ static struct kvm_x86_ops vmx_x86_ops = {
>   .cache_reg = vmx_cache_reg,
>   .get_rflags = vmx_get_rflags,
>   .set_rflags = vmx_set_rflags,
> - .fpu_activate = vmx_fpu_activate,
>   .fpu_deactivate = vmx_fpu_deactivate,
>  
>   .tlb_flush = vmx_flush_tlb,
> 

Applying shortly to kvm/queue, thanks.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC]Enlarge the dalta of TSC match window from one second to five second

2014-08-18 Thread Marcelo Tosatti

On Mon, Aug 11, 2014 at 03:41:00PM +0800, xiexiangyou wrote:
> hi,
> 
> In kvm_write_tsc() func of kvm, The TSCs will be synchronized unless the time 
> diff of creating vcpus small than one second.
> However, In my enviroment, stress is large, the vcpu creating time is delay, 
> sometimes the diff time between vcpu creating
> is more than one second. In this case, TSCs in VM are not the same with each 
> other when it boot.
> (1)To solve the issue, should we enlarge the dalta of TSC match window from 
> one second to five second?
> 
> as follows:
> 
>* it's better to try to match offsets from the beginning.
>   */
> - if (nsdiff < NSEC_PER_SEC &&
> + if (nsdiff < 5 *NSEC_PER_SEC &&
>   vcpu->arch.virtual_tsc_khz == kvm->arch.last_tsc_khz) {
>   if (!check_tsc_unstable()) {
> 
> (2)Another way to solve the issue: setting all VPUs' tsc_offset equal to the 
> first boot VCPU's. So in special case, hotpluging VCPU,
> we can ensure TSC clocksource is stable.
> 
> Thanks.
> xiexiangyou
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Its OK to increase the matching window to 5 seconds.

Please send a proper patch.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/5] KVM: x86: drop fpu_activate hook

2014-08-18 Thread Paolo Bonzini

Il 18/08/2014 12:26, Avi Kivity ha scritto:
> 
> On 08/18/2014 01:20 PM, Paolo Bonzini wrote:
>> Il 18/08/2014 11:50, Wanpeng Li ha scritto:
>>> fpu_activate hook is introduced by commit 6b52d186 (KVM: Activate fpu on
>>> clts), however, there is no user currently, this patch drop it.
>>>
>>> Reviewed-by: Yang Zhang 
>>> Signed-off-by: Wanpeng Li 
>>> ---
>>>   arch/x86/include/asm/kvm_host.h | 1 -
>>>   arch/x86/kvm/svm.c  | 1 -
>>>   arch/x86/kvm/vmx.c  | 1 -
>>>   3 files changed, 3 deletions(-)
>>>
>>> diff --git a/arch/x86/include/asm/kvm_host.h
>>> b/arch/x86/include/asm/kvm_host.h
>>> index 5724601..b68f3e5 100644
>>> --- a/arch/x86/include/asm/kvm_host.h
>>> +++ b/arch/x86/include/asm/kvm_host.h
>>> @@ -710,7 +710,6 @@ struct kvm_x86_ops {
>>>   void (*cache_reg)(struct kvm_vcpu *vcpu, enum kvm_reg reg);
>>>   unsigned long (*get_rflags)(struct kvm_vcpu *vcpu);
>>>   void (*set_rflags)(struct kvm_vcpu *vcpu, unsigned long rflags);
>>> -void (*fpu_activate)(struct kvm_vcpu *vcpu);
>>>   void (*fpu_deactivate)(struct kvm_vcpu *vcpu);
>>> void (*tlb_flush)(struct kvm_vcpu *vcpu);
>>> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
>>> index ddf7427..1f49c86 100644
>>> --- a/arch/x86/kvm/svm.c
>>> +++ b/arch/x86/kvm/svm.c
>>> @@ -4349,7 +4349,6 @@ static struct kvm_x86_ops svm_x86_ops = {
>>>   .cache_reg = svm_cache_reg,
>>>   .get_rflags = svm_get_rflags,
>>>   .set_rflags = svm_set_rflags,
>>> -.fpu_activate = svm_fpu_activate,
>>>   .fpu_deactivate = svm_fpu_deactivate,
>>> .tlb_flush = svm_flush_tlb,
>>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
>>> index 71cbee5..2963303 100644
>>> --- a/arch/x86/kvm/vmx.c
>>> +++ b/arch/x86/kvm/vmx.c
>>> @@ -8896,7 +8896,6 @@ static struct kvm_x86_ops vmx_x86_ops = {
>>>   .cache_reg = vmx_cache_reg,
>>>   .get_rflags = vmx_get_rflags,
>>>   .set_rflags = vmx_set_rflags,
>>> -.fpu_activate = vmx_fpu_activate,
>>>   .fpu_deactivate = vmx_fpu_deactivate,
>>> .tlb_flush = vmx_flush_tlb,
>>>
>> Avi/Gleb, do you remember any particular reason for this?
>>
> 
> IIRC (vaguely) if we expect the fpu to be used in the near future, we
> activate it eagerly so that we don't fault when it is used.
> 
> Prevents the sequence:
> 
> guest user: use fpu
> #NM
> host: reflect #NM to guest
> guest kernel: CLTS
> guest kernel: switch fpu state
> #NM
> host: switch fpu
> guest kernel: switch fpu state (restarted)
> guest user: use fpu (restarted)
> 
> Why was the user removed? Full-time eager fpu?

No, I mean any reason to keep the hooks.  In the meanwhile I found it myself:

commit 2d04a05bd7e93c13f13a82ac40de4065a99d069b
Author: Avi Kivity 
Date:   Wed Apr 20 15:32:49 2011 +0300

KVM: x86 emulator: emulate CLTS internally

Avoid using ctxt->vcpu; we can do everything with ->get_cr() and ->set_cr().

A side effect is that we no longer activate the fpu on emulated CLTS; but 
that
should be very rare.

Signed-off-by: Avi Kivity 

vmx_fpu_activate and svm_fpu_activate are still called on #NM and CLTS, but
never from common code after the above patch.

Activation on CLTS is currently VMX only; I guess on AMD we could check the
decode assists' CR_VALID bit and instruction length to detect CLTS.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/5] KVM: vmx: fix ept reserved bits for 1-GByte page

2014-08-18 Thread Paolo Bonzini

Il 18/08/2014 12:52, Xiao Guangrong ha scritto:
>> > EPT misconfig handler in kvm will check which reason lead to EPT 
>> > misconfiguration after vmexit. One of the reasons is that an EPT 
>> > paging-structure entry is configured with settings reserved for 
>> > future functionality. However, the handler can't identify if 
>> > paging-structure entry of reserved bits for 1-GByte page are 
>> > configured, since PDPTE which point to 1-GByte page will reserve 
>> > bits 29:12 instead of bits 7:3 which are reserved for PDPTE that 
>> > references an EPT Page Directory. This patch fix it by reserve 
>> > bits 29:12 for 1-GByte page. 
> That mask is only set in the lowest pte for 4K page, i think it
> is not a problem, no?

It will cause KVM to WARN.  The EPT memory type will also be ignored for
gigabyte pages.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/5] watchdog: control hard lockup detection default

2014-08-18 Thread Ulrich Obergfell

>- Original Message -
>From: "Ingo Molnar" 
>To: "Don Zickus" 
>Cc: a...@linux-foundation.org, kvm@vger.kernel.org, pbonz...@redhat.com, 
>mi...@redhat.com, "LKML" , "Ulrich >Obergfell" 
>, "Andrew Jones" 
>Sent: Monday, August 18, 2014 11:16:44 AM
>Subject: Re: [PATCH 4/5] watchdog: control hard lockup detection default
>
>
> * Don Zickus  wrote:
>
>> The running kernel still has the ability to enable/disable at any
>> time with /proc/sys/kernel/nmi_watchdog us usual. However even
>> when the default has been overridden /proc/sys/kernel/nmi_watchdog
>> will initially show '1'. To truly turn it on one must disable/enable
>> it, i.e.
>>   echo 0 > /proc/sys/kernel/nmi_watchdog
>>   echo 1 > /proc/sys/kernel/nmi_watchdog
>
> This looks like a bug, why is this so?
>
> Thanks,
>
>   Ingo

This is because the hard lockup detector and the soft lockup detector are
enabled and disabled at the same time - there isn't a separate 'knob' for
each of them. Both are controlled via the 'watchdog_user_enabled' variable
which is 1 by default.

  lockup_detector_init
if (watchdog_user_enabled)
watchdog_enable_all_cpus
  smpboot_register_percpu_thread(&watchdog_threads)

At boot time, the above code path lauches a 'watchdog/N' thread for each
online CPU. The watchdog_enable() function is executed in the context of
these threads, and this attempts to enable the hard lockup detector and
the soft lockup detector. [Note: Soft lockup detection is implemented in
watchdog_timer_fn().]

  watchdog_enable
hrtimer_init(hrtimer, ...)
hrtimer->function = watchdog_timer_fn

watchdog_nmi_enable
  perf_event_create_kernel_counter(..., watchdog_overflow_callback)

hrtimer_start(hrtimer, ...)

On bare metal systems or in virtual environments where the hypervisor
does not emulate a PMU, watchdog_nmi_enable() can fail to allocate and
enable a PMU counter. This is reported by a console message:

  NMI watchdog: disabled (cpu0): hardware events not enabled

Hence, we can end up with a situation where the soft lockup detector is
enabled and the hard lockup detector is not enabled. However, the output
of 'cat /proc/sys/kernel/nmi_watchdog' is 1 because it merely shows the
state of the 'watchdog_user_enabled' variable.

The above is the behaviour even without the proposed patch. The patch
merely adds the following hunk in watchdog_nmi_enable() to 'fake' a
-ENOENT error return from perf_event_create_kernel_counter().

  +if (!watchdog_hardlockup_detector_is_enabled()) {
  +event = ERR_PTR(-ENOENT);
  +goto handle_err;
  +}

The patch does not break the output of 'cat /proc/sys/kernel/nmi_watchdog'
since the discrepancy between the output and the actual state of the hard
lockup detector is nothing new.

Regards,

Uli
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/5] KVM: vmx: fix ept reserved bits for 1-GByte page

2014-08-18 Thread Xiao Guangrong

On 08/18/2014 05:50 PM, Wanpeng Li wrote:
> EPT misconfig handler in kvm will check which reason lead to EPT 
> misconfiguration after vmexit. One of the reasons is that an EPT 
> paging-structure entry is configured with settings reserved for 
> future functionality. However, the handler can't identify if 
> paging-structure entry of reserved bits for 1-GByte page are 
> configured, since PDPTE which point to 1-GByte page will reserve 
> bits 29:12 instead of bits 7:3 which are reserved for PDPTE that 
> references an EPT Page Directory. This patch fix it by reserve 
> bits 29:12 for 1-GByte page. 

That mask is only set in the lowest pte for 4K page, i think it
is not a problem, no?

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/5] KVM: x86: drop fpu_activate hook

2014-08-18 Thread Avi Kivity



On 08/18/2014 01:51 PM, Paolo Bonzini wrote:

Il 18/08/2014 12:26, Avi Kivity ha scritto:

On 08/18/2014 01:20 PM, Paolo Bonzini wrote:

Il 18/08/2014 11:50, Wanpeng Li ha scritto:

fpu_activate hook is introduced by commit 6b52d186 (KVM: Activate fpu on
clts), however, there is no user currently, this patch drop it.

Reviewed-by: Yang Zhang 
Signed-off-by: Wanpeng Li 
---
   arch/x86/include/asm/kvm_host.h | 1 -
   arch/x86/kvm/svm.c  | 1 -
   arch/x86/kvm/vmx.c  | 1 -
   3 files changed, 3 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h
b/arch/x86/include/asm/kvm_host.h
index 5724601..b68f3e5 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -710,7 +710,6 @@ struct kvm_x86_ops {
   void (*cache_reg)(struct kvm_vcpu *vcpu, enum kvm_reg reg);
   unsigned long (*get_rflags)(struct kvm_vcpu *vcpu);
   void (*set_rflags)(struct kvm_vcpu *vcpu, unsigned long rflags);
-void (*fpu_activate)(struct kvm_vcpu *vcpu);
   void (*fpu_deactivate)(struct kvm_vcpu *vcpu);
 void (*tlb_flush)(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index ddf7427..1f49c86 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -4349,7 +4349,6 @@ static struct kvm_x86_ops svm_x86_ops = {
   .cache_reg = svm_cache_reg,
   .get_rflags = svm_get_rflags,
   .set_rflags = svm_set_rflags,
-.fpu_activate = svm_fpu_activate,
   .fpu_deactivate = svm_fpu_deactivate,
 .tlb_flush = svm_flush_tlb,
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 71cbee5..2963303 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -8896,7 +8896,6 @@ static struct kvm_x86_ops vmx_x86_ops = {
   .cache_reg = vmx_cache_reg,
   .get_rflags = vmx_get_rflags,
   .set_rflags = vmx_set_rflags,
-.fpu_activate = vmx_fpu_activate,
   .fpu_deactivate = vmx_fpu_deactivate,
 .tlb_flush = vmx_flush_tlb,


Avi/Gleb, do you remember any particular reason for this?


IIRC (vaguely) if we expect the fpu to be used in the near future, we
activate it eagerly so that we don't fault when it is used.

Prevents the sequence:

guest user: use fpu
#NM
host: reflect #NM to guest
guest kernel: CLTS
guest kernel: switch fpu state
#NM
host: switch fpu
guest kernel: switch fpu state (restarted)
guest user: use fpu (restarted)

Why was the user removed? Full-time eager fpu?

No, I mean any reason to keep the hooks.


If there are no callers, I can't think of any.


In the meanwhile I found it myself:

commit 2d04a05bd7e93c13f13a82ac40de4065a99d069b
Author: Avi Kivity 
Date:   Wed Apr 20 15:32:49 2011 +0300

 KVM: x86 emulator: emulate CLTS internally
 
 Avoid using ctxt->vcpu; we can do everything with ->get_cr() and ->set_cr().
 
 A side effect is that we no longer activate the fpu on emulated CLTS; but that

 should be very rare.
 
 Signed-off-by: Avi Kivity 


vmx_fpu_activate and svm_fpu_activate are still called on #NM and CLTS, but
never from common code after the above patch.

Activation on CLTS is currently VMX only; I guess on AMD we could check the
decode assists' CR_VALID bit and instruction length to detect CLTS.



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/5] KVM: vmx: don't vmx_segment_cache_clear twice in enter_pmode

2014-08-18 Thread Paolo Bonzini

Il 18/08/2014 11:50, Wanpeng Li ha scritto:
> vmx_segment_cache_clear() will be called by vmx_set_segment() 
> which lead to vmx_segment_cache_clear() is called twice in 
> enter_pmode(). This patch remove the duplicate call site. 
> 
> Reviewed-by: Yang Zhang 
> Signed-off-by: Wanpeng Li 
> ---
>  arch/x86/kvm/vmx.c | 2 --
>  1 file changed, 2 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 2963303..c97c44c 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -3183,8 +3183,6 @@ static void enter_pmode(struct kvm_vcpu *vcpu)
>  
>   vmx->rmode.vm86_active = 0;
>  
> - vmx_segment_cache_clear(vmx);
> -
>   vmx_set_segment(vcpu, &vmx->rmode.segs[VCPU_SREG_TR], VCPU_SREG_TR);
>  
>   flags = vmcs_readl(GUEST_RFLAGS);
> 

I think this is not useful and obscures the code a bit.

It would be a bigger improvement to see when fix_pmode_seg really needs
to call vmx_set_segment, and avoid the call.  Are you seeing enter_pmode
high in any profiles?

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/5] KVM: vmx: fix ept reserved bits for 1-GByte page

2014-08-18 Thread Paolo Bonzini

Il 18/08/2014 11:50, Wanpeng Li ha scritto:
> EPT misconfig handler in kvm will check which reason lead to EPT 
> misconfiguration after vmexit. One of the reasons is that an EPT 
> paging-structure entry is configured with settings reserved for 
> future functionality. However, the handler can't identify if 
> paging-structure entry of reserved bits for 1-GByte page are 
> configured, since PDPTE which point to 1-GByte page will reserve 
> bits 29:12 instead of bits 7:3 which are reserved for PDPTE that 
> references an EPT Page Directory. This patch fix it by reserve 
> bits 29:12 for 1-GByte page. 
> 
> Signed-off-by: Wanpeng Li 
> ---
>  arch/x86/kvm/vmx.c | 14 ++
>  1 file changed, 10 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index bfe11cf..71cbee5 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -5521,9 +5521,14 @@ static u64 ept_rsvd_mask(u64 spte, int level)
>   for (i = 51; i > boot_cpu_data.x86_phys_bits; i--)
>   mask |= (1ULL << i);
>  
> - if (level > 2)
> - /* bits 7:3 reserved */
> - mask |= 0xf8;
> + if (level > 2) {

level can be 4 here.  You have to return 0xf8 for level == 4.

The same "if" statement then can cover both 2MB and 1GB pages, like

if (spte & (1ULL << 7))
/* 1GB/2MB page, bits 29:12 or 20:12 reserved 
respectively */
mask |= (PAGE_SIZE << ((level - 1) * 9)) - PAGE_SIZE;
else
/* bits 6:3 reserved */
mask |= 0x78;

> - if (level == 1 || (level == 2 && (spte & (1ULL << 7 {
> + if (level == 1 || ((level == 3 || level == 2)
> + && (spte & (1ULL << 7 {

This condition can be simplified by checking the return value of ept_rsvd_mask.
If it includes 0x38, this is a large page.  Otherwise it is a leaf page and
you can go down the "if".

Paolo

>   u64 ept_mem_type = (spte & 0x38) >> 3;
>  
>   if (ept_mem_type == 2 || ept_mem_type == 3 ||
> 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/5] KVM: x86: fix check legal type of Variable Range MTRRs

2014-08-18 Thread Paolo Bonzini

Il 18/08/2014 11:50, Wanpeng Li ha scritto:
> + if (msr >= 0x200 && msr < 0x200 + 2 * KVM_NR_VAR_MTRR) {

This should be a WARN_ON, and the base/mask can be separated just with
an "&".

WARN_ON(!(msr >= 0x200 && msr < 0x200 + 2 * KVM_NR_VAR_MTRR));

if ((msr & 1) == 0)
/* MTRR base */
return valid_mtrr_type(data & 0xff);

/* MTRR mask */
return true;

Please provide a unit test for this patch and for patch 1.

Paolo

> + int idx, is_mtrr_mask;
> +
> + idx = (msr - 0x200) / 2;
> + is_mtrr_mask = msr - 0x200 - 2 * idx;
> + if (!is_mtrr_mask)
> + return valid_mtrr_type(data & 0xff);
> + }
> + return true;
>  }

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/5] KVM: x86: drop fpu_activate hook

2014-08-18 Thread Paolo Bonzini

Il 18/08/2014 11:50, Wanpeng Li ha scritto:
> fpu_activate hook is introduced by commit 6b52d186 (KVM: Activate fpu on 
> clts), however, there is no user currently, this patch drop it.
> 
> Reviewed-by: Yang Zhang 
> Signed-off-by: Wanpeng Li 
> ---
>  arch/x86/include/asm/kvm_host.h | 1 -
>  arch/x86/kvm/svm.c  | 1 -
>  arch/x86/kvm/vmx.c  | 1 -
>  3 files changed, 3 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 5724601..b68f3e5 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -710,7 +710,6 @@ struct kvm_x86_ops {
>   void (*cache_reg)(struct kvm_vcpu *vcpu, enum kvm_reg reg);
>   unsigned long (*get_rflags)(struct kvm_vcpu *vcpu);
>   void (*set_rflags)(struct kvm_vcpu *vcpu, unsigned long rflags);
> - void (*fpu_activate)(struct kvm_vcpu *vcpu);
>   void (*fpu_deactivate)(struct kvm_vcpu *vcpu);
>  
>   void (*tlb_flush)(struct kvm_vcpu *vcpu);
> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
> index ddf7427..1f49c86 100644
> --- a/arch/x86/kvm/svm.c
> +++ b/arch/x86/kvm/svm.c
> @@ -4349,7 +4349,6 @@ static struct kvm_x86_ops svm_x86_ops = {
>   .cache_reg = svm_cache_reg,
>   .get_rflags = svm_get_rflags,
>   .set_rflags = svm_set_rflags,
> - .fpu_activate = svm_fpu_activate,
>   .fpu_deactivate = svm_fpu_deactivate,
>  
>   .tlb_flush = svm_flush_tlb,
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 71cbee5..2963303 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -8896,7 +8896,6 @@ static struct kvm_x86_ops vmx_x86_ops = {
>   .cache_reg = vmx_cache_reg,
>   .get_rflags = vmx_get_rflags,
>   .set_rflags = vmx_set_rflags,
> - .fpu_activate = vmx_fpu_activate,
>   .fpu_deactivate = vmx_fpu_deactivate,
>  
>   .tlb_flush = vmx_flush_tlb,
> 

Avi/Gleb, do you remember any particular reason for this?

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/5] KVM: x86: drop fpu_activate hook

2014-08-18 Thread Avi Kivity



On 08/18/2014 01:20 PM, Paolo Bonzini wrote:

Il 18/08/2014 11:50, Wanpeng Li ha scritto:

fpu_activate hook is introduced by commit 6b52d186 (KVM: Activate fpu on
clts), however, there is no user currently, this patch drop it.

Reviewed-by: Yang Zhang 
Signed-off-by: Wanpeng Li 
---
  arch/x86/include/asm/kvm_host.h | 1 -
  arch/x86/kvm/svm.c  | 1 -
  arch/x86/kvm/vmx.c  | 1 -
  3 files changed, 3 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 5724601..b68f3e5 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -710,7 +710,6 @@ struct kvm_x86_ops {
void (*cache_reg)(struct kvm_vcpu *vcpu, enum kvm_reg reg);
unsigned long (*get_rflags)(struct kvm_vcpu *vcpu);
void (*set_rflags)(struct kvm_vcpu *vcpu, unsigned long rflags);
-   void (*fpu_activate)(struct kvm_vcpu *vcpu);
void (*fpu_deactivate)(struct kvm_vcpu *vcpu);
  
  	void (*tlb_flush)(struct kvm_vcpu *vcpu);

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index ddf7427..1f49c86 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -4349,7 +4349,6 @@ static struct kvm_x86_ops svm_x86_ops = {
.cache_reg = svm_cache_reg,
.get_rflags = svm_get_rflags,
.set_rflags = svm_set_rflags,
-   .fpu_activate = svm_fpu_activate,
.fpu_deactivate = svm_fpu_deactivate,
  
  	.tlb_flush = svm_flush_tlb,

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 71cbee5..2963303 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -8896,7 +8896,6 @@ static struct kvm_x86_ops vmx_x86_ops = {
.cache_reg = vmx_cache_reg,
.get_rflags = vmx_get_rflags,
.set_rflags = vmx_set_rflags,
-   .fpu_activate = vmx_fpu_activate,
.fpu_deactivate = vmx_fpu_deactivate,
  
  	.tlb_flush = vmx_flush_tlb,



Avi/Gleb, do you remember any particular reason for this?



IIRC (vaguely) if we expect the fpu to be used in the near future, we 
activate it eagerly so that we don't fault when it is used.


Prevents the sequence:

guest user: use fpu
#NM
host: reflect #NM to guest
guest kernel: CLTS
guest kernel: switch fpu state
#NM
host: switch fpu
guest kernel: switch fpu state (restarted)
guest user: use fpu (restarted)

Why was the user removed? Full-time eager fpu?
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [questions] about using vfio to assign sr-iov vf to vm

2014-08-18 Thread Zhang Haoyu

>>> >> >> Hi, all
>>> >> >> I'm using VFIO to assign intel 82599 VF to VM, now I encounter a 
>>> >> >> problem,
>>> >> >> 82599 PF and its VFs belong to the same iommu_group, but I only want 
>>> >> >> to assign some VFs to one VM, and some other VFs to another VM,
>>> >> >> so how to only unbind (part of) the VFs but PF?
>>> >> >> I read the kernel doc vfio.txt, I'm not sure should I unbind all of 
>>> >> >> the devices which belong to one iommu_group?
>>> >> >> If so, because PF and its VFs belong to the same iommu_group, if I 
>>> >> >> unbind the PF, its VFs also diappeared.
>>> >> >> I think I misunderstand someting,
>>> >> >> any advises?
>>> >> >
>>> >> >This occurs when the PF is installed behind components in the system
>>> >> >that do not support PCIe Access Control Services (ACS).  The IOMMU group
>>> >> >contains both the PF and the VF because upstream transactions can be
>>> >> >re-routed downstream by these non-ACS components before being translated
>>> >> >by the IOMMU.  Please provide 'sudo lspci -vvv', 'lspci -n', and kernel
>>> >> >version and we might be able to give you some advise on how to work
>>> >> >around the problem.  Thanks,
>>> >> >
>>> >> The intel 82599(02:00.0 or 02:00.1) is behind the pci bridge (00:01.1),  
>>> >> does 00:01.1 PCI bridge support ACS ?
>>> >
>>> >It does not and that's exactly the problem.  We must assume that the
>>> >root port can redirect a transaction from a subordinate device back to
>>> >another subordinate device without IOMMU translation when ACS support is
>>> >not present.  If you had a device plugged in below 00:01.0, we'd also
>>> >need to assume that non-IOMMU translated peer-to-peer between devices
>>> >behind either function, 00:01.0 or 00:01.1, is possible.
>>> >
>>> >Intel has indicated that processor root ports for all Xeon class
>>> >processors should support ACS and have verified isolation for PCH based
>>> >root ports allowing us to support quirks in place of ACS support.  I'm
>>> >not aware of any efforts at Intel to verify isolation capabilities of
>>> >root ports on client processors.  They are however aware that lack of
>>> >ACS is a limiting factor for usability of VT-d, and I hope that we'll
>>> >see future products with ACS support.
>>> >
>>> >Chances are good that the PCH root port at 00:1c.0 is supported by an
>>> >ACS quirk, but it seems that your system has a PCIe switch below the
>>> >root port.  If the PCIe switch downstream ports support ACS, then you
>>> >may be able to move the 82599 to the empty slot at bus 07 to separate
>>> >the VFs into different IOMMU groups.  Thanks,
>>> >
>>> Thanks, Alex,
>>> how to tell whether a PCI bridge/deivce support ACS capability?
>>> 
>>> I perform "lspci -vvv -s | grep -i ACS", nothing matched.
>>> # lspci -vvv -s 00:1c.0
>>> 00:1c.0 PCI bridge: Intel Corporation 6 Series/C200 Series Chipset Family 
>>> PCI Express Root Port 1 (rev b5) (prog-if 00 [Normal decode])
>>
>>
>>Ideally there would be capabilities for it, something like:
>>
>>Capabilities [xxx] Access Control Services...
>>
>>But, Intel failed to provide this, so we enable "effective" ACS
>>capabilities via a quirk:
>>
>>drivers/pci/quirks.c:
>>/*
>> * Many Intel PCH root ports do provide ACS-like features to disable peer
>> * transactions and validate bus numbers in requests, but do not provide an
>> * actual PCIe ACS capability.  This is the list of device IDs known to fall
>> * into that category as provided by Intel in Red Hat bugzilla 1037684.
>> */
>>static const u16 pci_quirk_intel_pch_acs_ids[] = {
>>/* Ibexpeak PCH */
>>0x3b42, 0x3b43, 0x3b44, 0x3b45, 0x3b46, 0x3b47, 0x3b48, 0x3b49,
>>0x3b4a, 0x3b4b, 0x3b4c, 0x3b4d, 0x3b4e, 0x3b4f, 0x3b50, 0x3b51,
>>/* Cougarpoint PCH */
>>0x1c10, 0x1c11, 0x1c12, 0x1c13, 0x1c14, 0x1c15, 0x1c16, 0x1c17,
>>0x1c18, 0x1c19, 0x1c1a, 0x1c1b, 0x1c1c, 0x1c1d, 0x1c1e, 0x1c1f,
>>/* Pantherpoint PCH */
>>0x1e10, 0x1e11, 0x1e12, 0x1e13, 0x1e14, 0x1e15, 0x1e16, 0x1e17,
>>0x1e18, 0x1e19, 0x1e1a, 0x1e1b, 0x1e1c, 0x1e1d, 0x1e1e, 0x1e1f,
>>/* Lynxpoint-H PCH */
>>0x8c10, 0x8c11, 0x8c12, 0x8c13, 0x8c14, 0x8c15, 0x8c16, 0x8c17,
>>0x8c18, 0x8c19, 0x8c1a, 0x8c1b, 0x8c1c, 0x8c1d, 0x8c1e, 0x8c1f,
>>/* Lynxpoint-LP PCH */
>>0x9c10, 0x9c11, 0x9c12, 0x9c13, 0x9c14, 0x9c15, 0x9c16, 0x9c17,
>>0x9c18, 0x9c19, 0x9c1a, 0x9c1b,
>>/* Wildcat PCH */
>>0x9c90, 0x9c91, 0x9c92, 0x9c93, 0x9c94, 0x9c95, 0x9c96, 0x9c97,
>>0x9c98, 0x9c99, 0x9c9a, 0x9c9b,
>>/* Patsburg (X79) PCH */
>>0x1d10, 0x1d12, 0x1d14, 0x1d16, 0x1d18, 0x1d1a, 0x1d1c, 0x1d1e,
>>};
>>
>>Hopefully if you run 'lspci -n', you'll see your device ID listed among
>>these.  We don't currently have any quirks for PCIe switches, so if your
>>IOMMU group is still bigger than it should be, that may be the reason.
>>Thanks,
>>
>Using device specific mechanisms to enable and verify ACS-like capability is 
>okay

[PATCH 3/5] KVM: vmx: don't vmx_segment_cache_clear twice in enter_pmode

2014-08-18 Thread Wanpeng Li

vmx_segment_cache_clear() will be called by vmx_set_segment() 
which lead to vmx_segment_cache_clear() is called twice in 
enter_pmode(). This patch remove the duplicate call site. 

Reviewed-by: Yang Zhang 
Signed-off-by: Wanpeng Li 
---
 arch/x86/kvm/vmx.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 2963303..c97c44c 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -3183,8 +3183,6 @@ static void enter_pmode(struct kvm_vcpu *vcpu)
 
vmx->rmode.vm86_active = 0;
 
-   vmx_segment_cache_clear(vmx);
-
vmx_set_segment(vcpu, &vmx->rmode.segs[VCPU_SREG_TR], VCPU_SREG_TR);
 
flags = vmcs_readl(GUEST_RFLAGS);
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 4/5] KVM: x86: fix check legal type of Variable Range MTRRs

2014-08-18 Thread Wanpeng Li

The first entry in each pair(IA32_MTRR_PHYSBASEn) defines the base 
address and memory type for the range; the second entry(IA32_MTRR_PHYSMASKn)
contains a mask used to determine the address range. The legal values 
for the type field of IA32_MTRR_PHYSBASEn are 0,1,4,5, and 6. However,
IA32_MTRR_PHYSMASKn don't have type field. This patch avoid check if 
the type field is legal for IA32_MTRR_PHYSMASKn.

Signed-off-by: Wanpeng Li 
---
 arch/x86/kvm/x86.c | 10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 204422d..30f55cc 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1767,7 +1767,15 @@ static bool mtrr_valid(struct kvm_vcpu *vcpu, u32 msr, 
u64 data)
}
 
/* variable MTRRs */
-   return valid_mtrr_type(data & 0xff);
+   if (msr >= 0x200 && msr < 0x200 + 2 * KVM_NR_VAR_MTRR) {
+   int idx, is_mtrr_mask;
+
+   idx = (msr - 0x200) / 2;
+   is_mtrr_mask = msr - 0x200 - 2 * idx;
+   if (!is_mtrr_mask)
+   return valid_mtrr_type(data & 0xff);
+   }
+   return true;
 }
 
 static int set_msr_mtrr(struct kvm_vcpu *vcpu, u32 msr, u64 data)
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 2/5] KVM: x86: drop fpu_activate hook

2014-08-18 Thread Wanpeng Li

fpu_activate hook is introduced by commit 6b52d186 (KVM: Activate fpu on 
clts), however, there is no user currently, this patch drop it.

Reviewed-by: Yang Zhang 
Signed-off-by: Wanpeng Li 
---
 arch/x86/include/asm/kvm_host.h | 1 -
 arch/x86/kvm/svm.c  | 1 -
 arch/x86/kvm/vmx.c  | 1 -
 3 files changed, 3 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 5724601..b68f3e5 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -710,7 +710,6 @@ struct kvm_x86_ops {
void (*cache_reg)(struct kvm_vcpu *vcpu, enum kvm_reg reg);
unsigned long (*get_rflags)(struct kvm_vcpu *vcpu);
void (*set_rflags)(struct kvm_vcpu *vcpu, unsigned long rflags);
-   void (*fpu_activate)(struct kvm_vcpu *vcpu);
void (*fpu_deactivate)(struct kvm_vcpu *vcpu);
 
void (*tlb_flush)(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index ddf7427..1f49c86 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -4349,7 +4349,6 @@ static struct kvm_x86_ops svm_x86_ops = {
.cache_reg = svm_cache_reg,
.get_rflags = svm_get_rflags,
.set_rflags = svm_set_rflags,
-   .fpu_activate = svm_fpu_activate,
.fpu_deactivate = svm_fpu_deactivate,
 
.tlb_flush = svm_flush_tlb,
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 71cbee5..2963303 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -8896,7 +8896,6 @@ static struct kvm_x86_ops vmx_x86_ops = {
.cache_reg = vmx_cache_reg,
.get_rflags = vmx_get_rflags,
.set_rflags = vmx_set_rflags,
-   .fpu_activate = vmx_fpu_activate,
.fpu_deactivate = vmx_fpu_deactivate,
 
.tlb_flush = vmx_flush_tlb,
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/5] KVM: vmx: fix ept reserved bits for 1-GByte page

2014-08-18 Thread Wanpeng Li

EPT misconfig handler in kvm will check which reason lead to EPT 
misconfiguration after vmexit. One of the reasons is that an EPT 
paging-structure entry is configured with settings reserved for 
future functionality. However, the handler can't identify if 
paging-structure entry of reserved bits for 1-GByte page are 
configured, since PDPTE which point to 1-GByte page will reserve 
bits 29:12 instead of bits 7:3 which are reserved for PDPTE that 
references an EPT Page Directory. This patch fix it by reserve 
bits 29:12 for 1-GByte page. 

Signed-off-by: Wanpeng Li 
---
 arch/x86/kvm/vmx.c | 14 ++
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index bfe11cf..71cbee5 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -5521,9 +5521,14 @@ static u64 ept_rsvd_mask(u64 spte, int level)
for (i = 51; i > boot_cpu_data.x86_phys_bits; i--)
mask |= (1ULL << i);
 
-   if (level > 2)
-   /* bits 7:3 reserved */
-   mask |= 0xf8;
+   if (level > 2) {
+   if (spte & (1 << 7))
+   /* 1GB ref, bits 29:12 */
+   mask |= 0x3000;
+   else
+   /* bits 7:3 reserved */
+   mask |= 0xf8;
+   }
else if (level == 2) {
if (spte & (1ULL << 7))
/* 2MB ref, bits 20:12 reserved */
@@ -5561,7 +5566,8 @@ static void ept_misconfig_inspect_spte(struct kvm_vcpu 
*vcpu, u64 spte,
WARN_ON(1);
}
 
-   if (level == 1 || (level == 2 && (spte & (1ULL << 7 {
+   if (level == 1 || ((level == 3 || level == 2)
+   && (spte & (1ULL << 7 {
u64 ept_mem_type = (spte & 0x38) >> 3;
 
if (ept_mem_type == 2 || ept_mem_type == 3 ||
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 5/5] KVM: x86: #GP when attempts to write reserved bits of Variable Range MTRRs

2014-08-18 Thread Wanpeng Li

Section 11.11.2.3 of the SDM mentions "All other bits in the 
IA32_MTRR_PHYSBASEn 
and IA32_MTRR_PHYSMASKn registers are reserved; the processor generates a 
general-protection exception(#GP) if software attempts to write to them". This 
patch do it in kvm.

Signed-off-by: Wanpeng Li 
---
 arch/x86/kvm/x86.c | 15 +--
 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index caaffeb..aa64c70 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1769,11 +1769,22 @@ static bool mtrr_valid(struct kvm_vcpu *vcpu, u32 msr, 
u64 data)
/* variable MTRRs */
if (msr >= 0x200 && msr < 0x200 + 2 * KVM_NR_VAR_MTRR) {
int idx, is_mtrr_mask;
+   u64 mask = 0;
 
idx = (msr - 0x200) / 2;
is_mtrr_mask = msr - 0x200 - 2 * idx;
-   if (!is_mtrr_mask)
-   return valid_mtrr_type(data & 0xff);
+   for (i = 63; i > boot_cpu_data.x86_phys_bits; i--)
+   mask |= (1ULL << i);
+   if (!is_mtrr_mask) {
+   if (!valid_mtrr_type(data & 0xff))
+   return false;
+   mask |= 0xf00;
+   } else
+   mask |= 0x7ff;
+   if (data & mask) {
+   kvm_inject_gp(vcpu, 0);
+   return false;
+   }
}
return true;
 }
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] KVM: x86: Increase the number of fixed MTRR regs to 10

2014-08-18 Thread Nadav Amit

I reproduced the blue-screen. Let me to to figure it out.

Nadav

On Aug 18, 2014, at 11:11 AM, Wanpeng Li  wrote:

> On Mon, Aug 18, 2014 at 09:39:39AM +0300, Nadav Amit wrote:
>> This should have been a benign patch. I'll try to get windows 7 installation 
>> disk and check ASAP.
>> 
> 
> In addition, it just can be reproduced on 32bit win7 w/ MP enabled, in
> case UP can't be reproduced.
> 
> Regards,
> Wanpeng Li 
> 
>> Nadav
>> 
>>> On 18 Aug 2014, at 05:17, Wanpeng Li  wrote:
>>> 
>>> Hi Nadav,
 On Wed, Jun 18, 2014 at 05:21:19PM +0300, Nadav Amit wrote:
 Recent Intel CPUs have 10 variable range MTRRs. Since operating systems
 sometime make assumptions on CPUs while they ignore capability MSRs, it is
 better for KVM to be consistent with recent CPUs. Reporting more MTRRs than
 actually supported has no functional implications.
 
 Signed-off-by: Nadav Amit 
 ---
 arch/x86/include/asm/kvm_host.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
 
 diff --git a/arch/x86/include/asm/kvm_host.h 
 b/arch/x86/include/asm/kvm_host.h
 index 4931415..0bab29d 100644
 --- a/arch/x86/include/asm/kvm_host.h
 +++ b/arch/x86/include/asm/kvm_host.h
 @@ -95,7 +95,7 @@ static inline gfn_t gfn_to_index(gfn_t gfn, gfn_t 
 base_gfn, int level)
 #define KVM_REFILL_PAGES 25
 #define KVM_MAX_CPUID_ENTRIES 80
 #define KVM_NR_FIXED_MTRR_REGION 88
 -#define KVM_NR_VAR_MTRR 8
 +#define KVM_NR_VAR_MTRR 10
>>> 
>>> We observed that there is obvious regression caused by this commit, 32bit 
>>> win7 guest show blue screen during boot.
>>> 
>>> Regards,
>>> Wanpeng Li 
>>> 
 #define ASYNC_PF_PER_VCPU 64
 
 -- 
 1.9.1
 
 --
 To unsubscribe from this list: send the line "unsubscribe kvm" in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe kvm" in
>>> the body of a message to majord...@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/5] watchdog: control hard lockup detection default

2014-08-18 Thread Ingo Molnar


* Don Zickus  wrote:

> From: Ulrich Obergfell 
> 
> In some cases we don't want hard lockup detection enabled by default.
> An example is when running as a guest. Introduce
> 
>   watchdog_enable_hardlockup_detector(bool)

So, the name watchdog_enable_hardlockup_detector_enable(false) 
is both too long and also really confusing (because first it 
suggests enablement, then disables it), so I renamed it to 
hardlockup_detector_set(), which allows two natural variants:

hardlockup_detector_set(false);
...
hardlockup_detector_set(true);

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/5] watchdog: fix print-once on enable

2014-08-18 Thread Ingo Molnar


* Don Zickus  wrote:

> --- a/kernel/watchdog.c
> +++ b/kernel/watchdog.c
> @@ -522,6 +522,9 @@ static void watchdog_nmi_disable(unsigned int cpu)
>   /* should be in cleanup, but blocks oprofile */
>   perf_event_release_kernel(event);
>   }
> + if (cpu == 0)
> + /* watchdog_nmi_enable() expects this to be zero initially. */
> + cpu0_err = 0;
>   return;
>  }

While at it I also removed the stray 'return;'.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/5] watchdog: control hard lockup detection default

2014-08-18 Thread Ingo Molnar


* Don Zickus  wrote:

> The running kernel still has the ability to enable/disable at any
> time with /proc/sys/kernel/nmi_watchdog us usual. However even
> when the default has been overridden /proc/sys/kernel/nmi_watchdog
> will initially show '1'. To truly turn it on one must disable/enable
> it, i.e.
>   echo 0 > /proc/sys/kernel/nmi_watchdog
>   echo 1 > /proc/sys/kernel/nmi_watchdog

This looks like a bug, why is this so?

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/5] watchdog: fix print-once on enable

2014-08-18 Thread Ingo Molnar


* Don Zickus  wrote:

> From: Ulrich Obergfell 
> 
> This patch avoids printing the message 'enabled on all CPUs, ...'
> multiple times. For example, the issue can occur in the following
> scenario:
> 
> 1) watchdog_nmi_enable() fails to enable PMU counters and sets
>cpu0_err.
> 
> 2) 'echo [0|1] > /proc/sys/kernel/nmi_watchdog' is executed to
>disable and re-enable the watchdog mechanism 'on the fly'.
> 
> 3) If watchdog_nmi_enable() succeeds to enable PMU counters, each
>CPU will print the message because step1 left behind a non-zero
>cpu0_err.
> 
>if (!IS_ERR(event)) {
>if (cpu == 0 || cpu0_err)
>pr_info("enabled on all CPUs, ...")
> 
> The patch avoids this by clearing cpu0_err in watchdog_nmi_disable().
> 
> Signed-off-by: Ulrich Obergfell 
> Signed-off-by: Andrew Jones 
> Signed-off-by: Don Zickus 
> ---
>  kernel/watchdog.c |3 +++
>  1 files changed, 3 insertions(+), 0 deletions(-)
> 
> diff --git a/kernel/watchdog.c b/kernel/watchdog.c
> index 6d0a891..0838685 100644
> --- a/kernel/watchdog.c
> +++ b/kernel/watchdog.c
> @@ -522,6 +522,9 @@ static void watchdog_nmi_disable(unsigned int cpu)
>   /* should be in cleanup, but blocks oprofile */
>   perf_event_release_kernel(event);
>   }
> + if (cpu == 0)
> + /* watchdog_nmi_enable() expects this to be zero initially. */
> + cpu0_err = 0;

Looks good except a small detail: two-line blocks need curly 
braces too, even if it's just a single C statement. I've fixed 
this up in the commit.

thanks,

Ingo

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/5] softlockup: make detector be aware of task switch of processes hogging cpu

2014-08-18 Thread Ingo Molnar

* Don Zickus  wrote:

> From: chai wen 
> 
> For now, soft lockup detector warns once for each case of process softlockup.
> But the thread 'watchdog/n' may not always get the cpu at the time slot 
> between
> the task switch of two processes hogging that cpu to reset soft_watchdog_warn.
> 
> An example would be two processes hogging the cpu.  Process A causes the
> softlockup warning and is killed manually by a user.  Process B immediately
> becomes the new process hogging the cpu preventing the softlockup code from
> resetting the soft_watchdog_warn variable.
> 
> This case is a false negative of "warn only once for a process", as there may
> be a different process that is going to hog the cpu.  Resolve this by
> saving/checking the pid of the hogging process and use that to reset
> soft_watchdog_warn too.
> 
> Signed-off-by: chai wen 
> [modified the comment and changelog to be more specific]
> Signed-off-by: Don Zickus 
> ---
>  kernel/watchdog.c |   20 ++--
>  1 files changed, 18 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/watchdog.c b/kernel/watchdog.c
> index 4c2e11c..6d0a891 100644
> --- a/kernel/watchdog.c
> +++ b/kernel/watchdog.c
> @@ -42,6 +42,7 @@ static DEFINE_PER_CPU(bool, softlockup_touch_sync);
>  static DEFINE_PER_CPU(bool, soft_watchdog_warn);
>  static DEFINE_PER_CPU(unsigned long, hrtimer_interrupts);
>  static DEFINE_PER_CPU(unsigned long, soft_lockup_hrtimer_cnt);
> +static DEFINE_PER_CPU(pid_t, softlockup_warn_pid_saved);
>  #ifdef CONFIG_HARDLOCKUP_DETECTOR
>  static DEFINE_PER_CPU(bool, hard_watchdog_warn);
>  static DEFINE_PER_CPU(bool, watchdog_nmi_touch);
> @@ -317,6 +318,8 @@ static enum hrtimer_restart watchdog_timer_fn(struct 
> hrtimer *hrtimer)
>*/
>   duration = is_softlockup(touch_ts);
>   if (unlikely(duration)) {
> + pid_t pid = task_pid_nr(current);
> +
>   /*
>* If a virtual machine is stopped by the host it can look to
>* the watchdog like a soft lockup, check to see if the host
> @@ -326,8 +329,20 @@ static enum hrtimer_restart watchdog_timer_fn(struct 
> hrtimer *hrtimer)
>   return HRTIMER_RESTART;
>  
>   /* only warn once */
> - if (__this_cpu_read(soft_watchdog_warn) == true)
> + if (__this_cpu_read(soft_watchdog_warn) == true) {
> +
> + /*
> +  * Handle the case where multiple processes are
> +  * causing softlockups but the duration is small
> +  * enough, the softlockup detector can not reset
> +  * itself in time.  Use pids to detect this.
> +  */
> + if (__this_cpu_read(softlockup_warn_pid_saved) != pid) {

So I agree with the motivation of this improvement, but is this 
implementation namespace-safe?

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [questions] about using vfio to assign sr-iov vf to vm

2014-08-18 Thread Zhang Haoyu

>> >> >> Hi, all
>> >> >> I'm using VFIO to assign intel 82599 VF to VM, now I encounter a 
>> >> >> problem,
>> >> >> 82599 PF and its VFs belong to the same iommu_group, but I only want 
>> >> >> to assign some VFs to one VM, and some other VFs to another VM,
>> >> >> so how to only unbind (part of) the VFs but PF?
>> >> >> I read the kernel doc vfio.txt, I'm not sure should I unbind all of 
>> >> >> the devices which belong to one iommu_group?
>> >> >> If so, because PF and its VFs belong to the same iommu_group, if I 
>> >> >> unbind the PF, its VFs also diappeared.
>> >> >> I think I misunderstand someting,
>> >> >> any advises?
>> >> >
>> >> >This occurs when the PF is installed behind components in the system
>> >> >that do not support PCIe Access Control Services (ACS).  The IOMMU group
>> >> >contains both the PF and the VF because upstream transactions can be
>> >> >re-routed downstream by these non-ACS components before being translated
>> >> >by the IOMMU.  Please provide 'sudo lspci -vvv', 'lspci -n', and kernel
>> >> >version and we might be able to give you some advise on how to work
>> >> >around the problem.  Thanks,
>> >> >
>> >> The intel 82599(02:00.0 or 02:00.1) is behind the pci bridge (00:01.1),  
>> >> does 00:01.1 PCI bridge support ACS ?
>> >
>> >It does not and that's exactly the problem.  We must assume that the
>> >root port can redirect a transaction from a subordinate device back to
>> >another subordinate device without IOMMU translation when ACS support is
>> >not present.  If you had a device plugged in below 00:01.0, we'd also
>> >need to assume that non-IOMMU translated peer-to-peer between devices
>> >behind either function, 00:01.0 or 00:01.1, is possible.
>> >
>> >Intel has indicated that processor root ports for all Xeon class
>> >processors should support ACS and have verified isolation for PCH based
>> >root ports allowing us to support quirks in place of ACS support.  I'm
>> >not aware of any efforts at Intel to verify isolation capabilities of
>> >root ports on client processors.  They are however aware that lack of
>> >ACS is a limiting factor for usability of VT-d, and I hope that we'll
>> >see future products with ACS support.
>> >
>> >Chances are good that the PCH root port at 00:1c.0 is supported by an
>> >ACS quirk, but it seems that your system has a PCIe switch below the
>> >root port.  If the PCIe switch downstream ports support ACS, then you
>> >may be able to move the 82599 to the empty slot at bus 07 to separate
>> >the VFs into different IOMMU groups.  Thanks,
>> >
>> Thanks, Alex,
>> how to tell whether a PCI bridge/deivce support ACS capability?
>> 
>> I perform "lspci -vvv -s | grep -i ACS", nothing matched.
>> # lspci -vvv -s 00:1c.0
>> 00:1c.0 PCI bridge: Intel Corporation 6 Series/C200 Series Chipset Family 
>> PCI Express Root Port 1 (rev b5) (prog-if 00 [Normal decode])
>
>
>Ideally there would be capabilities for it, something like:
>
>Capabilities [xxx] Access Control Services...
>
>But, Intel failed to provide this, so we enable "effective" ACS
>capabilities via a quirk:
>
>drivers/pci/quirks.c:
>/*
> * Many Intel PCH root ports do provide ACS-like features to disable peer
> * transactions and validate bus numbers in requests, but do not provide an
> * actual PCIe ACS capability.  This is the list of device IDs known to fall
> * into that category as provided by Intel in Red Hat bugzilla 1037684.
> */
>static const u16 pci_quirk_intel_pch_acs_ids[] = {
>/* Ibexpeak PCH */
>0x3b42, 0x3b43, 0x3b44, 0x3b45, 0x3b46, 0x3b47, 0x3b48, 0x3b49,
>0x3b4a, 0x3b4b, 0x3b4c, 0x3b4d, 0x3b4e, 0x3b4f, 0x3b50, 0x3b51,
>/* Cougarpoint PCH */
>0x1c10, 0x1c11, 0x1c12, 0x1c13, 0x1c14, 0x1c15, 0x1c16, 0x1c17,
>0x1c18, 0x1c19, 0x1c1a, 0x1c1b, 0x1c1c, 0x1c1d, 0x1c1e, 0x1c1f,
>/* Pantherpoint PCH */
>0x1e10, 0x1e11, 0x1e12, 0x1e13, 0x1e14, 0x1e15, 0x1e16, 0x1e17,
>0x1e18, 0x1e19, 0x1e1a, 0x1e1b, 0x1e1c, 0x1e1d, 0x1e1e, 0x1e1f,
>/* Lynxpoint-H PCH */
>0x8c10, 0x8c11, 0x8c12, 0x8c13, 0x8c14, 0x8c15, 0x8c16, 0x8c17,
>0x8c18, 0x8c19, 0x8c1a, 0x8c1b, 0x8c1c, 0x8c1d, 0x8c1e, 0x8c1f,
>/* Lynxpoint-LP PCH */
>0x9c10, 0x9c11, 0x9c12, 0x9c13, 0x9c14, 0x9c15, 0x9c16, 0x9c17,
>0x9c18, 0x9c19, 0x9c1a, 0x9c1b,
>/* Wildcat PCH */
>0x9c90, 0x9c91, 0x9c92, 0x9c93, 0x9c94, 0x9c95, 0x9c96, 0x9c97,
>0x9c98, 0x9c99, 0x9c9a, 0x9c9b,
>/* Patsburg (X79) PCH */
>0x1d10, 0x1d12, 0x1d14, 0x1d16, 0x1d18, 0x1d1a, 0x1d1c, 0x1d1e,
>};
>
>Hopefully if you run 'lspci -n', you'll see your device ID listed among
>these.  We don't currently have any quirks for PCIe switches, so if your
>IOMMU group is still bigger than it should be, that may be the reason.
>Thanks,
>
Using device specific mechanisms to enable and verify ACS-like capability is 
okay,
but with regard to those devices which completely don't support ACS-like 
capabilities, 
what sh

Re: [PATCH 1/1] KVM: SVM: add rdmsr support for AMD event registers

2014-08-18 Thread Paolo Bonzini

Il 13/08/2014 18:06, Wei Huang ha scritto:
> Current KVM only supports RDMSR for K7_EVNTSEL0 and K7_PERFCTR0
> MSRs. Reading the rest MSRs will trigger KVM to inject #GP into
> guest VM. This causes a warning message "Failed to access perfctr
> msr (MSR c0010001 is )" on AMD host. This patch
> adds RDMSR support for all K7_EVNTSELn and K7_PERFCTRn registers
> and thus supresses the warning message.
> 
> Signed-off-by: Wei Huang 
> ---
>  arch/x86/kvm/x86.c | 6 ++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index ef432f8..3f10ca2 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -2399,7 +2399,13 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, u32 msr, 
> u64 *pdata)
>   case MSR_K7_HWCR:
>   case MSR_VM_HSAVE_PA:
>   case MSR_K7_EVNTSEL0:
> + case MSR_K7_EVNTSEL1:
> + case MSR_K7_EVNTSEL2:
> + case MSR_K7_EVNTSEL3:
>   case MSR_K7_PERFCTR0:
> + case MSR_K7_PERFCTR1:
> + case MSR_K7_PERFCTR2:
> + case MSR_K7_PERFCTR3:
>   case MSR_K8_INT_PENDING_MSG:
>   case MSR_AMD64_NB_CFG:
>   case MSR_FAM10H_MMIO_CONF_BASE:
> 

Thanks, applying for 3.18.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] KVM: x86: Avoid emulating instructions on #UD mistakenly

2014-08-18 Thread Paolo Bonzini

Il 13/08/2014 16:21, Nadav Amit ha scritto:
> Correction: the word “never” in the message is too harsh.
> Nonetheless, there is a regression bug. I encountered it with “wrfsbase” 
> instruction.

So KVM is emulating wrfsbase even if the host doesn't support it?

I'm swapping the order of the two operands of &&, since the first one will 
almost
always be true and the second one will almost always be false.

Also, there's now no need to test EmulateOnUD in the condition below.  Does the
below look good to you?

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 37a83b24e040..ef117b842334 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -4394,11 +4394,11 @@ done_prefixes:
 
ctxt->execute = opcode.u.execute;
 
-   if (!(ctxt->d & EmulateOnUD) && ctxt->ud)
+   if (unlikely(ctxt->ud) && likely(!(ctxt->d & EmulateOnUD)))
return EMULATION_FAILED;
 
if (unlikely(ctxt->d &
-
(NotImpl|EmulateOnUD|Stack|Op3264|Sse|Mmx|Intercept|CheckPerm))) {
+(NotImpl|Stack|Op3264|Sse|Mmx|Intercept|CheckPerm))) {
/*
 * These are copied unconditionally here, and checked 
unconditionally
 * in x86_emulate_insn.

 

Paolo

> Nadav
> 
> On Aug 13, 2014, at 4:50 PM, Nadav Amit  wrote:
> 
>> Commit d40a6898e5 mistakenly caused instructions which are not marked as
>> EmulateOnUD to be emulated upon #UD exception. The commit caused the check of
>> whether the instruction flags include EmulateOnUD to never be evaluated. As a
>> result instructions whose emulation is broken may be emulated.  This fix 
>> moves
>> the evaluation of EmulateOnUD so it would be evaluated.
>>
>> Signed-off-by: Nadav Amit 
>> ---
>> arch/x86/kvm/emulate.c | 6 +++---
>> 1 file changed, 3 insertions(+), 3 deletions(-)
>>
>> diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
>> index 56657b0..37a83b2 100644
>> --- a/arch/x86/kvm/emulate.c
>> +++ b/arch/x86/kvm/emulate.c
>> @@ -4394,6 +4394,9 @@ done_prefixes:
>>
>>  ctxt->execute = opcode.u.execute;
>>
>> +if (!(ctxt->d & EmulateOnUD) && ctxt->ud)
>> +return EMULATION_FAILED;
>> +
>>  if (unlikely(ctxt->d &
>>   
>> (NotImpl|EmulateOnUD|Stack|Op3264|Sse|Mmx|Intercept|CheckPerm))) {
>>  /*
>> @@ -4406,9 +4409,6 @@ done_prefixes:
>>  if (ctxt->d & NotImpl)
>>  return EMULATION_FAILED;
>>
>> -if (!(ctxt->d & EmulateOnUD) && ctxt->ud)
>> -return EMULATION_FAILED;
>> -
>>  if (mode == X86EMUL_MODE_PROT64 && (ctxt->d & Stack))
>>  ctxt->op_bytes = 8;
>>
>> -- 
>> 1.9.1
>>
> 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] KVM: x86: Increase the number of fixed MTRR regs to 10

2014-08-18 Thread Wanpeng Li

On Mon, Aug 18, 2014 at 09:39:39AM +0300, Nadav Amit wrote:
>This should have been a benign patch. I'll try to get windows 7 installation 
>disk and check ASAP.
>

In addition, it just can be reproduced on 32bit win7 w/ MP enabled, in
case UP can't be reproduced.

Regards,
Wanpeng Li 

>Nadav
>
>> On 18 Aug 2014, at 05:17, Wanpeng Li  wrote:
>> 
>> Hi Nadav,
>>> On Wed, Jun 18, 2014 at 05:21:19PM +0300, Nadav Amit wrote:
>>> Recent Intel CPUs have 10 variable range MTRRs. Since operating systems
>>> sometime make assumptions on CPUs while they ignore capability MSRs, it is
>>> better for KVM to be consistent with recent CPUs. Reporting more MTRRs than
>>> actually supported has no functional implications.
>>> 
>>> Signed-off-by: Nadav Amit 
>>> ---
>>> arch/x86/include/asm/kvm_host.h | 2 +-
>>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>> 
>>> diff --git a/arch/x86/include/asm/kvm_host.h 
>>> b/arch/x86/include/asm/kvm_host.h
>>> index 4931415..0bab29d 100644
>>> --- a/arch/x86/include/asm/kvm_host.h
>>> +++ b/arch/x86/include/asm/kvm_host.h
>>> @@ -95,7 +95,7 @@ static inline gfn_t gfn_to_index(gfn_t gfn, gfn_t 
>>> base_gfn, int level)
>>> #define KVM_REFILL_PAGES 25
>>> #define KVM_MAX_CPUID_ENTRIES 80
>>> #define KVM_NR_FIXED_MTRR_REGION 88
>>> -#define KVM_NR_VAR_MTRR 8
>>> +#define KVM_NR_VAR_MTRR 10
>> 
>> We observed that there is obvious regression caused by this commit, 32bit 
>> win7 guest show blue screen during boot.
>> 
>> Regards,
>> Wanpeng Li 
>> 
>>> #define ASYNC_PF_PER_VCPU 64
>>> 
>>> -- 
>>> 1.9.1
>>> 
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe kvm" in
>>> the body of a message to majord...@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> --
>> To unsubscribe from this list: send the line "unsubscribe kvm" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

86 matches

Mail list logo