Re: [PATCH 1/2] KVM: x86: set TMR when the interrupt is accepted

2015-09-02 Thread Nakajima, Jun
On Wed, Sep 2, 2015 at 3:38 PM, Steve Rutherford  wrote:
> On Thu, Aug 13, 2015 at 09:31:48AM +0200, Paolo Bonzini wrote:
> Pinging this thread.
>
> Should I put together a patch to make split irqchip work properly with the 
> old TMR behavior?

Yes, please.

Intel® 64 and IA-32 Architectures Software Developer’s Manual:

24.11.4 Software Access to Related Structures

In addition to data in the VMCS region itself, VMX non-root operation
can be controlled by data structures that are
referenced by pointers in a VMCS (for example, the I/O bitmaps). While
the pointers to these data structures are
parts of the VMCS, the data structures themselves are not. They are
not accessible using VMREAD and VMWRITE
but by ordinary memory writes.
Software should ensure that each such data structure is modified only
when no logical processor with a current
VMCS that references it is in VMX non-root operation. Doing otherwise
may lead to unpredictable behavior
(including behaviors identified in Section 24.11.1)


29.6 POSTED-INTERRUPT PROCESSING
...
Use of the posted-interrupt descriptor differs from that of other data
structures that are referenced by pointers in
a VMCS. There is a general requirement that software ensure that each
such data structure is modified only when
no logical processor with a current VMCS that references it is in VMX
non-root operation. That requirement does
not apply to the posted-interrupt descriptor. There is a requirement,
however, that such modifications be done
using locked read-modify-write instructions.


>
>>
>>
>> On 13/08/2015 08:35, Zhang, Yang Z wrote:
>> >> You may be right. It is safe if no future hardware plans to use
>> >> it. Let me check with our hardware team to see whether it will be
>> >> used or not in future.
>> >
>> > After checking with Jun, there is no guarantee that the guest running
>> > on another CPU will operate properly if hypervisor modify the vTMR
>> > from another CPU. So the hypervisor should not to do it.
>>
>> I guess I can cause a vmexit on level-triggered interrupts, it's not a
>> big deal, but no weasel words, please.
>>
>> What's going to break, and where is it documented?
>>
>> Paolo
>> --
>> To unsubscribe from this list: send the line "unsubscribe kvm" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Jun
Intel Open Source Technology Center
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Standardizing an MSR or other hypercall to get an RNG seed?

2014-09-19 Thread Nakajima, Jun
On Fri, Sep 19, 2014 at 3:06 PM, Andy Lutomirski  wrote:
> On Fri, Sep 19, 2014 at 3:05 PM, Theodore Ts'o  wrote:
>> On Fri, Sep 19, 2014 at 09:40:42AM -0700, H. Peter Anvin wrote:
>>>
>>> There is a huge disadvantage to the fact that CPUID is a user space
>>> instruction, though.
>>
>> But if the goal is to provide something like getrandom(2) direct from
>> the Host OS, it's not necessarily harmful to allow the Guest ring 3
>> code to be able to fetch randomness in that way.  The hypervisor can
>> implement rate limiting to protect against the guest using this too
>> frequently, but this is something that you should be doing for guest
>> ring 0 code anyway, since from the POV of the hypervisor Guest ring 0
>> is not necessarily any more trusted than Guest ring 3.
>
> On the other hand, the guest kernel might not want the guest ring 3 to
> be able to get random numbers.
>

But the RDSEED instruction, for example, is available in user-level.
And I'm not sure that the kernel can do something with that.

-- 
Jun
Intel Open Source Technology Center
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Standardizing an MSR or other hypercall to get an RNG seed?

2014-09-19 Thread Nakajima, Jun
On Thu, Sep 18, 2014 at 6:28 PM, Andy Lutomirski  wrote:
> On Thu, Sep 18, 2014 at 6:03 PM, Andy Lutomirski  wrote:
>> On Thu, Sep 18, 2014 at 5:49 PM, Nakajima, Jun  
>> wrote:
>>> On Thu, Sep 18, 2014 at 3:07 PM, Andy Lutomirski  
>>> wrote:
>>>
>>>> So, as a concrete straw-man:
>>>>
>>>> CPUID leaf 0x4800 would return a maximum leaf number in EAX (e.g.
>>>> 0x4801) along with a signature value (e.g. "CrossHVPara\0") in
>>>> EBX, ECX, and EDX.
>>>>
>>>> CPUID 0x4801.EAX would contain an MSR number to read to get a
>>>> random number if supported and zero if not supported.
>>>>
>>>> Questions:
>>>>
>>>> 1. Can we use a fixed MSR number?  This would be a little bit simpler,
>>>> but it would depend on getting a wider MSR range from Intel.
>>>>
>>>
>>> Why do you need a wider MSR range if you always detect the feature by
>>> CPUID.0x4801?
>>> Or are you still trying to avoid the detection by CPUID?
>>
>> Detecting the feature is one thing, but figuring out the MSR index is
>> another.  We could shove the index into the cpuid leaf, but that seems
>> unnecessarily indirect.  I'd much rather just say that CPUID leaves
>> *and* MSR indexes 0x4800-0x4800 or so are reserved for the
>> cross-HV mechanism, but we can't do that without either knowingly
>> violating the SDM assignments or asking Intel to consider allocating
>> more MSR indexes.
>>
>> Also, KVM is already conflicting with the SDM right now in its MSR
>> choice :(  I *think* that KVM could be changed to fix that, but 256
>> MSRs is rather confining given that KVM currently implements its own
>> MSR index *and* part of the Hyper-V index.
>
> Correction and update:
>
> KVM currently implements its own MSRs and, optionally, some of the
> Hyper-V MSRs.  By my count, Linux knows about 68 Hyper-V MSRs (in a
> header file), and there are current 7 KVM MSRs, so over 1/4 of the
> available MSR indices are taken (and even more would be taken if KVM
> were to move its MSRs into the correct range).
>

I slept on it, and I think using the CPUID instruction alone would be
simple and efficient:
- We have a huge space for CPUID leaves
- CPUID also works for user-level
- It can take an additional 32-bit parameter (ECX), and returns 4
32-bit values (EAX, EBX, ECX, and EDX).  RDMSR, for example, returns a
64-bit value.

Basically we can use it to implement a hypercall (rather than VMCALL).

For example,
- CPUID 0x4801.EAX would return the feature presence (e.g. in
EBX), and the result in EDX:EAX (if present) at the same time, or
- CPUID 0x4801.EAX would return the feature presence only, and
CPUID 0x4802.EAX (acts like a hypercall) returns up to 4 32-bit
values.

-- 
Jun
Intel Open Source Technology Center
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Standardizing an MSR or other hypercall to get an RNG seed?

2014-09-18 Thread Nakajima, Jun
On Thu, Sep 18, 2014 at 3:07 PM, Andy Lutomirski  wrote:

> So, as a concrete straw-man:
>
> CPUID leaf 0x4800 would return a maximum leaf number in EAX (e.g.
> 0x4801) along with a signature value (e.g. "CrossHVPara\0") in
> EBX, ECX, and EDX.
>
> CPUID 0x4801.EAX would contain an MSR number to read to get a
> random number if supported and zero if not supported.
>
> Questions:
>
> 1. Can we use a fixed MSR number?  This would be a little bit simpler,
> but it would depend on getting a wider MSR range from Intel.
>

Why do you need a wider MSR range if you always detect the feature by
CPUID.0x4801?
Or are you still trying to avoid the detection by CPUID?

-- 
Jun
Intel Open Source Technology Center
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Standardizing an MSR or other hypercall to get an RNG seed?

2014-09-18 Thread Nakajima, Jun
On Thu, Sep 18, 2014 at 12:07 PM, Andy Lutomirski  wrote:

> Might Intel be willing to extend that range to 0x4000 -
> 0x400f?  And would Microsoft be okay with using this mechanism for
> discovery?

So, for CPUID, the SDM (Table 3-17. Information Returned by CPUID) says today:
"No existing or future CPU will return processor identification or
feature information if the initial EAX value is in the range 4000H
to 4FFFH."

We can define a cross-VM CPUID range from there. The CPUID can return
the index of the MSR if needed.

-- 
Jun
Intel Open Source Technology Center
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Standardizing an MSR or other hypercall to get an RNG seed?

2014-09-18 Thread Nakajima, Jun
On Thu, Sep 18, 2014 at 10:20 AM, KY Srinivasan  wrote:
>
>
>> -Original Message-
>> From: Paolo Bonzini [mailto:paolo.bonz...@gmail.com] On Behalf Of Paolo
>> Bonzini
>> Sent: Thursday, September 18, 2014 10:18 AM
>> To: Nakajima, Jun; KY Srinivasan
>> Cc: Mathew John; Theodore Ts'o; John Starks; kvm list; Gleb Natapov; Niels
>> Ferguson; Andy Lutomirski; David Hepkin; H. Peter Anvin; Jake Oshins; Linux
>> Virtualization
>> Subject: Re: Standardizing an MSR or other hypercall to get an RNG seed?
>>
>> Il 18/09/2014 19:13, Nakajima, Jun ha scritto:
>> > In terms of the address for the MSR, I suggest that you choose one
>> > from the range between 4000H - 40FFH. The SDM (35.1
>> > ARCHITECTURAL MSRS) says "All existing and future processors will not
>> > implement any features using any MSR in this range." Hyper-V already
>> > defines many synthetic MSRs in this range, and I think it would be
>> > reasonable for you to pick one for this to avoid a conflict?
>>
>> KVM is not using any MSR in that range.
>>
>> However, I think it would be better to have the MSR (and perhaps CPUID)
>> outside the hypervisor-reserved ranges, so that it becomes architecturally
>> defined.  In some sense it is similar to the HYPERVISOR CPUID feature.
>
> Yes, given that we want this to be hypervisor agnostic.
>

Actually, that MSR address range has been reserved for that purpose, along with:
- CPUID.EAX=1 -> ECX bit 31 (always returns 0 on bare metal)
- CPUID.EAX=4000_00xxH leaves (i.e. HYPERVISOR CPUID)


-- 
Jun
Intel Open Source Technology Center
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Standardizing an MSR or other hypercall to get an RNG seed?

2014-09-18 Thread Nakajima, Jun
On Thu, Sep 18, 2014 at 9:36 AM, KY Srinivasan  wrote:
>
> I am copying other Hyper-V engineers to this discussion.
>

Thanks, K.Y.

In terms of the address for the MSR, I suggest that you choose one
from the range between 4000H - 40FFH. The SDM (35.1
ARCHITECTURAL MSRS) says "All existing and
future processors will not implement any features using any MSR in
this range." Hyper-V already defines many synthetic MSRs in this
range, and I think it would be reasonable for you to pick one for this
to avoid a conflict?

-- 
Jun
Intel Open Source Technology Center
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Freebsd VM Hang while Bootup on KVM, processor Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz

2014-09-09 Thread Nakajima, Jun
On Tue, Sep 9, 2014 at 4:12 AM, Venkateswara Rao Nandigam
 wrote:
> I have tried Freebsd10.0 64bit VM on the KVM Host running RHEL 6.4, processor 
> Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz.
>
> The Freebsd VM hangs at the "booting... " prompt.
>
> If I boot the host kernel with "nosmep", then Freebsd VM boots up fine. I 
> know Xeon V2 processors are having the smep feature.
>
> Any ideas/solutions on how to boot Freebsd VM with smep option enabled in 
> Host Kernel.
>

Does it boot on bare metal?


-- 
Jun
Intel Open Source Technology Center
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Integrity in untrusted environments

2014-07-31 Thread Nakajima, Jun
On Thu, Jul 31, 2014 at 2:25 PM, Shiva V  wrote:
> Hello,
>  I am exploring ideas to implement a service inside a virtual machine on
> untrusted hypervisors under current cloud infrastructures.
>  Particularly, I am interested how one can verify the integrity of the
> service in an environment where hypervisor is not trusted. This is my setup.
>
> 1. I have two virtual machines. (Normal client VM's).
> 2. VM-A is executing a service and VM-B wants to verify its integrity.
> 3. Both are executing on untrusted hypervisor.
>
> Though, Intel SGX will solve this, by using the concept of enclaves, its not
> publicly available yet.

Just clarification. The concept of enclaves and the specs of Intel SGX
are available in public.

See the following, for example:
https://software.intel.com/en-us/intel-isa-extensions

-- 
Jun
Intel Open Source Technology Center
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Verifying Execution Integrity in Untrusted hypervisors

2014-07-28 Thread Nakajima, Jun
On Mon, Jul 28, 2014 at 1:27 PM, Paolo Bonzini  wrote:
> Il 28/07/2014 20:31, Jan Kiszka ha scritto:
>> The hypervisor has full control of and insight into the guest vCPU
>> state. Only protecting some portions of guest memory seems insufficient.
>>
>> We rather need encryption of every data that leaves the CPU or moves
>> from guest to host mode (and decryption the other way around). I guess
>> that would have quite some performance impact and is far from being easy
>> to integrate into modern processors. But, who knows...
>
> Intel SGX sounds somewhat like what you describe, but I'm not sure how
> it's going to be virtualized.
>

Right. It's possible to virtualize (or pass-through) SGX without
losing the security feature.
With SGX, you can create secure (encrypted) islands on processes in
VMs as well. But I'm not sure if it's useful for solving the problem
described.

-- 
Jun
Intel Open Source Technology Center
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM exit on UD interception

2014-05-05 Thread Nakajima, Jun
On Mon, May 5, 2014 at 11:48 AM, Alexandru Duţu  wrote:
> Thank you Jun! I see that in case of VMX does not emulated the
> instruction that produced a UD exception, it just queues the exception
> and returns 1. After that KVM will still try to enter virtualized
> execution and so forth, the execution probably finishing with a DF and
> shut down. It does not seem that KVM, in case of VMX, will exit
> immediately on UD.
>
> I am not sure what you meant with MOVBE emulation.

I meant:

commit 84cffe499b9418d6c3b4de2ad9599cc2ec50c607
Author: Borislav Petkov 
Date:   Tue Oct 29 12:54:56 2013 +0100

kvm: Emulate MOVBE

This basically came from the need to be able to boot 32-bit Atom SMP
guests on an AMD host, i.e. a host which doesn't support MOVBE. As a
matter of fact, qemu has since recently received MOVBE support but we
cannot share that with kvm emulation and thus we have to do this in the
host. We're waay faster in kvm anyway. :-)

So, we piggyback on the #UD path and emulate the MOVBE functionality.
With it, an 8-core SMP guest boots in under 6 seconds.

Also, requesting MOVBE emulation needs to happen explicitly to work,
i.e. qemu -cpu n270,+movbe...

Just FYI, a fairly straight-forward boot of a MOVBE-enabled 3.9-rc6+
kernel in kvm executes MOVBE ~60K times.

Signed-off-by: Andre Przywara 
Signed-off-by: Borislav Petkov 
Signed-off-by: Paolo Bonzini 


-- 
Jun
Intel Open Source Technology Center
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM exit on UD interception

2014-05-05 Thread Nakajima, Jun
On Mon, May 5, 2014 at 8:56 AM, Alexandru Duţu  wrote:
> Dear all,
>
> It seems that currently, on UD interception KVM does not exit
> completely. Virtualized execution finishes, KVM executes
> ud_intercept() after which it enters virtualized execution again.

Maybe you might want to take a look at the VMX side (to port it to
SVM). The MOVBE emulation, for example, should be helpful.

>
> I am working on accelerating with virtualized execution a simulator
> that emulates system calls. Essentially doing virtualized execution
> without a OS kernel. In order to make this work, I had to modify my
> the KVM kernel module such that ud_intercept() return 0 and not 1
> which break KVM __vcpu_run loop. This is necessary as I need to trap
> syscall instructions, exit virtualized execution with UD exception,
> emulate the system call in the simulator and after the system call is
> done enter back in virtualized mode and start execution with the help
> of KVM.
>
> So by modifying ud_intercept() to return 0, I got all this to work. Is
> it possible to achieve the same effect (exit on undefined opcode)
> without modifying ud_intercept()?
>
> It seems that re-entering virtualized execution on UD interception
> gives the user the flexibility of running binaries with newer
> instructions on older hardware, if kvm is able to emulate the newer
> instructions. I do not fully understand the details of this scenario,
> is there such a scenario or is it likely that ud_interception() will
> change?
>
> Thank you in advance!
>
> Best regards,
> Alex
> --

-- 
Jun
Intel Open Source Technology Center
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 0/2] kvm: x86: Emulate MSR_PLATFORM_INFO

2013-06-18 Thread Nakajima, Jun
On Tue, Jun 18, 2013 at 8:16 AM, Gleb Natapov  wrote:
> On Tue, Jun 18, 2013 at 04:05:08PM +0200, Paolo Bonzini wrote:
>> Il 05/06/2013 10:42, Gleb Natapov ha scritto:
>> >> > These patches add an emulated MSR_PLATFORM_INFO that kvm guests
>> >> > can read as described in section 14.3.2.4 of the Intel SDM.
>> >> > The relevant changes and details are in [2/2]; [1/2] makes vendor_intel
>> >> > generic. There are atleat two known applications that fail to run 
>> >> > because
>> >> > of this MSR missing - Sandra and vTune.
>> > So I really want Intel opinion on this. Right now it is impossible to
>> > implement the MSR correctly in the face of migration (may be with tsc
>> > scaling it will be possible) and while it is unimplemented if application
>> > tries to use it it fails, but if we will implement it application will
>> > just produce incorrect result without any means for user to detect it.
>>
>> Jun, ping?  (Perhaps Gleb you want to ask a more specific question though).
>>
>> I don't think this is much different from any other RDTSC usage in
>> applications (they will typically do their calibration manually, and do
>> it just once).  I'm applying it to queue.
>>
> And we do not support application that uses RDTSC directly! If we could
> catch those it would be good from support point of view, so the way
> MSR_PLATFORM_INFO behaves now it better then proposed alternative.

Is it reasonable or possible to expose MSR_PLATFORM_INFO more and then
disable migration? Some use cases (like VTune) don't need live
migration.


--
Jun
Intel Open Source Technology Center
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 05/13] nEPT: MMU context for nested EPT

2013-05-21 Thread Nakajima, Jun
On Tue, May 21, 2013 at 1:50 AM, Xiao Guangrong
 wrote:
> On 05/19/2013 12:52 PM, Jun Nakajima wrote:
>> From: Nadav Har'El 
>>
>> KVM's existing shadow MMU code already supports nested TDP. To use it, we
>> need to set up a new "MMU context" for nested EPT, and create a few callbacks
>> for it (nested_ept_*()). This context should also use the EPT versions of
>> the page table access functions (defined in the previous patch).
>> Then, we need to switch back and forth between this nested context and the
>> regular MMU context when switching between L1 and L2 (when L1 runs this L2
>> with EPT).
>>
>> Signed-off-by: Nadav Har'El 
>> Signed-off-by: Jun Nakajima 
>> Signed-off-by: Xinhao Xu 
>> ---
>>  arch/x86/kvm/mmu.c | 38 ++
>>  arch/x86/kvm/mmu.h |  1 +
>>  arch/x86/kvm/vmx.c | 54 
>> +-
>>  3 files changed, 92 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
>> index 6c1670f..37f8d7f 100644
>> --- a/arch/x86/kvm/mmu.c
>> +++ b/arch/x86/kvm/mmu.c
>> @@ -3653,6 +3653,44 @@ int kvm_init_shadow_mmu(struct kvm_vcpu *vcpu, struct 
>> kvm_mmu *context)
>>  }
>>  EXPORT_SYMBOL_GPL(kvm_init_shadow_mmu);
>>
>> +int kvm_init_shadow_EPT_mmu(struct kvm_vcpu *vcpu, struct kvm_mmu *context)
>> +{
>> + ASSERT(vcpu);
>> + ASSERT(!VALID_PAGE(vcpu->arch.mmu.root_hpa));
>> +
>> + context->shadow_root_level = kvm_x86_ops->get_tdp_level();
>
> That means L1 guest always uses page-walk length == 4? But in your previous 
> patch,
> it can be 2.

We want to support "page-walk length == 4" only.

>
>> +
>> + context->nx = is_nx(vcpu); /* TODO: ? */
>
> Hmm? EPT always support NX.
>
>> + context->new_cr3 = paging_new_cr3;
>> + context->page_fault = EPT_page_fault;
>> + context->gva_to_gpa = EPT_gva_to_gpa;
>> + context->sync_page = EPT_sync_page;
>> + context->invlpg = EPT_invlpg;
>> + context->update_pte = EPT_update_pte;
>> + context->free = paging_free;
>> + context->root_level = context->shadow_root_level;
>> + context->root_hpa = INVALID_PAGE;
>> + context->direct_map = false;
>> +
>> + /* TODO: reset_rsvds_bits_mask() is not built for EPT, we need
>> +something different.
>> +  */
>
> Exactly. :)
>
>> + reset_rsvds_bits_mask(vcpu, context);
>> +
>> +
>> + /* TODO: I copied these from kvm_init_shadow_mmu, I don't know why
>> +they are done, or why they write to vcpu->arch.mmu and not context
>> +  */
>> + vcpu->arch.mmu.base_role.cr4_pae = !!is_pae(vcpu);
>> + vcpu->arch.mmu.base_role.cr0_wp  = is_write_protection(vcpu);
>> + vcpu->arch.mmu.base_role.smep_andnot_wp =
>> + kvm_read_cr4_bits(vcpu, X86_CR4_SMEP) &&
>> + !is_write_protection(vcpu);
>
> I guess we need not care these since the permission of EPT page does not 
> depend
> on these.

Right. I'll clean up this.

>
>> +
>> + return 0;
>> +}
>> +EXPORT_SYMBOL_GPL(kvm_init_shadow_EPT_mmu);
>> +
>>  static int init_kvm_softmmu(struct kvm_vcpu *vcpu)
>>  {
>>   int r = kvm_init_shadow_mmu(vcpu, vcpu->arch.walk_mmu);
>> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
>> index 2adcbc2..8fc94dd 100644
>> --- a/arch/x86/kvm/mmu.h
>> +++ b/arch/x86/kvm/mmu.h
>> @@ -54,6 +54,7 @@ int kvm_mmu_get_spte_hierarchy(struct kvm_vcpu *vcpu, u64 
>> addr, u64 sptes[4]);
>>  void kvm_mmu_set_mmio_spte_mask(u64 mmio_mask);
>>  int handle_mmio_page_fault_common(struct kvm_vcpu *vcpu, u64 addr, bool 
>> direct);
>>  int kvm_init_shadow_mmu(struct kvm_vcpu *vcpu, struct kvm_mmu *context);
>> +int kvm_init_shadow_EPT_mmu(struct kvm_vcpu *vcpu, struct kvm_mmu *context);
>>
>>  static inline unsigned int kvm_mmu_available_pages(struct kvm *kvm)
>>  {
>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
>> index fb9cae5..a88432f 100644
>> --- a/arch/x86/kvm/vmx.c
>> +++ b/arch/x86/kvm/vmx.c
>> @@ -1045,6 +1045,11 @@ static inline bool nested_cpu_has_virtual_nmis(struct 
>> vmcs12 *vmcs12,
>>   return vmcs12->pin_based_vm_exec_control & PIN_BASED_VIRTUAL_NMIS;
>>  }
>>
>> +static inline int nested_cpu_has_ept(struct vmcs12 *vmcs12)
>> +{
>> + return nested_cpu_has2(vmcs12, SECONDARY_EXEC_ENABLE_EPT);
>> +}
>> +
>>  static inline bool is_exception(u32 intr_info)
>>  {
>>   return (intr_info & (INTR_INFO_INTR_TYPE_MASK | INTR_INFO_VALID_MASK))
>> @@ -7311,6 +7316,46 @@ static void vmx_set_supported_cpuid(u32 func, struct 
>> kvm_cpuid_entry2 *entry)
>>   entry->ecx |= bit(X86_FEATURE_VMX);
>>  }
>>
>> +/* Callbacks for nested_ept_init_mmu_context: */
>> +
>> +static unsigned long nested_ept_get_cr3(struct kvm_vcpu *vcpu)
>> +{
>> + /* return the page table to be shadowed - in our case, EPT12 */
>> + return get_vmcs12(vcpu)->ept_pointer;
>> +}
>> +
>> +static void nested_ept_inject_page_fault(struct kvm_vcpu *vcpu,
>> + struct x86_exception *fault)
>> +{
>> + struct vmcs12 *vmcs12;
>> + nest

Re: [PATCH v3 03/13] nEPT: Add EPT tables support to paging_tmpl.h

2013-05-21 Thread Nakajima, Jun
On Tue, May 21, 2013 at 4:05 AM, Xiao Guangrong
 wrote:
> On 05/21/2013 05:01 PM, Gleb Natapov wrote:
>> On Tue, May 21, 2013 at 04:30:13PM +0800, Xiao Guangrong wrote:
> @@ -772,6 +810,7 @@ static gpa_t FNAME(gva_to_gpa_nested)(struct kvm_vcpu 
> *vcpu, gva_t vaddr,
>
>return gpa;
>  }
> +#endif

 Strange!

 Why does nested ept not need these functions? How to emulate the 
 instruction faulted on L2?
>>>
>>> Sorry, i misunderstood it. Have found the reason out.
>>>
>> You can write it down here for future reviewers :)
>
> Okay.
>
> The functions used to translate L2's gva to L1's gpa are 
> paging32_gva_to_gpa_nested
> and paging64_gva_to_gpa_nested which are created by PTTYPE == 32 and PTTYPE 
> == 64.
>
>

Back to your comments on PT_MAX_FULL_LEVELS:
> + #ifdef CONFIG_X86_64
> + #define PT_MAX_FULL_LEVELS 4
> + #define CMPXCHG cmpxchg
> + #else
> + #define CMPXCHG cmpxchg64
> +#define PT_MAX_FULL_LEVELS 2
I don't think we need to support nEPT on 32-bit hosts.  So, I plan to
remove such code. What do you think?

--
Jun
Intel Open Source Technology Center
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 04/13] nEPT: Define EPT-specific link_shadow_page()

2013-05-21 Thread Nakajima, Jun
Sure. Thanks for the suggestion.


On Tue, May 21, 2013 at 1:15 AM, Xiao Guangrong
 wrote:
> On 05/19/2013 12:52 PM, Jun Nakajima wrote:
>> From: Nadav Har'El 
>>
>> Since link_shadow_page() is used by a routine in mmu.c, add an
>> EPT-specific link_shadow_page() in paging_tmp.h, rather than moving
>> it.
>>
>> Signed-off-by: Nadav Har'El 
>> Signed-off-by: Jun Nakajima 
>> Signed-off-by: Xinhao Xu 
>> ---
>>  arch/x86/kvm/paging_tmpl.h | 20 
>>  1 file changed, 20 insertions(+)
>>
>> diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
>> index 4c45654..dc495f9 100644
>> --- a/arch/x86/kvm/paging_tmpl.h
>> +++ b/arch/x86/kvm/paging_tmpl.h
>> @@ -461,6 +461,18 @@ static void FNAME(pte_prefetch)(struct kvm_vcpu *vcpu, 
>> struct guest_walker *gw,
>>   }
>>  }
>>
>> +#if PTTYPE == PTTYPE_EPT
>> +static void FNAME(link_shadow_page)(u64 *sptep, struct kvm_mmu_page *sp)
>> +{
>> + u64 spte;
>> +
>> + spte = __pa(sp->spt) | VMX_EPT_READABLE_MASK | VMX_EPT_WRITABLE_MASK |
>> + VMX_EPT_EXECUTABLE_MASK;
>> +
>> + mmu_spte_set(sptep, spte);
>> +}
>> +#endif
>
> The only difference between this function and the current link_shadow_page()
> is shadow_accessed_mask. Can we add a parameter to eliminate this difference,
> some like:
>
> static void link_shadow_page(u64 *sptep, struct kvm_mmu_page *sp, bool 
> accessed)
> {
> u64 spte;
>
> spte = __pa(sp->spt) | PT_PRESENT_MASK | PT_WRITABLE_MASK |
>shadow_user_mask | shadow_x_mask;
>
> if (accessed)
> spte |= shadow_accessed_mask;
>
> mmu_spte_set(sptep, spte);
> }
>
> ?
>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Jun
Intel Open Source Technology Center
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 01/13] nEPT: Support LOAD_IA32_EFER entry/exit controls for L1

2013-05-18 Thread Nakajima, Jun
On Mon, May 13, 2013 at 5:25 AM, Gleb Natapov  wrote:
> Please use --no-chain-reply-to option to "git send-email" for nicer
> email threading and there is something wrong with Signed-off chain for
> the patches. The first Signed-off-by: is by Nadav, but you appears to be
> the author of the patches for git. AFAIK Nadav is the author, so you
> need to add proper From: in your patch submission. If you'll fix the
> authorship in your git "git format-patch" will do it for you.

I have been out of town, and I just have re-submitted v3 patches,
using the options. I wrote the patches 12/13 and 13/13, so I didn't
add From: Nadav in them.

--
Jun
Intel Open Source Technology Center
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [nVMX with: v3.9-11789-ge0fd9af] Stack trace when L2 guest is rebooted.

2013-05-10 Thread Nakajima, Jun
On Fri, May 10, 2013 at 9:33 AM, Jan Kiszka  wrote:
> On 2013-05-10 17:39, Kashyap Chamarthy wrote:
>> On Fri, May 10, 2013 at 8:54 PM, Jan Kiszka  wrote:
>>>
>>> On 2013-05-10 17:12, Jan Kiszka wrote:
 On 2013-05-10 15:00, Kashyap Chamarthy wrote:
> Heya,
>
> This is on Intel Haswell.
>
> First, some version info:
>
> L0, L1 -- both of them have same versions of kernel, qemu:
>

I tried to reproduce such a problem, and I found L2 (Linux) hangs in
SeaBIOS, after line "iPXE (http://ipxe.org) ...". It happens with or
w/o VMCS shadowing (and even without my virtual EPT patches). I didn't
realize this problem until I updated the L1 kernel to the latest (e.g.
3.9.0) from 3.7.0. L0 uses the kvm.git, next branch. It's possible
that the L1 kernel exposed a bug with the nested virtualization, as we
saw such cases before.

--
Jun
Intel Open Source Technology Center
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 02/11] nEPT: Add EPT tables support to paging_tmpl.h

2013-05-03 Thread Nakajima, Jun
Thanks for the comments.

This patch was mostly just mechanical rebase of the original patch,
and I'm going to clean it up.

On Thu, May 2, 2013 at 4:54 PM, Marcelo Tosatti  wrote:
> On Thu, Apr 25, 2013 at 11:43:22PM -0700, Jun Nakajima wrote:
>> This is the first patch in a series which adds nested EPT support to KVM's
>> nested VMX. Nested EPT means emulating EPT for an L1 guest so that L1 can use
>> EPT when running a nested guest L2. When L1 uses EPT, it allows the L2 guest
>> to set its own cr3 and take its own page faults without either of L0 or L1
>> getting involved. This often significanlty improves L2's performance over the
>> previous two alternatives (shadow page tables over EPT, and shadow page
>> tables over shadow page tables).
>>
>> This patch adds EPT support to paging_tmpl.h.
>>
>> paging_tmpl.h contains the code for reading and writing page tables. The code
>> for 32-bit and 64-bit tables is very similar, but not identical, so
>> paging_tmpl.h is #include'd twice in mmu.c, once with PTTTYPE=32 and once
>> with PTTYPE=64, and this generates the two sets of similar functions.
>>
>> There are subtle but important differences between the format of EPT tables
>> and that of ordinary x86 64-bit page tables, so for nested EPT we need a
>> third set of functions to read the guest EPT table and to write the shadow
>> EPT table.
>>
>> So this patch adds third PTTYPE, PTTYPE_EPT, which creates functions 
>> (prefixed
>> with "EPT") which correctly read and write EPT tables.
>>
>> Signed-off-by: Nadav Har'El 
>> Signed-off-by: Jun Nakajima 
>> Signed-off-by: Xinhao Xu 
>> ---
>>  arch/x86/kvm/mmu.c |  35 ++--
>>  arch/x86/kvm/paging_tmpl.h | 133 
>> ++---
>>  2 files changed, 130 insertions(+), 38 deletions(-)
>>
>> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
>> index 956ca35..cb9c6fd 100644
>> --- a/arch/x86/kvm/mmu.c
>> +++ b/arch/x86/kvm/mmu.c
>> @@ -2480,26 +2480,6 @@ static pfn_t pte_prefetch_gfn_to_pfn(struct kvm_vcpu 
>> *vcpu, gfn_t gfn,
>>   return gfn_to_pfn_memslot_atomic(slot, gfn);
>>  }
>>
>> -static bool prefetch_invalid_gpte(struct kvm_vcpu *vcpu,
>> -   struct kvm_mmu_page *sp, u64 *spte,
>> -   u64 gpte)
>> -{
>> - if (is_rsvd_bits_set(&vcpu->arch.mmu, gpte, PT_PAGE_TABLE_LEVEL))
>> - goto no_present;
>> -
>> - if (!is_present_gpte(gpte))
>> - goto no_present;
>> -
>> - if (!(gpte & PT_ACCESSED_MASK))
>> - goto no_present;
>> -
>> - return false;
>> -
>> -no_present:
>> - drop_spte(vcpu->kvm, spte);
>> - return true;
>> -}
>> -
>>  static int direct_pte_prefetch_many(struct kvm_vcpu *vcpu,
>>   struct kvm_mmu_page *sp,
>>   u64 *start, u64 *end)
>> @@ -3399,16 +3379,6 @@ static bool sync_mmio_spte(u64 *sptep, gfn_t gfn, 
>> unsigned access,
>>   return false;
>>  }
>>
>> -static inline unsigned gpte_access(struct kvm_vcpu *vcpu, u64 gpte)
>> -{
>> - unsigned access;
>> -
>> - access = (gpte & (PT_WRITABLE_MASK | PT_USER_MASK)) | ACC_EXEC_MASK;
>> - access &= ~(gpte >> PT64_NX_SHIFT);
>> -
>> - return access;
>> -}
>> -
>>  static inline bool is_last_gpte(struct kvm_mmu *mmu, unsigned level, 
>> unsigned gpte)
>>  {
>>   unsigned index;
>> @@ -3418,6 +3388,11 @@ static inline bool is_last_gpte(struct kvm_mmu *mmu, 
>> unsigned level, unsigned gp
>>   return mmu->last_pte_bitmap & (1 << index);
>>  }
>>
>> +#define PTTYPE_EPT 18 /* arbitrary */
>> +#define PTTYPE PTTYPE_EPT
>> +#include "paging_tmpl.h"
>> +#undef PTTYPE
>> +
>
> This breaks
>
> #if PTTYPE == 64
> if (walker->level == PT32E_ROOT_LEVEL) {
> pte = mmu->get_pdptr(vcpu, (addr >> 30) & 3);
> trace_kvm_mmu_paging_element(pte, walker->level);
>
> At walk_addr_generic.

This code path is not required for EPT page walk, as far as I
understand. Or am I missing something?

>
>>  #define PTTYPE 64
>>  #include "paging_tmpl.h"
>>  #undef PTTYPE
>> diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
>> index 105dd5b..e13b6c5 100644
>> --- a/arch/x86/kvm/paging_tmpl.h
>> +++ b/arch/x86/kvm/paging_tmpl.h
>> @@ -50,6 +50,22 @@
>>   #define PT_LEVEL_BITS PT32_LEVEL_BITS
>>   #define PT_MAX_FULL_LEVELS 2
>>   #define CMPXCHG cmpxchg
>> +#elif PTTYPE == PTTYPE_EPT
>> + #define pt_element_t u64
>> + #define guest_walker guest_walkerEPT
>> + #define FNAME(name) EPT_##name
>> + #define PT_BASE_ADDR_MASK PT64_BASE_ADDR_MASK
>> + #define PT_LVL_ADDR_MASK(lvl) PT64_LVL_ADDR_MASK(lvl)
>> + #define PT_LVL_OFFSET_MASK(lvl) PT64_LVL_OFFSET_MASK(lvl)
>> + #define PT_INDEX(addr, level) PT64_INDEX(addr, level)
>> + #define PT_LEVEL_BITS PT64_LEVEL_BITS
>> + #ifdef CONFIG_X86_64
>> + #define PT_MAX_FULL_LEVELS 4
>> + #define CMPXCHG cmpxchg
>> + #else
>> + #de

Re: [PATCH 11/11] nEPT: Provide the correct exit qualification upon EPT

2013-04-29 Thread Nakajima, Jun
On Mon, Apr 29, 2013 at 8:37 AM, Paolo Bonzini  wrote:
> Il 26/04/2013 08:43, Jun Nakajima ha scritto:
>> diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
>> index e13b6c5..bd370e7 100644
>> --- a/arch/x86/kvm/paging_tmpl.h
>> +++ b/arch/x86/kvm/paging_tmpl.h
>> @@ -349,7 +349,12 @@ error:
>>
>>   walker->fault.vector = PF_VECTOR;
>>   walker->fault.error_code_valid = true;
>> +#if PTTYPE != PTTYPE_EPT
>>   walker->fault.error_code = errcode;
>> +#else
>> + /* Reuse bits [2:0] of EPT violation */
>> + walker->fault.error_code = vcpu->arch.exit_qualification & 0x7;
>> +#endif
>>   walker->fault.address = addr;
>>   walker->fault.nested_page_fault = mmu != vcpu->arch.walk_mmu;
>>
>
> I'm not sure that this is a step in the right direction.
>
> errcode is dropped completely, but it would be needed to rebuild bits
> 3:5 of the exit qualification.
>
> Perhaps it is better to access vcpu->arch.exit_qualification in
> nested_ept_inject_page_fault, and mix it with the error code from
> walker->fault to compute bits 3:5?

Yes. We need to generate those bits from the walk of the guest EPT
page tables, and combine them.

I'll update the patches, fixing the issues in Xinhao's patch.

--
Jun
Intel Open Source Technology Center
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 53611] New: nVMX: Add nested EPT

2013-04-26 Thread Nakajima, Jun
On Thu, Apr 25, 2013 at 11:26 PM, Jan Kiszka  wrote:

> That's great but - as Gleb already said - unfortunately not yet usable.
> I'd like to rebase my fixes and enhancements (unrestricted guest mode
> specifically) on top these days, and also run some tests with a non-KVM
> guest. So, if git send-email is not yet working there, I would also be
> happy about a public git repository.
>

I re-submitted the patches last night using git send-email this time.
We had some email problems at that time, and I needed to use a
workaround (imap-send) at that time (and it didn't work well).

-- 
Jun
Intel Open Source Technology Center
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v10 7/7] KVM: VMX: Use posted interrupt to deliver virtual interrupt

2013-04-26 Thread Nakajima, Jun
On Fri, Apr 26, 2013 at 2:29 AM, Yangminqiang  wrote:

> > Ivytown or newer platform supported it.
>
> Ivytown? Do you mean Ivy Bridge?
>

Ivy Town is the codename of "Ivy Bridge-based servers".

--
Jun
Intel Open Source Technology Center
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 53611] New: nVMX: Add nested EPT

2013-04-25 Thread Nakajima, Jun
On Wed, Apr 24, 2013 at 8:55 AM, Nakajima, Jun  wrote:
> Sorry about the slow progress. We've been distracted by some priority
> things. The patches are ready (i.e. working), but we are cleaning them
> up. I'll send what we have today.

So, I have sent them, and frankly we are still cleaning up.  Please
bear with us.
We are also sending one more patchset to deal with EPT
misconfiguration, but Linux should run in L2 on top of L1 KVM.

--
Jun
Intel Open Source Technology Center
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 11/12] Move the routines to paging_tmpl.h to make them diffrent for virtual EPT.

2013-04-25 Thread Nakajima, Jun
Signed-off-by: Nadav Har'El 
Signed-off-by: Jun Nakajima 

modified:   arch/x86/kvm/mmu.c
---
 arch/x86/kvm/mmu.c | 30 --
 1 file changed, 30 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 34e406e2..99bfc5e 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2480,26 +2480,6 @@ static pfn_t pte_prefetch_gfn_to_pfn(struct
kvm_vcpu *vcpu, gfn_t gfn,
  return gfn_to_pfn_memslot_atomic(slot, gfn);
 }

-static bool prefetch_invalid_gpte(struct kvm_vcpu *vcpu,
-  struct kvm_mmu_page *sp, u64 *spte,
-  u64 gpte)
-{
- if (is_rsvd_bits_set(&vcpu->arch.mmu, gpte, PT_PAGE_TABLE_LEVEL))
- goto no_present;
-
- if (!is_present_gpte(gpte))
- goto no_present;
-
- if (!(gpte & PT_ACCESSED_MASK))
- goto no_present;
-
- return false;
-
-no_present:
- drop_spte(vcpu->kvm, spte);
- return true;
-}
-
 static int direct_pte_prefetch_many(struct kvm_vcpu *vcpu,
 struct kvm_mmu_page *sp,
 u64 *start, u64 *end)
@@ -3399,16 +3379,6 @@ static bool sync_mmio_spte(u64 *sptep, gfn_t
gfn, unsigned access,
  return false;
 }

-static inline unsigned gpte_access(struct kvm_vcpu *vcpu, u64 gpte)
-{
- unsigned access;
-
- access = (gpte & (PT_WRITABLE_MASK | PT_USER_MASK)) | ACC_EXEC_MASK;
- access &= ~(gpte >> PT64_NX_SHIFT);
-
- return access;
-}
-
 static inline bool is_last_gpte(struct kvm_mmu *mmu, unsigned level,
unsigned gpte)
 {
  unsigned index;
--
1.8.2.1.610.g562af5b
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 12/12] Provide the correct exit qualification upon EPT violation to L1 VMM.

2013-04-25 Thread Nakajima, Jun
Since vcpu_vmx is contained in vmx.c, use kvm_vcpu_arch so that we can
use the exit quaflication in paging_tmpl.h.

Signed-off-by: Jun Nakajima 

modified:   arch/x86/include/asm/kvm_host.h
modified:   arch/x86/kvm/paging_tmpl.h
modified:   arch/x86/kvm/vmx.c
---
 arch/x86/include/asm/kvm_host.h | 2 ++
 arch/x86/kvm/paging_tmpl.h  | 4 
 arch/x86/kvm/vmx.c  | 3 +++
 3 files changed, 9 insertions(+)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 4979778..5d1fdf2 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -504,6 +504,8 @@ struct kvm_vcpu_arch {
  * instruction.
  */
  bool write_fault_to_shadow_pgtable;
+
+ unsigned long exit_qualification;
 };

 struct kvm_lpage_info {
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 6226b51..0da6044 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -349,7 +349,11 @@ error:

  walker->fault.vector = PF_VECTOR;
  walker->fault.error_code_valid = true;
+#if PTTYPE != PTTYPE_EPT
  walker->fault.error_code = errcode;
+#else
+ walker->fault.error_code = vcpu->arch.exit_qualification & 0x7; /*
exit_qualificaiton */
+#endif
  walker->fault.address = addr;
  walker->fault.nested_page_fault = mmu != vcpu->arch.walk_mmu;

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 95304cc..61e2853 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -425,6 +425,7 @@ struct vcpu_vmx {
  ktime_t entry_time;
  s64 vnmi_blocked_time;
  u32 exit_reason;
+ unsigned long exit_qualification;

  bool rdtscp_enabled;

@@ -5074,6 +5075,8 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)
  /* ept page table is present? */
  error_code |= (exit_qualification >> 3) & 0x1;

+vcpu->arch.exit_qualification = exit_qualification;
+
  return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
 }

--
1.8.2.1.610.g562af5b
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 10/12] Subject: [PATCH 10/10] nEPT: Miscelleneous cleanups

2013-04-25 Thread Nakajima, Jun
Some trivial code cleanups not really related to nested EPT.

Signed-off-by: Nadav Har'El 
Signed-off-by: Jun Nakajima 

modified:   arch/x86/include/asm/vmx.h
modified:   arch/x86/kvm/vmx.c
---
 arch/x86/include/asm/vmx.h | 44 
 arch/x86/kvm/vmx.c |  3 +--
 2 files changed, 45 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 0ce54f3..5838be1 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -254,6 +254,50 @@ enum vmcs_field {
  HOST_RIP= 0x6c16,
 };

+#define VMX_EXIT_REASONS_FAILED_VMENTRY 0x8000
+
+#define EXIT_REASON_EXCEPTION_NMI   0
+#define EXIT_REASON_EXTERNAL_INTERRUPT  1
+#define EXIT_REASON_TRIPLE_FAULT2
+
+#define EXIT_REASON_PENDING_INTERRUPT   7
+#define EXIT_REASON_NMI_WINDOW 8
+#define EXIT_REASON_TASK_SWITCH 9
+#define EXIT_REASON_CPUID   10
+#define EXIT_REASON_HLT 12
+#define EXIT_REASON_INVD13
+#define EXIT_REASON_INVLPG  14
+#define EXIT_REASON_RDPMC   15
+#define EXIT_REASON_RDTSC   16
+#define EXIT_REASON_VMCALL  18
+#define EXIT_REASON_VMCLEAR 19
+#define EXIT_REASON_VMLAUNCH20
+#define EXIT_REASON_VMPTRLD 21
+#define EXIT_REASON_VMPTRST 22
+#define EXIT_REASON_VMREAD  23
+#define EXIT_REASON_VMRESUME24
+#define EXIT_REASON_VMWRITE 25
+#define EXIT_REASON_VMOFF   26
+#define EXIT_REASON_VMON27
+#define EXIT_REASON_CR_ACCESS   28
+#define EXIT_REASON_DR_ACCESS   29
+#define EXIT_REASON_IO_INSTRUCTION  30
+#define EXIT_REASON_MSR_READ31
+#define EXIT_REASON_MSR_WRITE   32
+#define EXIT_REASON_INVALID_STATE 33
+#define EXIT_REASON_MWAIT_INSTRUCTION   36
+#define EXIT_REASON_MONITOR_INSTRUCTION 39
+#define EXIT_REASON_PAUSE_INSTRUCTION   40
+#define EXIT_REASON_MCE_DURING_VMENTRY 41
+#define EXIT_REASON_TPR_BELOW_THRESHOLD 43
+#define EXIT_REASON_APIC_ACCESS 44
+#define EXIT_REASON_EPT_VIOLATION   48
+#define EXIT_REASON_EPT_MISCONFIG   49
+#define EXIT_REASON_INVEPT 50
+#define EXIT_REASON_WBINVD 54
+#define EXIT_REASON_XSETBV 55
+#define EXIT_REASON_INVPCID 58
+
 /*
  * Interruption-information format
  */
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 10f2a69..95304cc 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -616,7 +616,6 @@ static void nested_release_page_clean(struct page *page)
 static u64 construct_eptp(unsigned long root_hpa);
 static void kvm_cpu_vmxon(u64 addr);
 static void kvm_cpu_vmxoff(void);
-static void vmx_set_cr3(struct kvm_vcpu *vcpu, unsigned long cr3);
 static int vmx_set_tss_addr(struct kvm *kvm, unsigned int addr);
 static void vmx_set_segment(struct kvm_vcpu *vcpu,
 struct kvm_segment *var, int seg);
@@ -6320,7 +6319,7 @@ static int vmx_handle_exit(struct kvm_vcpu *vcpu)

  if (unlikely(!cpu_has_virtual_nmis() && vmx->soft_vnmi_blocked &&
 !(is_guest_mode(vcpu) && nested_cpu_has_virtual_nmis(
-get_vmcs12(vcpu), vcpu {
+ get_vmcs12(vcpu) {
  if (vmx_interrupt_allowed(vcpu)) {
  vmx->soft_vnmi_blocked = 0;
  } else if (vmx->vnmi_blocked_time > 10LL &&
--
1.8.2.1.610.g562af5b
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 09/12] Subject: [PATCH 09/10] nEPT: Documentation

2013-04-25 Thread Nakajima, Jun
Update the documentation to no longer say that nested EPT is not supported.

Signed-off-by: Nadav Har'El 
Signed-off-by: Jun Nakajima 

modified:   Documentation/virtual/kvm/nested-vmx.txt
---
 Documentation/virtual/kvm/nested-vmx.txt | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/Documentation/virtual/kvm/nested-vmx.txt
b/Documentation/virtual/kvm/nested-vmx.txt
index 8ed937d..cdf7839 100644
--- a/Documentation/virtual/kvm/nested-vmx.txt
+++ b/Documentation/virtual/kvm/nested-vmx.txt
@@ -38,8 +38,8 @@ The current code supports running Linux guests under
KVM guests.
 Only 64-bit guest hypervisors are supported.

 Additional patches for running Windows under guest KVM, and Linux under
-guest VMware server, and support for nested EPT, are currently running in
-the lab, and will be sent as follow-on patchsets.
+guest VMware server, are currently running in the lab, and will be sent as
+follow-on patchsets.


 Running nested VMX
--
1.8.2.1.610.g562af5b
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 08/12] Subject: [PATCH 08/10] nEPT: Nested INVEPT

2013-04-25 Thread Nakajima, Jun
If we let L1 use EPT, we should probably also support the INVEPT instruction.

In our current nested EPT implementation, when L1 changes its EPT table for
L2 (i.e., EPT12), L0 modifies the shadow EPT table (EPT02), and in the course
of this modification already calls INVEPT. Therefore, when L1 calls INVEPT,
we don't really need to do anything. In particular we *don't* need to call
the real INVEPT again. All we do in our INVEPT is verify the validity of the
call, and its parameters, and then do nothing.

In KVM Forum 2010, Dong et al. presented "Nested Virtualization Friendly KVM"
and classified our current nested EPT implementation as "shadow-like virtual
EPT". He recommended instead a different approach, which he called "VTLB-like
virtual EPT". If we had taken that alternative approach, INVEPT would have had
a bigger role: L0 would only rebuild the shadow EPT table when L1 calls INVEPT.

Signed-off-by: Nadav Har'El 
Signed-off-by: Jun Nakajima 

modified:   arch/x86/include/asm/vmx.h
modified:   arch/x86/kvm/vmx.c
---
 arch/x86/include/asm/vmx.h |  4 ++-
 arch/x86/kvm/vmx.c | 83 ++
 2 files changed, 86 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index b6fbf86..0ce54f3 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -376,7 +376,9 @@ enum vmcs_field {
 #define VMX_EPTP_WB_BIT (1ull << 14)
 #define VMX_EPT_2MB_PAGE_BIT (1ull << 16)
 #define VMX_EPT_1GB_PAGE_BIT (1ull << 17)
-#define VMX_EPT_AD_BIT(1ull << 21)
+#define VMX_EPT_INVEPT_BIT (1ull << 20)
+#define VMX_EPT_AD_BIT (1ull << 21)
+#define VMX_EPT_EXTENT_INDIVIDUAL_BIT (1ull << 24)
 #define VMX_EPT_EXTENT_CONTEXT_BIT (1ull << 25)
 #define VMX_EPT_EXTENT_GLOBAL_BIT (1ull << 26)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index a5e14d1..10f2a69 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -5878,6 +5878,87 @@ static int handle_vmptrst(struct kvm_vcpu *vcpu)
  return 1;
 }

+/* Emulate the INVEPT instruction */
+static int handle_invept(struct kvm_vcpu *vcpu)
+{
+ u32 vmx_instruction_info;
+ unsigned long type;
+ gva_t gva;
+ struct x86_exception e;
+ struct {
+ u64 eptp, gpa;
+ } operand;
+
+ if (!(nested_vmx_secondary_ctls_high & SECONDARY_EXEC_ENABLE_EPT) ||
+!(nested_vmx_ept_caps & VMX_EPT_INVEPT_BIT)) {
+ kvm_queue_exception(vcpu, UD_VECTOR);
+ return 1;
+ }
+
+ if (!nested_vmx_check_permission(vcpu))
+ return 1;
+
+ if (!kvm_read_cr0_bits(vcpu, X86_CR0_PE)) {
+ kvm_queue_exception(vcpu, UD_VECTOR);
+ return 1;
+ }
+
+ /* According to the Intel VMX instruction reference, the memory
+ * operand is read even if it isn't needed (e.g., for type==global)
+ */
+ vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO);
+ if (get_vmx_mem_address(vcpu, vmcs_readl(EXIT_QUALIFICATION),
+ vmx_instruction_info, &gva))
+ return 1;
+ if (kvm_read_guest_virt(&vcpu->arch.emulate_ctxt, gva, &operand,
+ sizeof(operand), &e)) {
+ kvm_inject_page_fault(vcpu, &e);
+ return 1;
+ }
+
+ type = kvm_register_read(vcpu, (vmx_instruction_info >> 28) & 0xf);
+
+ switch (type) {
+ case VMX_EPT_EXTENT_GLOBAL:
+ if (!(nested_vmx_ept_caps & VMX_EPT_EXTENT_GLOBAL_BIT))
+ nested_vmx_failValid(vcpu,
+ VMXERR_INVALID_OPERAND_TO_INVEPT_INVVPID);
+ else {
+ /*
+ * Do nothing: when L1 changes EPT12, we already
+ * update EPT02 (the shadow EPT table) and call INVEPT.
+ * So when L1 calls INVEPT, there's nothing left to do.
+ */
+ nested_vmx_succeed(vcpu);
+ }
+ break;
+ case VMX_EPT_EXTENT_CONTEXT:
+ if (!(nested_vmx_ept_caps & VMX_EPT_EXTENT_CONTEXT_BIT))
+ nested_vmx_failValid(vcpu,
+ VMXERR_INVALID_OPERAND_TO_INVEPT_INVVPID);
+ else {
+ /* Do nothing */
+ nested_vmx_succeed(vcpu);
+ }
+ break;
+ case VMX_EPT_EXTENT_INDIVIDUAL_ADDR:
+ if (!(nested_vmx_ept_caps & VMX_EPT_EXTENT_INDIVIDUAL_BIT))
+ nested_vmx_failValid(vcpu,
+ VMXERR_INVALID_OPERAND_TO_INVEPT_INVVPID);
+ else {
+ /* Do nothing */
+ nested_vmx_succeed(vcpu);
+ }
+ break;
+ default:
+ nested_vmx_failValid(vcpu,
+ VMXERR_INVALID_OPERAND_TO_INVEPT_INVVPID);
+ }
+
+ skip_emulated_instruction(vcpu);
+ return 1;
+}
+
 /*
  * The exit handlers return 1 if the exit was handled fully and guest execution
  * may resume.  Otherwise they set the kvm_run parameter to indicate what needs
@@ -5922,6 +6003,7 @@ static int (*const
kvm_vmx_exit_handlers[])(struct kvm_vcpu *vcpu) = {
  [EXIT_REASON_PAUSE_INSTRUCTION]   = handle_pause,
  [EXIT_REASON_MWAIT_INSTRUCTION]  = handle_invalid_op,
  [EXIT_REASON_MONITOR_INSTRUCTION] = handle_invalid_op,
+ [EXIT_REASON_INVEPT]  = handle_invept,
 };

 static const int kvm_vmx_max_exit_handlers =
@@ -6106,6 +6188,7 @@ static bool nested_vmx_exit_handled(struct kvm_vcpu *vcpu)
  case EXIT_REASON_VMPTRST: case EXIT_REASON_VMREAD:
  case EXIT_REASON_VMRESUME: case EXIT_REASON_VMWRITE:
  case EXIT_REASON_VMOFF: case EXIT_REASON_VMON:
+ case EXIT_REASON_INVEPT:
  /*
  * VMX instructions trap unconditionally. This allows

[PATCH 07/12] Subject: [PATCH 07/10] nEPT: Advertise EPT to L1

2013-04-25 Thread Nakajima, Jun
Advertise the support of EPT to the L1 guest, through the appropriate MSR.

This is the last patch of the basic Nested EPT feature, so as to allow
bisection through this patch series: The guest will not see EPT support until
this last patch, and will not attempt to use the half-applied feature.

Signed-off-by: Nadav Har'El 
Signed-off-by: Jun Nakajima 

modified:   arch/x86/kvm/vmx.c
---
 arch/x86/kvm/vmx.c | 17 +++--
 1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 0e99b15..a5e14d1 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -2026,6 +2026,7 @@ static u32 nested_vmx_secondary_ctls_low,
nested_vmx_secondary_ctls_high;
 static u32 nested_vmx_pinbased_ctls_low, nested_vmx_pinbased_ctls_high;
 static u32 nested_vmx_exit_ctls_low, nested_vmx_exit_ctls_high;
 static u32 nested_vmx_entry_ctls_low, nested_vmx_entry_ctls_high;
+static u32 nested_vmx_ept_caps;
 static __init void nested_vmx_setup_ctls_msrs(void)
 {
  /*
@@ -2101,6 +2102,18 @@ static __init void nested_vmx_setup_ctls_msrs(void)
  nested_vmx_secondary_ctls_low = 0;
  nested_vmx_secondary_ctls_high &=
  SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
+ if (enable_ept) {
+ /* nested EPT: emulate EPT also to L1 */
+ nested_vmx_secondary_ctls_high |= SECONDARY_EXEC_ENABLE_EPT;
+ nested_vmx_ept_caps = VMX_EPT_PAGE_WALK_4_BIT;
+ nested_vmx_ept_caps |=
+ VMX_EPT_INVEPT_BIT | VMX_EPT_EXTENT_GLOBAL_BIT |
+ VMX_EPT_EXTENT_CONTEXT_BIT |
+ VMX_EPT_EXTENT_INDIVIDUAL_BIT;
+ nested_vmx_ept_caps &= vmx_capability.ept;
+ } else
+ nested_vmx_ept_caps = 0;
+
 }

 static inline bool vmx_control_verify(u32 control, u32 low, u32 high)
@@ -2200,8 +2213,8 @@ static int vmx_get_vmx_msr(struct kvm_vcpu
*vcpu, u32 msr_index, u64 *pdata)
  nested_vmx_secondary_ctls_high);
  break;
  case MSR_IA32_VMX_EPT_VPID_CAP:
- /* Currently, no nested ept or nested vpid */
- *pdata = 0;
+ /* Currently, no nested vpid support */
+ *pdata = nested_vmx_ept_caps;
  break;
  default:
  return 0;
--
1.8.2.1.610.g562af5b
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 06/12] Subject: [PATCH 06/10] nEPT: Some additional comments

2013-04-25 Thread Nakajima, Jun
Some additional comments to preexisting code:
Explain who (L0 or L1) handles EPT violation and misconfiguration exits.
Don't mention "shadow on either EPT or shadow" as the only two options.

Signed-off-by: Nadav Har'El 
Signed-off-by: Jun Nakajima 

modified:   arch/x86/kvm/vmx.c
---
 arch/x86/kvm/vmx.c | 13 +
 1 file changed, 13 insertions(+)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index d4bfd32..0e99b15 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -6126,7 +6126,20 @@ static bool nested_vmx_exit_handled(struct
kvm_vcpu *vcpu)
  return nested_cpu_has2(vmcs12,
  SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES);
  case EXIT_REASON_EPT_VIOLATION:
+ /*
+ * L0 always deals with the EPT violation. If nested EPT is
+ * used, and the nested mmu code discovers that the address is
+ * missing in the guest EPT table (EPT12), the EPT violation
+ * will be injected with nested_ept_inject_page_fault()
+ */
+ return 0;
  case EXIT_REASON_EPT_MISCONFIG:
+ /*
+ * L2 never uses directly L1's EPT, but rather L0's own EPT
+ * table (shadow on EPT) or a merged EPT table that L0 built
+ * (EPT on EPT). So any problems with the structure of the
+ * table is L0's fault.
+ */
  return 0;
  case EXIT_REASON_WBINVD:
  return nested_cpu_has2(vmcs12, SECONDARY_EXEC_WBINVD_EXITING);
--
1.8.2.1.610.g562af5b
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 05/12] Subject: [PATCH 05/10] nEPT: Fix wrong test in kvm_set_cr3

2013-04-25 Thread Nakajima, Jun
kvm_set_cr3() attempts to check if the new cr3 is a valid guest physical
address. The problem is that with nested EPT, cr3 is an *L2* physical
address, not an L1 physical address as this test expects.

As the comment above this test explains, it isn't necessary, and doesn't
correspond to anything a real processor would do. So this patch removes it.

Note that this wrong test could have also theoretically caused problems
in nested NPT, not just in nested EPT. However, in practice, the problem
was avoided: nested_svm_vmexit()/vmrun() do not call kvm_set_cr3 in the
nested NPT case, and instead set the vmcb (and arch.cr3) directly, thus
circumventing the problem. Additional potential calls to the buggy function
are avoided in that we don't trap cr3 modifications when nested NPT is
enabled. However, because in nested VMX we did want to use kvm_set_cr3()
(as requested in Avi Kivity's review of the original nested VMX patches),
we can't avoid this problem and need to fix it.

Signed-off-by: Nadav Har'El 
Signed-off-by: Jun Nakajima 

modified:   arch/x86/kvm/x86.c
---
 arch/x86/kvm/x86.c | 11 ---
 1 file changed, 11 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index e172132..c34590d 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -659,17 +659,6 @@ int kvm_set_cr3(struct kvm_vcpu *vcpu, unsigned long cr3)
  */
  }

- /*
- * Does the new cr3 value map to physical memory? (Note, we
- * catch an invalid cr3 even in real-mode, because it would
- * cause trouble later on when we turn on paging anyway.)
- *
- * A real CPU would silently accept an invalid cr3 and would
- * attempt to use it - with largely undefined (and often hard
- * to debug) behavior on the guest side.
- */
- if (unlikely(!gfn_to_memslot(vcpu->kvm, cr3 >> PAGE_SHIFT)))
- return 1;
  vcpu->arch.cr3 = cr3;
  __set_bit(VCPU_EXREG_CR3, (ulong *)&vcpu->arch.regs_avail);
  vcpu->arch.mmu.new_cr3(vcpu);
--
1.8.2.1.610.g562af5b
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 04/12] Subject: [PATCH 04/10] nEPT: Fix cr3 handling in nested exit and entry

2013-04-25 Thread Nakajima, Jun
The existing code for handling cr3 and related VMCS fields during nested
exit and entry wasn't correct in all cases:

If L2 is allowed to control cr3 (and this is indeed the case in nested EPT),
during nested exit we must copy the modified cr3 from vmcs02 to vmcs12, and
we forgot to do so. This patch adds this copy.

If L0 isn't controlling cr3 when running L2 (i.e., L0 is using EPT), and
whoever does control cr3 (L1 or L2) is using PAE, the processor might have
saved PDPTEs and we should also save them in vmcs12 (and restore later).

Signed-off-by: Nadav Har'El 
Signed-off-by: Jun Nakajima 

modified:   arch/x86/kvm/vmx.c
---
 arch/x86/kvm/vmx.c | 37 -
 1 file changed, 36 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index f2fd79d..d4bfd32 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -7162,10 +7162,26 @@ static void prepare_vmcs02(struct kvm_vcpu
*vcpu, struct vmcs12 *vmcs12)
  vmx_set_cr4(vcpu, vmcs12->guest_cr4);
  vmcs_writel(CR4_READ_SHADOW, nested_read_cr4(vmcs12));

- /* shadow page tables on either EPT or shadow page tables */
+ /*
+ * Note that kvm_set_cr3() and kvm_mmu_reset_context() will do the
+ * right thing, and set GUEST_CR3 and/or EPT_POINTER in all supported
+ * settings: 1. shadow page tables on shadow page tables, 2. shadow
+ * page tables on EPT, 3. EPT on EPT.
+ */
  kvm_set_cr3(vcpu, vmcs12->guest_cr3);
  kvm_mmu_reset_context(vcpu);

+ /*
+ * Additionally, except when L0 is using shadow page tables, L1 or
+ * L2 control guest_cr3 for L2, so they may also have saved PDPTEs
+ */
+ if (enable_ept) {
+ vmcs_write64(GUEST_PDPTR0, vmcs12->guest_pdptr0);
+ vmcs_write64(GUEST_PDPTR1, vmcs12->guest_pdptr1);
+ vmcs_write64(GUEST_PDPTR2, vmcs12->guest_pdptr2);
+ vmcs_write64(GUEST_PDPTR3, vmcs12->guest_pdptr3);
+ }
+
  kvm_register_write(vcpu, VCPU_REGS_RSP, vmcs12->guest_rsp);
  kvm_register_write(vcpu, VCPU_REGS_RIP, vmcs12->guest_rip);
 }
@@ -7397,6 +7413,25 @@ void prepare_vmcs12(struct kvm_vcpu *vcpu,
struct vmcs12 *vmcs12)
  vmcs12->guest_pending_dbg_exceptions =
  vmcs_readl(GUEST_PENDING_DBG_EXCEPTIONS);

+ /*
+ * In some cases (usually, nested EPT), L2 is allowed to change its
+ * own CR3 without exiting. If it has changed it, we must keep it.
+ * Of course, if L0 is using shadow page tables, GUEST_CR3 was defined
+ * by L0, not L1 or L2, so we mustn't unconditionally copy it to vmcs12.
+ */
+ if (enable_ept)
+ vmcs12->guest_cr3 = vmcs_read64(GUEST_CR3);
+ /*
+ * Additionally, except when L0 is using shadow page tables, L1 or
+ * L2 control guest_cr3 for L2, so save their PDPTEs
+ */
+ if (enable_ept) {
+ vmcs12->guest_pdptr0 = vmcs_read64(GUEST_PDPTR0);
+ vmcs12->guest_pdptr1 = vmcs_read64(GUEST_PDPTR1);
+ vmcs12->guest_pdptr2 = vmcs_read64(GUEST_PDPTR2);
+ vmcs12->guest_pdptr3 = vmcs_read64(GUEST_PDPTR3);
+ }
+
  /* TODO: These cannot have changed unless we have MSR bitmaps and
  * the relevant bit asks not to trap the change */
  vmcs12->guest_ia32_debugctl = vmcs_read64(GUEST_IA32_DEBUGCTL);
--
1.8.2.1.610.g562af5b
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 03/12] Subject: [PATCH 03/10] nEPT: MMU context for nested EPT

2013-04-25 Thread Nakajima, Jun
KVM's existing shadow MMU code already supports nested TDP. To use it, we
need to set up a new "MMU context" for nested EPT, and create a few callbacks
for it (nested_ept_*()). This context should also use the EPT versions of
the page table access functions (defined in the previous patch).
Then, we need to switch back and forth between this nested context and the
regular MMU context when switching between L1 and L2 (when L1 runs this L2
with EPT).

Signed-off-by: Nadav Har'El 
Signed-off-by: Jun Nakajima 

modified:   arch/x86/kvm/mmu.c
modified:   arch/x86/kvm/mmu.h
modified:   arch/x86/kvm/vmx.c
---
 arch/x86/kvm/mmu.c | 38 
 arch/x86/kvm/mmu.h |  1 +
 arch/x86/kvm/vmx.c | 56 +++---
 3 files changed, 92 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 91cac19..34e406e2 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -3674,6 +3674,44 @@ int kvm_init_shadow_mmu(struct kvm_vcpu *vcpu,
struct kvm_mmu *context)
 }
 EXPORT_SYMBOL_GPL(kvm_init_shadow_mmu);

+int kvm_init_shadow_EPT_mmu(struct kvm_vcpu *vcpu, struct kvm_mmu *context)
+{
+ ASSERT(vcpu);
+ ASSERT(!VALID_PAGE(vcpu->arch.mmu.root_hpa));
+
+ context->shadow_root_level = kvm_x86_ops->get_tdp_level();
+
+ context->nx = is_nx(vcpu); /* TODO: ? */
+ context->new_cr3 = paging_new_cr3;
+ context->page_fault = EPT_page_fault;
+ context->gva_to_gpa = EPT_gva_to_gpa;
+ context->sync_page = EPT_sync_page;
+ context->invlpg = EPT_invlpg;
+ context->update_pte = EPT_update_pte;
+ context->free = paging_free;
+ context->root_level = context->shadow_root_level;
+ context->root_hpa = INVALID_PAGE;
+ context->direct_map = false;
+
+ /* TODO: reset_rsvds_bits_mask() is not built for EPT, we need
+   something different.
+ */
+ reset_rsvds_bits_mask(vcpu, context);
+
+
+ /* TODO: I copied these from kvm_init_shadow_mmu, I don't know why
+   they are done, or why they write to vcpu->arch.mmu and not context
+ */
+ vcpu->arch.mmu.base_role.cr4_pae = !!is_pae(vcpu);
+ vcpu->arch.mmu.base_role.cr0_wp  = is_write_protection(vcpu);
+ vcpu->arch.mmu.base_role.smep_andnot_wp =
+ kvm_read_cr4_bits(vcpu, X86_CR4_SMEP) &&
+ !is_write_protection(vcpu);
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(kvm_init_shadow_EPT_mmu);
+
 static int init_kvm_softmmu(struct kvm_vcpu *vcpu)
 {
  int r = kvm_init_shadow_mmu(vcpu, vcpu->arch.walk_mmu);
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 6987108..19dd5ab 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -54,6 +54,7 @@ int kvm_mmu_get_spte_hierarchy(struct kvm_vcpu
*vcpu, u64 addr, u64 sptes[4]);
 void kvm_mmu_set_mmio_spte_mask(u64 mmio_mask);
 int handle_mmio_page_fault_common(struct kvm_vcpu *vcpu, u64 addr,
bool direct);
 int kvm_init_shadow_mmu(struct kvm_vcpu *vcpu, struct kvm_mmu *context);
+int kvm_init_shadow_EPT_mmu(struct kvm_vcpu *vcpu, struct kvm_mmu *context);

 static inline unsigned int kvm_mmu_available_pages(struct kvm *kvm)
 {
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 9e0ec9d..f2fd79d 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -912,12 +912,16 @@ static inline bool nested_cpu_has2(struct vmcs12
*vmcs12, u32 bit)
  (vmcs12->secondary_vm_exec_control & bit);
 }

-static inline bool nested_cpu_has_virtual_nmis(struct vmcs12 *vmcs12,
- struct kvm_vcpu *vcpu)
+static inline bool nested_cpu_has_virtual_nmis(struct vmcs12 *vmcs12)
 {
  return vmcs12->pin_based_vm_exec_control & PIN_BASED_VIRTUAL_NMIS;
 }

+static inline int nested_cpu_has_ept(struct vmcs12 *vmcs12)
+{
+ return nested_cpu_has2(vmcs12, SECONDARY_EXEC_ENABLE_EPT);
+}
+
 static inline bool is_exception(u32 intr_info)
 {
  return (intr_info & (INTR_INFO_INTR_TYPE_MASK | INTR_INFO_VALID_MASK))
@@ -6873,6 +6877,46 @@ static void vmx_set_supported_cpuid(u32 func,
struct kvm_cpuid_entry2 *entry)
  entry->ecx |= bit(X86_FEATURE_VMX);
 }

+/* Callbacks for nested_ept_init_mmu_context: */
+
+static unsigned long nested_ept_get_cr3(struct kvm_vcpu *vcpu)
+{
+ /* return the page table to be shadowed - in our case, EPT12 */
+ return get_vmcs12(vcpu)->ept_pointer;
+}
+
+static void nested_ept_inject_page_fault(struct kvm_vcpu *vcpu,
+ struct x86_exception *fault)
+{
+ struct vmcs12 *vmcs12;
+ nested_vmx_vmexit(vcpu);
+ vmcs12 = get_vmcs12(vcpu);
+ /*
+ * Note no need to set vmcs12->vm_exit_reason as it is already copied
+ * from vmcs02 in nested_vmx_vmexit() above, i.e., EPT_VIOLATION.
+ */
+ vmcs12->exit_qualification = fault->error_code;
+ vmcs12->guest_physical_address = fault->address;
+}
+
+static int nested_ept_init_mmu_context(struct kvm_vcpu *vcpu)
+{
+ int r = kvm_init_shadow_EPT_mmu(vcpu, &vcpu->arch.mmu);
+
+ vcpu->arch.mmu.set_cr3   = vmx_set_cr3;
+ vcpu->arch.mmu.get_cr3   = nested_ept_get_cr3;
+ vcpu->arch.mmu.inject_page_fault = nested_ept_inject_page_fault;
+
+ vcpu->arch.walk_mmu  = &vcpu->arch.nested_mmu;
+
+ return r;
+}
+
+static void nested_ept_un

[PATCH 02/12] Subject: [PATCH 02/10] nEPT: Add EPT tables support to paging_tmpl.h

2013-04-25 Thread Nakajima, Jun
This is the first patch in a series which adds nested EPT support to KVM's
nested VMX. Nested EPT means emulating EPT for an L1 guest so that L1 can use
EPT when running a nested guest L2. When L1 uses EPT, it allows the L2 guest
to set its own cr3 and take its own page faults without either of L0 or L1
getting involved. This often significanlty improves L2's performance over the
previous two alternatives (shadow page tables over EPT, and shadow page
tables over shadow page tables).

This patch adds EPT support to paging_tmpl.h.

paging_tmpl.h contains the code for reading and writing page tables. The code
for 32-bit and 64-bit tables is very similar, but not identical, so
paging_tmpl.h is #include'd twice in mmu.c, once with PTTTYPE=32 and once
with PTTYPE=64, and this generates the two sets of similar functions.

There are subtle but important differences between the format of EPT tables
and that of ordinary x86 64-bit page tables, so for nested EPT we need a
third set of functions to read the guest EPT table and to write the shadow
EPT table.

So this patch adds third PTTYPE, PTTYPE_EPT, which creates functions (prefixed
with "EPT") which correctly read and write EPT tables.

Signed-off-by: Nadav Har'El 
Signed-off-by: Jun Nakajima 

modified:   arch/x86/kvm/mmu.c
modified:   arch/x86/kvm/paging_tmpl.h
---
 arch/x86/kvm/mmu.c |   5 ++
 arch/x86/kvm/paging_tmpl.h | 135 ++---
 2 files changed, 131 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 956ca35..91cac19 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -3418,6 +3418,11 @@ static inline bool is_last_gpte(struct kvm_mmu
*mmu, unsigned level, unsigned gp
  return mmu->last_pte_bitmap & (1 << index);
 }

+#define PTTYPE_EPT 18 /* arbitrary */
+#define PTTYPE PTTYPE_EPT
+#include "paging_tmpl.h"
+#undef PTTYPE
+
 #define PTTYPE 64
 #include "paging_tmpl.h"
 #undef PTTYPE
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 105dd5b..6226b51 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -50,6 +50,22 @@
  #define PT_LEVEL_BITS PT32_LEVEL_BITS
  #define PT_MAX_FULL_LEVELS 2
  #define CMPXCHG cmpxchg
+#elif PTTYPE == PTTYPE_EPT
+ #define pt_element_t u64
+ #define guest_walker guest_walkerEPT
+ #define FNAME(name) EPT_##name
+ #define PT_BASE_ADDR_MASK PT64_BASE_ADDR_MASK
+ #define PT_LVL_ADDR_MASK(lvl) PT64_LVL_ADDR_MASK(lvl)
+ #define PT_LVL_OFFSET_MASK(lvl) PT64_LVL_OFFSET_MASK(lvl)
+ #define PT_INDEX(addr, level) PT64_INDEX(addr, level)
+ #define PT_LEVEL_BITS PT64_LEVEL_BITS
+ #ifdef CONFIG_X86_64
+ #define PT_MAX_FULL_LEVELS 4
+ #define CMPXCHG cmpxchg
+ #else
+ #define CMPXCHG cmpxchg64
+ #define PT_MAX_FULL_LEVELS 2
+ #endif
 #else
  #error Invalid PTTYPE value
 #endif
@@ -80,6 +96,7 @@ static gfn_t gpte_to_gfn_lvl(pt_element_t gpte, int lvl)
  return (gpte & PT_LVL_ADDR_MASK(lvl)) >> PAGE_SHIFT;
 }

+#if PTTYPE != PTTYPE_EPT
 static int FNAME(cmpxchg_gpte)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
pt_element_t __user *ptep_user, unsigned index,
pt_element_t orig_pte, pt_element_t new_pte)
@@ -102,7 +119,52 @@ static int FNAME(cmpxchg_gpte)(struct kvm_vcpu
*vcpu, struct kvm_mmu *mmu,

  return (ret != orig_pte);
 }
+#endif
+
+static unsigned FNAME(gpte_access)(struct kvm_vcpu *vcpu, u64 gpte)
+{
+ unsigned access;
+
+#if PTTYPE == PTTYPE_EPT
+ /* We rely here that ACC_WRITE_MASK==VMX_EPT_WRITABLE_MASK */
+ access = (gpte & VMX_EPT_WRITABLE_MASK) | ACC_USER_MASK |
+ ((gpte & VMX_EPT_EXECUTABLE_MASK) ? ACC_EXEC_MASK : 0);
+#else
+ access = (gpte & (PT_WRITABLE_MASK | PT_USER_MASK)) | ACC_EXEC_MASK;
+ access &= ~(gpte >> PT64_NX_SHIFT);
+#endif
+
+ return access;
+}
+
+static inline int FNAME(is_present_gpte)(unsigned long pte)
+{
+#if PTTYPE == PTTYPE_EPT
+ return pte & (VMX_EPT_READABLE_MASK | VMX_EPT_WRITABLE_MASK |
+ VMX_EPT_EXECUTABLE_MASK);
+#else
+ return is_present_gpte(pte);
+#endif
+}
+
+static inline int FNAME(check_write_user_access)(struct kvm_vcpu *vcpu,
+   bool write_fault, bool user_fault,
+   unsigned long pte)
+{
+#if PTTYPE == PTTYPE_EPT
+ if (unlikely(write_fault && !(pte & VMX_EPT_WRITABLE_MASK)
+ && (user_fault || is_write_protection(vcpu
+ return false;
+ return true;
+#else
+ u32 access = ((kvm_x86_ops->get_cpl(vcpu) == 3) ? PFERR_USER_MASK : 0)
+| (write_fault ? PFERR_WRITE_MASK : 0);
+
+ return !permission_fault(vcpu->arch.walk_mmu, vcpu->arch.access, access);
+#endif
+}

+#if PTTYPE != PTTYPE_EPT
 static int FNAME(update_accessed_dirty_bits)(struct kvm_vcpu *vcpu,
  struct kvm_mmu *mmu,
  struct guest_walker *walker,
@@ -139,6 +201,7 @@ static int
FNAME(update_accessed_dirty_bits)(struct kvm_vcpu *vcpu,
  }
  return 0;
 }
+#endif

 /*
  * Fetch a guest pte for a guest virtual address
@@ -147,7 +210,6 @@ static int FNAME(walk_addr_generic)(struct
guest_walker *walker,
 struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
 gva_t addr

[PATCH 01/12] Subject: [PATCH 01/10] nEPT: Support LOAD_IA32_EFER entry/exit controls for L1

2013-04-25 Thread Nakajima, Jun
Recent KVM, since http://kerneltrap.org/mailarchive/linux-kvm/2010/5/2/6261577
switch the EFER MSR when EPT is used and the host and guest have different
NX bits. So if we add support for nested EPT (L1 guest using EPT to run L2)
and want to be able to run recent KVM as L1, we need to allow L1 to use this
EFER switching feature.

To do this EFER switching, KVM uses VM_ENTRY/EXIT_LOAD_IA32_EFER if available,
and if it isn't, it uses the generic VM_ENTRY/EXIT_MSR_LOAD. This patch adds
support for the former (the latter is still unsupported).

Nested entry and exit emulation (prepare_vmcs_02 and load_vmcs12_host_state,
respectively) already handled VM_ENTRY/EXIT_LOAD_IA32_EFER correctly. So all
that's left to do in this patch is to properly advertise this feature to L1.

Note that vmcs12's VM_ENTRY/EXIT_LOAD_IA32_EFER are emulated by L0, by using
vmx_set_efer (which itself sets one of several vmcs02 fields), so we always
support this feature, regardless of whether the host supports it.

Signed-off-by: Nadav Har'El 
Signed-off-by: Jun Nakajima 

modified:   arch/x86/kvm/vmx.c
---
 arch/x86/kvm/vmx.c | 18 ++
 1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 6667042..9e0ec9d 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -2057,6 +2057,7 @@ static __init void nested_vmx_setup_ctls_msrs(void)
 #else
  nested_vmx_exit_ctls_high = 0;
 #endif
+ nested_vmx_exit_ctls_high |= VM_EXIT_LOAD_IA32_EFER;

  /* entry controls */
  rdmsr(MSR_IA32_VMX_ENTRY_CTLS,
@@ -2064,6 +2065,7 @@ static __init void nested_vmx_setup_ctls_msrs(void)
  nested_vmx_entry_ctls_low = 0;
  nested_vmx_entry_ctls_high &=
  VM_ENTRY_LOAD_IA32_PAT | VM_ENTRY_IA32E_MODE;
+ nested_vmx_entry_ctls_high |= VM_ENTRY_LOAD_IA32_EFER;

  /* cpu-based controls */
  rdmsr(MSR_IA32_VMX_PROCBASED_CTLS,
@@ -7050,10 +7052,18 @@ static void prepare_vmcs02(struct kvm_vcpu
*vcpu, struct vmcs12 *vmcs12)
  vcpu->arch.cr0_guest_owned_bits &= ~vmcs12->cr0_guest_host_mask;
  vmcs_writel(CR0_GUEST_HOST_MASK, ~vcpu->arch.cr0_guest_owned_bits);

- /* Note: IA32_MODE, LOAD_IA32_EFER are modified by vmx_set_efer below */
- vmcs_write32(VM_EXIT_CONTROLS,
- vmcs12->vm_exit_controls | vmcs_config.vmexit_ctrl);
- vmcs_write32(VM_ENTRY_CONTROLS, vmcs12->vm_entry_controls |
+ /* L2->L1 exit controls are emulated - the hardware exit is to L0 so
+ * we should use its exit controls. Note that IA32_MODE, LOAD_IA32_EFER
+ * bits are further modified by vmx_set_efer() below.
+ */
+ vmcs_write32(VM_EXIT_CONTROLS, vmcs_config.vmexit_ctrl);
+
+ /* vmcs12's VM_ENTRY_LOAD_IA32_EFER and VM_ENTRY_IA32E_MODE are
+ * emulated by vmx_set_efer(), below.
+ */
+ vmcs_write32(VM_ENTRY_CONTROLS,
+ (vmcs12->vm_entry_controls & ~VM_ENTRY_LOAD_IA32_EFER &
+ ~VM_ENTRY_IA32E_MODE) |
  (vmcs_config.vmentry_ctrl & ~VM_ENTRY_IA32E_MODE));

  if (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_IA32_PAT)
--
1.8.2.1.610.g562af5b
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 53611] New: nVMX: Add nested EPT

2013-04-24 Thread Nakajima, Jun
On Wed, Apr 24, 2013 at 12:25 AM, Jan Kiszka  wrote:
>>
>> I don't have a full picture (already asked you to post / git-push your
>> intermediate state), but nested related states typically go to
>> nested_vmx, thus vcpu_vmx.
>
> Ping regarding publication. I'm about to redo your porting work as we
> are making no progress.
>

Sorry about the slow progress. We've been distracted by some priority
things. The patches are ready (i.e. working), but we are cleaning them
up. I'll send what we have today.

--
Jun
Intel Open Source Technology Center
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 53611] New: nVMX: Add nested EPT

2013-03-21 Thread Nakajima, Jun
On Mon, Mar 4, 2013 at 8:45 PM, Nakajima, Jun  wrote:
> I have some updates on this. We rebased the patched to the latest KVM
> (L0). It turned out that the version of L1 KVM/Linux matters. At that
> time, actually I used v3.7 kernel for L1, and the L2 didn't work as I
> described above. If I use v3.5 or older for L1, L2 works with the EPT
> patches. So, I guess some changes made to v3.6 might have exposed a
> bug with the nested EPT patches or somewhere. We are looking at the
> changes to root-cause it.
>

Finally I've had more time to work on this, and I think I've fixed
this. The problem was that the exit qualification for EPT violation
(to L1) was not accurate (enough). And I needed to save the exit
qualification upon EPT violation somewhere. Today, that information is
converted to error_code (see below), and we lose the information.  We
need to use  at least the lower 3 bits when injecting EPT violation to
the L1 VMM. I tried to use the upper bytes of error_code to pass  part
of the exit qualification, but it didn't work well. Any suggestion for
the place to store the value? kvm_vcpu?

   ...
/* It is a write fault? */
error_code = exit_qualification & (1U << 1);
/* ept page table is present? */
error_code |= (exit_qualification >> 3) & 0x1;

return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);

--
Jun
Intel Open Source Technology Center
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 53611] New: nVMX: Add nested EPT

2013-03-04 Thread Nakajima, Jun
On Tue, Feb 26, 2013 at 11:43 AM, Jan Kiszka  wrote:
> On 2013-02-26 15:11, Nadav Har'El wrote:
>> On Thu, Feb 14, 2013, Nakajima, Jun wrote about "Re: [Bug 53611] New: nVMX: 
>> Add nested EPT":
>>> We have started looking at the pataches first. But I couldn't
>>> reproduce the results by simply applying the original patches to v3.6:
>>> - L2 Ubuntu 12.04 (64-bit)  (smp 2)
>>> - L1 Ubuntu 12.04 (64-bit) KVM (smp 2)
>>> - L0 Ubuntu 12.04 (64-bit)-based. kernel/KVM is v3.6 + patches (the
>>> ones in nept-v2.tgz).
>>> https://bugzilla.kernel.org/attachment.cgi?id=93101
>>>
>>> Without the patches, the L2 guest works. With it, it hangs at boot
>>> time (just black screen):
>>> - EPT was detected by L1 KVM.
>>> - UP L2 didn't help.
>>> - Looks like it's looping at EPT_walk_add_generic at the same address in L0.
>>>
>>> Will take a closer look. It would be helpful if the test configuration
>>> (e.g kernel/commit id used, L1/L2 guests) was documented as well.
>>
>> I sent the patches in August 1st, and they applied to commit
>> ade38c311a0ad8c32e902fe1d0ae74d0d44bc71e from a week earlier.
>>
>> In most of my tests, L1 and L2 were old images - L1 had Linux 2.6.33,
>> while L2 had Linux 2.6.28. In most of my tests both L1 and L2 were UP.
>>
>> I've heard another report of my patch not working with newer L1/L2 -
>> the report said that L2 failed to boot (like you reported), and also
>> that L1 became unstable (running anything in it gave a memory fault).
>> So it is very likely that this code still has bugs - but since I already
>> know of errors and holes that need to be plugged (see the announcement file
>> together with the patches), it's not very surprising :( These patches
>> definitely need some lovin', but it's easier than starting from scratch.
>
> FWIW, I'm playing with them on top of kvm-3.6-2 (second pull request for
> 3.6) for a while. They work OK for my use case (static mapping) but
> apparently lock up L2 when starting KVM on KVM, just as reported. I
> didn't look into any details there, still busy with fixing other issues
> like CR0/CR4 handling (which I came across while adding unrestricted
> guest support on top of EPT).

I have some updates on this. We rebased the patched to the latest KVM
(L0). It turned out that the version of L1 KVM/Linux matters. At that
time, actually I used v3.7 kernel for L1, and the L2 didn't work as I
described above. If I use v3.5 or older for L1, L2 works with the EPT
patches. So, I guess some changes made to v3.6 might have exposed a
bug with the nested EPT patches or somewhere. We are looking at the
changes to root-cause it.

>
> Given that I'm porting now patches between that branch and "next" back
> and forth (I depend on EPT), it would be really great if someone
> familiar with the KVM MMU (or enough time) could port the series to the
> current git head. That would not solve remaining bugs but could trigger
> more development, maybe also help me jumping into this.
>
> Thanks,
> Jan
>


-- 
Jun
Intel Open Source Technology Center
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 53611] New: nVMX: Add nested EPT

2013-02-14 Thread Nakajima, Jun
On Tue, Feb 12, 2013 at 11:43 PM, Jan Kiszka  wrote:
>
> On 2013-02-12 20:13, Nakajima, Jun wrote:
> > I looked at your (old) patches, and they seem to be very useful
> > although some of them require rebasing or rewriting. We are interested
> > in completing the nested-VMX features.
>
> That's great news. Can you estimate when you will be able to work on it?
>

We have started looking at the pataches first. But I couldn't
reproduce the results by simply applying the original patches to v3.6:
- L2 Ubuntu 12.04 (64-bit)  (smp 2)
- L1 Ubuntu 12.04 (64-bit) KVM (smp 2)
- L0 Ubuntu 12.04 (64-bit)-based. kernel/KVM is v3.6 + patches (the
ones in nept-v2.tgz).
https://bugzilla.kernel.org/attachment.cgi?id=93101

Without the patches, the L2 guest works. With it, it hangs at boot
time (just black screen):
- EPT was detected by L1 KVM.
- UP L2 didn't help.
- Looks like it's looping at EPT_walk_add_generic at the same address in L0.

Will take a closer look. It would be helpful if the test configuration
(e.g kernel/commit id used, L1/L2 guests) was documented as well.

> I will have a use case for nEPT soon - testing purposes. But working
> into the KVM MMU and doing the port myself may unfortunately consume too
> much time here.
>
> Jan
>

--
Jun
Intel Open Source Technology Center
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 53611] New: nVMX: Add nested EPT

2013-02-12 Thread Nakajima, Jun
On Mon, Feb 11, 2013 at 5:27 AM, Nadav Har'El  wrote:
> Hi,
>
> On Mon, Feb 11, 2013, Jan Kiszka wrote about "Re: [Bug 53611] New: nVMX: Add 
> nested EPT":
>> On 2013-02-11 13:49, bugzilla-dae...@bugzilla.kernel.org wrote:
>> > https://bugzilla.kernel.org/show_bug.cgi?id=53611
>> >Summary: nVMX: Add nested EPT
>
> Yikes, I didn't realize that these bugzilla edits all get spammed to the
> entire mailing list :( Sorry about those...
>
>> I suppose they do not apply anymore as well. Do you have a recent tree
>> around somewhere or plan to resume work on it?
>
> Unfortunately, no - I did not have time to work on these patches since
> August.
>
> The reason I'm now stuffing these things into the bug tracker is that
> at the end of this month I am leaving IBM to a new job, so I'm pretty
> sure I won't have time myself to continue any work on nested VMX, and
> would like for the missing nested-VMX features to be documented in case
> someone else comes along and wants to improve it. So unfortunately, you
> should expect more of this bugzilla spam on the mailing list...
>

I looked at your (old) patches, and they seem to be very useful
although some of them require rebasing or rewriting. We are interested
in completing the nested-VMX features.

-- 
Jun
Intel Open Source Technology Center
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: KVM call agenda for Sept 21

2010-09-20 Thread Nakajima, Jun
Avi Kivity wrote on Mon, 20 Sep 2010 at 09:50:55:

>   On 09/20/2010 06:44 PM, Chris Wright wrote:
>> Please send in any agenda items you are interested in covering.
>> 
>  nested vmx: the resurrection.  Nice to see it progressing again, but
> there's still a lot of ground to cover.  Perhaps we can involve Intel to
> speed things up?
> 
Hi, Avi

What are you looking for?

Jun
___
Intel Open Source Technology Center



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: KVM performance vs. Xen

2009-04-29 Thread Nakajima, Jun
On 4/29/2009 7:41:50 AM, Andrew Theurer wrote:
> I wanted to share some performance data for KVM and Xen.  I thought it
> would be interesting to share some performance results especially
> compared to Xen, using a more complex situation like heterogeneous
> server consolidation.
>
> The Workload:
> The workload is one that simulates a consolidation of servers on to a
> single host.  There are 3 server types: web, imap, and app (j2ee).  In
> addition, there are other "helper" servers which are also
> consolidated: a db server, which helps out with the app server, and an
> nfs server, which helps out with the web server (a portion of the docroot is 
> nfs mounted).
> There is also one other server that is simply idle.  All 6 servers
> make up one set.  The first 3 server types are sent requests, which in
> turn may send requests to the db and nfs helper servers.  The request
> rate is throttled to produce a fixed amount of work.  In order to
> increase utilization on the host, more sets of these servers are used.
> The clients which send requests also have a response time requirement
> which is monitored.  The following results have passed the response
> time requirements.
>
> The host hardware:
> A 2 socket, 8 core Nehalem with SMT, and EPT enabled, lots of disks, 4
> x
> 1 GB Ethenret
>
> The host software:
> Both Xen and KVM use the same host Linux OS, SLES11.  KVM uses the
> 2.6.27.19-5-default kernel and Xen uses the 2.6.27.19-5-xen kernel.  I
> have tried 2.6.29 for KVM, but results are actually worse.  KVM
> modules are rebuilt with kvm-85.  Qemu is also from kvm-85.  Xen
> version is "3.3.1_18546_12-3.1".
>
> The guest software:
> All guests are RedHat 5.3.  The same disk images are used but
> different kernels. Xen uses the RedHat Xen kernel and KVM uses 2.6.29
> with all paravirt build options enabled.  Both use PV I/O drivers.  Software 
> used:
> Apache, PHP, Java, Glassfish, Postgresql, and Dovecot.
>

Just for clarification. So are you using PV (Xen) Linux on Xen, not HVM? Is 
that 32-bit or 64-bit?

 .
Jun Nakajima | Intel Open Source Technology Center
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [SR-IOV driver example 2/3] PF driver: integrate with SR-IOV core

2008-11-26 Thread Nakajima, Jun
On 11/26/2008 8:58:59 AM, Greg KH wrote:
> On Wed, Nov 26, 2008 at 10:21:56PM +0800, Yu Zhao wrote:
> > This patch integrates the IGB driver with the SR-IOV core. It shows
> > how the SR-IOV API is used to support the capability. Obviously
> > people does not need to put much effort to integrate the PF driver
> > with SR-IOV core. All SR-IOV standard stuff are handled by SR-IOV
> > core and PF driver once it gets the necessary information (i.e.
> > number of Virtual
> > Functions) from the callback function.
> >
> > ---
> >  drivers/net/igb/igb_main.c |   30 ++
> >  1 files changed, 30 insertions(+), 0 deletions(-)
> >
> > diff --git a/drivers/net/igb/igb_main.c b/drivers/net/igb/igb_main.c
> > index bc063d4..b8c7dc6 100644
> > --- a/drivers/net/igb/igb_main.c
> > +++ b/drivers/net/igb/igb_main.c
> > @@ -139,6 +139,7 @@ void igb_set_mc_list_pools(struct igb_adapter *,
> > struct e1000_hw *, int, u16);  static int igb_vmm_control(struct
> > igb_adapter *, bool);  static int igb_set_vf_mac(struct net_device
> > *, int, u8*);  static void igb_mbox_handler(struct igb_adapter *);
> > +static int igb_virtual(struct pci_dev *, int);
> >  #endif
> >
> >  static int igb_suspend(struct pci_dev *, pm_message_t); @@ -184,6
> > +185,9 @@ static struct pci_driver igb_driver = {  #endif
> > .shutdown = igb_shutdown,
> > .err_handler = &igb_err_handler,
> > +#ifdef CONFIG_PCI_IOV
> > +   .virtual = igb_virtual
> > +#endif
>
> #ifdef should not be needed, right?
>

Good point. I think this is because the driver is expected to build on older 
kernels also, but the problem is that the driver (and probably others) is 
broken unless the kernel is built with CONFIG_PCI_IOV because of the following 
hunk, for example.

However, we don't want to use #ifdef for the (*virtual) field in the header. 
One option would be to define a constant like the following along with those 
changes.
#define PCI_DEV_IOV

Any better idea?

Thanks,
 .
Jun Nakajima | Intel Open Source Technology Center


@@ -259,6 +266,7 @@ struct pci_dev {
struct list_head msi_list;
 #endif
struct pci_vpd *vpd;
+   struct pci_iov *iov;
 };

 extern struct pci_dev *alloc_pci_dev(void); @@ -426,6 +434,7 @@ struct 
pci_driver {
int  (*resume_early) (struct pci_dev *dev);
int  (*resume) (struct pci_dev *dev);   /* Device woken 
up */
void (*shutdown) (struct pci_dev *dev);
+   int (*virtual) (struct pci_dev *dev, int nr_virtfn);
struct pm_ext_ops *pm;
struct pci_error_handlers *err_handler;
struct device_driverdriver;

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 0/16 v6] PCI: Linux kernel SR-IOV support

2008-11-06 Thread Nakajima, Jun
On 11/6/2008 2:38:40 PM, Anthony Liguori wrote:
> Matthew Wilcox wrote:
> > [Anna, can you fix your word-wrapping please?  Your lines appear to
> > be infinitely long which is most unpleasant to reply to]
> >
> > On Thu, Nov 06, 2008 at 05:38:16PM +, Fischer, Anna wrote:
> >
> > > > Where would the VF drivers have to be associated?  On the "pci_dev"
> > > > level or on a higher one?
> > > >
> > > A VF appears to the Linux OS as a standard (full, additional) PCI
> > > device. The driver is associated in the same way as for a normal
> > > PCI device. Ideally, you would use SR-IOV devices on a virtualized
> > > system, for example, using Xen. A VF can then be assigned to a
> > > guest domain as a full PCI device.
> > >
> >
> > It's not clear thats the right solution.  If the VF devices are
> > _only_ going to be used by the guest, then arguably, we don't want
> > to create pci_devs for them in the host.  (I think it _is_ the right
> > answer, but I want to make it clear there's multiple opinions on this).
> >
>
> The VFs shouldn't be limited to being used by the guest.
>
> SR-IOV is actually an incredibly painful thing.  You need to have a VF
> driver in the guest, do hardware pass through, have a PV driver stub
> in the guest that's hypervisor specific (a VF is not usable on it's
> own), have a device specific backend in the VMM, and if you want to do
> live migration, have another PV driver in the guest that you can do
> teaming with.  Just a mess.

Actually "a PV driver stub in the guest" _was_ correct; I admit that I stated 
so at a virt mini summit more than a half year ago ;-). But the things have 
changed, and such a stub is no longer required (at least in our 
implementation). The major benefit of VF drivers now is that they are 
VMM-agnostic.

>
> What we would rather do in KVM, is have the VFs appear in the host as
> standard network devices.  We would then like to back our existing PV
> driver to this VF directly bypassing the host networking stack.  A key
> feature here is being able to fill the VF's receive queue with guest
> memory instead of host kernel memory so that you can get zero-copy
> receive traffic.  This will perform just as well as doing passthrough
> (at
> least) and avoid all that ugliness of dealing with SR-IOV in the guest.
>
> This eliminates all of the mess of various drivers in the guest and
> all the associated baggage of doing hardware passthrough.
>
> So IMHO, having VFs be usable in the host is absolutely critical
> because I think it's the only reasonable usage model.

As Eddie said, VMDq is better for this model, and the feature is already 
available today. It is much simpler because it was designed for such purposes. 
It does not require hardware pass-through (e.g. VT-d) or VFs as a PCI device, 
either.

>
> Regards,
>
> Anthony Liguori
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in the
> body of a message to [EMAIL PROTECTED] More majordomo info at
> http://vger.kernel.org/majordomo-info.html
 .
Jun Nakajima | Intel Open Source Technology Center
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [RFC] CPUID usage for interaction between Hypervisors and Linux.

2008-10-07 Thread Nakajima, Jun
On 10/3/2008 5:35:39 PM, H. Peter Anvin wrote:
> Nakajima, Jun wrote:
> >
> > What's the significance of supporting multiple interfaces to the
> > same guest simultaneously, i.e. _runtime_? We don't want the guests
> > to run on such a literarily Frankenstein machine. And practically,
> > such testing/debugging would be good only for Halloween :-).
> >
>
> By that notion, EVERY CPU currently shipped is a "Frankenstein" CPU,
> since at very least they export Intel-derived and AMD-derived interfaces.
>  This is in other words, a ridiculous claim.

The big difference here is that you could create a VM at runtime (by combining 
the existing interfaces) that did not exist before (or was not tested before). 
For example, a hypervisor could show hyper-v, osx-v (if any), linux-v, etc., 
and a guest could create a VM with hyper-v MMU, osx-v interrupt handling, 
Linux-v timer, etc. And such combinations/variations can grow exponentially.

Or are you suggesting that multiple interfaces be _available_ to guests at 
runtime but the guest chooses one of them?

> -hpa
>
 .
Jun Nakajima | Intel Open Source Technology Center


RE: [RFC] CPUID usage for interaction between Hypervisors and Linux.

2008-10-03 Thread Nakajima, Jun
On 10/3/2008 4:30:29 PM, H. Peter Anvin wrote:
> Nakajima, Jun wrote:
> > What it means their hypervisor returns the interface signature (i.e.
> > "Hv#1"), and that defines the interface. If we use "Lv_1", for
> > example, we can define the interface 0x4002 through 0x40FF for 
> > Linux.
> > Since leaf 0x4000 and 0x4001 are separate, we can decouple
> > the hypervisor vender from the interface it supports.
>
> Right so far.
>
> > This also allows a hypervisor to support multiple interfaces.
>
> Wrong.
>
> This isn't a two-way interface.  It's a one-way interface, and it
> *SHOULD BE*; exposing different information depending on what is
> running is a hack that is utterly tortorous at best.

What I mean is that a hypervisor (with a single vender id) can support multiple 
interfaces, exposing a single interface to each guest that would expect a 
specific interface at runtime.

>
> >
> > In fact, both Xen and KVM are using the leaf 0x4001 for
> > different purposes today (Xen: Xen version number, KVM: KVM
> > para-virtualization features). But I don't think this would break
> > their existing binaries mainly because they would need to expose the 
> > interface explicitly now.
> >
> > > > > This further underscores my belief that using 0x40xx for
> > > > > anything "standards-based" at all is utterly futile, and that
> > > > > this space should be treated as vendor identification and the
> > > > > rest as vendor-specific. Any hope of creating a standard
> > > > > that's actually usable needs to be outside this space, e.g. in
> > > > > the 0x40xx space I proposed earlier.
> > > > Actually I'm not sure I'm following your logic. Are you saying
> > > > using that 0x40xx for anything "standards-based" is utterly
> > > > futile because Microsoft said "the range is hypervisor
> > > > vendor-neutral"? Or you were not sure what they meant there. If
> > > > we are not clear, we can ask them.
> > > >
> > > What I'm saying is that Microsoft is effectively squatting on the
> > > 0x40xx space with their definition.  As written, it's not even
> > > clear that it will remain consistent between *their own*
> > > hypervisors, even less anyone else's.
> >
> > I hope the above clarified your concern. You can google-search a
> > more detailed public spec. Let me know if you want to know a specific URL.
> >
>
> No, it hasn't "clarified my concern" in any way.  It's exactly
> *underscoring* it.  In other words, I consider 0x40xx unusable for
> anything that is standards-based.  The interfaces everyone is
> currently using aren't designed to export multiple interfaces; they're
> designed to tell the guest which *one* interface is exported.  That is
> fine, we just need to go elsewhere.
>
> -hpa

What's the significance of supporting multiple interfaces to the same guest 
simultaneously, i.e. _runtime_? We don't want the guests to run on such a 
literarily Frankenstein machine. And practically, such testing/debugging would 
be good only for Halloween :-).

The interface space can be distinct, but the contents are defined and 
implemented independently, thus you might find overlaps, inconsistency, etc. 
among the interfaces. And why is runtime "multiple interfaces" required for a 
standards-based interface?

 .
Jun Nakajima | Intel Open Source Technology Center


RE: [RFC] CPUID usage for interaction between Hypervisors and Linux.

2008-10-03 Thread Nakajima, Jun
On 10/1/2008 6:24:26 PM, H. Peter Anvin wrote:
> Nakajima, Jun wrote:
> > >
> > > All I have seen out of Microsoft only covers CPUID levels
> > > 0x4000 as an vendor identification leaf and 0x4001 as a
> > > "hypervisor identification leaf", but you might have access to other 
> > > information.
> >
> > No, it says "Leaf 0x4001 as hypervisor vendor-neutral interface
> > identification, which determines the semantics of leaves from
> > 0x4002 through 0x40FF." The Leaf 0x4000 returns vendor
> > identifier signature (i.e. hypervisor identification) and the
> > hypervisor CPUID leaf range, as in the proposal.
> >
>

Resuming the thread :-)

> In other words, 0x4002+ is vendor-specific space, based on the
> hypervisor specified in 0x4001 (in theory); in practice both
> 0x4000:0x4001 since M$ seem to use clever identifiers as
> "Hypervisor 1".

What it means their hypervisor returns the interface signature (i.e. "Hv#1"), 
and that defines the interface. If we use "Lv_1", for example, we can define 
the interface 0x4002 through 0x40FF for Linux. Since leaf 0x4000 
and 0x4001 are separate, we can decouple the hypervisor vender from the 
interface it supports. This also allows a hypervisor to support multiple 
interfaces.

And whether a guest wants to use the interface without checking the vender id 
is a different thing. For Linux, we don't want to hardcode the vender ids in 
the upstream code at least for such a generic interface.

So I think we need to modify the proposal:

Hypervisor interface identification Leaf:
Leaf 0x4001.

This leaf returns the interface signature that the hypervisor 
implements.
# EAX: "Lv_1" (or something)
# EBX, ECX, EDX: Reserved.

Lv_1 interface Leaves:
Leaf range 0x4002 - 0x400FF.

In fact, both Xen and KVM are using the leaf 0x4001 for different purposes 
today (Xen: Xen version number, KVM: KVM para-virtualization features). But I 
don't think this would break their existing binaries mainly because they would 
need to expose the interface explicitly now.

>
> > > This further underscores my belief that using 0x40xx for
> > > anything "standards-based" at all is utterly futile, and that this
> > > space should be treated as vendor identification and the rest as
> > > vendor-specific. Any hope of creating a standard that's actually
> > > usable needs to be outside this space, e.g. in the 0x40xx
> > > space I proposed earlier.
> >
> > Actually I'm not sure I'm following your logic. Are you saying using
> > that 0x40xx for anything "standards-based" is utterly futile
> > because Microsoft said "the range is hypervisor vendor-neutral"? Or
> > you were not sure what they meant there. If we are not clear, we can
> > ask them.
> >
>
> What I'm saying is that Microsoft is effectively squatting on the
> 0x40xx space with their definition.  As written, it's not even
> clear that it will remain consistent between *their own* hypervisors,
> even less anyone else's.

I hope the above clarified your concern. You can google-search a more detailed 
public spec. Let me know if you want to know a specific URL.

>
> -hpa
>
 .
Jun Nakajima | Intel Open Source Technology Center


RE: [RFC] CPUID usage for interaction between Hypervisors and Linux.

2008-10-01 Thread Nakajima, Jun
On 10/1/2008 3:46:45 PM, H. Peter Anvin wrote:
> Alok Kataria wrote:
> > > No, that's always a terrible idea.  Sure, its necessary to deal
> > > with some backward-compatibility issues, but we should even
> > > consider a new interface which assumes this kind of thing.  We
> > > want properly enumerable interfaces.
> >
> > The reason we still have to do this is because, Microsoft has
> > already defined a CPUID format which is way different than what you
> > or I are proposing ( with the current case of 256 leafs being
> > available). And I doubt they would change the way they deal with it on 
> > their OS.
> > Any proposal that we go with, we will have to export different CPUID
> > interface from the hypervisor for the 2 OS in question.
> >
> > So i think this is something that we anyways will have to do and not
> > worth binging about in the discussion.
>
> No, that's a good hint that what "you and I" are proposing is utterly
> broken and exactly underscores what I have been stressing about
> noncompliant hypervisors.
>
> All I have seen out of Microsoft only covers CPUID levels 0x4000
> as an vendor identification leaf and 0x4001 as a "hypervisor
> identification leaf", but you might have access to other information.

No, it says "Leaf 0x4001 as hypervisor vendor-neutral interface 
identification, which determines the semantics of leaves from 0x4002 
through 0x40FF." The Leaf 0x4000 returns vendor identifier signature 
(i.e. hypervisor identification) and the hypervisor CPUID leaf range, as in the 
proposal.

>
> This further underscores my belief that using 0x40xx for anything
> "standards-based" at all is utterly futile, and that this space should
> be treated as vendor identification and the rest as vendor-specific.
> Any hope of creating a standard that's actually usable needs to be
> outside this space, e.g. in the 0x40xx space I proposed earlier.
>

Actually I'm not sure I'm following your logic. Are you saying using that 
0x40xx for anything "standards-based" is utterly futile because Microsoft 
said "the range is hypervisor vendor-neutral"? Or you were not sure what they 
meant there. If we are not clear, we can ask them.


> -hpa
 .
Jun Nakajima | Intel Open Source Technology Center
N�r��yb�X��ǧv�^�)޺{.n�+h����ܨ}���Ơz�&j:+v���zZ+��+zf���h���~i���z��w���?�&�)ߢf

RE: [PATCH 2/2] VMX: Reinject real mode exception

2008-07-14 Thread Nakajima, Jun
On 7/14/2008 3:04:17 AM, Avi Kivity wrote:
> Nakajima, Jun wrote:
> > On 7/13/2008 8:31:44 AM, Avi Kivity wrote:
> >
> > > Avi Kivity wrote:
> > >
> > > > Well, xen and bochs do not push an error code for real mode #GP.
> > > > I tried running the attached test program but it doesn't work on
> > > > real hardware (it does work on bochs).
> > > >
> > > Jun, perhaps you can clarify? do #GP exceptions in real-mode push
> > > an error code?
> > >
> >
> > Avi,
> >
> > Exceptions in real-mode do not push an error code in the stack.
>
> Thanks.  You might consider updating the documentation, for example
> #DF states that an error code of 0 is always pushed.
>

If you are looking at the description of Chapter 5, at the top it says "This 
chapter describes the interrupt and exception-handling mechanism when operating 
in protected mode on an Intel 64 or IA-32 processor. ... Chapter 15, "8086 
Emulation," describes information specific to interrupt and exception 
mechanisms in real-address and virtual-8086 mode.".
 .
Jun Nakajima | Intel Open Source Technology Center
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 2/2] VMX: Reinject real mode exception

2008-07-13 Thread Nakajima, Jun
On 7/13/2008 8:31:44 AM, Avi Kivity wrote:
> Avi Kivity wrote:
> >
> > Well, xen and bochs do not push an error code for real mode #GP.  I
> > tried running the attached test program but it doesn't work on real
> > hardware (it does work on bochs).
> >
>
> Jun, perhaps you can clarify? do #GP exceptions in real-mode push an
> error code?

Avi,

Exceptions in real-mode do not push an error code in the stack. In vm86 mode 
#GP exceptions push an error code, triggering a protected-mode handler in the 
monitor, as you know. Is it possible that the guest is actually using vm86 mode?

>
> --
> error compiling committee.c: too many arguments to function
>
 .
Jun Nakajima | Intel Open Source Technology Center
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [patch 0/7] force the TSC unreliable by reporting C2 state

2008-06-18 Thread Nakajima, Jun
On 6/18/2008 3:57:16 PM, john stultz wrote:
>
> On Wed, 2008-06-18 at 19:41 -0300, Marcelo Tosatti wrote:
> > On Wed, Jun 18, 2008 at 04:42:40PM -0500, Anthony Liguori wrote:
> > > Marcelo Tosatti wrote:
> > > > On Wed, Jun 18, 2008 at 04:02:39PM -0500, Anthony Liguori wrote:
> > > >
> > > > > > > Have we yet determined why the TSC is so unstable in the first
> > > > > > > place?   In theory, it should be relatively stable on
> > > > > > > single-node Intel and  Barcelona chips.
> > > > > > >
> > > > > > If the host enters C2/C3, or changes CPU frequency, it
> > > > > > becomes unreliable as a clocksource and there's no guarantee
> > > > > > the guest will detect that.
> > > > > >
> > > > > On Intel, the TSC should be fixed-frequency for basically all
> > > > > shipping  processors supporting VT.  Starting with 10h
> > > > > (Barcelona), I believe AMD  also has a fixed frequency TSC.
> > > > >
> > > >
> > > > But still stops ticking in C2/C3 state, I suppose?
> > > >
> > >
> > > I don't know for sure but the TSC is not tied to the CPU clock so
> > > I would be surprised if it did.  I think that that would defeat
> > > the utility of a fixed-frequency TSC.
> >
> > Well, Linux assumes that TSC stops ticking on C2/C3.
> >
> > Section 18.10 of Intel says:
> >
> > "The specific processor configuration determines the behavior.
> > Constant TSC behavior ensures that the duration of each clock tick
> > is uniform and supports the use of the TSC as a wall clock timer
> > even if the processor core changes frequency. This is the
> > architectural behavior moving forward."
> >
> > However it does not mention C2/C3.
> >
> > Could someone confirm either way?
>
> My understanding: On most systems, the TSC halts in C3. C2 may also
> halt the TSC, but that seems to depend on the BIOS.

TSC stops counting in the H/W C3. The C-states reported by BIOS may not 
necessarily be mapped to H/W C-states; C2 used by BIOS may be C3 for H/W.

>
> thanks
> -john
>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in the
> body of a message to [EMAIL PROTECTED] More majordomo info at
> http://vger.kernel.org/majordomo-info.html
 .
Jun Nakajima | Intel Open Source Technology Center
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html