from:"Sean Christopherson"

Re: [intel-sgx-kernel-dev] [PATCH v5 06/11] intel_sgx: driver for Intel Software Guard Extensions

2017-11-15 Thread Sean Christopherson

On Tue, 2017-11-14 at 22:28 +0200, Jarkko Sakkinen wrote:
> On Tue, Nov 14, 2017 at 09:55:06AM -0800, Sean Christopherson wrote:
> > 
> > What do you mean by bottlenecks?  Assuming you're referring to performance
> > bottlenecks, this statement is flat out false.  Moving the launch enclave
> > into
> > the kernel introduces performance bottlenecks, e.g. as implemented, a single
> > LE
> > services all EINIT requests and is protected by a mutex.  That is the very
> > definition of a bottleneck.
> I guess the text does not do a good job describing what I meant. Maybe I
> should refine it? Your argument about mutex is correct.
> 
> The use of "bottleneck" does not specifically refer to performance. I'm
> worried about splitting the tasks needed to launch an enclave between
> kernel and user space. It could become difficult to manage when more
> SGX features are added. That is what I was referring when I used the
> word "bottleneck".
> 
> I suppose you think I should refine the commit message?
> 
> About the perf bottleneck. Given that all the data is already in
> sgx_le_ctx the driver could for example have own LE process for every
> opened /dev/sgx. Is your comment also suggesting to refine this or
> could it be postponed?

More that I don't understand why the driver doesn't allow userspace to provide
an EINIT token, and reciprocally, doesn't provide the token back to userspace. 
IMO, the act of generating an EINIT token is orthogonal to deciding whether or
not to run the enclave.  Running code in a kernel-owned enclave is not specific
to SGX, e.g. paranoid kernels could run other sensitive tasks in an enclave.
Being forced to run an enclave to generate an EINIT token is an unfortunate
speed bump that exists purely because hardware doesn't provide the option to
disable launch control entirely.

In other words, accepting a token via the IOCTL doesn't mean the driver has to
use it, e.g. it can always ignore the token, enforce periodic reverification,
check that the token was created by the driver, etc...  And using the token
doesn't preclude the driver from re-running its verification checks outside of
the launch enclave.

> The driver architecture already allows to scale this but it is not
> nearly as bad issue as the one Dave pointed out.
> 
> /Jarkko
> ___
> intel-sgx-kernel-dev mailing list
> intel-sgx-kernel-...@lists.01.org
> https://lists.01.org/mailman/listinfo/intel-sgx-kernel-dev

Re: [PATCH v5 11/11] intel_sgx: driver documentation

2017-11-27 Thread Sean Christopherson

On Tue, 2017-11-21 at 01:08 +0200, Jarkko Sakkinen wrote:
> On Sat, Nov 18, 2017 at 12:34:33AM +0100, Thomas Gleixner wrote:
> > 
> > This is architecural. From the cursory read of that series it seems there
> > are two parts to it:
> > 
> >   1) The actual core handling, which should be in arch/x86 because that
> >  hardly qualifies as a 'platform' device driver.
> > 
> >   2) The user space interface, which can be separated out perhaps.
> > 
> > I don't know how intertwingled they are, but that's hard to tell from the
> > actual patches w/o doing a deep inspection. Jarkko should be able to answer
> > that.
> > 
> > Thanks,
> > 
> > tglx
> Darren, tglx,
> 
> You can leave user space device as separate module as sgx_ioctl.c merely
> calls stuff that I have inside sgx_encl.c. VMA creation is bound to file
> operations.
> 
> My questions would be:
> 
> 1. What is your recommendation on the deployment under arch/x86?
> 2. Which parts should be compilable as a LKM? Only the user interface
>    or both parts?
> 
> /Jarkko

To enable KVM and a cgroup for EPC accounting, at a minimum arch/x86 needs to
manage the EPC pages (alloc/free/lrus/reclaim/etc...) and LE hash MSRs.  IMO,
ideally everything else would be left in the device driver, e.g. anything
involving ENCLS.  Keeping the majority of the driver out of arch/x86 minimizes
the footprint in arch/x86 and thereby the size of KVM's dependency required to
virtualize SGX, and allows the various SGX pieces, e.g. arch, driver and KVM, to
evolve more independently.

Preferably the arch/x86 code would not be a loadable module, e.g. to simplify
KVM support.

I have a branch based on Jarkko's patches (I believe it's up-to-date with v5)
that implements what I described.  I'd be happy to send RFC patches if that
would help.

Branches for those interested:

https://github.com/sean-jc/linux.git sgx/arch   - move core EPC to arch/x86
https://github.com/sean-jc/linux.git sgx/kvm    - KVM support for SGX
https://github.com/sean-jc/linux.git sgx/lc     - KVM support for Launch Control
https://github.com/sean-jc/linux.git sgx/cgroup - EPC cgroup

branch relationships:

    Jarkko's patches
|
|
 sgx/arch
/\
 sgx/kvmsgx/cgroup
  /
   sgx/lc

Re: [PATCH v5 11/11] intel_sgx: driver documentation

2017-11-27 Thread Sean Christopherson

+ Cc: KVM, Paolo and Radim

On Mon, 2017-11-27 at 09:03 -0800, Sean Christopherson wrote:
> On Tue, 2017-11-21 at 01:08 +0200, Jarkko Sakkinen wrote:
> > 
> > On Sat, Nov 18, 2017 at 12:34:33AM +0100, Thomas Gleixner wrote:
> > > 
> > > 
> > > This is architecural. From the cursory read of that series it seems there
> > > are two parts to it:
> > > 
> > >   1) The actual core handling, which should be in arch/x86 because that
> > >  hardly qualifies as a 'platform' device driver.
> > > 
> > >   2) The user space interface, which can be separated out perhaps.
> > > 
> > > I don't know how intertwingled they are, but that's hard to tell from the
> > > actual patches w/o doing a deep inspection. Jarkko should be able to
> > > answer
> > > that.
> > > 
> > > Thanks,
> > > 
> > >   tglx
> > Darren, tglx,
> > 
> > You can leave user space device as separate module as sgx_ioctl.c merely
> > calls stuff that I have inside sgx_encl.c. VMA creation is bound to file
> > operations.
> > 
> > My questions would be:
> > 
> > 1. What is your recommendation on the deployment under arch/x86?
> > 2. Which parts should be compilable as a LKM? Only the user interface
> >    or both parts?
> > 
> > /Jarkko
> To enable KVM and a cgroup for EPC accounting, at a minimum arch/x86 needs to
> manage the EPC pages (alloc/free/lrus/reclaim/etc...) and LE hash MSRs.  IMO,
> ideally everything else would be left in the device driver, e.g. anything
> involving ENCLS.  Keeping the majority of the driver out of arch/x86 minimizes
> the footprint in arch/x86 and thereby the size of KVM's dependency required to
> virtualize SGX, and allows the various SGX pieces, e.g. arch, driver and KVM,
> to evolve more independently.
> 
> Preferably the arch/x86 code would not be a loadable module, e.g. to simplify
> KVM support.
> 
> I have a branch based on Jarkko's patches (I believe it's up-to-date with v5)
> that implements what I described.  I'd be happy to send RFC patches if that
> would help.
> 
> 
> Branches for those interested:
> 
> https://github.com/sean-jc/linux.git sgx/arch   - move core EPC to arch/x86
> https://github.com/sean-jc/linux.git sgx/kvm    - KVM support for SGX
> https://github.com/sean-jc/linux.git sgx/lc     - KVM support for Launch
> Control
> https://github.com/sean-jc/linux.git sgx/cgroup - EPC cgroup
> 
> 
> branch relationships:
> 
>     Jarkko's patches
> |
> |
>  sgx/arch
> /\
>  sgx/kvmsgx/cgroup
>   /
>    sgx/lc

Re: [PATCH v6 03/11] x86: define IA32_FEATURE_CONTROL.SGX_ENABLE

2017-11-28 Thread Sean Christopherson

On Sat, 2017-11-25 at 21:29 +0200, Jarkko Sakkinen wrote:
> From: Sean Christopherson <sean.j.christopher...@intel.com>
> 
> When IA32_FEATURE_CONTROL.SGX_ENABLE and IA32_FEATURE_CONTROL.LOCK are
> set by the pre-boot firmware, SGX is usable by the OS.

This implies that only pre-boot firmware can write feature control, which is not
true.  What about:

    SGX instructions (ENCLS and ENCLU) are usable if and only if SGX_ENABLE is
    set in the IA32_FEATURE_CONTROL MSR and said MSR is locked.

> Signed-off-by: Sean Christopherson <sean.j.christopher...@intel.com>
> Signed-off-by: Jarkko Sakkinen <jarkko.sakki...@linux.intel.com>
> ---
>  arch/x86/include/asm/msr-index.h | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-
> index.h
> index 17f5c12e1afd..b35cb98b5d60 100644
> --- a/arch/x86/include/asm/msr-index.h
> +++ b/arch/x86/include/asm/msr-index.h
> @@ -435,6 +435,7 @@
>  #define FEATURE_CONTROL_LOCKED   (1<<0)
>  #define FEATURE_CONTROL_VMXON_ENABLED_INSIDE_SMX (1<<1)
>  #define FEATURE_CONTROL_VMXON_ENABLED_OUTSIDE_SMX(1<<2)
> +#define FEATURE_CONTROL_SGX_ENABLE  (1<<18)
>  #define FEATURE_CONTROL_LMCE (1<<20)
>  
>  #define MSR_IA32_APICBASE0x001b

Re: [PATCH v6 04/11] x86: define IA32_FEATUE_CONTROL.SGX_LC

2017-11-28 Thread Sean Christopherson

On Tue, 2017-11-28 at 09:16 -0800, Sean Christopherson wrote:
> On Sat, 2017-11-25 at 21:29 +0200, Jarkko Sakkinen wrote:
> > 
> > When IA32_FEATURE_CONTROL.SGX_LC identifies that the root key for
> > enclave signatures can be configured either by the OS or pre-boot
> > firmware.
> > 
> > If this the case, IA32_SGXLEPUBKEYHASHn MSRs (0 < n < 4) can be used
> > to
> > set the SHA256 of the root key. IA32_FEATURE_CONTROL bit 17 controls
> > whether the MSRs are writable by the OS. The pre-boot firmware can
> > decided whether to  set this bit before setting
> > IA32_FEATURE_CONTROL.LOCK.
> The commit message (feature control bit) doesn't match the patch (CPUID
> bit).

Also, assuming this message is destined for the commit that adds SGX_LC
to feature control, I think it should first and foremost describe the
hardware behavior.  The firmware vs OS interaction and use cases are
valuable to document but IMO should come after the hardware description.

And though it's not documented in the SDM, I think it's worthwhile to
describe the SGX activation sequence and its relationship with the SGX
MSRs, e.g. the LE hash MSRs are writable prior to SGX activation.
Without that information, it's unclear as to how the LE hash MSRs could
be different than Intel's reset value.

So, maybe something like this?

    After SGX is activated[1] the IA32_SGXLEPUBKEYHASHn MSRs are writable
    if and only if SGX_LC is set in the IA32_FEATURE_CONTROL MSR and the
    IA32_FEATURE_CONTROL MSR is locked, otherwise they are read-only.

    For example, firmware can allow the OS to change the launch enclave
    root key by setting IA32_FEATURE_CONTROL.SGX_LC, and thus give the
    OS complete control over the enclaves it runs.  Alternatively,
    firmware can clear IA32_FEATURE_CONTROL.SGX_LC to lock down the root
    key and restrict the OS to running enclaves signed with the root key
    or whitelisted/trusted by a launch enclave (which must be signed with
    the root key).

    [1] SGX related bits in IA32_FEATURE_CONTROL cannot be set until SGX
        is activated, e.g. by firmware.  SGX activation is triggered by
        setting bit 0 in MSR 0x7a.  Until SGX is activated, the LE hash
        MSRs are writable, e.g. to allow firmware to lock down the LE
        root key with a non-Intel value.

> > 
> > 
> > Signed-off-by: Jarkko Sakkinen <jarkko.sakki...@linux.intel.com>
> > ---
> >  arch/x86/include/asm/cpufeatures.h | 1 +
> >  1 file changed, 1 insertion(+)
> > 
> > diff --git a/arch/x86/include/asm/cpufeatures.h
> > b/arch/x86/include/asm/cpufeatures.h
> > index 31a7d1c0f204..43130f3c18a1 100644
> > --- a/arch/x86/include/asm/cpufeatures.h
> > +++ b/arch/x86/include/asm/cpufeatures.h
> > @@ -298,6 +298,7 @@
> >  #define X86_FEATURE_AVX512_VPOPCNTDQ (16*32+14) /* POPCNT for vectors
> > of DW/QW */
> >  #define X86_FEATURE_LA57   (16*32+16) /* 5-level page tables */
> >  #define X86_FEATURE_RDPID  (16*32+22) /* RDPID instruction */
> > +#define X86_FEATURE_SGX_LC (16*32+30) /* supports SGX launch
> > configuration */
> >  
> >  /* AMD-defined CPU features, CPUID level 0x8007 (ebx), word 17 */
> >  #define X86_FEATURE_OVERFLOW_RECOV (17*32+0) /* MCA overflow recovery
> > support */

Re: [PATCH v6 04/11] x86: define IA32_FEATUE_CONTROL.SGX_LC

2017-11-28 Thread Sean Christopherson

On Sat, 2017-11-25 at 21:29 +0200, Jarkko Sakkinen wrote:
> When IA32_FEATURE_CONTROL.SGX_LC identifies that the root key for
> enclave signatures can be configured either by the OS or pre-boot
> firmware.
> 
> If this the case, IA32_SGXLEPUBKEYHASHn MSRs (0 < n < 4) can be used
> to
> set the SHA256 of the root key. IA32_FEATURE_CONTROL bit 17 controls
> whether the MSRs are writable by the OS. The pre-boot firmware can
> decided whether to  set this bit before setting
> IA32_FEATURE_CONTROL.LOCK.

The commit message (feature control bit) doesn't match the patch (CPUID
bit).

> 
> Signed-off-by: Jarkko Sakkinen 
> ---
>  arch/x86/include/asm/cpufeatures.h | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/arch/x86/include/asm/cpufeatures.h
> b/arch/x86/include/asm/cpufeatures.h
> index 31a7d1c0f204..43130f3c18a1 100644
> --- a/arch/x86/include/asm/cpufeatures.h
> +++ b/arch/x86/include/asm/cpufeatures.h
> @@ -298,6 +298,7 @@
>  #define X86_FEATURE_AVX512_VPOPCNTDQ (16*32+14) /* POPCNT for vectors
> of DW/QW */
>  #define X86_FEATURE_LA57 (16*32+16) /* 5-level page tables */
>  #define X86_FEATURE_RDPID(16*32+22) /* RDPID instruction */
> +#define X86_FEATURE_SGX_LC   (16*32+30) /* supports SGX launch
> configuration */
>  
>  /* AMD-defined CPU features, CPUID level 0x8007 (ebx), word 17 */
>  #define X86_FEATURE_OVERFLOW_RECOV (17*32+0) /* MCA overflow recovery
> support */

Re: [PATCH v6 05/11] x86: add SGX MSRs to msr-index.h

2017-11-28 Thread Sean Christopherson

On Sat, 2017-11-25 at 21:29 +0200, Jarkko Sakkinen wrote:
> From: Haim Cohen 
> 
> These MSRs hold the SHA256 checksum of the currently configured root
> key for enclave signatures.

The commit message doesn't talk about the launch control bit in the
feature control MSR.

> 
> Signed-off-by: Haim Cohen 
> Signed-off-by: Jarkko Sakkinen 
> ---
>  arch/x86/include/asm/msr-index.h | 7 +++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/arch/x86/include/asm/msr-index.h
> b/arch/x86/include/asm/msr-index.h
> index b35cb98b5d60..22e27d46d046 100644
> --- a/arch/x86/include/asm/msr-index.h
> +++ b/arch/x86/include/asm/msr-index.h
> @@ -436,6 +436,7 @@
>  #define FEATURE_CONTROL_VMXON_ENABLED_INSIDE_SMX (1<<1)
>  #define FEATURE_CONTROL_VMXON_ENABLED_OUTSIDE_SMX(1<<2)
>  #define FEATURE_CONTROL_SGX_ENABLE  (1<<18)
> +#define FEATURE_CONTROL_SGX_LAUNCH_CONTROL_ENABLE(1<<17)
>  #define FEATURE_CONTROL_LMCE (1<<20)
>  
>  #define MSR_IA32_APICBASE0x001b
> @@ -502,6 +503,12 @@
>  #define PACKAGE_THERM_INT_LOW_ENABLE (1 << 1)
>  #define PACKAGE_THERM_INT_PLN_ENABLE (1 << 24)
>  
> +/* Intel SGX MSRs */
> +#define MSR_IA32_SGXLEPUBKEYHASH00x008C
> +#define MSR_IA32_SGXLEPUBKEYHASH10x008D
> +#define MSR_IA32_SGXLEPUBKEYHASH20x008E
> +#define MSR_IA32_SGXLEPUBKEYHASH30x008F
> +
>  /* Thermal Thresholds Support */
>  #define THERM_INT_THRESHOLD0_ENABLE(1 << 15)
>  #define THERM_SHIFT_THRESHOLD08

Re: [PATCH v6 06/11] intel_sgx: driver for Intel Software Guard Extensions

2017-11-28 Thread Sean Christopherson

> diff --git a/arch/x86/include/asm/sgx_arch.h b/arch/x86/include/asm/sgx_arch.h
> new file mode 100644
> index ..515676031006
> --- /dev/null
> +++ b/arch/x86/include/asm/sgx_arch.h
> @@ -0,0 +1,268 @@

[...]

> +
> +#ifndef _ASM_X86_SGX_ARCH_H
> +#define _ASM_X86_SGX_ARCH_H

Still missing #include .

> +
> +#define SGX_SSA_GPRS_SIZE  182
> +#define SGX_SSA_MISC_EXINFO_SIZE   16

Re: [PATCH v6 04/11] x86: define IA32_FEATUE_CONTROL.SGX_LC

2017-11-28 Thread Sean Christopherson

On Tue, 2017-11-28 at 23:24 +0200, Jarkko Sakkinen wrote:
> On Tue, Nov 28, 2017 at 10:53:24PM +0200, Jarkko Sakkinen wrote:
> > 
> > > 
> > > So, maybe something like this?
> > > 
> > >     After SGX is activated[1] the IA32_SGXLEPUBKEYHASHn MSRs are writable
> > >     if and only if SGX_LC is set in the IA32_FEATURE_CONTROL MSR and the
> > >     IA32_FEATURE_CONTROL MSR is locked, otherwise they are read-only.
> > > 
> > >     For example, firmware can allow the OS to change the launch enclave
> > >     root key by setting IA32_FEATURE_CONTROL.SGX_LC, and thus give the
> > >     OS complete control over the enclaves it runs.  Alternatively,
> > >     firmware can clear IA32_FEATURE_CONTROL.SGX_LC to lock down the root
> > >     key and restrict the OS to running enclaves signed with the root key
> > >     or whitelisted/trusted by a launch enclave (which must be signed with
> > >     the root key).
> > > 
> > >     [1] SGX related bits in IA32_FEATURE_CONTROL cannot be set until SGX
> > >         is activated, e.g. by firmware.  SGX activation is triggered by
> > >         setting bit 0 in MSR 0x7a.  Until SGX is activated, the LE hash
> > >         MSRs are writable, e.g. to allow firmware to lock down the LE
> > >         root key with a non-Intel value.
> > Thanks I'll use this as a basis and move most of the crappy commit
> > message to the commit (with some editing) that defines the MSRs.
> Not sure after all if I'm following this.
> 
> IA32_FEATURE_CONTROL[17] contols whether the MSRs are writable or not
> after the feature control MSR is locked. SGX_LC means just that the
> CPU supports the launch configuration.
> 
> /Jarkko

My comments were referring to improving the commit message for defining
IA32_FEATURE_CONTROL.SGX_LC, i.e. bit 17, not the CPUID bit.

Re: [PATCH v6 04/11] x86: define IA32_FEATUE_CONTROL.SGX_LC

2017-11-28 Thread Sean Christopherson

On Tue, 2017-11-28 at 23:55 +0200, Jarkko Sakkinen wrote:
> On Tue, Nov 28, 2017 at 01:33:14PM -0800, Sean Christopherson wrote:
> > 
> > On Tue, 2017-11-28 at 23:24 +0200, Jarkko Sakkinen wrote:
> > > 
> > > On Tue, Nov 28, 2017 at 10:53:24PM +0200, Jarkko Sakkinen wrote:
> > > > 
> > > > 
> > > > > 
> > > > > 
> > > > > So, maybe something like this?
> > > > > 
> > > > >     After SGX is activated[1] the IA32_SGXLEPUBKEYHASHn MSRs are 
> > > > > writable
> > > > >     if and only if SGX_LC is set in the IA32_FEATURE_CONTROL MSR and 
> > > > > the
> > > > >     IA32_FEATURE_CONTROL MSR is locked, otherwise they are read-only.
> > > > > 
> > > > >     For example, firmware can allow the OS to change the launch 
> > > > > enclave
> > > > >     root key by setting IA32_FEATURE_CONTROL.SGX_LC, and thus give the
> > > > >     OS complete control over the enclaves it runs.  Alternatively,
> > > > >     firmware can clear IA32_FEATURE_CONTROL.SGX_LC to lock down the 
> > > > > root
> > > > >     key and restrict the OS to running enclaves signed with the root 
> > > > > key
> > > > >     or whitelisted/trusted by a launch enclave (which must be signed 
> > > > > with
> > > > >     the root key).
> > > > > 
> > > > >     [1] SGX related bits in IA32_FEATURE_CONTROL cannot be set until 
> > > > > SGX
> > > > >         is activated, e.g. by firmware.  SGX activation is triggered 
> > > > > by
> > > > >         setting bit 0 in MSR 0x7a.  Until SGX is activated, the LE 
> > > > > hash
> > > > >         MSRs are writable, e.g. to allow firmware to lock down the LE
> > > > >         root key with a non-Intel value.
> > > > Thanks I'll use this as a basis and move most of the crappy commit
> > > > message to the commit (with some editing) that defines the MSRs.
> > > Not sure after all if I'm following this.
> > > 
> > > IA32_FEATURE_CONTROL[17] contols whether the MSRs are writable or not
> > > after the feature control MSR is locked. SGX_LC means just that the
> > > CPU supports the launch configuration.
> > > 
> > > /Jarkko
> > My comments were referring to improving the commit message for defining
> > IA32_FEATURE_CONTROL.SGX_LC, i.e. bit 17, not the CPUID bit.
> My bad but SGX_LC is referring here to the CPUID bit.
> 
> In SGX chapters there is no specific name for IA32_FEATURE_CONTROL[17].
> I would call it something else than SGX_LC. Maybe SGX_LC_WRITABLE.
> 
> /Jarkko

What about SGX_LC_ENABLE?  The title in the MSR section of the SDM is
"SGX Launch Control Enable", and it's more consistent with the other
bits defined in feature control.  I'd also prefer that name for the
actual #define too, SGX_LAUNCH_CONTROL_ENABLE is overly verbose IMO.

Re: [PATCH v6 04/11] x86: define IA32_FEATUE_CONTROL.SGX_LC

2017-11-28 Thread Sean Christopherson

On Tue, 2017-11-28 at 23:40 +0200, Jarkko Sakkinen wrote:
> On Tue, Nov 28, 2017 at 11:24:07PM +0200, Jarkko Sakkinen wrote:
> > 
> > On Tue, Nov 28, 2017 at 10:53:24PM +0200, Jarkko Sakkinen wrote:
> > > 
> > > > 
> > > > So, maybe something like this?
> > > > 
> > > >     After SGX is activated[1] the IA32_SGXLEPUBKEYHASHn MSRs are 
> > > > writable
> > > >     if and only if SGX_LC is set in the IA32_FEATURE_CONTROL MSR and the
> > > >     IA32_FEATURE_CONTROL MSR is locked, otherwise they are read-only.
> > > > 
> > > >     For example, firmware can allow the OS to change the launch enclave
> > > >     root key by setting IA32_FEATURE_CONTROL.SGX_LC, and thus give the
> > > >     OS complete control over the enclaves it runs.  Alternatively,
> > > >     firmware can clear IA32_FEATURE_CONTROL.SGX_LC to lock down the root
> > > >     key and restrict the OS to running enclaves signed with the root key
> > > >     or whitelisted/trusted by a launch enclave (which must be signed 
> > > > with
> > > >     the root key).
> > > > 
> > > >     [1] SGX related bits in IA32_FEATURE_CONTROL cannot be set until SGX
> > > >         is activated, e.g. by firmware.  SGX activation is triggered by
> > > >         setting bit 0 in MSR 0x7a.  Until SGX is activated, the LE hash
> > > >         MSRs are writable, e.g. to allow firmware to lock down the LE
> > > >         root key with a non-Intel value.
> > > Thanks I'll use this as a basis and move most of the crappy commit
> > > message to the commit (with some editing) that defines the MSRs.
> > Not sure after all if I'm following this.
> > 
> > IA32_FEATURE_CONTROL[17] contols whether the MSRs are writable or not
> > after the feature control MSR is locked. SGX_LC means just that the
> > CPU supports the launch configuration.
> > 
> > /Jarkko
> I used this commit message with some minor editing in the commit that
> defines the MSRs and squashed commits that define cpuid level 7 bits.
> Can you peer check the commit messages? They are in the le branch.
> 
> /Jarkko

The commit defines FEATURE_CONTROL_SGX_LAUNCH_CONTROL_ENABLE in addition
to the LE hash MSRs, which is why my suggestion referred to "SGX_LC" and
not simply bit 17.  I used "SGX_LC" instead of the full name because
that's what your original commit messaged used (though it was attached
to the CPUID patch, thus all the confusion).

Anyways, I think the commit should have a blurb about defining bit 17,
and then refer to SGX_LAUNCH_CONTROL_ENABLE (or some variation) rather
than bit 17 when talking about its effects on SGX.

Re: [PATCH v6 06/11] intel_sgx: driver for Intel Software Guard Extensions

2017-11-30 Thread Sean Christopherson

On Sat, Nov 25, 2017 at 09:29:24PM +0200, Jarkko Sakkinen wrote:
> +static void *sgx_try_alloc_page(void)
> +{
> +   struct sgx_epc_bank *bank;
> +   void *page = NULL;
> +   int i;
> +
> +   for (i = 0; i < sgx_nr_epc_banks; i++) {
> +   bank = _epc_banks[i];
> +
> +   down_write(>lock);

Is a R/W semaphore actually preferable to a spinlock?  Concurrent
free calls don't seem that interesting/beneficial because freeing
an enclave's pages isn't multiplexed across multiple CPUs, unlike
the allocation of EPC pages.

As a whole, I'm not a fan of packing the EPC page pointers into an
array rather than encapsulating them in a struct+list.  The primary
benefit I see for the array approach is that it saves ~8 bytes per
free EPC page, but at a cost of increased memory usage for in-use
pages and severely restricting the ability to enhance/modify how
EPC pages are tracked, reclaimed, etc...

The main issue is that the array approach relies on the caller to
handle reclaim.  This effectively makes it impossible to reclaim
pages from multiple processes, requires other consumers e.g. KVM
to implement their own reclaim logic and kthread, and prevents
cgroup accounting because the cgroup can't initiate reclaim.

> +
> +   if (atomic_read(>free_cnt))
> +   page = 
> bank->pages[atomic_dec_return(>free_cnt)];
> +
> +   up_write(>lock);
> +
> +   if (page)
> +   break;
> +   }
> +
> +   if (page)
> +   atomic_dec(_nr_free_pages);
> +
> +   return page;
> +}

Re: [intel-sgx-kernel-dev] [PATCH v7 4/8] intel_sgx: driver for Intel Software Guard Extensions

2017-12-12 Thread Sean Christopherson

On Thu, 2017-12-07 at 18:05 +0200, Jarkko Sakkinen wrote:
> On Thu, Dec 07, 2017 at 02:46:39PM +, Christopherson, Sean J wrote:
> > 
> > > 
> > > + for (i = 0; i < 2; i++) {
> > > + va_page = list_first_entry(>va_pages,
> > > +    struct sgx_va_page, list);
> > > + va_offset = sgx_alloc_va_slot(va_page);
> > > + if (va_offset < PAGE_SIZE)
> > > + break;
> > > +
> > > + list_move_tail(_page->list, >va_pages);
> > > + }
> > This is broken, there is no guarantee that the next VA page will have
> > a free slot.  You have to walk over all VA pages to guarantee a slot
> > is found, e.g. this caused EWB and ELDU errors.
> I did run some extensive stress tests on this and did not experience any
> issues. Full VA pages are always put to the end. Please point me to the
> test where this breaks so that I can fix the issue if it persists.
> 
> > 
> > Querying list.next to determine if an encl_page is resident in the EPC
> > is ugly and unintuitive, and depending on list's internal state seems
> > dangerous.  Why not use a flag in the encl_page, e.g. as in the patch
> > I submitted almost 8 months ago for combining epc_page and va_page into
> > a union?  And, the encl's SGX_ENCL_SECS_EVICTED flag can be dropped if
> > a flag is added to indicate whether or not any encl_page is resident in
> > the EPC.
> > 
> > https://lists.01.org/pipermail/intel-sgx-kernel-dev/2017-April/000570.html
> I think it is better to just zero list entry and do list_empty test. You
> correct that checking that with poison is ugly.

Except this whole approach breaks if you do list_del_init instead of
list_del.  Inferring the residency of a page based on whether or not
it's on a list AND how the page was removed from said list is fragile.
And, the lack of an explicit flag makes it quite painful to debug any
issues, e.g. it's difficult to identify the call site of list_del.

Case in point, I spent the better part of a day debugging a #PF BUG
in sgx_eldu because it tried to directly deference an EPC page.  The
list check in sgx_fault_page failed to detect an already-faulted page
because sgx_isolate_pages calls list_del and releases the enclave's
mutex long before the page is actually evicted.


[  656.093093] BUG: unable to handle kernel paging request at 000480f23000
[  656.095157] IP: sgx_eldu+0xc1/0x3c0 [intel_sgx]
[  656.095760] PGD 469f6a067 P4D 469f6a067 PUD 0 
[  656.096371] Oops:  [#1] SMP
[  656.096818] Modules linked in: intel_sgx scsi_transport_iscsi bridge stp llc
[  656.097747] CPU: 3 PID: 5362 Comm: lsdt Not tainted 4.14.0+ #5
[  656.098514] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 
02/06/2015
[  656.099472] task: a0af5c1b9d80 task.stack: acd9473e
[  656.100233] RIP: 0010:sgx_eldu+0xc1/0x3c0 [intel_sgx]
[  656.100843] RSP: :acd9473e3c40 EFLAGS: 00010286
[  656.101491] RAX: 000480f23000 RBX: acd94a29d000 RCX: 
[  656.102369] RDX:  RSI: a0af54424b90 RDI: 000485224000
[  656.103225] RBP: acd9473e3cf0 R08: ef4f5180c59c R09: a0af54424b68
[  656.104102] R10: acd9473e3ab8 R11: 0040 R12: ef4f513e7980
[  656.104970] R13: a0af693fe5e0 R14: ef4f5180c580 R15: a0af6c885a00
[  656.105851] FS:  7f42ea7fc700() GS:a0af7fcc() 
knlGS:
[  656.106767] CS:  0010 DS:  ES:  CR0: 80050033
[  656.107470] CR2: 000480f23000 CR3: 000467fc6004 CR4: 003606e0
[  656.108244] DR0:  DR1:  DR2: 
[  656.109060] DR3:  DR6: fffe0ff0 DR7: 0400
[  656.109880] Call Trace:
[  656.110224]  ? __wake_up_common_lock+0x8e/0xc0
[  656.110740]  sgx_fault_page+0x1d5/0x390 [intel_sgx]
[  656.111319]  ? sgx_fault_page+0x1d5/0x390 [intel_sgx]
[  656.111917]  sgx_vma_fault+0x17/0x40 [intel_sgx]
[  656.112517]  __do_fault+0x1c/0x60
[  656.112916]  __handle_mm_fault+0x98c/0xeb0
[  656.113385]  ? set_next_entity+0x109/0x6e0
[  656.113876]  handle_mm_fault+0xcc/0x1c0
[  656.114423]  __do_page_fault+0x262/0x4f0
[  656.114956]  do_page_fault+0x2e/0xe0
[  656.115488]  do_async_page_fault+0x1a/0x80
[  656.116071]  async_page_fault+0x22/0x30
[  656.118384] RIP: 0033:0x5db36e
[  656.120406] RSP: 002b:7f42ea7fbbf0 EFLAGS: 0202
[  656.121970] RAX: 0003 RBX: 7f42e624e000 RCX: 005db36e
[  656.123512] RDX:  RSI:  RDI: 
[  656.125023] RBP: 7f42ea7fbc40 R08:  R09: 
[  656.126369] R10:  R11:  R12: 
[  656.127581] R13:  R14:  R15: 
[  656.128812] Code: 02 00 00 48 c7 85 68 ff ff ff 00 00 00 00 31 db 80 7d 8c 00
[  656.132076] RIP: sgx_eldu+0xc1/0x3c0 [intel_sgx] RSP: acd9473e3c40
[  656.133211] CR2: 000480f23000
[

Re: [intel-sgx-kernel-dev] [PATCH v7 4/8] intel_sgx: driver for Intel Software Guard Extensions

2017-12-12 Thread Sean Christopherson

On Fri, 2017-12-08 at 07:31 -0800, Christopherson, Sean J wrote:
> Jarkko Sakkinen  wrote:
> > On Thu, Dec 07, 2017 at 02:46:39PM +, Christopherson, Sean J wrote:
> > > > + for (i = 0; i < 2; i++) {
> > > > + va_page = list_first_entry(>va_pages,
> > > > +struct sgx_va_page, list);
> > > > + va_offset = sgx_alloc_va_slot(va_page);
> > > > + if (va_offset < PAGE_SIZE)
> > > > + break;
> > > > +
> > > > + list_move_tail(_page->list, >va_pages);
> > > > + }
> > > 
> > > This is broken, there is no guarantee that the next VA page will have
> > > a free slot.  You have to walk over all VA pages to guarantee a slot
> > > is found, e.g. this caused EWB and ELDU errors.
> > 
> > I did run some extensive stress tests on this and did not experience any
> > issues. Full VA pages are always put to the end. Please point me to the
> > test where this breaks so that I can fix the issue if it persists.
> 
> Three VA pages in the enclave: A, B and C.  Evict all pages in the
> enclave, i.e. consume all slots in A, B and C.  The list can be in
> any order at this point, but for the sake of argument let's say the
> order is C->A->B, i.e. C was originally the last VA page in the list.
> Fault in page X, whose VA is in B.  Evict X.  This code looks at C
> and A, and finds no available slot, but continues with VA page A and
> a va_offset of PAGE_SIZE.

So it looks like you avoid the described case by moving B to the head of
the list in sgx_eldu.  The bug I am seeing is still straightforward to
theorize:

1. Three VA pages.  List = A->B->C
2. Fill A and B, use one entry in C.  List = C->B->A
3. ELDU, freeing a slot in B.  List = B->C->A
4. EWB, consuming the last slot in B.  List = B->C->A
5. ELDU, freeing a slot in A.  List = A->B->C
6. EWB, consuming the last slot in A.  List = A->B->C
7. ELDU, but both A and B are full
8. Explode

Re: [intel-sgx-kernel-dev] [PATCH v5 08/11] intel_sgx: in-kernel launch enclave

2017-11-14 Thread Sean Christopherson

On Mon, 2017-11-13 at 21:45 +0200, Jarkko Sakkinen wrote:
> This commits implements the in-kernel launch enclave. It is wrapped into
> a user space program that reads SIGSTRUCT instances from stdin and
> outputs launch tokens to stdout.
> 
> The commit also adds enclave signing tool that is used by kbuild to
> measure and sign the launch enclave.
> 
> CONFIG_INTEL_SGX_SIGNING_KEY points to a PEM-file for the 3072-bit RSA
> key that is used as the LE public key pair. The default location is:
> 
>   drivers/platform/x86/intel_sgx/intel_sgx_signing_key.pem

Unless there is some conflict you are worried about, "signing_key.pem" is
preferable as the default name so that the key is ignored via the top-level
.gitignore.  The intel_sgx dir should have also a .gitignore to exclude the
other LE related output files:

drivers/platform/x86/intel_sgx/le/enclave/sgx_le.ss
drivers/platform/x86/intel_sgx/le/enclave/sgxsign
drivers/platform/x86/intel_sgx/le/sgx_le_proxy

> If the default key does not exist kbuild will generate a random key and
> place it to this location. KBUILD_SGX_SIGN_PIN can be used to specify
> the passphrase for the LE public key.
> 
> TinyCrypt (https://github.com/01org/tinycrypt) is used as AES
> implementation, which is not timing resistant. Eventually this needs to
> be replaced with AES-NI based implementation that could be either
> 
> - re-use existing AES-NI code in the kernel
> - have its own hand written code
> 
> Signed-off-by: Jarkko Sakkinen 
> ---

Re: [intel-sgx-kernel-dev] [PATCH v5 06/11] intel_sgx: driver for Intel Software Guard Extensions

2017-11-14 Thread Sean Christopherson

On Mon, 2017-11-13 at 21:45 +0200, Jarkko Sakkinen wrote:
> Intel SGX is a set of CPU instructions that can be used by applications
> to set aside private regions of code and data.  The code outside the
> enclave is disallowed to access the memory inside the enclave by the CPU
> access control.
> 
> SGX driver provides a ioctl API for loading and initializing enclaves.
> Address range for enclaves is reserved with mmap() and they are
> destroyed with munmap(). Enclave construction, measurement and
> initialization is done with the provided the ioctl API.
> 
> The driver implements also a swapper thread ksgxswapd for EPC pages
> backed by a private shmem file. Currently it has a limitation of not
> swapping VA pages but there is nothing preventing to implement it later
> on. Now it was scoped out in order to keep the implementation simple.
> 
> The parameter struct for SGX_IOC_ENCLAVE_INIT does not contain a
> parameter to supply a launch token. Generating and using tokens is best
> to be kept in the control of the kernel because it has direct binding to
> the IA32_SGXPUBKEYHASHx MSRs (a core must have MSRs set to the same
> value as the signer of token).
> 
> By giving user space any role in the launch process is a risk for
> introducing bottlenecks as kernel must exhibit behavior that user space
> launch daemon depends on

What do you mean by bottlenecks?  Assuming you're referring to performance
bottlenecks, this statement is flat out false.  Moving the launch enclave into
the kernel introduces performance bottlenecks, e.g. as implemented, a single LE
services all EINIT requests and is protected by a mutex.  That is the very
definition of a bottleneck.

The kernel will never be as performant as userspace when it comes to EINIT
tokens because userspace can make informed decisions based on its usage model,
e.g. number of LEs (or LE threads) to spawn, LE and token lifecycles, LE and
token thread safety, etc...

> , properietary risks (closed launch daemons on
> closed platforms) 

This justifies the need for the kernel to be able to generate launch tokens, but
I don't think allowing userspace to also provide its own tokens adds any
proprietary risks.

> and stability risks as there would be division of
> semantics between user space and kernel.
> 

What exactly are the stability risks?  The token itself is architecturally
defined and isn't fundamentally different than e.g. the sigstruct.  Writing the
LE hash MSRs as needed, e.g. for userspace LEs, isn't difficult.

> Signed-off-by: Jarkko Sakkinen 
> ---

Re: [intel-sgx-kernel-dev] [PATCH v5 10/11] intel_sgx: glue code for in-kernel LE

2017-11-14 Thread Sean Christopherson

On Mon, 2017-11-13 at 21:45 +0200, Jarkko Sakkinen wrote:
> --- a/drivers/platform/x86/intel_sgx/sgx_main.c
> +++ b/drivers/platform/x86/intel_sgx/sgx_main.c
> @@ -88,6 +88,37 @@ u64 sgx_encl_size_max_64;
>  u64 sgx_xfrm_mask = 0x3;
>  u32 sgx_misc_reserved;
>  u32 sgx_xsave_size_tbl[64];
> +bool sgx_unlocked_msrs;
> +u64 sgx_le_pubkeyhash[4];
> +
> +static DECLARE_RWSEM(sgx_file_sem);
> +
> +static int sgx_open(struct inode *inode, struct file *file)
> +{
> + int ret;
> +
> + down_read(_file_sem);
> +
> + ret = sgx_le_start(_le_ctx);
> + if (ret) {
> + up_read(_file_sem);
> + return ret;
> + }
> +
> + return 0;
> +}
> +
> +static int sgx_release(struct inode *inode, struct file *file)
> +{
> + up_read(_file_sem);
> +
> + if (down_write_trylock(_file_sem)) {
> + sgx_le_stop(_le_ctx);
> + up_write(_file_sem);
> + }
> +
> + return 0;
> +}

This semaphore approach is broken due to the LE process using an anon inode for
/dev/sgx, which results in sgx_release being called without an accompanying call
to sgx_open.  This causes deadlocks due to a semaphore underrun.

https://lists.01.org/pipermail/intel-sgx-kernel-dev/2017-November/000901.html

[  242.659272] INFO: task lsdt:9425 blocked for more than 120 seconds.
[  242.659783]   Not tainted 4.14.0-rc4+ #18
[  242.660063] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
[  242.660558] lsdtD0  9425  1 0x0004
[  242.660559] Call Trace:
[  242.660564]  __schedule+0x3c2/0x8b0
[  242.660567]  schedule+0x36/0x80
[  242.660568]  rwsem_down_read_failed+0x10a/0x170
[  242.660569]  call_rwsem_down_read_failed+0x18/0x30
[  242.660570]  ? call_rwsem_down_read_failed+0x18/0x30
[  242.660571]  down_read+0x20/0x40
[  242.660572]  sgx_open+0x19/0x40 [intel_sgx]
[  242.660574]  chrdev_open+0xbf/0x1b0
[  242.660576]  do_dentry_open+0x1f8/0x300
[  242.660577]  ? cdev_put+0x30/0x30
[  242.660578]  vfs_open+0x4f/0x70
[  242.660579]  path_openat+0x2ae/0x13a0
[  242.660581]  ? mem_cgroup_uncharge_swap+0x60/0x90
[  242.660582]  do_filp_open+0x99/0x110
[  242.660583]  ? __check_object_size+0xfc/0x1a0
[  242.660585]  ? __alloc_fd+0xb0/0x170
[  242.660586]  do_sys_open+0x124/0x210
[  242.660587]  ? do_sys_open+0x124/0x210
[  242.660588]  SyS_open+0x1e/0x20
[  242.660589]  entry_SYSCALL_64_fastpath+0x1e/0xa9
[  242.660590] RIP: 0033:0x7f426cf9ec7d
[  242.660591] RSP: 002b:7f426b31ea60 EFLAGS: 0293 ORIG_RAX: 
[  242.660592] RAX: ffda RBX: 00c4200ba000 RCX: 7f426cf9ec7d
[  242.660592] RDX:  RSI: 0002 RDI: 0068cca7
[  242.660593] RBP: 7f426b31ec10 R08: 00f6bc30 R09: 
[  242.660593] R10: 7f426478 R11: 0293 R12: 0001
[  242.660594] R13:  R14: 7f426d31b13d R15: 7f42640008c0

>  #ifdef CONFIG_COMPAT
>  long sgx_compat_ioctl(struct file *filep, unsigned int cmd, unsigned long
> arg)
> @@ -141,8 +172,10 @@ static unsigned long sgx_get_unmapped_area(struct file
> *file,
>   return addr;
>  }
>  
> -static const struct file_operations sgx_fops = {
> +const struct file_operations sgx_fops = {
>   .owner  = THIS_MODULE,
> + .open   = sgx_open,
> + .release= sgx_release,
>   .unlocked_ioctl = sgx_ioctl,
>  #ifdef CONFIG_COMPAT
>   .compat_ioctl   = sgx_compat_ioctl,

Re: [PATCH v11 11/13] intel_sgx: ptrace() support

2018-06-11 Thread Sean Christopherson

On Fri, 2018-06-08 at 11:34 -0700, Dave Hansen wrote:
> On 06/08/2018 10:09 AM, Jarkko Sakkinen wrote:
> > 
> > +   ret = sgx_edbgrd(encl, entry, align, data);
> > +   if (ret)
> > +   break;
> > +   if (write) {
> > +   memcpy(data + offset, buf + i, cnt);
> > +   ret = sgx_edbgwr(encl, entry, align, data);
> > +   if (ret)
> > +   break;
> > +   }
> > +   else
> > +   memcpy(buf + i,data + offset, cnt);
> > +   }
> The SGX instructions like "edbgrd" be great to put on a license plat,
> but we can do better in the kernel.  Can you give these reasonable
> english names, please?  sgx_debug_write(), maybe?

IMO the function names for ENCLS leafs are appropriate.  The real
issue is the lack of documentation of the ENCLS helpers and their
naming conventions.

The sgx_ functions, e.g. sgx_edbgrd(), are essentially direct
invocations of the specific leaf, i.e. they are dumb wrappers to
the lower level leaf functions, e.g. __edbgrd().  The wrappers exist
primarily to deal with the boilerplate necessary to access a page in
the EPC.  sgx_ conveys that the function contains the preamble
and/or postamble needed to execute its leaf, but otherwise does not
contain any logic.

Functions with actual logic do have English names, e.g.
sgx_encl_init(), sgx_encl_add_page(), sgx_encl_modify_pages() etc...

> Note that we have plenty of incomprehensible instruction names in the
> kernel like "wrpkru", but we do our best to keep them as confined as
> possible and make sure they don't hurt code readability.

Re: [intel-sgx-kernel-dev] [PATCH v11 09/13] x86, sgx: basic routines for enclave page cache

2018-06-11 Thread Sean Christopherson

On Sat, 2018-06-09 at 22:32 -0700, Andy Lutomirski wrote:
> On Fri, Jun 8, 2018 at 10:22 AM Jarkko Sakkinen
>  wrote:
> > 
> > 
> > SGX has a set of data structures to maintain information about the enclaves
> > and their security properties. BIOS reserves a fixed size region of
> > physical memory for these structures by setting Processor Reserved Memory
> > Range Registers (PRMRR). This memory area is called Enclave Page Cache
> > (EPC).
> > 
> > 
> > +/**
> > + * sgx_einit - EINIT an enclave with the appropriate LE pubkey hash
> > + * @sigstruct: a pointer to the enclave's sigstruct
> > + * @token: a pointer to the enclave's EINIT token
> > + * @secs_page: a pointer to the enclave's SECS EPC page
> > + * @le_pubkey_hash:the desired LE pubkey hash for EINIT
> > + */
> > +int sgx_einit(struct sgx_sigstruct *sigstruct, struct sgx_einittoken 
> > *token,
> > + struct sgx_epc_page *secs_page, u64 le_pubkey_hash[4])
> > +{
> > +   u64 __percpu *cache;
> > +   void *secs;
> > +   int i, ret;
> > +
> > +   secs = sgx_get_page(secs_page);
> > +
> > +   if (!sgx_lc_enabled) {
> I'm confused.  What does this code path do?  It kind of looks like the
> driver will load and just malfunction if we don't have write access to
> the MSRs.  What is the intended behavior?

The driver will also allow itself to load if the MSRs are read-only,
but only if the MSRs' pubkey hash matches that of its launch enclave,
i.e. the system has been pre-configured for the kernel's LE.  Whether
or not that is a valid scenario is probably a different discussion.

> > +   ret = __einit(sigstruct, token, secs);
> > +   goto out;
> > +   }
> > +
> > +   cache = per_cpu(sgx_le_pubkey_hash_cache, smp_processor_id());
> > +
> > +   preempt_disable();
> > +   for (i = 0; i < 4; i++) {
> > +   if (le_pubkey_hash[i] == cache[i])
> > +   continue;
> > +
> > +   wrmsrl(MSR_IA32_SGXLEPUBKEYHASH0 + i, le_pubkey_hash[i]);
> > +   cache[i] = le_pubkey_hash[i];
> > +   }
> > +   ret = __einit(sigstruct, token, secs);
> > +   preempt_enable();
> > +
> > +out:
> > +   sgx_put_page(secs);
> > +   return ret;
> > +}
> > +EXPORT_SYMBOL(sgx_einit);
> > +

Re: [PATCH v11 06/13] crypto: aesni: add minimal build option for SGX LE

2018-06-11 Thread Sean Christopherson

On Fri, 2018-06-08 at 10:27 -0700, Dave Hansen wrote:
> On 06/08/2018 10:09 AM, Jarkko Sakkinen wrote:
> > 
> > --- a/arch/x86/crypto/aesni-intel_asm.S
> > +++ b/arch/x86/crypto/aesni-intel_asm.S
> > @@ -45,6 +45,8 @@
> >  #define MOVADQ movaps
> >  #define MOVUDQ movups
> >  
> > +#ifndef AESNI_INTEL_MINIMAL
> > +
> >  #ifdef __x86_64__
> >  
> >  # constants in mergeable sections, linker can reorder and merge
> > @@ -133,6 +135,8 @@ ALL_F:  .octa 0x
> >  #define keysize 2*15*16(%arg1)
> >  #endif
> >  
> > +#endif /* AESNI_INTEL_MINIMAL */
> > +
> I'd really prefer that these get moved into a separate file rather than
> a scattered set of #ifdefs.  This just seem fragile to me.
> 
> Can we have a "aesni-intel_asm-minimal.S"?  Or, at least bunch the
> minimal set of things *together*?

A separate file doesn't seem appropriate because there is no criteria
for including code in the "minimal" build beyond "this code happens to
be needed by SGX".  I considered having SGX somewhere in the define
but opted for AESNI_INTEL_MINIMAL on the off chance that the minimal
build was useful for something other than SGX.

I'm not opposed to bunching the minimal stuff together, my intent was
simply to disturb the code as little as possible.

Re: [PATCH v11 09/13] x86, sgx: basic routines for enclave page cache

2018-06-19 Thread Sean Christopherson

On Tue, Jun 19, 2018 at 05:57:53PM +0300, Jarkko Sakkinen wrote:
> On Fri, Jun 08, 2018 at 11:24:12AM -0700, Dave Hansen wrote:
> > On 06/08/2018 10:09 AM, Jarkko Sakkinen wrote:
> > > +static __init bool sgx_is_enabled(bool *lc_enabled)
> > >  {
> > >   unsigned long fc;
> > >  
> > > @@ -41,12 +466,26 @@ static __init bool sgx_is_enabled(void)
> > >   if (!(fc & FEATURE_CONTROL_SGX_ENABLE))
> > >   return false;
> > >  
> > > + *lc_enabled = !!(fc & FEATURE_CONTROL_SGX_LE_WR);
> > > +
> > >   return true;
> > >  }
> > 
> > I'm baffled why lc_enabled is connected to the enclave page cache.
> 
> KVM works only with writable MSRs. Driver works both with writable
> and read-only MSRs.

That's not true, KVM can/will support SGX regardless of whether or not
Launch Control (LC) is available and/or enabled.  KVM does need to
know whether or not LC is enabled.

Back to Dave's question, LC isn't connected to the EPC management,
the LC code should be split into a separate patch.
 
> Thanks, I'll try my best to deal with all this :-)
> 
> /Jarkko

Re: [intel-sgx-kernel-dev] [PATCH v11 13/13] intel_sgx: in-kernel launch enclave

2018-06-25 Thread Sean Christopherson

On Mon, Jun 25, 2018 at 05:00:05PM -0400, Nathaniel McCallum wrote:
> On Thu, Jun 21, 2018 at 5:21 PM Sean Christopherson
>  wrote:
> >
> > On Thu, Jun 21, 2018 at 03:11:18PM -0400, Nathaniel McCallum wrote:
> > > If this is acceptable for everyone, my hope is the following:
> > >
> > > 1. Intel would split the existing code into one of the following
> > > schemas (I don't care which):
> > >   A. three parts: UEFI module, FLC-only kernel driver and user-space
> > > launch enclave
> > >   B. two parts: UEFI module (including launch enclave) and FLC-only
> > > kernel driver
> >
> > To make sure I understand correctly...
> >
> > The UEFI module would lock the LE MSRs with a public key hardcoded
> > into both the UEFI module and the kernel at build time?
> >
> > And for the kernel, it would only load its SGX driver if FLC is
> > supported and the MSRs are locked to the expected key?
> >
> > IIUC, this approach will cause problems for virtualization.  Running
> > VMs with different LE keys would require the bare metal firmware to
> > configure the LE MSRs to be unlocked, which would effectively make
> > using SGX in the host OS mutually exlusive with exposing SGX to KVM
> > guests.  Theoretically it would be possible for KVM to emulate the
> > guest's LE and use the host's LE to generate EINIT tokens, but
> > emulating an enclave would likely require a massive amount of code
> > and/or complexity.
> 
> How is this different from any other scenario where you lock the LE
> MSRs? Unless Intel provides hardware support between the LE MSRs and
> the VMX instructions, I don't see any way around this besides letting
> any launch enclave run.

It's not.  All prior discussions have effectively required unlocked
MSRs, so that's my baseline.

> > > 2. Intel would release a reproducible build of the GPL UEFI module
> > > sources signed with a SecureBoot trusted key and provide an
> > > acceptable[0] binary redistribution license.
> > >
> > > 3. The kernel community would agree to merge the kernel driver given
> > > the above criteria (and, obviously, acceptable kernel code).
> > >
> > > The question of how to distribute the UEFI module and possible launch
> > > enclave remains open. I see two options: independent distribution and
> > > bundling it in linux-firmware. The former may be a better
> > > technological fit since the UEFI module will likely need to be run
> > > before the kernel (and the boot loader; and shim). However, the latter
> > > has the benefit of already being a well-known entity to our downstream
> > > distributors. I could go either way on this.
> >
> > Writing and locks the LE MSRs effectively needs to be done before
> > running the bootloader/kernel/etc...  Delaying activation would
> > require, at a minimum, leaving IA32_FEATURE_CONTROL unlocked since
> > IA32_FEATURE_CONTROL's SGX bits can't be set until SGX is activated.
> >
> > > I know this plan is more work for everyone involved, but I think it
> > > manages to actually maximize both security and freedom.
> > >
> > > [0]: details here -
> > > https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/tree/README#n19
> > > On Thu, Jun 21, 2018 at 11:29 AM Neil Horman  wrote:
> > > >
> > > > On Thu, Jun 21, 2018 at 08:32:25AM -0400, Nathaniel McCallum wrote:
> > > > > On Wed, Jun 20, 2018 at 5:02 PM Sean Christopherson
> > > > >  wrote:
> > > > > >
> > > > > > On Wed, Jun 20, 2018 at 11:39:00AM -0700, Jethro Beekman wrote:
> > > > > > > On 2018-06-20 11:16, Jethro Beekman wrote:
> > > > > > > > > This last bit is also repeated in different words in Table 
> > > > > > > > > 35-2 and
> > > > > > > > > Section 42.2.2. The MSRs are *not writable* before the 
> > > > > > > > > write-lock bit
> > > > > > > > > itself is locked. Meaning the MSRs are either locked with 
> > > > > > > > > Intel's key
> > > > > > > > > hash, or not locked at all.
> > > > > > >
> > > > > > > Actually, this might be a documentation bug. I have some test 
> > > > > > > hardware and I
> > > > > > > was able to configure the MSRs in the BIOS and then read the MSRs 
> > > > > > > after boot
> > > > > > > like this:
> > >

Re: [intel-sgx-kernel-dev] [PATCH v11 13/13] intel_sgx: in-kernel launch enclave

2018-06-20 Thread Sean Christopherson

On Wed, Jun 20, 2018 at 11:39:00AM -0700, Jethro Beekman wrote:
> On 2018-06-20 11:16, Jethro Beekman wrote:
> > > This last bit is also repeated in different words in Table 35-2 and
> > > Section 42.2.2. The MSRs are *not writable* before the write-lock bit
> > > itself is locked. Meaning the MSRs are either locked with Intel's key
> > > hash, or not locked at all.
> 
> Actually, this might be a documentation bug. I have some test hardware and I
> was able to configure the MSRs in the BIOS and then read the MSRs after boot
> like this:
> 
> MSR 0x3a 0x00040005
> MSR 0x8c 0x20180620
> MSR 0x8d 0x20180620
> MSR 0x8e 0x20180620
> MSR 0x8f 0x20180620
> 
> Since this is not production hardware, it could also be a CPU bug of course.
> 
> If it is indeed possible to configure AND lock the MSR values to non-Intel
> values, I'm very much in favor of Nathaniels proposal to treat the launch
> enclave like any other firmware blob.

It's not a CPU or documentation bug (though the latter is arguable).
SGX has an activation step that is triggered by doing a WRMSR(0x7a)
with bit 0 set.  Until SGX is activated, the SGX related bits in
IA32_FEATURE_CONTROL cannot be set, i.e. SGX can't be enabled.  But,
the LE hash MSRs are fully writable prior to activation, e.g. to
allow firmware to lock down the LE key with a non-Intel value.

So yes, it's possible to lock the MSRs to a non-Intel value.  The
obvious caveat is that whatever blob is used to write the MSRs would
need be executed prior to activation.

As for the SDM, it's a documentation... omission?  SGX activation
is intentionally omitted from the SDM.  The intended usage model is
that firmware will always do the activation (if it wants SGX enabled),
i.e. post-firmware software will only ever "see" SGX as disabled or
in the fully activated state, and so the SDM doesn't describe SGX
behavior prior to activation.  I believe the activation process, or
at least what is required from firmware, is documented in the BIOS
writer's guide.

> Jethro Beekman | Fortanix
>

Re: [PATCH v11 09/13] x86, sgx: basic routines for enclave page cache

2018-06-20 Thread Sean Christopherson

On Fri, 2018-06-08 at 19:09 +0200, Jarkko Sakkinen wrote:
> SGX has a set of data structures to maintain information about the enclaves
> and their security properties. BIOS reserves a fixed size region of
> physical memory for these structures by setting Processor Reserved Memory
> Range Registers (PRMRR). This memory area is called Enclave Page Cache
> (EPC).
> 
> This commit implements the basic routines to allocate and free pages from
> different EPC banks. There is also a swapper thread ksgxswapd for EPC pages
> that gets woken up by sgx_alloc_page() when we run below the low watermark.
> The swapper thread continues swapping pages up until it reaches the high
> watermark.
> 
> Each subsystem that uses SGX must provide a set of callbacks for EPC
> pages that are used to reclaim, block and write an EPC page. Kernel
> takes the responsibility of maintaining LRU cache for them.
> 
> Signed-off-by: Jarkko Sakkinen 
> ---
>  arch/x86/include/asm/sgx.h  |  67 +
>  arch/x86/include/asm/sgx_arch.h | 224 
>  arch/x86/kernel/cpu/intel_sgx.c | 443 +++-
>  3 files changed, 732 insertions(+), 2 deletions(-)
>  create mode 100644 arch/x86/include/asm/sgx_arch.h

...

> +struct sgx_pcmd {
> + struct sgx_secinfo secinfo;
> + uint64_t enclave_id;
> + uint8_t reserved[40];
> + uint8_t mac[16];
> +};

sgx_pcmd has a 128-byte alignment requirement.  I think it's
worth specifying here as sgx_pcmd is small enough that it could
be put on the stack, e.g. by KVM when trapping and executing
ELD* on behalf of a guest VM.

In fact, it probably makes sense to add alightment attributes
to all SGX structs for self-documentation purposes, even though
many of them will never be allocated statically or on the stack.

> +
> +#define SGX_MODULUS_SIZE 384
> +
> +struct sgx_sigstruct_header {
> + uint64_t header1[2];
> + uint32_t vendor;
> + uint32_t date;
> + uint64_t header2[2];
> + uint32_t swdefined;
> + uint8_t reserved1[84];
> +};
> +
> +struct sgx_sigstruct_body {
> + uint32_t miscselect;
> + uint32_t miscmask;
> + uint8_t reserved2[20];
> + uint64_t attributes;
> + uint64_t xfrm;
> + uint8_t attributemask[16];
> + uint8_t mrenclave[32];
> + uint8_t reserved3[32];
> + uint16_t isvprodid;
> + uint16_t isvsvn;
> +} __attribute__((__packed__));
> +
> +struct sgx_sigstruct {
> + struct sgx_sigstruct_header header;
> + uint8_t modulus[SGX_MODULUS_SIZE];
> + uint32_t exponent;
> + uint8_t signature[SGX_MODULUS_SIZE];
> + struct sgx_sigstruct_body body;
> + uint8_t reserved4[12];
> + uint8_t q1[SGX_MODULUS_SIZE];
> + uint8_t q2[SGX_MODULUS_SIZE];
> +};
> +
> +struct sgx_sigstruct_payload {
> + struct sgx_sigstruct_header header;
> + struct sgx_sigstruct_body body;
> +};
> +
> +struct sgx_einittoken_payload {
> + uint32_t valid;
> + uint32_t reserved1[11];
> + uint64_t attributes;
> + uint64_t xfrm;
> + uint8_t mrenclave[32];
> + uint8_t reserved2[32];
> + uint8_t mrsigner[32];
> + uint8_t reserved3[32];
> +};
> +
> +struct sgx_einittoken {
> + struct sgx_einittoken_payload payload;
> + uint8_t cpusvnle[16];
> + uint16_t isvprodidle;
> + uint16_t isvsvnle;
> + uint8_t reserved2[24];
> + uint32_t maskedmiscselectle;
> + uint64_t maskedattributesle;
> + uint64_t maskedxfrmle;
> + uint8_t keyid[32];
> + uint8_t mac[16];
> +};
> +
> +struct sgx_report {
> + uint8_t cpusvn[16];
> + uint32_t miscselect;
> + uint8_t reserved1[28];
> + uint64_t attributes;
> + uint64_t xfrm;
> + uint8_t mrenclave[32];
> + uint8_t reserved2[32];
> + uint8_t mrsigner[32];
> + uint8_t reserved3[96];
> + uint16_t isvprodid;
> + uint16_t isvsvn;
> + uint8_t reserved4[60];
> + uint8_t reportdata[64];
> + uint8_t keyid[32];
> + uint8_t mac[16];
> +};
> +
> +struct sgx_targetinfo {
> + uint8_t mrenclave[32];
> + uint64_t attributes;
> + uint64_t xfrm;
> + uint8_t reserved1[4];
> + uint32_t miscselect;
> + uint8_t reserved2[456];
> +};
> +
> +struct sgx_keyrequest {
> + uint16_t keyname;
> + uint16_t keypolicy;
> + uint16_t isvsvn;
> + uint16_t reserved1;
> + uint8_t cpusvn[16];
> + uint64_t attributemask;
> + uint64_t xfrmmask;
> + uint8_t keyid[32];
> + uint32_t miscmask;
> + uint8_t reserved2[436];
> +};

Re: [PATCH v11 08/13] x86, sgx: added ENCLS wrappers

2018-06-20 Thread Sean Christopherson

On Fri, 2018-06-08 at 19:09 +0200, Jarkko Sakkinen wrote:
> This commit adds wrappers for Intel(R) SGX ENCLS opcode functionality.
> 
> Signed-off-by: Jarkko Sakkinen 
> ---
>  arch/x86/include/asm/sgx.h | 198 +
>  1 file changed, 198 insertions(+)
> 
> diff --git a/arch/x86/include/asm/sgx.h b/arch/x86/include/asm/sgx.h
> index fa3e6e0eb8af..a2f727f85b91 100644
> --- a/arch/x86/include/asm/sgx.h
> +++ b/arch/x86/include/asm/sgx.h
> @@ -10,6 +10,10 @@
>  #ifndef _ASM_X86_SGX_H
>  #define _ASM_X86_SGX_H
>  
> +#include 
> +#include 
> +#include 
> +#include 
>  #include 
>  
>  #define SGX_CPUID 0x12
> @@ -20,6 +24,200 @@ enum sgx_cpuid {
>   SGX_CPUID_EPC_BANKS = 2,
>  };
>  
> +enum sgx_commands {

This should be something like "sgx_encls_leafs" and probably moved
to sgx_arch.h (as Dave alluded to, these are architectural values).
"sgx_commands" is not accurate (they're only the cpl0 "commands")
and will have collision issues in the future, e.g. with the ENCLV
instruction and its leafs.

> + ECREATE = 0x0,
> + EADD= 0x1,
> + EINIT   = 0x2,
> + EREMOVE = 0x3,
> + EDGBRD  = 0x4,
> + EDGBWR  = 0x5,
> + EEXTEND = 0x6,

Even though it's not used in the code (yet...), I think ELDB,
leaf 0x7, should be defined here for completeness.

> + ELDU= 0x8,
> + EBLOCK  = 0x9,
> + EPA = 0xA,
> + EWB = 0xB,
> + ETRACK  = 0xC,
> + EAUG= 0xD,
> + EMODPR  = 0xE,
> + EMODT   = 0xF,
> +};

Re: [PATCH v11 09/13] x86, sgx: basic routines for enclave page cache

2018-06-20 Thread Sean Christopherson

On Fri, 2018-06-08 at 19:09 +0200, Jarkko Sakkinen wrote:
> SGX has a set of data structures to maintain information about the enclaves
> and their security properties. BIOS reserves a fixed size region of
> physical memory for these structures by setting Processor Reserved Memory
> Range Registers (PRMRR). This memory area is called Enclave Page Cache
> (EPC).
> 
> This commit implements the basic routines to allocate and free pages from
> different EPC banks. There is also a swapper thread ksgxswapd for EPC pages
> that gets woken up by sgx_alloc_page() when we run below the low watermark.
> The swapper thread continues swapping pages up until it reaches the high
> watermark.
> 
> Each subsystem that uses SGX must provide a set of callbacks for EPC
> pages that are used to reclaim, block and write an EPC page. Kernel
> takes the responsibility of maintaining LRU cache for them.
> 
> Signed-off-by: Jarkko Sakkinen 
> ---
>  arch/x86/include/asm/sgx.h  |  67 +
>  arch/x86/include/asm/sgx_arch.h | 224 
>  arch/x86/kernel/cpu/intel_sgx.c | 443 +++-
>  3 files changed, 732 insertions(+), 2 deletions(-)
>  create mode 100644 arch/x86/include/asm/sgx_arch.h

...

> diff --git a/arch/x86/kernel/cpu/intel_sgx.c b/arch/x86/kernel/cpu/intel_sgx.c
> index db6b315334f4..ae2b5c5b455f 100644
> --- a/arch/x86/kernel/cpu/intel_sgx.c
> +++ b/arch/x86/kernel/cpu/intel_sgx.c
> @@ -14,14 +14,439 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
> +#include 
>  #include 
>  
> +#define SGX_NR_TO_SCAN   16
> +#define SGX_NR_LOW_PAGES 32
> +#define SGX_NR_HIGH_PAGES 64
> +
>  bool sgx_enabled __ro_after_init = false;
>  EXPORT_SYMBOL(sgx_enabled);
> +bool sgx_lc_enabled __ro_after_init;
> +EXPORT_SYMBOL(sgx_lc_enabled);
> +atomic_t sgx_nr_free_pages = ATOMIC_INIT(0);
> +EXPORT_SYMBOL(sgx_nr_free_pages);
> +struct sgx_epc_bank sgx_epc_banks[SGX_MAX_EPC_BANKS];
> +EXPORT_SYMBOL(sgx_epc_banks);
> +int sgx_nr_epc_banks;
> +EXPORT_SYMBOL(sgx_nr_epc_banks);
> +LIST_HEAD(sgx_active_page_list);
> +EXPORT_SYMBOL(sgx_active_page_list);
> +DEFINE_SPINLOCK(sgx_active_page_list_lock);
> +EXPORT_SYMBOL(sgx_active_page_list_lock);

I don't think we should be exporting anything other than sgx_enabled
and sgx_lc_enabled.  The only external use of a symbol that can't be
trivially (re)moved is in the driver's sgx_pm_suspend() in sgx_main.c,
which uses the sgx_active_page_list to invalidate enclaves.  And that
behavior seems unsafe, e.g. an enclave could theoretically have zero
pages on the active list and so could be missed in the suspend flow.

> +static struct task_struct *ksgxswapd_tsk;
> +static DECLARE_WAIT_QUEUE_HEAD(ksgxswapd_waitq);

Re: [intel-sgx-kernel-dev] [PATCH v11 13/13] intel_sgx: in-kernel launch enclave

2018-06-21 Thread Sean Christopherson

On Thu, Jun 21, 2018 at 03:11:18PM -0400, Nathaniel McCallum wrote:
> If this is acceptable for everyone, my hope is the following:
> 
> 1. Intel would split the existing code into one of the following
> schemas (I don't care which):
>   A. three parts: UEFI module, FLC-only kernel driver and user-space
> launch enclave
>   B. two parts: UEFI module (including launch enclave) and FLC-only
> kernel driver

To make sure I understand correctly...

The UEFI module would lock the LE MSRs with a public key hardcoded
into both the UEFI module and the kernel at build time?

And for the kernel, it would only load its SGX driver if FLC is
supported and the MSRs are locked to the expected key?

IIUC, this approach will cause problems for virtualization.  Running
VMs with different LE keys would require the bare metal firmware to
configure the LE MSRs to be unlocked, which would effectively make
using SGX in the host OS mutually exlusive with exposing SGX to KVM
guests.  Theoretically it would be possible for KVM to emulate the
guest's LE and use the host's LE to generate EINIT tokens, but
emulating an enclave would likely require a massive amount of code
and/or complexity.

> 2. Intel would release a reproducible build of the GPL UEFI module
> sources signed with a SecureBoot trusted key and provide an
> acceptable[0] binary redistribution license.
> 
> 3. The kernel community would agree to merge the kernel driver given
> the above criteria (and, obviously, acceptable kernel code).
> 
> The question of how to distribute the UEFI module and possible launch
> enclave remains open. I see two options: independent distribution and
> bundling it in linux-firmware. The former may be a better
> technological fit since the UEFI module will likely need to be run
> before the kernel (and the boot loader; and shim). However, the latter
> has the benefit of already being a well-known entity to our downstream
> distributors. I could go either way on this.

Writing and locks the LE MSRs effectively needs to be done before
running the bootloader/kernel/etc...  Delaying activation would
require, at a minimum, leaving IA32_FEATURE_CONTROL unlocked since
IA32_FEATURE_CONTROL's SGX bits can't be set until SGX is activated.

> I know this plan is more work for everyone involved, but I think it
> manages to actually maximize both security and freedom.
> 
> [0]: details here -
> https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/tree/README#n19
> On Thu, Jun 21, 2018 at 11:29 AM Neil Horman  wrote:
> >
> > On Thu, Jun 21, 2018 at 08:32:25AM -0400, Nathaniel McCallum wrote:
> > > On Wed, Jun 20, 2018 at 5:02 PM Sean Christopherson
> > >  wrote:
> > > >
> > > > On Wed, Jun 20, 2018 at 11:39:00AM -0700, Jethro Beekman wrote:
> > > > > On 2018-06-20 11:16, Jethro Beekman wrote:
> > > > > > > This last bit is also repeated in different words in Table 35-2 
> > > > > > > and
> > > > > > > Section 42.2.2. The MSRs are *not writable* before the write-lock 
> > > > > > > bit
> > > > > > > itself is locked. Meaning the MSRs are either locked with Intel's 
> > > > > > > key
> > > > > > > hash, or not locked at all.
> > > > >
> > > > > Actually, this might be a documentation bug. I have some test 
> > > > > hardware and I
> > > > > was able to configure the MSRs in the BIOS and then read the MSRs 
> > > > > after boot
> > > > > like this:
> > > > >
> > > > > MSR 0x3a 0x00040005
> > > > > MSR 0x8c 0x20180620
> > > > > MSR 0x8d 0x20180620
> > > > > MSR 0x8e 0x20180620
> > > > > MSR 0x8f 0x20180620
> > > > >
> > > > > Since this is not production hardware, it could also be a CPU bug of 
> > > > > course.
> > > > >
> > > > > If it is indeed possible to configure AND lock the MSR values to 
> > > > > non-Intel
> > > > > values, I'm very much in favor of Nathaniels proposal to treat the 
> > > > > launch
> > > > > enclave like any other firmware blob.
> > > >
> > > > It's not a CPU or documentation bug (though the latter is arguable).
> > > > SGX has an activation step that is triggered by doing a WRMSR(0x7a)
> > > > with bit 0 set.  Until SGX is activated, the SGX related bits in
> > > > IA32_FEATURE_CONTROL cannot be set, i.e. SGX can't be enabled.  But,
> > > > the LE hash MSRs are fully writable prior to

[PATCH] x86/pkeys: Explicitly treat PK #PF on kernel address as a bad area

2018-08-07 Thread Sean Christopherson

Kernel addresses are always mapped with _PAGE_USER=0, i.e. PKRU
isn't enforced, and so we should never see X86_PF_PK set on a
kernel address fault.  WARN once to capture the issue in case we
somehow don't die, e.g. the access originated in userspace.

Remove a similar check and its comment from spurious_fault_check().
The intent of the comment (and later code[1]) was simply to document
that spurious faults due to protection keys should be impossible, but
that's irrelevant and a bit of a red herring since we should never
get a protection keys fault on a kernel address regardless of the
kernel's TLB flushing behavior.

[1] 
http://lists-archives.com/linux-kernel/28407455-x86-pkeys-new-page-fault-error-code-bit-pf_pk.html

Signed-off-by: Sean Christopherson 
Cc: Dave Hansen 
---
There's no indication that this condition has ever been encountered.
I came across the code in spurious_fault_check() and was confused as
to why we would unconditionally treat a protection keys fault as
spurious when the comment explicitly stated that such a case should
be impossible.

Dave Hansen suggested adding a WARN_ON_ONCE in spurious_fault_check(),
but it seemed more appropriate to freak out on any protection keys
fault on a kernel address since that would imply a hardware issue or
kernel bug.  I omitted a Suggested-by since this isn't necessarily
what Dave had in mind.

 arch/x86/mm/fault.c | 16 ++--
 1 file changed, 10 insertions(+), 6 deletions(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 2aafa6ab6103..f19a55972136 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1040,12 +1040,6 @@ static int spurious_fault_check(unsigned long 
error_code, pte_t *pte)
 
if ((error_code & X86_PF_INSTR) && !pte_exec(*pte))
return 0;
-   /*
-* Note: We do not do lazy flushing on protection key
-* changes, so no spurious fault will ever set X86_PF_PK.
-*/
-   if ((error_code & X86_PF_PK))
-   return 1;
 
return 1;
 }
@@ -1241,6 +1235,14 @@ __do_page_fault(struct pt_regs *regs, unsigned long 
error_code,
 * protection error (error_code & 9) == 0.
 */
if (unlikely(fault_in_kernel_space(address))) {
+   /*
+* We should never encounter a protection keys fault on a
+* kernel address as kernel address are always mapped with
+* _PAGE_USER=0, i.e. PKRU isn't enforced.
+*/
+   if (WARN_ON_ONCE(error_code & X86_PF_PK))
+   goto bad_kernel_address;
+
if (!(error_code & (X86_PF_RSVD | X86_PF_USER | X86_PF_PROT))) {
if (vmalloc_fault(address) >= 0)
return;
@@ -1253,6 +1255,8 @@ __do_page_fault(struct pt_regs *regs, unsigned long 
error_code,
/* kprobes don't want to hook the spurious faults: */
if (kprobes_fault(regs))
return;
+
+bad_kernel_address:
/*
 * Don't take the mm semaphore here. If we fixup a prefetch
 * fault we could otherwise deadlock:
-- 
2.18.0

[PATCH] x86/speculation/l1tf: Exempt zeroed PTEs from XOR conversion

2018-08-16 Thread Sean Christopherson

clear_page() does not undergo the XOR logic to invert the address
bits, i.e. PTE, PMD and PUD entries that have not been individually
written will have val=0 and so will trigger __pte_needs_invert().
As a result, {pte,pmd,pud}_pfn() will return the wrong PFN value,
i.e. all ones (adjusted by the max PFN mask) instead of zero.
A zeroed entry is ok because the page at physical address 0 is
reserved early in boot specifically to mitigate L1TF, so explicitly
exempt them from the inversion when reading the PFN.

Manifested as an unexpected mprotect(..., PROT_NONE) failure when
called on a VMA that has VM_PFNMAP and was mmap'd to as something
other than PROT_NONE but never used.  mprotect() sends the PROT_NONE
request down prot_none_walk(), which walks the PTEs to check the PFNs.
prot_none_pte_entry() gets the bogus PFN from pte_pfn() and returns
-EACCES because it thinks mprotect() is trying to adjust a high MMIO
address.

Fixes: 6b28baca9b1f ("x86/speculation/l1tf: Protect PROT_NONE PTEs against 
speculation")
Signed-off-by: Sean Christopherson 
Cc: Andi Kleen 
Cc: Thomas Gleixner 
Cc: Josh Poimboeuf 
Cc: Michal Hocko 
Cc: Vlastimil Babka 
Cc: Dave Hansen 
Cc: Greg Kroah-Hartman 
---
 arch/x86/include/asm/pgtable.h | 9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index e4ffa565a69f..f21a1df4ca89 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -195,21 +195,24 @@ static inline u64 protnone_mask(u64 val);
 static inline unsigned long pte_pfn(pte_t pte)
 {
phys_addr_t pfn = pte_val(pte);
-   pfn ^= protnone_mask(pfn);
+   if (pfn)
+   pfn ^= protnone_mask(pfn);
return (pfn & PTE_PFN_MASK) >> PAGE_SHIFT;
 }
 
 static inline unsigned long pmd_pfn(pmd_t pmd)
 {
phys_addr_t pfn = pmd_val(pmd);
-   pfn ^= protnone_mask(pfn);
+   if (pfn)
+   pfn ^= protnone_mask(pfn);
return (pfn & pmd_pfn_mask(pmd)) >> PAGE_SHIFT;
 }
 
 static inline unsigned long pud_pfn(pud_t pud)
 {
phys_addr_t pfn = pud_val(pud);
-   pfn ^= protnone_mask(pfn);
+   if (pfn)
+   pfn ^= protnone_mask(pfn);
return (pfn & pud_pfn_mask(pud)) >> PAGE_SHIFT;
 }
 
-- 
2.18.0

[PATCH 1/2] KVM: vmx: Add defines for SGX ENCLS exiting

2018-08-14 Thread Sean Christopherson

Hardware support for basic SGX virtualization adds a new execution
control (ENCLS_EXITING), VMCS field (ENCLS_EXITING_BITMAP) and exit
reason (ENCLS), that enables a VMM to intercept specific ENCLS leaf
functions, e.g. to inject faults when the VMM isn't exposing SGX to
a VM.  When ENCLS_EXITING is enabled, the VMM can set/clear bits in
the bitmap to intercept/allow ENCLS leaf functions in non-root, e.g.
setting bit 2 in the ENCLS_EXITING_BITMAP will cause ENCLS[EINIT]
to VMExit(ENCLS).

Note: EXIT_REASON_ENCLS was previously added by commit 1f5199927034
("KVM: VMX: add missing exit reasons").

Signed-off-by: Sean Christopherson 
---
 arch/x86/include/asm/vmx.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 6aa8499e1f62..2665c10ece4c 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -74,6 +74,7 @@
 #define SECONDARY_EXEC_ENABLE_INVPCID  0x1000
 #define SECONDARY_EXEC_ENABLE_VMFUNC0x2000
 #define SECONDARY_EXEC_SHADOW_VMCS  0x4000
+#define SECONDARY_EXEC_ENCLS_EXITING   0x8000
 #define SECONDARY_EXEC_RDSEED_EXITING  0x0001
 #define SECONDARY_EXEC_ENABLE_PML   0x0002
 #define SECONDARY_EXEC_XSAVES  0x0010
@@ -213,6 +214,8 @@ enum vmcs_field {
VMWRITE_BITMAP_HIGH = 0x2029,
XSS_EXIT_BITMAP = 0x202C,
XSS_EXIT_BITMAP_HIGH= 0x202D,
+   ENCLS_EXITING_BITMAP= 0x202E,
+   ENCLS_EXITING_BITMAP_HIGH   = 0x202F,
TSC_MULTIPLIER  = 0x2032,
TSC_MULTIPLIER_HIGH = 0x2033,
GUEST_PHYSICAL_ADDRESS  = 0x2400,
-- 
2.18.0

Re: [PATCH] x86/speculation/l1tf: Exempt zeroed PTEs from XOR conversion

2018-08-17 Thread Sean Christopherson

On Fri, Aug 17, 2018 at 09:13:51AM -0700, Linus Torvalds wrote:
> On Thu, Aug 16, 2018 at 1:47 PM Sean Christopherson
>  wrote:
> >
> > Fixes: 6b28baca9b1f ("x86/speculation/l1tf: Protect PROT_NONE PTEs against 
> > speculation")
> 
> This seems wrong.
> 
> That commit doesn't invert a cleared page table entry, because that
> commit still required _PAGE_PROTNONE being set for a pte to be
> inverted.
> 
> I'm assuming the real culprit is commit f22cc87f6c1f
> ("x86/speculation/l1tf: Invert all not present mappings") which made
> it look at _just_ the present bit.
> 
> And yeah, that was wrong.
> 
> So I really think a much better patch would be the appended one-liner.
> 
> Note - it's whitespace-damaged by cut-and-paste, but it should be
> obvious enough to apply by hand.
> 
> Can you test this one instead?

Checking for a non-zero val in __pte_needs_invert() also resolves the
issue.  I shied away from that change because prot_none_walk() doesn't
pass the full PTE to __pte_needs_invert(), it only passes the pgprot_t
bits.  This works because PAGE_NONE sets the global and accessed bits,
but it made me nervous nonetheless.

>  Linus
> ---
> 
>  arch/x86/include/asm/pgtable-invert.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/x86/include/asm/pgtable-invert.h
> b/arch/x86/include/asm/pgtable-invert.h
> index 44b1203ece12..821438e91b77 100644
> --- a/arch/x86/include/asm/pgtable-invert.h
> +++ b/arch/x86/include/asm/pgtable-invert.h
> @@ -6,7 +6,7 @@
> 
>  static inline bool __pte_needs_invert(u64 val)
>  {
> -   return !(val & _PAGE_PRESENT);
> +   return val && !(val & _PAGE_PRESENT);
>  }
> 
>  /* Get a mask to xor with the page table entry to get the correct pfn. */

Re: SEV guest regression in 4.18

2018-08-23 Thread Sean Christopherson

On Thu, Aug 23, 2018 at 01:26:55PM +0200, Paolo Bonzini wrote:
> On 22/08/2018 22:11, Brijesh Singh wrote:
> > 
> > Yes, this is one of approach I have in mind. It will avoid splitting
> > the larger pages; I am thinking that early in boot code we can lookup
> > for this special section and decrypt it in-place and probably maps with
> > C=0. Only downside, it will increase data section footprint a bit
> > because we need to align this section to PM_SIZE.
> 
> If you can ensure it doesn't span a PMD, maybe it does not need to be
> aligned; you could establish a C=0 mapping of the whole 2M around it.

Wouldn't that result in exposing/leaking whatever code/data happened
to reside on the same 2M page (or corrupting it if the entire page
isn't decrypted)?  Or are you suggesting that we'd also leave the
encrypted mapping intact?  If it's the latter...

Does hardware include the C-bit in the cache tag?  I.e are the C=0 and
C=1 variations of the same PA treated as different cache lines?  If
so, we could also treat the unencrypted variation as a separate PA by
defining it to be (ACTUAL_PA | (1 << x86_phys_bits)), (re)adjusting
x86_phys_bits if necessary to get the kernel to allow the address.
init_memory_mapping() could then alias every PA with an unencrypted
VA mapping, which would allow the kernel to access any PA unencrypted
by using virt_to_phys() and phys_to_virt() to translate an encrypted
VA to an unencrypted VA.  It would mean doubling INIT_PGD_PAGE_COUNT,
but that'd be a one-time cost regardless of how many pages needed to
be accessed with C=0.

> Paolo

Re: SEV guest regression in 4.18

2018-08-24 Thread Sean Christopherson

On Fri, Aug 24, 2018 at 10:41:27AM -0500, Brijesh Singh wrote:
> 
> 
> On 08/23/2018 11:16 AM, Paolo Bonzini wrote:
> >On 23/08/2018 17:29, Sean Christopherson wrote:
> >>On Thu, Aug 23, 2018 at 01:26:55PM +0200, Paolo Bonzini wrote:
> >>>On 22/08/2018 22:11, Brijesh Singh wrote:
> >>>>
> >>>>Yes, this is one of approach I have in mind. It will avoid splitting
> >>>>the larger pages; I am thinking that early in boot code we can lookup
> >>>>for this special section and decrypt it in-place and probably maps with
> >>>>C=0. Only downside, it will increase data section footprint a bit
> >>>>because we need to align this section to PM_SIZE.
> >>>
> >>>If you can ensure it doesn't span a PMD, maybe it does not need to be
> >>>aligned; you could establish a C=0 mapping of the whole 2M around it.
> >>
> >>Wouldn't that result in exposing/leaking whatever code/data happened
> >>to reside on the same 2M page (or corrupting it if the entire page
> >>isn't decrypted)?  Or are you suggesting that we'd also leave the
> >>encrypted mapping intact?
> >
> >Yes, exactly the latter, because...
> 
> 
> Hardware does not enforce coherency between the encrypted and
> unencrypted mapping for the same physical page. So, creating a
> two mapping of same physical address will lead a possible data
> corruption.

But couldn't we avoid corruption by ensuring data accessed via the
unencrypted mapping is cache line aligned and sized?  The CPU could
speculatively bring the encrypted version into the cache but it
should never get into a modified state (barring a software bug, but
that would be a problem regardless of encryption).

> Note, SME creates two mapping of the same physical address to perform
> in-place encryption of kernel and initrd images; this is a special case
> and APM documents steps on how to do this.
> 
> 
> >
> >>Does hardware include the C-bit in the cache tag?
> >
> >... the C-bit is effectively part of the physical address and hence of
> >the cache tag.  The kernel is already relying on this to properly
> >encrypt/decrypt pages, if I remember correctly.
> >
> >Paolo
> >

Re: [PATCH v13 09/13] x86/sgx: Enclave Page Cache (EPC) memory manager

2018-08-28 Thread Sean Christopherson

On Tue, Aug 28, 2018 at 02:26:36PM -0700, Dave Hansen wrote:
> On 08/28/2018 02:22 PM, Sean Christopherson wrote:
> > On Tue, Aug 28, 2018 at 07:07:33AM -0700, Dave Hansen wrote:
> >> On 08/28/2018 01:35 AM, Jarkko Sakkinen wrote:
> >>> On Mon, Aug 27, 2018 at 02:15:34PM -0700, Dave Hansen wrote:
> >>>> On 08/27/2018 11:53 AM, Jarkko Sakkinen wrote:
> >>>>> +struct sgx_epc_page_ops {
> >>>>> +   bool (*get)(struct sgx_epc_page *epc_page);
> >>>>> +   void (*put)(struct sgx_epc_page *epc_page);
> >>>>> +   bool (*reclaim)(struct sgx_epc_page *epc_page);
> >>>>> +   void (*block)(struct sgx_epc_page *epc_page);
> >>>>> +   void (*write)(struct sgx_epc_page *epc_page);
> >>>>> +};
> >>>> Why do we need a fancy, slow (retpoline'd) set of function pointers when
> >>>> we only have one user of these (the SGX driver)?
> >>> KVM has its own implementation for these operations.
> >>
> >> That belongs in the changelog.
> >>
> >> Also, where is the implementation?  How can we assess this code that was
> >> built to create an abstraction without both of the users?
> > 
> > I can provide an early preview of the KVM reclaim code, but honestly
> > I think that would do more harm than good.  The VMX architecture for
> > EPC reclaim is complex, even for SGX standards.  Opening that can of
> > worms would likely derail this discussion.  That being said, this
> > abstraction isn't exactly what KVM will need, but it's pretty close
> > and gives us something to build on.
> 
> Please remove the abstraction code.  We don't introduce infrastructure
> which no one will use.

The infrastructure is used in the sense that it allows us to split the
userspace-facing code, i.e. the driver, into a separate module.  This
in turn allows virtualization of SGX without having to load the driver
or building it in the first place, e.g. to virtualize SGX on a system
that doesn't meet the driver's requirements.

We could eliminate the abstraction by moving the EPC management code
into the driver, but that would directly conflict with past feedback
and would need to be completely undone to enable KVM.  The abstraction
could be dumbed down to a single function, but as mentioned earlier,
that comes with its own costs.  I can dive into exactly what we lose
with a single function approach if this is a sticking point.

Re: [PATCH v13 07/13] x86/sgx: Add data structures for tracking the EPC pages

2018-08-28 Thread Sean Christopherson

On Tue, Aug 28, 2018 at 09:53:11AM -0700, Dave Hansen wrote:
> >>> + sgx_nr_epc_banks++;
> >>> + }
> >>> +
> >>> + if (!sgx_nr_epc_banks) {
> >>> + pr_err("There are zero EPC banks.\n");
> >>> + return -ENODEV;
> >>> + }
> >>> +
> >>> + return 0;
> >>> +}
> >>
> >> Does this support hot-addition of a bank?  If not, why not?
> ...
> > I'm not aware that we would have an ACPI specification for SGX so this
> > is all I have at the moment (does not show any ACPI event for
> > hotplugging).
> 
> So you're saying the one platform you looked at don't support hotplug.
> I was looking for a more broad statement about SGX.

Hardware doesn't support hotplug of EPC as the EPC size and location
is locked during activation of SGX.  And IIRC, activation of SGX must
be synchronized across all CPUs in a multi-socket platform, e.g. you
can't late-enable SGX on a socket and due hotplugging that way.

In a virtualized environment there are no such restrictions.  I am not
aware of any explicit requirements or use cases for supporting hotplug
of EPC, but that's probably only because virtualization of SGX is
fairly nascent.

Re: [PATCH v13 09/13] x86/sgx: Enclave Page Cache (EPC) memory manager

2018-08-28 Thread Sean Christopherson

On Tue, Aug 28, 2018 at 07:07:33AM -0700, Dave Hansen wrote:
> On 08/28/2018 01:35 AM, Jarkko Sakkinen wrote:
> > On Mon, Aug 27, 2018 at 02:15:34PM -0700, Dave Hansen wrote:
> >> On 08/27/2018 11:53 AM, Jarkko Sakkinen wrote:
> >>> +struct sgx_epc_page_ops {
> >>> + bool (*get)(struct sgx_epc_page *epc_page);
> >>> + void (*put)(struct sgx_epc_page *epc_page);
> >>> + bool (*reclaim)(struct sgx_epc_page *epc_page);
> >>> + void (*block)(struct sgx_epc_page *epc_page);
> >>> + void (*write)(struct sgx_epc_page *epc_page);
> >>> +};
> >> Why do we need a fancy, slow (retpoline'd) set of function pointers when
> >> we only have one user of these (the SGX driver)?
> > KVM has its own implementation for these operations.
> 
> That belongs in the changelog.
> 
> Also, where is the implementation?  How can we assess this code that was
> built to create an abstraction without both of the users?

I can provide an early preview of the KVM reclaim code, but honestly
I think that would do more harm than good.  The VMX architecture for
EPC reclaim is complex, even for SGX standards.  Opening that can of
worms would likely derail this discussion.  That being said, this
abstraction isn't exactly what KVM will need, but it's pretty close
and gives us something to build on.

Regardless, this layer of indirection is justifiable even with a
single implementation.  Reclaiming an EPC page is not a simple matter
of copying the data somewhere else and marking the page not present.
Actual eviction requires a reference to the Secure Enclave Control
Structure (SECS) of the enclave that owns the page.  Software needs
to ensure it doesn't violate the hardware-enforced access rules, e.g.
most reclaim activites need to be done under a per-enclave lock.
Not all pages have the same reclaim rules, e.g. an SECS can only
be reclaimed after all its child pages have reclaimed, a VA page
doesn't need to be blocked, etc...  And the list goes on...

All of the tracking metadata and logic about what can be evicted when
resides in the driver component[1], e.g. the core SGX code doesn't
even have a software representation of an enclave.  Alternatively the
driver could provide a single "do_reclaim" function if we wanted to
avoid a fancy abstraction layer, and in fact that's how I initially
implemented the abstraction, but that approach has it's own warts and
in a lot of ways makes the end result more complex.

Ultimately, moving pages in/out of the EPC is so abysmally slow that
the raw performance of software is almost completely irrelevant.  The
algorithms certainly matter, but optimizing away things like function
pointers definitely takes a back seat to just about everything else.

[1] Early versions of SGX support tried to put all EPC management in
the driver, but that approach caused major problems for KVM (even
without reclaim support) and received a less than enthusiastic
response from the community.

Re: SEV guest regression in 4.18

2018-08-22 Thread Sean Christopherson

On Wed, Aug 22, 2018 at 10:14:17AM +0200, Borislav Petkov wrote:
> Dropping Pavel as it bounces.
> 
> On Tue, Aug 21, 2018 at 11:07:38AM -0500, Brijesh Singh wrote:
> > The tsc_early_init() is called before setup_arch() -> init_mem_mapping.
> 
> Ok, I see it, thanks for explaining.
> 
> So back to your original ideas - I'm wondering whether we should define
> a chunk of memory which the hypervisor and guest can share and thus
> communicate over... Something ala SEV-ES also with strictly defined
> layout and put all those variables there. And then the guest can map
> decrypted.

What about creating a data section specifically for shared memory?
The section would be PMD aligned and sized so that it could be mapped
appropriately without having to fracture the page.  Then define a
macro to easily declare data in the new section, a la __read_mostly.
 
> There might be something similar though, I dunno.
> 
> Maybe Paolo has a better idea...
> 
> -- 
> Regards/Gruss,
> Boris.
> 
> SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 
> 21284 (AG Nürnberg)
> --

Re: [PATCH v2 2/3] x86/mm: add .data..decrypted section to hold shared variables

2018-08-29 Thread Sean Christopherson

On Tue, Aug 28, 2018 at 05:12:56PM -0500, Brijesh Singh wrote:
> kvmclock defines few static variables which are shared with hypervisor
> during the kvmclock initialization.
> 
> When SEV is active, memory is encrypted with a guest-specific key, and
> if guest OS wants to share the memory region with hypervisor then it must
> clear the C-bit before sharing it. Currently, we use
> kernel_physical_mapping_init() to split large pages before clearing the
> C-bit on shared pages. But the kernel_physical_mapping_init fails when
> called from the kvmclock initialization (mainly because memblock allocator
> was not ready).
> 
> The '__decrypted' can be used to define a shared variable; the variables
> will be put in the .data.decryption section. This section is mapped with
> C=0 early in the boot, we also ensure that the initialized values are
> updated to match with C=0 (i.e perform an in-place decryption). The
> .data..decrypted section is PMD aligned and sized so that we avoid the
> need to split the large pages when mapping this section.

What about naming the attribute (and section) '__unencrypted' instead
of '__decrypted'?  The attribute should be a property describing how
the data must be accessed, it shouldn't imply anything regarding the
history of the data.  Decrypted implies that data was once encrypted,
whereas unencrypted simply states that the data is stored in plain
text.  All data that has been decrypted is also unencrypted, but the
reverse does not hold true.

Re: [PATCH v13 10/13] x86/sgx: Add sgx_einit() for initializing enclaves

2018-08-31 Thread Sean Christopherson

On Fri, Aug 31, 2018 at 03:17:03PM +0300, Jarkko Sakkinen wrote:
> On Wed, Aug 29, 2018 at 07:33:54AM +, Huang, Kai wrote:
> > [snip..]
> > 
> > > > >
> > > > > @@ -38,6 +39,18 @@ static LIST_HEAD(sgx_active_page_list);  static
> > > > > DEFINE_SPINLOCK(sgx_active_page_list_lock);
> > > > >  static struct task_struct *ksgxswapd_tsk;  static
> > > > > DECLARE_WAIT_QUEUE_HEAD(ksgxswapd_waitq);
> > > > > +static struct notifier_block sgx_pm_notifier; static u64
> > > > > +sgx_pm_cnt;
> > > > > +
> > > > > +/* The cache for the last known values of IA32_SGXLEPUBKEYHASHx
> > > > > +MSRs
> > > > > for each
> > > > > + * CPU. The entries are initialized when they are first used by
> > > > > sgx_einit().
> > > > > + */
> > > > > +struct sgx_lepubkeyhash {
> > > > > + u64 msrs[4];
> > > > > + u64 pm_cnt;
> > > >
> > > > May I ask why do we need pm_cnt here? In fact why do we need suspend
> > > > staff (namely, sgx_pm_cnt above, and related code in this patch) here
> > > > in this patch? From the patch commit message I don't see why we need
> > > > PM staff here. Please give comment why you need PM staff, or you may
> > > > consider to split the PM staff to another patch.
> > > 
> > > Refining the commit message probably makes more sense because without PM
> > > code sgx_einit() would be broken. The MSRs have been reset after waking 
> > > up.
> > > 
> > > Some kind of counter is required to keep track of the power cycle. When 
> > > going
> > > to sleep the sgx_pm_cnt is increased. sgx_einit() compares the current 
> > > value of
> > > the global count to the value in the cache entry to see whether we are in 
> > > a new
> > > power cycle.
> > 
> > You mean reset to Intel default? I think we can also just reset the
> > cached MSR values on each power cycle, which would be simpler, IMHO?
> 
> I don't really see that much difference in the complexity.

Tracking the validity of the cache means we're hosed if we miss any
condition that causes the MSRs to be reset.  I think we're better off
assuming the cache can be stale at any time, i.e. don't track power
cyles and instead handle EINIT failure due to INVALID_TOKEN by writing
the cache+MSRs with the desired hash and retrying EINIT.  EINIT is
interruptible and its latency is extremely variable in any case, e.g.
tens of thousands of cycles, so this rarely-hit "slow path" probably
wouldn't affect the worst case latency of EINIT.
 
> > I think we definitely need some code to handle S3-S5, but should be in
> > separate patches, since I think the major impact of S3-S5 is entire
> > EPC being destroyed. I think keeping pm_cnt is not sufficient enough
> > to handle such case?
> 
> The driver has SGX_POWER_LOST_ENCLAVE for ioctls and it deletes the TCS
> entries.
> 
> > > This brings up one question though: how do we deal with VM host going to 
> > > sleep?
> > > VM guest would not be aware of this.
> > 
> > IMO VM just gets "sudden loss of EPC" after suspend & resume in host.
> > SGX driver and SDK should be able to handle "sudden loss of EPC", ie,
> > co-working together to re-establish the missing enclaves.
> 
> This is not about EPC. It is already dealt by the driver. I'm concerned
> about the MSR cache as it would mess up.
> 
> But I guess this logic is part of the KVM code anyway now that I think
> more of it.

Ya, I expect that VMMs will preserve VM's virtual pubkey MSRs, though
there might be scenarios where a VM has direct access to the hardware
MSRs.  This is probably a moot point since I don't think we want to
assume the kernel's cache is 100% accurate, regardless of environment.

Re: [RFC][PATCH 5/5] [PATCH 5/5] kvm-ept-idle: enable module

2018-09-04 Thread Sean Christopherson

On Sat, Sep 01, 2018 at 07:28:23PM +0800, Fengguang Wu wrote:
> Signed-off-by: Fengguang Wu 
> ---
>  arch/x86/kvm/Kconfig  | 11 +++
>  arch/x86/kvm/Makefile |  4 
>  2 files changed, 15 insertions(+)
> 
> diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
> index 1bbec387d289..4c6dec47fac6 100644
> --- a/arch/x86/kvm/Kconfig
> +++ b/arch/x86/kvm/Kconfig
> @@ -96,6 +96,17 @@ config KVM_MMU_AUDIT
>This option adds a R/W kVM module parameter 'mmu_audit', which allows
>auditing of KVM MMU events at runtime.
>  
> +config KVM_EPT_IDLE
> + tristate "KVM EPT idle page tracking"

KVM_MMU_IDLE_INTEL might be a more user friendly name, I doubt that
all Kconfig users would immediately associate EPT with Intel's two
dimensional paging.  And it meshes nicely with KVM_MMU_IDLE as a base
name if we ever want to move common functionality to its own module,
as well as all of the other KVM_MMU_* nomenclature.

> + depends on KVM_INTEL
> + depends on PROC_PAGE_MONITOR
> + ---help---
> +   Provides support for walking EPT to get the A bits on Intel
> +   processors equipped with the VT extensions.
> +
> +   To compile this as a module, choose M here: the module
> +   will be called kvm-ept-idle.
> +
>  # OK, it's a little counter-intuitive to do this, but it puts it neatly under
>  # the virtualization menu.
>  source drivers/vhost/Kconfig
> diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
> index dc4f2fdf5e57..5cad0590205d 100644
> --- a/arch/x86/kvm/Makefile
> +++ b/arch/x86/kvm/Makefile
> @@ -19,6 +19,10 @@ kvm-y  += x86.o mmu.o emulate.o 
> i8259.o irq.o lapic.o \
>  kvm-intel-y  += vmx.o pmu_intel.o
>  kvm-amd-y+= svm.o pmu_amd.o
>  
> +kvm-ept-idle-y   += ept_idle.o
> +
>  obj-$(CONFIG_KVM)+= kvm.o
>  obj-$(CONFIG_KVM_INTEL)  += kvm-intel.o
>  obj-$(CONFIG_KVM_AMD)+= kvm-amd.o
> +
> +obj-$(CONFIG_KVM_EPT_IDLE)   += kvm-ept-idle.o
> -- 
> 2.15.0
> 
> 
>

Re: [PATCH v13 10/13] x86/sgx: Add sgx_einit() for initializing enclaves

2018-09-04 Thread Sean Christopherson

On Tue, Sep 04, 2018 at 06:30:21PM +0300, Jarkko Sakkinen wrote:
> On Tue, Sep 04, 2018 at 07:54:51AM -0700, Sean Christopherson wrote:
> > I don't see any value in trying to rule out specific causes of
> > INVALID_TOKEN, but we should only retry EINIT if ret==INVALID_TOKEN
> > and RDMSR(HASH0) != sgx_lepubkeyhash[0].  Only the first MSR needs to
> > be checked for validity as they're a package deal, i.e. they'll all be
> > valid or all be reset.  There shouldn't be a limit on retry attempts,
> > e.g. the MSRs could theoretically be reset between WRMSR and EINIT.
> 
> Why is doing rdmsrs necessary? With the INVALID_TOKEN error we know we
> are out-of-sync i.e. have been sleeping and then one just needs to do
> wrmsrs.

As Kai mentioned, INVALID_TOKEN is returned for other reasons, e.g. a
production enclave trying to use a debug token or reserved bits set in
the token.  And in the KVM case, the hash and token are provided by
the guest, so it's entirely possible the enclave/token is not signed
with the key specified in the hash.  RDMSR is relatively inexpensive
compared to the overall cost of EINIT.  Though of course EINIT failure
isn't exactly a fast path, so I'm ok if you want to opt for simplicity
and retry on INVALID_TOKEN without checking the MSRs, just make sure
to add a comment indicating we're intentionally not checking the MSRs.

> I think one retry should be enough given that VMM traps EINIT. One retry
> is needed to take care of the guest itself (or host if we are running on
> bare metal) having been in a sleep state.

Assuming we do RDMSR(hash0), that should be sufficient to prevent
infinite retry and it protects against the MSRs being lost between
WRMSR and EINIT during retry.  That being said, I'm ok retrying only
once, especially if you want to omit the RDMSR.  Disabling preemption
should prevent the kernel from suspending between WRMSR and EINIT,
I'm just being paranoid.

Re: [PATCH v13 07/13] x86/sgx: Add data structures for tracking the EPC pages

2018-09-04 Thread Sean Christopherson

On Mon, Sep 03, 2018 at 05:41:53PM +0300, Andy Shevchenko wrote:
> On Mon, Aug 27, 2018 at 9:58 PM Jarkko Sakkinen
>  wrote:
> 
> > +   va = ioremap_cache(addr, size);
> > +   if (!va)
> > +   return -ENOMEM;
> 
> I'm not sure this is a right API. Do we operate with memory? Does it
> have I/O side effects?
> If no, memremap() would be better to use.

Preserving __iomem is desirable.  There aren't side effects per se,
but direct non-enclave accesses to the EPC get abort page semantics so
the kernel shouldn't be directly dereferencing a pointer to the EPC.
Though by that argument, sgx_epc_bank.va, sgx_epc_addr's return and
all ENCLS helpers should be tagged __iomem.

For documentation purposes, maybe it would be better to use __private
or "#define __sgx_epc __iomem" and use that?

Re: [PATCH v13 07/13] x86/sgx: Add data structures for tracking the EPC pages

2018-09-04 Thread Sean Christopherson

On Tue, Sep 04, 2018 at 09:01:15PM +0300, Andy Shevchenko wrote:
> On Tue, Sep 4, 2018 a> +/**
> 
> > > > +   va = ioremap_cache(addr, size);
> > > > +   if (!va)
> > > > +   return -ENOMEM;
> > >
> > > I'm not sure this is a right API. Do we operate with memory? Does it
> > > have I/O side effects?
> > > If no, memremap() would be better to use.
> >
> > Preserving __iomem is desirable.  There aren't side effects per se,
> > but direct non-enclave accesses to the EPC get abort page semantics so
> > the kernel shouldn't be directly dereferencing a pointer to the EPC.
> > Though by that argument, sgx_epc_bank.va, sgx_epc_addr's return and
> > all ENCLS helpers should be tagged __iomem.
> 
> Why?
> Does it related to *any* I/O?

No, hence my other comment that __private or a new tag altogether may
be more appropriate.  The noderef attribute is what we truly care
about.

Re: [RFC][PATCH 2/5] [PATCH 2/5] proc: introduce /proc/PID/idle_bitmap

2018-09-04 Thread Sean Christopherson

On Sat, Sep 01, 2018 at 07:28:20PM +0800, Fengguang Wu wrote:
> diff --git a/fs/proc/internal.h b/fs/proc/internal.h
> index da3dbfa09e79..732a502acc27 100644
> --- a/fs/proc/internal.h
> +++ b/fs/proc/internal.h
> @@ -305,6 +305,7 @@ extern const struct file_operations 
> proc_pid_smaps_rollup_operations;
>  extern const struct file_operations proc_tid_smaps_operations;
>  extern const struct file_operations proc_clear_refs_operations;
>  extern const struct file_operations proc_pagemap_operations;
> +extern const struct file_operations proc_mm_idle_operations;
>  
>  extern unsigned long task_vsize(struct mm_struct *);
>  extern unsigned long task_statm(struct mm_struct *,
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index dfd73a4616ce..376406a9cf45 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -1564,6 +1564,69 @@ const struct file_operations proc_pagemap_operations = 
> {
>   .open   = pagemap_open,
>   .release= pagemap_release,
>  };
> +
> +/* will be filled when kvm_ept_idle module loads */
> +struct file_operations proc_ept_idle_operations = {
> +};
> +EXPORT_SYMBOL_GPL(proc_ept_idle_operations);

Exposing EPT outside of VMX specific code is wrong, e.g. this should
be something like proc_kvm_idle_operations.  This is a common theme
for all of the patches.  Only the low level bits that are EPT specific
should be named as such, everything else should be encapsulated via
KVM or some other appropriate name. 

> +static ssize_t mm_idle_read(struct file *file, char __user *buf,
> + size_t count, loff_t *ppos)
> +{
> + struct task_struct *task = file->private_data;
> + ssize_t ret = -ESRCH;

No need for @ret, just return the error directly at the end.  And
-ESRCH isn't appropriate for a task that exists but doesn't have an
associated KVM object.

> +
> + // TODO: implement mm_walk for normal tasks
> +
> + if (task_kvm(task)) {
> + if (proc_ept_idle_operations.read)
> + return proc_ept_idle_operations.read(file, buf, count, 
> ppos);
> + }

Condensing the task_kvm and ops check into a single if saves two lines
per instance, e.g.:

if (task_kvm(task) && proc_ept_idle_operations.read)
return proc_ept_idle_operations.read(file, buf, count, ppos);
> +
> + return ret;
> +}
> +
> +
> +static int mm_idle_open(struct inode *inode, struct file *file)
> +{
> + struct task_struct *task = get_proc_task(inode);
> +
> + if (!task)
> + return -ESRCH;
> +
> + file->private_data = task;
> +
> + if (task_kvm(task)) {
> + if (proc_ept_idle_operations.open)
> + return proc_ept_idle_operations.open(inode, file);
> + }
> +
> + return 0;
> +}
> +
> +static int mm_idle_release(struct inode *inode, struct file *file)
> +{
> + struct task_struct *task = file->private_data;
> +
> + if (!task)
> + return 0;
> +
> + if (task_kvm(task)) {
> + if (proc_ept_idle_operations.release)
> + return proc_ept_idle_operations.release(inode, file);
> + }
> +
> + put_task_struct(task);
> + return 0;
> +}
> +
> +const struct file_operations proc_mm_idle_operations = {
> + .llseek = mem_lseek, /* borrow this */
> + .read   = mm_idle_read,
> + .open   = mm_idle_open,
> + .release= mm_idle_release,
> +};
> +
>  #endif /* CONFIG_PROC_PAGE_MONITOR */
>  
>  #ifdef CONFIG_NUMA
> -- 
> 2.15.0
> 
> 
>

Re: [PATCH] x86/pkeys: Explicitly treat PK #PF on kernel address as a bad area

2018-09-04 Thread Sean Christopherson

Cc-ing Jann and Andy.

On Tue, Aug 07, 2018 at 10:29:20AM -0700, Sean Christopherson wrote:
> Kernel addresses are always mapped with _PAGE_USER=0, i.e. PKRU
> isn't enforced, and so we should never see X86_PF_PK set on a
> kernel address fault.  WARN once to capture the issue in case we
> somehow don't die, e.g. the access originated in userspace.
> 
> Remove a similar check and its comment from spurious_fault_check().
> The intent of the comment (and later code[1]) was simply to document
> that spurious faults due to protection keys should be impossible, but
> that's irrelevant and a bit of a red herring since we should never
> get a protection keys fault on a kernel address regardless of the
> kernel's TLB flushing behavior.
> 
> [1] 
> http://lists-archives.com/linux-kernel/28407455-x86-pkeys-new-page-fault-error-code-bit-pf_pk.html
> 
> Signed-off-by: Sean Christopherson 
> Cc: Dave Hansen 
> ---
> There's no indication that this condition has ever been encountered.
> I came across the code in spurious_fault_check() and was confused as
> to why we would unconditionally treat a protection keys fault as
> spurious when the comment explicitly stated that such a case should
> be impossible.
> 
> Dave Hansen suggested adding a WARN_ON_ONCE in spurious_fault_check(),
> but it seemed more appropriate to freak out on any protection keys
> fault on a kernel address since that would imply a hardware issue or
> kernel bug.  I omitted a Suggested-by since this isn't necessarily
> what Dave had in mind.
> 
>  arch/x86/mm/fault.c | 16 ++--
>  1 file changed, 10 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> index 2aafa6ab6103..f19a55972136 100644
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -1040,12 +1040,6 @@ static int spurious_fault_check(unsigned long 
> error_code, pte_t *pte)
>  
>   if ((error_code & X86_PF_INSTR) && !pte_exec(*pte))
>   return 0;
> - /*
> -  * Note: We do not do lazy flushing on protection key
> -  * changes, so no spurious fault will ever set X86_PF_PK.
> -  */
> - if ((error_code & X86_PF_PK))
> - return 1;
>  
>   return 1;
>  }
> @@ -1241,6 +1235,14 @@ __do_page_fault(struct pt_regs *regs, unsigned long 
> error_code,
>* protection error (error_code & 9) == 0.
>*/
>   if (unlikely(fault_in_kernel_space(address))) {
> + /*
> +  * We should never encounter a protection keys fault on a
> +  * kernel address as kernel address are always mapped with
> +  * _PAGE_USER=0, i.e. PKRU isn't enforced.
> +  */
> + if (WARN_ON_ONCE(error_code & X86_PF_PK))
> + goto bad_kernel_address;
> +
>   if (!(error_code & (X86_PF_RSVD | X86_PF_USER | X86_PF_PROT))) {
>   if (vmalloc_fault(address) >= 0)
>   return;
> @@ -1253,6 +1255,8 @@ __do_page_fault(struct pt_regs *regs, unsigned long 
> error_code,
>   /* kprobes don't want to hook the spurious faults: */
>   if (kprobes_fault(regs))
>   return;
> +
> +bad_kernel_address:
>   /*
>* Don't take the mm semaphore here. If we fixup a prefetch
>* fault we could otherwise deadlock:
> -- 
> 2.18.0
>

Re: [PATCH v4 4/4] x86/kvm: use __decrypted attribute in shared variables

2018-09-04 Thread Sean Christopherson

On Mon, Sep 03, 2018 at 08:29:42PM -0500, Brijesh Singh wrote:
> Commit: 368a540e0232 (x86/kvmclock: Remove memblock dependency)
> caused SEV guest regression. When SEV is active, we map the shared
> variables (wall_clock and hv_clock_boot) with C=0 to ensure that both
> the guest and the hypervisor are able to access the data. To map the
> variables we use kernel_physical_mapping_init() to split the large pages,
> but splitting large pages requires allocating a new PMD, which fails now
> that kvmclock initialization is called early during boot.
> 
> Recently we added a special .data..decrypted section to hold the shared
> variables. This section is mapped with C=0 early during boot. Use
> __decrypted attribute to put the wall_clock and hv_clock_boot in
> .data..decrypted section so that they are mapped with C=0.
> 
> Signed-off-by: Brijesh Singh 
> Reviewed-by: Tom Lendacky 
> Fixes: 368a540e0232 ("x86/kvmclock: Remove memblock dependency")
> Cc: Tom Lendacky 
> Cc: k...@vger.kernel.org
> Cc: Thomas Gleixner 
> Cc: Borislav Petkov 
> Cc: "H. Peter Anvin" 
> Cc: linux-kernel@vger.kernel.org
> Cc: Paolo Bonzini 
> Cc: Sean Christopherson 
> Cc: k...@vger.kernel.org
> Cc: "Radim Krčmář" 
> ---
>  arch/x86/kernel/kvmclock.c | 30 +-
>  1 file changed, 25 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
> index 1e67646..08f5f8a 100644
> --- a/arch/x86/kernel/kvmclock.c
> +++ b/arch/x86/kernel/kvmclock.c
> @@ -28,6 +28,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include 
>  #include 
> @@ -61,8 +62,8 @@ early_param("no-kvmclock-vsyscall", 
> parse_no_kvmclock_vsyscall);
>   (PAGE_SIZE / sizeof(struct pvclock_vsyscall_time_info))
>  
>  static struct pvclock_vsyscall_time_info
> - hv_clock_boot[HVC_BOOT_ARRAY_SIZE] __aligned(PAGE_SIZE);
> -static struct pvclock_wall_clock wall_clock;
> + hv_clock_boot[HVC_BOOT_ARRAY_SIZE] __decrypted 
> __aligned(PAGE_SIZE);
> +static struct pvclock_wall_clock wall_clock __decrypted;
>  static DEFINE_PER_CPU(struct pvclock_vsyscall_time_info *, hv_clock_per_cpu);
>  
>  static inline struct pvclock_vcpu_time_info *this_cpu_pvti(void)
> @@ -267,10 +268,29 @@ static int kvmclock_setup_percpu(unsigned int cpu)
>   return 0;
>  
>   /* Use the static page for the first CPUs, allocate otherwise */
> - if (cpu < HVC_BOOT_ARRAY_SIZE)
> + if (cpu < HVC_BOOT_ARRAY_SIZE) {
>   p = _clock_boot[cpu];
> - else
> - p = kzalloc(sizeof(*p), GFP_KERNEL);
> + } else {
> + int rc;
> + unsigned int sz = sizeof(*p);
> +
> + if (sev_active())
> + sz = PAGE_ALIGN(sz);

Hmm, again we're wasting a fairly sizable amount of memory since each
CPU is doing a separate 4k allocation.  What if we defined an auxilary
array in __decrypted to be used for cpus > HVC_BOOT_ARRAY_SIZE when
SEV is active?  struct pvclock_vsyscall_time_info is 32 bytes so we
could handle the max of 8192 CPUs with 256kb of data (252kb if you
subtract the pre-existing 4k page), i.e. the SEV case wouldn't need
additional memory beyond the 2mb page that's reserved for __decrypted.
The non-SEV case could do free_kernel_image_pages() on the unused
array (which would need to be page sized) so it wouldn't waste memory.

> +
> + p = kzalloc(sz, GFP_KERNEL);
> +
> + /*
> +  * The physical address of per-cpu variable will be shared with
> +  * the hypervisor. Let's clear the C-bit before we assign the
> +  * memory to per_cpu variable.
> +  */
> + if (p && sev_active()) {
> + rc = set_memory_decrypted((unsigned long)p, sz >> 
> PAGE_SHIFT);
> + if (rc)

@p is being leaked if set_memory_decrypted() fails.

> + return rc;
> + memset(p, 0, sz);
> + }
> + }
>  
>   per_cpu(hv_clock_per_cpu, cpu) = p;
>   return p ? 0 : -ENOMEM;
> -- 
> 2.7.4
>

Re: [PATCH v13 11/13] platform/x86: Intel SGX driver

2018-09-07 Thread Sean Christopherson

On Thu, Sep 06, 2018 at 05:50:01PM -0700, Joe Perches wrote:
> On Thu, 2018-09-06 at 19:35 +0200, Miguel Ojeda wrote:
> > > Which one is right and why the kernel tree is polluted with C99-headers
> > > when they do not pass checkpatch.pl?
> 
> checkpatch ignores c99 headers since 2016.

Jarkko was referring to c99 comments for the SPDX license.  checkpatch
explicitly requires c-style comments for headers and assembly files as
dictated by Documentation/process/license-rules.rst.  

$ grep -r SPDX **/*.h | grep \/\/ | wc -l
665

$ grep -r SPDX **/*.S | grep \/\/ | wc -l
22

$ git show 9f3a89926d6df
commit 9f3a89926d6dfc30a4fd1bbcb92cc7b218d3786d
Author: Rob Herring 
Date:   Tue Apr 10 16:33:13 2018 -0700

checkpatch.pl: add SPDX license tag check

Add SPDX license tag check based on the rules defined in
Documentation/process/license-rules.rst.  To summarize, SPDX license
tags should be on the 1st line (or 2nd line in scripts) using the
appropriate comment style for the file type.

Link: http://lkml.kernel.org/r/20180202154026.15298-1-r...@kernel.org
Signed-off-by: Rob Herring 
Signed-off-by: Joe Perches 
Acked-by: Greg Kroah-Hartman 
Acked-by: Philippe Ombredanne 
Cc: Andy Whitcroft 
Cc: Joe Perches 
Cc: Thomas Gleixner 
Cc: Igor Stoppa 
Cc: Jonathan Corbet 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 

diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl
index b464a4c3f863..0f022b56f117 100755
--- a/scripts/checkpatch.pl
+++ b/scripts/checkpatch.pl
@@ -2257,6 +2257,8 @@ sub process {

my $camelcase_file_seeded = 0;

+   my $checklicenseline = 1;
+
sanitise_line_reset();
my $line;
foreach my $rawline (@rawlines) {
@@ -2448,6 +2450,7 @@ sub process {
} else {
$check = $check_orig;
}
+   $checklicenseline = 1;
next;
}

@@ -2911,6 +2914,30 @@ sub process {
}
}

+# check for using SPDX license tag at beginning of files
+   if ($realline == $checklicenseline) {
+   if ($rawline =~ /^[ \+]\s*\#\!\s*\//) {
+   $checklicenseline = 2;
+   } elsif ($rawline =~ /^\+/) {
+   my $comment = "";
+   if ($realfile =~ /\.(h|s|S)$/) {
+   $comment = '/*';
+   } elsif ($realfile =~ /\.(c|dts|dtsi)$/) {
+   $comment = '//';
+   } elsif (($checklicenseline == 2) || $realfile 
=~ /\.(sh|pl|py|awk|tc)$/) {
+   $comment = '#';
+   } elsif ($realfile =~ /\.rst$/) {
+   $comment = '..';
+   }
+
+   if ($comment !~ /^$/ &&
+   $rawline !~ /^\+\Q$comment\E 
SPDX-License-Identifier: /) {
+   WARN("SPDX_LICENSE_TAG",
+"Missing or malformed 
SPDX-License-Identifier tag in line $checklicenseline\n" . $herecurr);
+   }
+   }
+   }
+
 # check we are in a valid source file if not then ignore this hunk
next if ($realfile !~ /\.(h|c|s|S|sh|dtsi|dts)$/);

Re: [PATCH v13 09/13] x86/sgx: Enclave Page Cache (EPC) memory manager

2018-09-04 Thread Sean Christopherson

On Tue, Sep 04, 2018 at 06:38:03PM +0300, Jarkko Sakkinen wrote:
> On Mon, Sep 03, 2018 at 10:02:16PM +0300, Andy Shevchenko wrote:
> > On Mon, Aug 27, 2018 at 9:58 PM Jarkko Sakkinen
> >  wrote:
> > >
> > > +   WARN(ret < 0, "sgx: cannot free page, reclaim in-progress");
> > > +   WARN(ret > 0, "sgx: EREMOVE returned %d (0x%x)", ret, ret);
> > 
> > I'm not sure (though it's easy to check) that you need sgx: prefix
> > here. WARN() might take pr_fmt() if defined.
> 
> Sean, you took care of this one. Was it so that WARN() does not respect
> pr_fmt?

Yep, WARN() doesn't respect pr_fmt.

Re: [PATCH v13 10/13] x86/sgx: Add sgx_einit() for initializing enclaves

2018-09-04 Thread Sean Christopherson

On Mon, Sep 03, 2018 at 04:45:14PM -0700, Huang, Kai wrote:
> > -Original Message-
> > From: linux-sgx-ow...@vger.kernel.org [mailto:linux-sgx-
> > ow...@vger.kernel.org] On Behalf Of Jarkko Sakkinen
> > Sent: Tuesday, September 4, 2018 7:19 AM
> > To: Christopherson, Sean J 
> > Cc: Huang, Kai ; platform-driver-...@vger.kernel.org;
> > x...@kernel.org; nhor...@redhat.com; linux-kernel@vger.kernel.org;
> > t...@linutronix.de; suresh.b.sid...@intel.com; Ayoun, Serge
> > ; h...@zytor.com; npmccal...@redhat.com;
> > mi...@redhat.com; linux-...@vger.kernel.org; Hansen, Dave
> > 
> > Subject: Re: [PATCH v13 10/13] x86/sgx: Add sgx_einit() for initializing 
> > enclaves
> > 
> > On Fri, Aug 31, 2018 at 11:15:09AM -0700, Sean Christopherson wrote:
> > > On Fri, Aug 31, 2018 at 03:17:03PM +0300, Jarkko Sakkinen wrote:
> > > > On Wed, Aug 29, 2018 at 07:33:54AM +, Huang, Kai wrote:
> > > > > [snip..]
> > > > >
> > > > > > > >
> > > > > > > > @@ -38,6 +39,18 @@ static LIST_HEAD(sgx_active_page_list);
> > > > > > > > static DEFINE_SPINLOCK(sgx_active_page_list_lock);
> > > > > > > >  static struct task_struct *ksgxswapd_tsk;  static
> > > > > > > > DECLARE_WAIT_QUEUE_HEAD(ksgxswapd_waitq);
> > > > > > > > +static struct notifier_block sgx_pm_notifier; static u64
> > > > > > > > +sgx_pm_cnt;
> > > > > > > > +
> > > > > > > > +/* The cache for the last known values of
> > > > > > > > +IA32_SGXLEPUBKEYHASHx MSRs
> > > > > > > > for each
> > > > > > > > + * CPU. The entries are initialized when they are first
> > > > > > > > + used by
> > > > > > > > sgx_einit().
> > > > > > > > + */
> > > > > > > > +struct sgx_lepubkeyhash {
> > > > > > > > +   u64 msrs[4];
> > > > > > > > +   u64 pm_cnt;
> > > > > > >
> > > > > > > May I ask why do we need pm_cnt here? In fact why do we need
> > > > > > > suspend staff (namely, sgx_pm_cnt above, and related code in
> > > > > > > this patch) here in this patch? From the patch commit message
> > > > > > > I don't see why we need PM staff here. Please give comment why
> > > > > > > you need PM staff, or you may consider to split the PM staff to 
> > > > > > > another
> > patch.
> > > > > >
> > > > > > Refining the commit message probably makes more sense because
> > > > > > without PM code sgx_einit() would be broken. The MSRs have been 
> > > > > > reset
> > after waking up.
> > > > > >
> > > > > > Some kind of counter is required to keep track of the power
> > > > > > cycle. When going to sleep the sgx_pm_cnt is increased.
> > > > > > sgx_einit() compares the current value of the global count to
> > > > > > the value in the cache entry to see whether we are in a new power 
> > > > > > cycle.
> > > > >
> > > > > You mean reset to Intel default? I think we can also just reset
> > > > > the cached MSR values on each power cycle, which would be simpler,
> > IMHO?
> > > >
> > > > I don't really see that much difference in the complexity.
> > >
> > > Tracking the validity of the cache means we're hosed if we miss any
> > > condition that causes the MSRs to be reset.  I think we're better off
> > > assuming the cache can be stale at any time, i.e. don't track power
> > > cyles and instead handle EINIT failure due to INVALID_TOKEN by writing
> > > the cache+MSRs with the desired hash and retrying EINIT.  EINIT is
> > > interruptible and its latency is extremely variable in any case, e.g.
> > > tens of thousands of cycles, so this rarely-hit "slow path" probably
> > > wouldn't affect the worst case latency of EINIT.
> > 
> > Sounds a good refiniment. Pretty good solution to heal from host sleep on 
> > the
> > guest VM and then there is no need for driver changes.
> 
> To me either way should be OK, keeping MSR cache or retrying EINIT, since 
> EINIT should not be in performance critical path I think.
> 
> But INVALID_TOKEN is not only returned when MSRs are mismatched, so do you 
> plan to check to rule out other cases that cause INVALID_TOKEN before 
> retrying EINIT, or unconditionally  retry EINIT? And we should only retry 
> once?

I don't see any value in trying to rule out specific causes of
INVALID_TOKEN, but we should only retry EINIT if ret==INVALID_TOKEN
and RDMSR(HASH0) != sgx_lepubkeyhash[0].  Only the first MSR needs to
be checked for validity as they're a package deal, i.e. they'll all be
valid or all be reset.  There shouldn't be a limit on retry attempts,
e.g. the MSRs could theoretically be reset between WRMSR and EINIT.

> 
> Thanks,
> -Kai
> > 
> > /Jarkko

Re: [PATCH v5 5/5] x86/kvm: Avoid dynamic allocation of pvclock data when SEV is active

2018-09-06 Thread Sean Christopherson

On Thu, Sep 06, 2018 at 06:43:02AM -0500, Brijesh Singh wrote:
> Currently, the per-cpu pvclock data is allocated dynamically when
> cpu > HVC_BOOT_ARRAY_SIZE. The physical address of this variable is
> shared between the guest and the hypervisor hence it must be mapped as
> unencrypted (ie. C=0) when SEV is active.
> 
> When SEV is active, we will be wasting fairly sizeable amount of memory
> since each CPU will be doing a separate 4k allocation so that it can clear
> C-bit. Let's define few extra static page sized array of pvclock data.
> In the preparatory stage of CPU hotplug, use the element of this static
> array to avoid the dynamic allocation. This array will be put in
> the .data..decrypted section so that its mapped with C=0 during the boot.
> 
> In non-SEV case, this static page will unused and free'd by the
> free_decrypted_mem().
> 
> Signed-off-by: Brijesh Singh 
> Suggested-by: Sean Christopherson 
> Cc: Tom Lendacky 
> Cc: k...@vger.kernel.org
> Cc: Thomas Gleixner 
> Cc: Borislav Petkov 
> Cc: "H. Peter Anvin" 
> Cc: linux-kernel@vger.kernel.org
> Cc: Paolo Bonzini 
> Cc: Sean Christopherson 
> Cc: k...@vger.kernel.org
> Cc: "Radim Krčmář" 
> ---
>  arch/x86/include/asm/mem_encrypt.h |  4 
>  arch/x86/kernel/kvmclock.c | 22 +++---
>  arch/x86/kernel/vmlinux.lds.S  |  3 +++
>  arch/x86/mm/init.c |  3 +++
>  arch/x86/mm/mem_encrypt.c  | 10 ++
>  5 files changed, 39 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/include/asm/mem_encrypt.h 
> b/arch/x86/include/asm/mem_encrypt.h
> index 802b2eb..aa204af 100644
> --- a/arch/x86/include/asm/mem_encrypt.h
> +++ b/arch/x86/include/asm/mem_encrypt.h
> @@ -48,11 +48,13 @@ int __init early_set_memory_encrypted(unsigned long 
> vaddr, unsigned long size);
>  
>  /* Architecture __weak replacement functions */
>  void __init mem_encrypt_init(void);
> +void __init free_decrypted_mem(void);
>  
>  bool sme_active(void);
>  bool sev_active(void);
>  
>  #define __decrypted __attribute__((__section__(".data..decrypted")))
> +#define __decrypted_hvclock 
> __attribute__((__section__(".data..decrypted_hvclock")))
>  
>  #else/* !CONFIG_AMD_MEM_ENCRYPT */
>  
> @@ -80,6 +82,7 @@ static inline int __init
>  early_set_memory_encrypted(unsigned long vaddr, unsigned long size) { return 
> 0; }
>  
>  #define __decrypted
> +#define __decrypted_hvclock
>  
>  #endif   /* CONFIG_AMD_MEM_ENCRYPT */
>  
> @@ -93,6 +96,7 @@ early_set_memory_encrypted(unsigned long vaddr, unsigned 
> long size) { return 0;
>  #define __sme_pa_nodebug(x)  (__pa_nodebug(x) | sme_me_mask)
>  
>  extern char __start_data_decrypted[], __end_data_decrypted[];
> +extern char __start_data_decrypted_hvclock[];
>  
>  #endif   /* __ASSEMBLY__ */
>  
> diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
> index 376fd3a..5b88773 100644
> --- a/arch/x86/kernel/kvmclock.c
> +++ b/arch/x86/kernel/kvmclock.c
> @@ -65,6 +65,13 @@ static struct pvclock_vsyscall_time_info
>  static struct pvclock_wall_clock wall_clock __decrypted;
>  static DEFINE_PER_CPU(struct pvclock_vsyscall_time_info *, hv_clock_per_cpu);
>  
> +
> +/* This should cover upto 512 VCPUS (first 64 are covered by 
> hv_clock_boot[]). */
> +#define HVC_DECRYPTED_ARRAY_SIZE \
> + ((PAGE_SIZE * 7)  / sizeof(struct pvclock_vsyscall_time_info))

I think we can define the size relative to NR_CPUS rather than picking
an arbitrary number of pages, maybe with a BUILD_BUG_ON to make sure
the total size won't require a second 2mb page for __decrpyted.

#define HVC_DECRYPTED_ARRAY_SIZE  \
PAGE_ALIGN((NR_CPUS - HVC_BOOT_ARRAY_SIZE) * \
   sizeof(struct pvclock_vsyscall_time_info))

> +static struct pvclock_vsyscall_time_info
> + hv_clock_dec[HVC_DECRYPTED_ARRAY_SIZE] 
> __decrypted_hvclock;
> +
>  static inline struct pvclock_vcpu_time_info *this_cpu_pvti(void)
>  {
>   return _cpu_read(hv_clock_per_cpu)->pvti;
> @@ -267,10 +274,19 @@ static int kvmclock_setup_percpu(unsigned int cpu)
>   return 0;
>  
>   /* Use the static page for the first CPUs, allocate otherwise */
> - if (cpu < HVC_BOOT_ARRAY_SIZE)
> + if (cpu < HVC_BOOT_ARRAY_SIZE) {
>   p = _clock_boot[cpu];
> - else
> - p = kzalloc(sizeof(*p), GFP_KERNEL);
> + } else {
> + /*
> +  * When SEV is active, use the static pages from
> +  * .data..decrypted_hvclock section. The pages are already
> +  * mapped with C=0.
> +  */
> + if (

Re: [PATCH v5 5/5] x86/kvm: Avoid dynamic allocation of pvclock data when SEV is active

2018-09-06 Thread Sean Christopherson

On Thu, Sep 06, 2018 at 02:24:23PM +0200, Borislav Petkov wrote:
> On Thu, Sep 06, 2018 at 06:43:02AM -0500, Brijesh Singh wrote:
> > Currently, the per-cpu pvclock data is allocated dynamically when
> > cpu > HVC_BOOT_ARRAY_SIZE. The physical address of this variable is
> > shared between the guest and the hypervisor hence it must be mapped as
> > unencrypted (ie. C=0) when SEV is active.
> > 
> > When SEV is active, we will be wasting fairly sizeable amount of memory
> > since each CPU will be doing a separate 4k allocation so that it can clear
> > C-bit. Let's define few extra static page sized array of pvclock data.
> > In the preparatory stage of CPU hotplug, use the element of this static
> > array to avoid the dynamic allocation. This array will be put in
> > the .data..decrypted section so that its mapped with C=0 during the boot.
> > 
> > In non-SEV case, this static page will unused and free'd by the
> > free_decrypted_mem().
> > 
> > Signed-off-by: Brijesh Singh 
> > Suggested-by: Sean Christopherson 
> > Cc: Tom Lendacky 
> > Cc: k...@vger.kernel.org
> > Cc: Thomas Gleixner 
> > Cc: Borislav Petkov 
> > Cc: "H. Peter Anvin" 
> > Cc: linux-kernel@vger.kernel.org
> > Cc: Paolo Bonzini 
> > Cc: Sean Christopherson 
> > Cc: k...@vger.kernel.org
> > Cc: "Radim Krčmář" 
> > ---
> >  arch/x86/include/asm/mem_encrypt.h |  4 
> >  arch/x86/kernel/kvmclock.c | 22 +++---
> >  arch/x86/kernel/vmlinux.lds.S  |  3 +++
> >  arch/x86/mm/init.c |  3 +++
> >  arch/x86/mm/mem_encrypt.c  | 10 ++
> >  5 files changed, 39 insertions(+), 3 deletions(-)
> > 
> > diff --git a/arch/x86/include/asm/mem_encrypt.h 
> > b/arch/x86/include/asm/mem_encrypt.h
> > index 802b2eb..aa204af 100644
> > --- a/arch/x86/include/asm/mem_encrypt.h
> > +++ b/arch/x86/include/asm/mem_encrypt.h
> > @@ -48,11 +48,13 @@ int __init early_set_memory_encrypted(unsigned long 
> > vaddr, unsigned long size);
> >  
> >  /* Architecture __weak replacement functions */
> >  void __init mem_encrypt_init(void);
> > +void __init free_decrypted_mem(void);
> >  
> >  bool sme_active(void);
> >  bool sev_active(void);
> >  
> >  #define __decrypted __attribute__((__section__(".data..decrypted")))
> > +#define __decrypted_hvclock 
> > __attribute__((__section__(".data..decrypted_hvclock")))
> 
> So are we going to be defining a decrypted section for every piece of
> machinery now?
> 
> That's a bit too much in my book.
> 
> Why can't you simply free everything in .data..decrypted on !SVE guests?

That would prevent adding __decrypted to existing declarations, e.g.
hv_clock_boot, which would be ugly in its own right.  A more generic
solution would be to add something like __decrypted_exclusive to mark
data that is used if and only if SEV is active, and then free the
SEV-only data when SEV is disabled.

Originally, my thought was that this would be a one-off case and the
array could be freed directly in kvmclock_init(), e.g.:

static struct pvclock_vsyscall_time_info
hv_clock_aux[HVC_AUX_ARRAY_SIZE] __decrypted __aligned(PAGE_SIZE);

...

void __init kvmclock_init(void)
{
u8 flags;

if (!sev_active())
free_init_pages("unused decrypted",
(unsigned long)hv_clock_aux,
(unsigned long)hv_clock_aux + sizeof(hv_clock_aux));

> 
> -- 
> Regards/Gruss,
> Boris.
> 
> SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 
> 21284 (AG Nürnberg)
> --

Re: [PATCH v5 5/5] x86/kvm: Avoid dynamic allocation of pvclock data when SEV is active

2018-09-06 Thread Sean Christopherson

On Thu, Sep 06, 2018 at 06:50:41AM -0700, Sean Christopherson wrote:
> On Thu, Sep 06, 2018 at 02:24:23PM +0200, Borislav Petkov wrote:
> > On Thu, Sep 06, 2018 at 06:43:02AM -0500, Brijesh Singh wrote:
> > > Currently, the per-cpu pvclock data is allocated dynamically when
> > > cpu > HVC_BOOT_ARRAY_SIZE. The physical address of this variable is
> > > shared between the guest and the hypervisor hence it must be mapped as
> > > unencrypted (ie. C=0) when SEV is active.
> > > 
> > > When SEV is active, we will be wasting fairly sizeable amount of memory
> > > since each CPU will be doing a separate 4k allocation so that it can clear
> > > C-bit. Let's define few extra static page sized array of pvclock data.
> > > In the preparatory stage of CPU hotplug, use the element of this static
> > > array to avoid the dynamic allocation. This array will be put in
> > > the .data..decrypted section so that its mapped with C=0 during the boot.
> > > 
> > > In non-SEV case, this static page will unused and free'd by the
> > > free_decrypted_mem().
> > > 
> > > diff --git a/arch/x86/include/asm/mem_encrypt.h 
> > > b/arch/x86/include/asm/mem_encrypt.h
> > > index 802b2eb..aa204af 100644
> > > --- a/arch/x86/include/asm/mem_encrypt.h
> > > +++ b/arch/x86/include/asm/mem_encrypt.h
> > > @@ -48,11 +48,13 @@ int __init early_set_memory_encrypted(unsigned long 
> > > vaddr, unsigned long size);
> > >  
> > >  /* Architecture __weak replacement functions */
> > >  void __init mem_encrypt_init(void);
> > > +void __init free_decrypted_mem(void);
> > >  
> > >  bool sme_active(void);
> > >  bool sev_active(void);
> > >  
> > >  #define __decrypted __attribute__((__section__(".data..decrypted")))
> > > +#define __decrypted_hvclock 
> > > __attribute__((__section__(".data..decrypted_hvclock")))
> > 
> > So are we going to be defining a decrypted section for every piece of
> > machinery now?
> > 
> > That's a bit too much in my book.
> > 
> > Why can't you simply free everything in .data..decrypted on !SVE guests?
> 
> That would prevent adding __decrypted to existing declarations, e.g.
> hv_clock_boot, which would be ugly in its own right.  A more generic
> solution would be to add something like __decrypted_exclusive to mark
> data that is used if and only if SEV is active, and then free the
> SEV-only data when SEV is disabled.

Oh, and we'd need to make sure __decrypted_exclusive is freed when
!CONFIG_AMD_MEM_ENCRYPT, and preferably !sev_active() since the big
array is used only if SEV is active.  This patch unconditionally
defines hv_clock_dec but only frees it if CONFIG_AMD_MEM_ENCRYPT=y &&
!mem_encrypt_active().

Re: [PATCH v5 5/5] x86/kvm: Avoid dynamic allocation of pvclock data when SEV is active

2018-09-06 Thread Sean Christopherson

On Thu, Sep 06, 2018 at 04:43:42PM +0200, Borislav Petkov wrote:
> On Thu, Sep 06, 2018 at 06:50:41AM -0700, Sean Christopherson wrote:
> > That would prevent adding __decrypted to existing declarations, e.g.
> > hv_clock_boot, which would be ugly in its own right.  A more generic
> > solution would be to add something like __decrypted_exclusive to mark
> 
> I still don't understand why can't there be only a single __decrypted
> section and to free that whole section on !SEV.

Wouldn't that result in @hv_clock_boot being incorrectly freed in the
!SEV case?

Re: [PATCH v5 5/5] x86/kvm: Avoid dynamic allocation of pvclock data when SEV is active

2018-09-06 Thread Sean Christopherson

On Thu, Sep 06, 2018 at 05:19:38PM +0200, Borislav Petkov wrote:
> On Thu, Sep 06, 2018 at 07:56:40AM -0700, Sean Christopherson wrote:
> > Wouldn't that result in @hv_clock_boot being incorrectly freed in the
> > !SEV case?
> 
> Ok, maybe I'm missing something but why do we need 4K per CPU? Why can't
> we map all those pages which contain the clock variable, decrypted in
> all guests' page tables?
> 
> Basically
> 
> (NR_CPUS * sizeof(struct pvclock_vsyscall_time_info)) / 4096
> 
> pages.
> 
> For the !SEV case then nothing changes.

The 4k per CPU refers to the dynamic allocation in Brijesh's original
patch.   Currently, @hv_clock_boot is a single 4k page to limit the
amount of unused memory when 'nr_cpu_ids < NR_CPUS'.  In the SEV case,
dynamically allocating for 'cpu > HVC_BOOT_ARRAY_SIZE' one at a time
means that each CPU allocates a full 4k page to store a single 32-byte
variable.  My thought was that we could simply define a second array
for the SEV case to statically allocate for NR_CPUS since __decrypted
has a big chunk of memory that would be ununsed anyways[1].  And since
the second array is only used for SEV it can be freed if !SEV.

If we free the array explicitly then we don't need a second section or
attribute.  My comments about __decrypted_exclusive were that if we
did want to go with a second section/attribute, e.g. to have a generic
solution that can be used for other stuff, then we'd have more corner
cases to deal with.  I agree that simpler is better, i.e. I'd vote for
explicitly freeing the second array.  Apologies for not making that
clear from the get-go. 

[1] An alternative solution would be to batch the dynamic allocations,
but that would probably require locking and be more complex.

Re: [PATCH v5 5/5] x86/kvm: Avoid dynamic allocation of pvclock data when SEV is active

2018-09-06 Thread Sean Christopherson

On Thu, Sep 06, 2018 at 03:20:46PM -0500, Brijesh Singh wrote:
> 
> 
> On 09/06/2018 02:47 PM, Sean Christopherson wrote:
> ...
> 
> >>
> >>Yes, the auxiliary array will dumped into the regular .bss when
> >>CONFIG_AMD_MEM_ENCRYPT=n. Typically it will be few k, I am not
> >>sure if its worth complicating the code to save those extra memory.
> >>Most of the distro's have CONFIG_AMD_MEM_ENCRYPT=y anyways.
> >
> >I just realized that we'll try to create a bogus array if 'NR_CPUS <=
> >HVC_BOOT_ARRAY_SIZE'.  A bit ugly, but we could #ifdef away both that
> >and CONFIG_AMD_MEM_ENCRYPT=n in a single shot, e.g.:
> >
> >#if defined(CONFIG_AMD_MEM_ENCRYPT) && NR_CPUS > HVC_BOOT_ARRAY_SIZE
> >#define HVC_AUX_ARRAY_SIZE  \
> > PAGE_ALIGN((NR_CPUS - HVC_BOOT_ARRAY_SIZE) * \
> >sizeof(struct pvclock_vsyscall_time_info))
> >static struct pvclock_vsyscall_time_info
> > hv_clock_aux[HVC_AUX_ARRAY_SIZE] __decrypted __aligned(PAGE_SIZE);
> >#endif
> >
> 
> The HVC_BOOT_ARRAY_SIZE macro uses sizeof(..) and to my understanding
> the sizeof operators are not allowed in '#if'. Anyway, I will try to see
> if it can be used, if not then I will stick to CONFIG_AMD_MEM_ENCRYPT
> check.

Hmm, we'll need something otherwise 'NR_CPUS - HVC_BOOT_ARRAY_SIZE'
will wrap and cause build errors.

Re: [PATCH v5 5/5] x86/kvm: Avoid dynamic allocation of pvclock data when SEV is active

2018-09-06 Thread Sean Christopherson

On Thu, Sep 06, 2018 at 02:24:32PM -0500, Brijesh Singh wrote:
> 
> 
> On 09/06/2018 01:47 PM, Sean Christopherson wrote:
> >On Thu, Sep 06, 2018 at 01:37:50PM -0500, Brijesh Singh wrote:
> >>
> >>
> >>On 09/06/2018 09:18 AM, Sean Christopherson wrote:
> >>
> >>
> >>>>>
> >>>>>So are we going to be defining a decrypted section for every piece of
> >>>>>machinery now?
> >>>>>
> >>>>>That's a bit too much in my book.
> >>>>>
> >>>>>Why can't you simply free everything in .data..decrypted on !SVE guests?
> >>>>
> >>>>That would prevent adding __decrypted to existing declarations, e.g.
> >>>>hv_clock_boot, which would be ugly in its own right.  A more generic
> >>>>solution would be to add something like __decrypted_exclusive to mark
> >>>>data that is used if and only if SEV is active, and then free the
> >>>>SEV-only data when SEV is disabled.
> >>>
> >>>Oh, and we'd need to make sure __decrypted_exclusive is freed when
> >>>!CONFIG_AMD_MEM_ENCRYPT, and preferably !sev_active() since the big
> >>>array is used only if SEV is active.  This patch unconditionally
> >>>defines hv_clock_dec but only frees it if CONFIG_AMD_MEM_ENCRYPT=y &&
> >>>!mem_encrypt_active().
> >>>
> >>
> >>Again we have to consider the bare metal scenario while doing this. The
> >>aux array you proposed will be added in decrypted section only when
> >>CONFIG_AMD_MEM_ENCRYPT=y.  If CONFIG_AMD_MEM_ENCRYPT=n then nothng
> >>gets put in .data.decrypted section. At the runtime, if memory
> >>encryption is active then .data.decrypted_hvclock will contains useful
> >>data.
> >>
> >>The __decrypted attribute in "" when CONFIG_AMD_MEM_ENCRYPT=n.
> >
> >Right, but won't the data get dumped into the regular .bss in that
> >case, i.e. needs to be freed?
> >
> 
> 
> Yes, the auxiliary array will dumped into the regular .bss when
> CONFIG_AMD_MEM_ENCRYPT=n. Typically it will be few k, I am not
> sure if its worth complicating the code to save those extra memory.
> Most of the distro's have CONFIG_AMD_MEM_ENCRYPT=y anyways.

I just realized that we'll try to create a bogus array if 'NR_CPUS <=
HVC_BOOT_ARRAY_SIZE'.  A bit ugly, but we could #ifdef away both that
and CONFIG_AMD_MEM_ENCRYPT=n in a single shot, e.g.:

#if defined(CONFIG_AMD_MEM_ENCRYPT) && NR_CPUS > HVC_BOOT_ARRAY_SIZE
#define HVC_AUX_ARRAY_SIZE  \
PAGE_ALIGN((NR_CPUS - HVC_BOOT_ARRAY_SIZE) * \
   sizeof(struct pvclock_vsyscall_time_info))
static struct pvclock_vsyscall_time_info
hv_clock_aux[HVC_AUX_ARRAY_SIZE] __decrypted __aligned(PAGE_SIZE);
#endif

Re: [PATCH v5 5/5] x86/kvm: Avoid dynamic allocation of pvclock data when SEV is active

2018-09-06 Thread Sean Christopherson

On Thu, Sep 06, 2018 at 01:37:50PM -0500, Brijesh Singh wrote:
> 
> 
> On 09/06/2018 09:18 AM, Sean Christopherson wrote:
> 
> 
> >>>
> >>>So are we going to be defining a decrypted section for every piece of
> >>>machinery now?
> >>>
> >>>That's a bit too much in my book.
> >>>
> >>>Why can't you simply free everything in .data..decrypted on !SVE guests?
> >>
> >>That would prevent adding __decrypted to existing declarations, e.g.
> >>hv_clock_boot, which would be ugly in its own right.  A more generic
> >>solution would be to add something like __decrypted_exclusive to mark
> >>data that is used if and only if SEV is active, and then free the
> >>SEV-only data when SEV is disabled.
> >
> >Oh, and we'd need to make sure __decrypted_exclusive is freed when
> >!CONFIG_AMD_MEM_ENCRYPT, and preferably !sev_active() since the big
> >array is used only if SEV is active.  This patch unconditionally
> >defines hv_clock_dec but only frees it if CONFIG_AMD_MEM_ENCRYPT=y &&
> >!mem_encrypt_active().
> >
> 
> Again we have to consider the bare metal scenario while doing this. The
> aux array you proposed will be added in decrypted section only when
> CONFIG_AMD_MEM_ENCRYPT=y.  If CONFIG_AMD_MEM_ENCRYPT=n then nothng
> gets put in .data.decrypted section. At the runtime, if memory
> encryption is active then .data.decrypted_hvclock will contains useful
> data.
> 
> The __decrypted attribute in "" when CONFIG_AMD_MEM_ENCRYPT=n.

Right, but won't the data get dumped into the regular .bss in that
case, i.e. needs to be freed?

Re: [PATCH v5 5/5] x86/kvm: Avoid dynamic allocation of pvclock data when SEV is active

2018-09-06 Thread Sean Christopherson

On Thu, Sep 06, 2018 at 08:33:34PM +0200, Borislav Petkov wrote:
> On Thu, Sep 06, 2018 at 08:54:52AM -0700, Sean Christopherson wrote:
> > My thought was that we could simply define a second array for the SEV
> > case to statically allocate for NR_CPUS since __decrypted has a big
> > chunk of memory that would be ununsed anyways[1]. And since the second
> > array is only used for SEV it can be freed if !SEV.
> 
> Lemme see if I get it straight:
> 
> __decrypted:
> 
>  4K
> 
> __decrypted_XXX:
> 
>  ((num_possible_cpus() * 32) / 4K) pages
> 
> __decrypted_end:
> 
> Am I close?

Yep, though because the 4k chunk in __decrypted is @hv_clock_boot 
that's used for cpus 0-127, __decrypted_XXX would effectively be:

   (((num_possible_cpus() * 32) / 4k) - 1) pages

Re: [PATCH v6 4/5] x86/kvm: use __decrypted attribute in shared variables

2018-09-10 Thread Sean Christopherson

On Mon, 2018-09-10 at 14:04 +0200, Borislav Petkov wrote:
> On Fri, Sep 07, 2018 at 12:57:29PM -0500, Brijesh Singh wrote:
> > 
> > Commit: 368a540e0232 (x86/kvmclock: Remove memblock dependency)
> > caused SEV guest regression.
> When mentioning a commit in the commit message, put it on a separate
> line, like this:
> 
> "Commit
> 
>   368a540e0232 (x86/kvmclock: Remove memblock dependency)
> 
> caused a SEV guest regression."

Heh, that was the original formatting until I asked him to change it:
https://lkml.org/lkml/2018/8/29/809.  Checkpatch throws an error if
there's a newline after 'Commit'.  Though looking at this again, there
shouldn't be a colon after 'Commit' and there should be quotes inside
the parentheses, e.g. this satisfies checkpatch:

Commit 368a540e0232 ("x86/kvmclock: Remove memblock dependency")

Re: [PATCH v6 5/5] x86/kvm: Avoid dynamic allocation of pvclock data when SEV is active

2018-09-10 Thread Sean Christopherson

On Mon, 2018-09-10 at 08:15 -0500, Brijesh Singh wrote:
> 
> On 9/10/18 7:27 AM, Borislav Petkov wrote:
> > 
> > On Fri, Sep 07, 2018 at 12:57:30PM -0500, Brijesh Singh wrote:
> > > 
> > > diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
> > > index 376fd3a..6086b56 100644
> > > --- a/arch/x86/kernel/kvmclock.c
> > > +++ b/arch/x86/kernel/kvmclock.c
> > > @@ -65,6 +65,15 @@ static struct pvclock_vsyscall_time_info
> > >  static struct pvclock_wall_clock wall_clock __decrypted;
> > >  static DEFINE_PER_CPU(struct pvclock_vsyscall_time_info *, 
> > > hv_clock_per_cpu);
> > >  
> > > +#ifdef CONFIG_AMD_MEM_ENCRYPT
> > > +/*
> > > + * The auxiliary array will be used when SEV is active. In non-SEV case,
> > > + * it will be freed by free_decrypted_mem().
> > > + */
> > > +static struct pvclock_vsyscall_time_info
> > > + hv_clock_aux[NR_CPUS] __decrypted_aux;
> > Hmm, so worst case that's 64 4K pages:
> > 
> > (8192*32)/4096 = 64 4K pages.
> We can minimize the worst case memory usage. The number of VCPUs
> supported by KVM maybe less than NR_CPUS. e.g Currently KVM_MAX_VCPUS is
> set to 288

KVM_MAX_VCPUS is a property of the host, whereas this code runs in the
guest, e.g. KVM_MAX_VCPUS could be 2048 in the host for all we know.

> (288 * 64)/4096 = 4 4K pages.
> 
> (pvclock_vsyscall_time_info is cache aligned so it will be 64 bytes)

Ah, I was wondering why my calculations were always different than
yours.  I was looking at struct pvclock_vcpu_time_info, which is 32
bytes.

> #if NR_CPUS > KVM_MAX_VCPUS
> #define HV_AUX_ARRAY_SIZE  KVM_MAX_VCPUS
> #else
> #define HV_AUX_ARRAY_SIZE NR_CPUS
> #endif
> 
> static struct pvclock_vsyscall_time_info
>                         hv_clock_aux[HV_AUX_ARRAY_SIZE] __decrypted_aux;

Re: [PATCH v6 5/5] x86/kvm: Avoid dynamic allocation of pvclock data when SEV is active

2018-09-10 Thread Sean Christopherson

On Mon, 2018-09-10 at 17:53 +0200, Borislav Petkov wrote:
> On Mon, Sep 10, 2018 at 08:15:38AM -0500, Brijesh Singh wrote:
> > 
> > > 
> > > Now, the real question from all this SNAFU is, why can't all those point
> > > to a single struct pvclock_vsyscall_time_info and all CPUs read a single
> > > thing? Why do they have to be per-CPU and thus waste so much memory?
> You forgot to answer to the real question - why do we need those things
> to be perCPU and why can't we use a single instance to share with *all*
> CPUs?

I can't speak to the actual TSC stuff, but...

The pvclock ABI includes a per-vCPU bit, PVCLOCK_GUEST_STOPPED, to indicate
that the VM has been paused by the host.  The guest uses this information to
update its watchdogs to avoid false positives.  I have no idea if there are
use cases for setting STOPPED on a subset of vCPUs, but the ABI allows it so
here we are...

Re: [PATCH 1/13] KVM: Add tlb_remote_flush_with_range callback in kvm_x86_ops

2018-09-10 Thread Sean Christopherson

On Mon, 2018-09-10 at 08:38 +, Tianyu Lan wrote:
> Add flush range call back in the kvm_x86_ops and platform can use it
> to register its associated function. The parameter "kvm_tlb_range"
> accepts a single range and flush list which contains a list of ranges.
> 
> Signed-off-by: Lan Tianyu 
> ---
>  arch/x86/include/asm/kvm_host.h | 8 
>  1 file changed, 8 insertions(+)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index e12916e7c2fb..dcdf8cc16388 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -402,6 +402,12 @@ struct kvm_mmu {
>   u64 pdptrs[4]; /* pae */
>  };
>  
> +struct kvm_tlb_range {
> + u64 start_gfn;
> + u64 end_gfn;

IMO this struct and all functions should pass around the number of pages
instead of end_gfn to avoid confusion as to whether end_gfn is inclusive
or exlusive.

> + struct list_head *flush_list;
> +};
> +
>  enum pmc_type {
>   KVM_PMC_GP = 0,
>   KVM_PMC_FIXED,
> @@ -991,6 +997,8 @@ struct kvm_x86_ops {
>  
>   void (*tlb_flush)(struct kvm_vcpu *vcpu, bool invalidate_gpa);
>   int  (*tlb_remote_flush)(struct kvm *kvm);
> + int  (*tlb_remote_flush_with_range)(struct kvm *kvm,
> + struct kvm_tlb_range *range);
>  
>   /*
>    * Flush any TLB entries associated with the given GVA.

Re: [PATCH v6 5/5] x86/kvm: Avoid dynamic allocation of pvclock data when SEV is active

2018-09-10 Thread Sean Christopherson

On Mon, 2018-09-10 at 10:10 -0500, Brijesh Singh wrote:
> 
> On 09/10/2018 08:29 AM, Sean Christopherson wrote:
> ...
> 
> > > > > + */
> > > > > +static struct pvclock_vsyscall_time_info
> > > > > + hv_clock_aux[NR_CPUS] __decrypted_aux;
> > > > Hmm, so worst case that's 64 4K pages:
> > > > 
> > > > (8192*32)/4096 = 64 4K pages.
> > > We can minimize the worst case memory usage. The number of VCPUs
> > > supported by KVM maybe less than NR_CPUS. e.g Currently KVM_MAX_VCPUS is
> > > set to 288
> > KVM_MAX_VCPUS is a property of the host, whereas this code runs in the
> > guest, e.g. KVM_MAX_VCPUS could be 2048 in the host for all we know.
> > 
> 
> IIRC, during guest creation time qemu will check the host supported
> VCPUS count. If count is greater than KVM_MAX_VCPUS then it will
> fail to launch guest (or fail to hot plug vcpus). In other words, the
> number of vcpus in a KVM guest will never to > KVM_MAX_VCPUS.
> 
> Am I missing something ?

KVM_MAX_VCPUS is a definition for use in the *host*, it's even defined
in kvm_host.h.  The guest's pvclock code won't get magically recompiled
if KVM_MAX_VCPUS is changed in the host.  KVM_MAX_VCPUS is an arbitrary
value in the sense that there isn't a fundamental hard limit, i.e. the
value can be changed, either for a custom KVM build or in mainline,
e.g. it was bumped in 2016:

commit 682f732ecf7396e9d6fe24d44738966699fae6c0
Author: Radim Krčmář 
Date:   Tue Jul 12 22:09:29 2016 +0200

KVM: x86: bump MAX_VCPUS to 288

288 is in high demand because of Knights Landing CPU.
We cannot set the limit to 640k, because that would be wasting space.

Reviewed-by: Paolo Bonzini 
Signed-off-by: Radim Krčmář 
Signed-off-by: Paolo Bonzini 

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 074b5c760327..21a40dc7aad6 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -34,7 +34,7 @@
 #include 
 #include 
 
-#define KVM_MAX_VCPUS 255
+#define KVM_MAX_VCPUS 288
 #define KVM_SOFT_MAX_VCPUS 240
 #define KVM_USER_MEM_SLOTS 509
 /* memory slots that are not exposed to userspace */

Re: [RFC][PATCH 5/8] x86/mm: fix exception table comments

2018-09-10 Thread Sean Christopherson

On Fri, 2018-09-07 at 14:51 -0700, Dave Hansen wrote:
> > 
> > > 
> > > +  * Only do the expensive exception table search when we might be at
> > > +  * risk of a deadlock:
> > > +  * 1. We failed to acquire mmap_sem, and
> > > +  * 2. The access was an explicit kernel-mode access
> > > +  *(X86_PF_USER=0).
> > Might be worth reminding the reader that X86_PF_USER will be set in
> > sw_error_code for implicit accesses.  I saw "explicit" and my mind
> > immediately jumped to hw_error_code for whatever reason.  E.g.:
> > 
> > * 2. The access was an explicit kernel-mode access (we set X86_PF_USER
> > *    in sw_error_code for implicit kernel-mode accesses).
> Yeah, that was not worded well.  Is this better?
> 
> > 
> >  * Only do the expensive exception table search when we might be at
> >  * risk of a deadlock:
> >  * 1. We failed to acquire mmap_sem, and
> >  * 2. The access was an explicit kernel-mode access.  An access
> >  *from user-mode will X86_PF_USER=1 set via hw_error_code or
> >  *set in sw_error_code if it were an implicit kernel-mode
> >  *access that originated in user mode.

For me, mentioning hw_error_code just muddies the waters, e.g. why is
hw_error_code mentioned when it's not checked in the code?  Comments
alone won't help someone that's reading this code and doesn't understand
that hardware sets X86_PF_USER for user-mode accesses.  Maybe this?

 * 2. The access was an explicit kernel-mode access.  X86_PF_USER
 *is set in sw_error_code for both user-mode accesses and
 *implicit kernel-mode accesses that originated in user mode.

Re: [RFC PATCH 1/6] x86/alternative: assert text_mutex is taken

2018-08-29 Thread Sean Christopherson

On Wed, Aug 29, 2018 at 07:36:22PM +, Nadav Amit wrote:
> at 10:11 AM, Nadav Amit  wrote:
> 
> > at 1:59 AM, Masami Hiramatsu  wrote:
> > 
> >> On Wed, 29 Aug 2018 01:11:42 -0700
> >> Nadav Amit  wrote:
> >> 
> >>> Use lockdep to ensure that text_mutex is taken when text_poke() is
> >>> called.
> >>> 
> >>> Actually it is not always taken, specifically when it is called by kgdb,
> >>> so take the lock in these cases.
> >> 
> >> Can we really take a mutex in kgdb context?
> >> 
> >> kgdb_arch_remove_breakpoint
> >> <- dbg_deactivate_sw_breakpoints
> >>   <- kgdb_reenter_check
> >>  <- kgdb_handle_exception
> >> <- __kgdb_notify
> >>   <- kgdb_ll_trap
> >> <- do_int3
> >>   <- kgdb_notify
> >> <- die notifier
> >> 
> >> kgdb_arch_set_breakpoint
> >> <- dbg_activate_sw_breakpoints
> >>   <- kgdb_reenter_check
> >>  <- kgdb_handle_exception
> >>  ...
> >> 
> >> Both seems called in exception context, so we can not take a mutex lock.
> >> I think kgdb needs a special path.
> > 
> > You are correct, but I don’t want a special path. Presumably text_mutex is
> > guaranteed not to be taken according to the code.
> > 
> > So I guess the only concern is lockdep. Do you see any problem if I change
> > mutex_lock() into mutex_trylock()? It should always succeed, and I can add a
> > warning and a failure path if it fails for some reason.
> 
> Err.. This will not work. I think I will drop this patch, since I cannot
> find a proper yet simple assertion. Creating special path just for the
> assertion seems wrong.

It's probably worth expanding the comment for text_poke() to call out
the kgdb case and reference kgdb_arch_{set,remove}_breakpoint(), whose
code and comments make it explicitly clear why its safe for them to
call text_poke() without acquiring the lock.  Might prevent someone
from going down this path again in the future.

Re: [PATCH v13 10/13] x86/sgx: Add sgx_einit() for initializing enclaves

2018-08-29 Thread Sean Christopherson

On Wed, Aug 29, 2018 at 12:33:54AM -0700, Huang, Kai wrote:
> [snip..]
> 
> > > >
> > > > @@ -38,6 +39,18 @@ static LIST_HEAD(sgx_active_page_list);  static
> > > > DEFINE_SPINLOCK(sgx_active_page_list_lock);
> > > >  static struct task_struct *ksgxswapd_tsk;  static
> > > > DECLARE_WAIT_QUEUE_HEAD(ksgxswapd_waitq);
> > > > +static struct notifier_block sgx_pm_notifier; static u64
> > > > +sgx_pm_cnt;
> > > > +
> > > > +/* The cache for the last known values of IA32_SGXLEPUBKEYHASHx
> > > > +MSRs
> > > > for each
> > > > + * CPU. The entries are initialized when they are first used by
> > > > sgx_einit().
> > > > + */
> > > > +struct sgx_lepubkeyhash {
> > > > +   u64 msrs[4];
> > > > +   u64 pm_cnt;
> > >
> > > May I ask why do we need pm_cnt here? In fact why do we need suspend
> > > staff (namely, sgx_pm_cnt above, and related code in this patch) here
> > > in this patch? From the patch commit message I don't see why we need
> > > PM staff here. Please give comment why you need PM staff, or you may
> > > consider to split the PM staff to another patch.
> > 
> > Refining the commit message probably makes more sense because without PM
> > code sgx_einit() would be broken. The MSRs have been reset after waking up.
> > 
> > Some kind of counter is required to keep track of the power cycle. When 
> > going
> > to sleep the sgx_pm_cnt is increased. sgx_einit() compares the current 
> > value of
> > the global count to the value in the cache entry to see whether we are in a 
> > new
> > power cycle.
> 
> You mean reset to Intel default? I think we can also just reset the cached MSR
> values on each power cycle, which would be simpler, IMHO?

Refresh my brain, does hardware reset the MSRs on a transition to S3 or lower?

> I think we definitely need some code to handle S3-S5, but should be in
> separate patches, since I think the major impact of S3-S5 is entire EPC
> being destroyed. I think keeping pm_cnt is not sufficient enough to handle
> such case?
> > 
> > This brings up one question though: how do we deal with VM host going to 
> > sleep?
> > VM guest would not be aware of this.
> 
> IMO VM just gets "sudden loss of EPC" after suspend & resume in host. SGX 
> driver
> and SDK should be able to handle "sudden loss of EPC", ie, co-working 
> together to
> re-establish the missing enclaves.
> 
> Actually supporting "sudden loss of EPC" is a requirement to support live 
> migration
> of VM w/ SGX. Internally long time ago we had a discussion and the decision 
> was we
> should support SGX live migration given two facts:
> 
> 1) losing platform-dependent is not important. For example, losing sealing 
> key is
> not a problem, as we could get secrets provisioned again from remote. 2) Both 
> windows
> & linux driver commit to support "sudden loss of EPC".
> 
> I don't think we have to support in very first upstream driver, but I think 
> we need
> to support someday.

Actually, we can easily support this in the driver, at least for SGX1
hardware.  SGX2 isn't difficult to handle, but we've intentionally
postponed those patches until SGX1 support is in mainline[1].
Accesses to the EPC after it is lost will cause faults.  Userspace
EPC accesses, e.g. ERESUME, will get a SIGSEGV that the process
should interpret as an "I should restart my enclave" event.  The
SDK already does this.  In the driver, we just need to be aware of
this potential behavior and not freak out.  Specifically, SGX_INVD
needs to not WARN on faults that may have been due to a the EPC
being nuked.  I think we can even remove the sgx_encl_pm_notifier()
code altogether.

[1] SGX1 hardware signals a #GP on an access to an invalid EPC page.
SGX2 signals a #PF with the PF_SGX error code bit set.  This is
problematic because the kernel looks at the PTEs for CR2 and sees
nothing wrong, so it thinks it should just restart the
instruction, leading to an infinite fault loop.  Resolving this
is fairly straightforward, but a complete fix requires propagating
PF_SGX down to the ENCLS fixup handler, which means plumbing the
error code through the fixup handlers or smushing PF_SGX into
trapnr.  Since there is a parallel effort to plumb the error code
through the handlers, https://lkml.org/lkml/2018/8/6/924, we opted
to do this in a separate series.

> Sean, 
> 
> Would you be able to comment here?
> 
> > 
> > I think the best measure would be to add a new parameter to sgx_einit() that
> > enforces update of the MSRs. The driver can then set this parameter in the 
> > case
> > when sgx_einit() returns SGX_INVALID_LICENSE. This is coherent because the
> > driver requires writable MSRs. It would not be coherent to do it directly 
> > in the
> > core because KVM does not require writable MSRs.
> 
> IMHO this is not required, as I mentioned above.
> 
> And 
> [snip...]
> 
> Thanks,
> -Kai

Re: [RFC PATCH 1/6] x86/alternative: assert text_mutex is taken

2018-08-29 Thread Sean Christopherson

On Wed, Aug 29, 2018 at 08:44:47PM +, Nadav Amit wrote:
> at 1:13 PM, Sean Christopherson  wrote:
> 
> > On Wed, Aug 29, 2018 at 07:36:22PM +, Nadav Amit wrote:
> >> at 10:11 AM, Nadav Amit  wrote:
> >> 
> >>> at 1:59 AM, Masami Hiramatsu  wrote:
> >>> 
> >>>> On Wed, 29 Aug 2018 01:11:42 -0700
> >>>> Nadav Amit  wrote:
> >>>> 
> >>>>> Use lockdep to ensure that text_mutex is taken when text_poke() is
> >>>>> called.
> >>>>> 
> >>>>> Actually it is not always taken, specifically when it is called by kgdb,
> >>>>> so take the lock in these cases.
> >>>> 
> >>>> Can we really take a mutex in kgdb context?
> >>>> 
> >>>> kgdb_arch_remove_breakpoint
> >>>> <- dbg_deactivate_sw_breakpoints
> >>>>  <- kgdb_reenter_check
> >>>> <- kgdb_handle_exception
> >>>><- __kgdb_notify
> >>>>  <- kgdb_ll_trap
> >>>><- do_int3
> >>>>  <- kgdb_notify
> >>>><- die notifier
> >>>> 
> >>>> kgdb_arch_set_breakpoint
> >>>> <- dbg_activate_sw_breakpoints
> >>>>  <- kgdb_reenter_check
> >>>> <- kgdb_handle_exception
> >>>> ...
> >>>> 
> >>>> Both seems called in exception context, so we can not take a mutex lock.
> >>>> I think kgdb needs a special path.
> >>> 
> >>> You are correct, but I don’t want a special path. Presumably text_mutex is
> >>> guaranteed not to be taken according to the code.
> >>> 
> >>> So I guess the only concern is lockdep. Do you see any problem if I change
> >>> mutex_lock() into mutex_trylock()? It should always succeed, and I can 
> >>> add a
> >>> warning and a failure path if it fails for some reason.
> >> 
> >> Err.. This will not work. I think I will drop this patch, since I cannot
> >> find a proper yet simple assertion. Creating special path just for the
> >> assertion seems wrong.
> > 
> > It's probably worth expanding the comment for text_poke() to call out
> > the kgdb case and reference kgdb_arch_{set,remove}_breakpoint(), whose
> > code and comments make it explicitly clear why its safe for them to
> > call text_poke() without acquiring the lock.  Might prevent someone
> > from going down this path again in the future.
> 
> I thought that the whole point of the patch was to avoid comments, and
> instead enforce the right behavior. I don’t understand well enough kgdb
> code, so I cannot attest it does the right thing. What happens if
> kgdb_do_roundup==0?

As is, the comment is wrong because there are obviously cases where
text_poke() is called without text_mutex being held.  I can't attest
to the kgdb code either.  My thought was to document the exception so
that if someone does want to try and enforce the right behavior they
can dive right into the problem instead of having to learn of the kgdb
gotcha the hard way.  Maybe a FIXME is the right approach?

Re: [PATCH v13 10/13] x86/sgx: Add sgx_einit() for initializing enclaves

2018-08-29 Thread Sean Christopherson

On Wed, Aug 29, 2018 at 01:58:09PM -0700, Huang, Kai wrote:
> > -Original Message-
> > From: Christopherson, Sean J
> > Sent: Thursday, August 30, 2018 8:34 AM
> > To: Huang, Kai 
> > Cc: Jarkko Sakkinen ; platform-driver-
> > x...@vger.kernel.org; x...@kernel.org; nhor...@redhat.com; linux-
> > ker...@vger.kernel.org; t...@linutronix.de; suresh.b.sid...@intel.com; 
> > Ayoun,
> > Serge ; h...@zytor.com; npmccal...@redhat.com;
> > mi...@redhat.com; linux-...@vger.kernel.org; Hansen, Dave
> > 
> > Subject: Re: [PATCH v13 10/13] x86/sgx: Add sgx_einit() for initializing 
> > enclaves
> > 
> > On Wed, Aug 29, 2018 at 12:33:54AM -0700, Huang, Kai wrote:
> > > [snip..]
> > >
> > > > > >
> > > > > > @@ -38,6 +39,18 @@ static LIST_HEAD(sgx_active_page_list);
> > > > > > static DEFINE_SPINLOCK(sgx_active_page_list_lock);
> > > > > >  static struct task_struct *ksgxswapd_tsk;  static
> > > > > > DECLARE_WAIT_QUEUE_HEAD(ksgxswapd_waitq);
> > > > > > +static struct notifier_block sgx_pm_notifier; static u64
> > > > > > +sgx_pm_cnt;
> > > > > > +
> > > > > > +/* The cache for the last known values of IA32_SGXLEPUBKEYHASHx
> > > > > > +MSRs
> > > > > > for each
> > > > > > + * CPU. The entries are initialized when they are first used by
> > > > > > sgx_einit().
> > > > > > + */
> > > > > > +struct sgx_lepubkeyhash {
> > > > > > +   u64 msrs[4];
> > > > > > +   u64 pm_cnt;
> > > > >
> > > > > May I ask why do we need pm_cnt here? In fact why do we need
> > > > > suspend staff (namely, sgx_pm_cnt above, and related code in this
> > > > > patch) here in this patch? From the patch commit message I don't
> > > > > see why we need PM staff here. Please give comment why you need PM
> > > > > staff, or you may consider to split the PM staff to another patch.
> > > >
> > > > Refining the commit message probably makes more sense because
> > > > without PM code sgx_einit() would be broken. The MSRs have been reset
> > after waking up.
> > > >
> > > > Some kind of counter is required to keep track of the power cycle.
> > > > When going to sleep the sgx_pm_cnt is increased. sgx_einit()
> > > > compares the current value of the global count to the value in the
> > > > cache entry to see whether we are in a new power cycle.
> > >
> > > You mean reset to Intel default? I think we can also just reset the
> > > cached MSR values on each power cycle, which would be simpler, IMHO?
> > 
> > Refresh my brain, does hardware reset the MSRs on a transition to S3 or 
> > lower?
> > 
> > > I think we definitely need some code to handle S3-S5, but should be in
> > > separate patches, since I think the major impact of S3-S5 is entire
> > > EPC being destroyed. I think keeping pm_cnt is not sufficient enough
> > > to handle such case?
> > > >
> > > > This brings up one question though: how do we deal with VM host going to
> > sleep?
> > > > VM guest would not be aware of this.
> > >
> > > IMO VM just gets "sudden loss of EPC" after suspend & resume in host.
> > > SGX driver and SDK should be able to handle "sudden loss of EPC", ie,
> > > co-working together to re-establish the missing enclaves.
> > >
> > > Actually supporting "sudden loss of EPC" is a requirement to support
> > > live migration of VM w/ SGX. Internally long time ago we had a
> > > discussion and the decision was we should support SGX live migration given
> > two facts:
> > >
> > > 1) losing platform-dependent is not important. For example, losing
> > > sealing key is not a problem, as we could get secrets provisioned
> > > again from remote. 2) Both windows & linux driver commit to support 
> > > "sudden
> > loss of EPC".
> > >
> > > I don't think we have to support in very first upstream driver, but I
> > > think we need to support someday.
> > 
> > Actually, we can easily support this in the driver, at least for SGX1 
> > hardware.
> 
> That's my guess too. Just want to check whether we are still on the same page 
> :)
> 
> > SGX2 isn't difficult to handle, but we've intentionally postponed those 
> > patches
> > until SGX1 support is in mainline[1].
> > Accesses to the EPC after it is lost will cause faults.  Userspace EPC 
> > accesses, e.g.
> > ERESUME, will get a SIGSEGV that the process should interpret as an "I 
> > should
> > restart my enclave" event.  The SDK already does this.  In the driver, we 
> > just need
> > to be aware of this potential behavior and not freak out.  Specifically, 
> > SGX_INVD
> > needs to not WARN on faults that may have been due to a the EPC being nuked.
> > I think we can even remove the sgx_encl_pm_notifier() code altogether.
> 
> Possibly we still need to do some cleanup, ie, all structures of enclaves, 
> upon resume? 

Not for functional reasons.  The driver will automatically do the
cleanup via SGX_INVD when it next accesses the enclave's pages and
takes a fault, e.g. during reclaim.  Proactively reclaiming the EPC
pages would probably affect performance, though not necessarily in
a good way.  And I think it would be a beneficial

Re: [PATCH v3 4/4] x86/kvm: use __decrypted attribute in shared variables

2018-08-29 Thread Sean Christopherson

On Wed, Aug 29, 2018 at 01:24:00PM -0500, Brijesh Singh wrote:
> The following commit:
> 
>   368a540e0232 (x86/kvmclock: Remove memblock dependency)

Checkpatch prefers:

Commit 368a540e0232 ("x86/kvmclock: Remove memblock dependency")

That'll also save three lines in the commit message.
 
> caused SEV guest regression. When SEV is active, we map the shared
> variables (wall_clock and hv_clock_boot) with C=0 to ensure that both
> the guest and the hypervisor is able to access the data. To map the

Nit: s/is/are

> variables we use kernel_physical_mapping_init() to split the large pages,
> but this routine fails to allocate a new page. Before the above commit,
> kvmclock initialization was called after memory allocator was available
> but now its called early during boot.

What about something like this to make the issue a bit clearer:

  variables we use kernel_physical_mapping_init() to split the large pages,
  but splitting large pages requires allocating a new PMD, which fails now
  that kvmclock initialization is called early during boot.

> Recently we added a special .data..decrypted section to hold the shared
> variables. This section is mapped with C=0 early during boot. Use
> __decrypted attribute to put the wall_clock and hv_clock_boot in
> .data..decrypted section so that they are mapped with C=0.
> 
> Signed-off-by: Brijesh Singh 
> Fixes: 368a540e0232 ("x86/kvmclock: Remove memblock dependency")
> Cc: Tom Lendacky 
> Cc: k...@vger.kernel.org
> Cc: Thomas Gleixner 
> Cc: Borislav Petkov 
> Cc: "H. Peter Anvin" 
> Cc: linux-kernel@vger.kernel.org
> Cc: Paolo Bonzini 
> Cc: Sean Christopherson 
> Cc: k...@vger.kernel.org
> Cc: "Radim Krčmář" 
> ---
>  arch/x86/kernel/kvmclock.c | 30 +-
>  1 file changed, 25 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
> index 1e67646..08f5f8a 100644
> --- a/arch/x86/kernel/kvmclock.c
> +++ b/arch/x86/kernel/kvmclock.c
> @@ -28,6 +28,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include 
>  #include 
> @@ -61,8 +62,8 @@ early_param("no-kvmclock-vsyscall", 
> parse_no_kvmclock_vsyscall);
>   (PAGE_SIZE / sizeof(struct pvclock_vsyscall_time_info))
>  
>  static struct pvclock_vsyscall_time_info
> - hv_clock_boot[HVC_BOOT_ARRAY_SIZE] __aligned(PAGE_SIZE);
> -static struct pvclock_wall_clock wall_clock;
> + hv_clock_boot[HVC_BOOT_ARRAY_SIZE] __decrypted 
> __aligned(PAGE_SIZE);
> +static struct pvclock_wall_clock wall_clock __decrypted;
>  static DEFINE_PER_CPU(struct pvclock_vsyscall_time_info *, hv_clock_per_cpu);
>  
>  static inline struct pvclock_vcpu_time_info *this_cpu_pvti(void)
> @@ -267,10 +268,29 @@ static int kvmclock_setup_percpu(unsigned int cpu)
>   return 0;
>  
>   /* Use the static page for the first CPUs, allocate otherwise */
> - if (cpu < HVC_BOOT_ARRAY_SIZE)
> + if (cpu < HVC_BOOT_ARRAY_SIZE) {
>   p = _clock_boot[cpu];
> - else
> - p = kzalloc(sizeof(*p), GFP_KERNEL);
> + } else {
> + int rc;
> + unsigned int sz = sizeof(*p);
> +
> + if (sev_active())
> + sz = PAGE_ALIGN(sz);

This is a definite downside to the section approach.  Unless I missed
something, the section padding goes to waste since we don't have a
mechanism in place to allocate into that section, e.g. as is we're
burning nearly 2mb of data since we're only using 4k of the 2mb page.
And every decrypted allocation can potentially fracture a large page
since the allocator is unaware of the decrypted requirement.  Might
not be an issue for kvmclock since it's a one-time allocation, but
we could suffer death by a thousand cuts if there are scenarios where
a decrypted allocation isn't be persistent (VirtIO queues maybe?).

Duplicating the full kernel tables for C=0 accesses doesn't suffer
from these issues.  And I think potential corruption issues due to
mis-{aligned,size} objects can be detected through static analysis,
build assertions and/or runtime checks.

> + p = kzalloc(sz, GFP_KERNEL);

For the SEV case, can't we do a straight kmalloc() since we zero
out the page after decrypting it?

> +
> + /*
> +  * The physical address of per-cpu variable will be shared with
> +  * the hypervisor. Let's clear the C-bit before we assign the
> +  * memory to per_cpu variable.
> +  */
> + if (p && sev_active()) {
> + rc = set_memory_decrypted((unsigned long)p, sz >> 
> PAGE_SHIFT);
> + if (rc)
> + return rc;
> + memset(p, 0, sz);
> + }
> + }
>  
>   per_cpu(hv_clock_per_cpu, cpu) = p;
>   return p ? 0 : -ENOMEM;
> -- 
> 2.7.4
>

Re: [RFC][PATCH 5/8] x86/mm: fix exception table comments

2018-09-07 Thread Sean Christopherson

On Fri, 2018-09-07 at 12:49 -0700, Dave Hansen wrote:
> From: Dave Hansen 
> 
> The comments here are wrong.  They are too absolute about where
> faults can occur when running in the kernel.  The comments are
> also a bit hard to match up with the code.
> 
> Trim down the comments, and make them more precise.
> 
> Also add a comment explaining why we are doing the
> bad_area_nosemaphore() path here.
> 
> Signed-off-by: Dave Hansen 
> Cc: Sean Christopherson 
> Cc: "Peter Zijlstra (Intel)" 
> Cc: Thomas Gleixner 
> Cc: x...@kernel.org
> Cc: Andy Lutomirski 
> ---
> 
>  b/arch/x86/mm/fault.c |   27 ++-
>  1 file changed, 14 insertions(+), 13 deletions(-)
> 
> diff -puN arch/x86/mm/fault.c~pkeys-fault-warnings-03 arch/x86/mm/fault.c
> --- a/arch/x86/mm/fault.c~pkeys-fault-warnings-03 2018-09-07 
> 11:21:47.696751898 -0700
> +++ b/arch/x86/mm/fault.c 2018-09-07 11:21:47.700751898 -0700
> @@ -1349,24 +1349,25 @@ void do_user_addr_space_fault(struct pt_
>   flags |= FAULT_FLAG_INSTRUCTION;
>  
>   /*
> -  * When running in the kernel we expect faults to occur only to
> -  * addresses in user space.  All other faults represent errors in
> -  * the kernel and should generate an OOPS.  Unfortunately, in the
> -  * case of an erroneous fault occurring in a code path which already
> -  * holds mmap_sem we will deadlock attempting to validate the fault
> -  * against the address space.  Luckily the kernel only validly
> -  * references user space from well defined areas of code, which are
> -  * listed in the exceptions table.
> +  * Kernel-mode access to the user address space should only occur
> +  * inside well-defined areas of code listed in the exception
> +  * tables.  But, an erroneous kernel fault occurring outside one of
> +  * those areas which also holds mmap_sem might deadlock attempting
> +  * to validate the fault against the address space.
>    *
> -  * As the vast majority of faults will be valid we will only perform
> -  * the source reference check when there is a possibility of a
> -  * deadlock. Attempt to lock the address space, if we cannot we then
> -  * validate the source. If this is invalid we can skip the address
> -  * space check, thus avoiding the deadlock:
> +  * Only do the expensive exception table search when we might be at
> +  * risk of a deadlock:
> +  * 1. We failed to acquire mmap_sem, and
> +  * 2. The access was an explicit kernel-mode access
> +  *(X86_PF_USER=0).

Might be worth reminding the reader that X86_PF_USER will be set in
sw_error_code for implicit accesses.  I saw "explicit" and my mind
immediately jumped to hw_error_code for whatever reason.  E.g.:

* 2. The access was an explicit kernel-mode access (we set X86_PF_USER
*    in sw_error_code for implicit kernel-mode accesses).

>    */
>   if (unlikely(!down_read_trylock(>mmap_sem))) {
>   if (!(sw_error_code & X86_PF_USER) &&
>   !search_exception_tables(regs->ip)) {
> + /*
> +  * Fault from code in kernel from
> +  * which we do not expect faults.
> +  */
>   bad_area_nosemaphore(regs, sw_error_code, address, 
> NULL);
>   return;
>   }
> _

Re: [RFC][PATCH 2/8] x86/mm: break out kernel address space handling

2018-09-07 Thread Sean Christopherson

On Fri, 2018-09-07 at 12:48 -0700, Dave Hansen wrote:
> From: Dave Hansen 
> 
> The page fault handler (__do_page_fault())  basically has two sections:
> one for handling faults in the kernel porttion of the address space
> and another for faults in the user porttion of the address space.

%s/porttion/portion

> But, these two parts don't stick out that well.  Let's make that more
> clear from code separation and naming.  Pull kernel fault
> handling into its own helper, and reflect that naming by renaming
> spurious_fault() -> spurious_kernel_fault().
> 
> Also, rewrite the vmalloc handling comment a bit.  It was a bit
> stale and also glossed over the reserved bit handling.

Re: [PATCH] x86/kvm/vmx: don't read current->thread.{fs,gs}base of legacy tasks

2018-07-13 Thread Sean Christopherson

On Wed, Jul 11, 2018 at 07:37:18PM +0200, Vitaly Kuznetsov wrote:
> When we switched from doing rdmsr() to reading FS/GS base values from
> current->thread we completely forgot about legacy 32-bit userspaces which
> we still support in KVM (why?). task->thread.{fsbase,gsbase} are only
> synced for 64-bit processes, calling save_fsgs_for_kvm() and using
> its result from current is illegal for legacy processes.
> 
> There's no ARCH_SET_FS/GS prctls for legacy applications. Base MSRs are,
> however, not always equal to zero. Intel's manual says (3.4.4 Segment
> Loading Instructions in IA-32e Mode):
> 
> "In order to set up compatibility mode for an application, segment-load
> instructions (MOV to Sreg, POP Sreg) work normally in 64-bit mode. An
> entry is read from the system descriptor table (GDT or LDT) and is loaded
> in the hidden portion of the segment register.
> ...
> The hidden descriptor register fields for FS.base and GS.base are
> physically mapped to MSRs in order to load all address bits supported by
> a 64-bit implementation.
> "
> 
> The issue was found by strace test suite where 32-bit ioctl_kvm_run test
> started segfaulting.
> 
> Reported-by: Dmitry V. Levin 
> Bisected-by: Masatake YAMATO 
> Fixes: 42b933b59721 ("x86/kvm/vmx: read MSR_{FS,KERNEL_GS}_BASE from 
> current->thread")
> Signed-off-by: Vitaly Kuznetsov 
> ---
>  arch/x86/kvm/vmx.c | 25 +
>  1 file changed, 17 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 559a12b6184d..65968649b365 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -2560,6 +2560,7 @@ static void vmx_save_host_state(struct kvm_vcpu *vcpu)
>   struct vcpu_vmx *vmx = to_vmx(vcpu);
>  #ifdef CONFIG_X86_64
>   int cpu = raw_smp_processor_id();
> + unsigned long fsbase, kernel_gsbase;

Because bikeshedding is fun, what do you think about using fs_base and
kernel_gs_base for these names?  I have a series that touches this
code and also adds local variables for {FS,GS}.base and {FS,GS}.sel.
I used {fs,gs}_base and {fs,gs}_sel to be consistent with the
vmx->host_state nomenclature (the local variables are used to update
the associated vmx->host_state variables), but I'll change my patches
if you have a strong preference for omitting the underscore.

>  #endif
>   int i;
>  
> @@ -2575,12 +2576,20 @@ static void vmx_save_host_state(struct kvm_vcpu *vcpu)
>   vmx->host_state.gs_ldt_reload_needed = vmx->host_state.ldt_sel;
>  
>  #ifdef CONFIG_X86_64
> - save_fsgs_for_kvm();
> - vmx->host_state.fs_sel = current->thread.fsindex;
> - vmx->host_state.gs_sel = current->thread.gsindex;
> -#else
> - savesegment(fs, vmx->host_state.fs_sel);
> - savesegment(gs, vmx->host_state.gs_sel);
> + if (likely(is_64bit_mm(current->mm))) {
> + save_fsgs_for_kvm();
> + vmx->host_state.fs_sel = current->thread.fsindex;
> + vmx->host_state.gs_sel = current->thread.gsindex;
> + fsbase = current->thread.fsbase;
> + kernel_gsbase = current->thread.gsbase;
> + } else {
> +#endif
> + savesegment(fs, vmx->host_state.fs_sel);
> + savesegment(gs, vmx->host_state.gs_sel);
> +#ifdef CONFIG_X86_64
> + fsbase = read_msr(MSR_FS_BASE);
> + kernel_gsbase = read_msr(MSR_KERNEL_GS_BASE);
> + }
>  #endif
>   if (!(vmx->host_state.fs_sel & 7)) {
>   vmcs_write16(HOST_FS_SELECTOR, vmx->host_state.fs_sel);
> @@ -2600,10 +2609,10 @@ static void vmx_save_host_state(struct kvm_vcpu *vcpu)
>   savesegment(ds, vmx->host_state.ds_sel);
>   savesegment(es, vmx->host_state.es_sel);
>  
> - vmcs_writel(HOST_FS_BASE, current->thread.fsbase);
> + vmcs_writel(HOST_FS_BASE, fsbase);
>   vmcs_writel(HOST_GS_BASE, cpu_kernelmode_gs_base(cpu));
>  
> - vmx->msr_host_kernel_gs_base = current->thread.gsbase;
> + vmx->msr_host_kernel_gs_base = kernel_gsbase;
>   if (is_long_mode(>vcpu))
>   wrmsrl(MSR_KERNEL_GS_BASE, vmx->msr_guest_kernel_gs_base);
>  #else
> -- 
> 2.14.4
>

Re: [PATCH v13 09/13] x86/sgx: Enclave Page Cache (EPC) memory manager

2018-09-11 Thread Sean Christopherson

On Mon, 2018-08-27 at 21:53 +0300, Jarkko Sakkinen wrote:
> Add a Enclave Page Cache (EPC) memory manager that can be used to
> allocate and free EPC pages. The swapper thread ksgxswapd reclaims pages
> on the event when the number of free EPC pages goes below
> %SGX_NR_LOW_PAGES up until it reaches %SGX_NR_HIGH_PAGES.
> 
> Pages are reclaimed in LRU fashion from a global list. The consumers
> take care of calling EBLOCK (block page from new accesses), ETRACK
> (restart counting the entering hardware threads) and EWB (write page to
> the regular memory) because executing these operations usually (if not
> always) requires to do some subsystem-internal locking operations.
> 
> Signed-off-by: Jarkko Sakkinen 
> Co-developed-by: Sean Christopherson 
> Signed-off-by: Sean Christopherson 
> ---
>  arch/x86/include/asm/sgx.h  |  56 --
>  arch/x86/kernel/cpu/intel_sgx.c | 322 
>  2 files changed, 362 insertions(+), 16 deletions(-)

...

> +/**
> + * sgx_reclaim_pages - reclaim EPC pages from the consumers
> + *
> + * Takes a fixed chunk of pages from the global list of consumed EPC pages 
> and
> + * tries to swap them. Only the pages that are either being freed by the
> + * consumer or actively used are skipped.
> + */
> +static void sgx_reclaim_pages(void)
> +{
> + struct sgx_epc_page *chunk[SGX_NR_TO_SCAN + 1];

The array size should simply be SGX_NR_TO_SCAN.  The +1 is a remnant
from the previous version that bounded the for-loops with "!chunk[i]"
check instead of "i < j".  No functional issue, essentially just an
unused variable.

> + struct sgx_epc_page *epc_page;
> + struct sgx_epc_bank *bank;
> + int i, j;
> +
> + spin_lock(_active_page_list_lock);
> + for (i = 0, j = 0; i < SGX_NR_TO_SCAN; i++) {
> + if (list_empty(_active_page_list))
> + break;
> +
> + epc_page = list_first_entry(_active_page_list,
> + struct sgx_epc_page, list);
> + list_del_init(_page->list);
> +
> + if (epc_page->impl->ops->get(epc_page))
> + chunk[j++] = epc_page;
> + else
> + epc_page->desc &= ~SGX_EPC_PAGE_RECLAIMABLE;
> + }
> + spin_unlock(_active_page_list_lock);
> +
> + for (i = 0; i < j; i++) {
> + epc_page = chunk[i];
> + if (epc_page->impl->ops->reclaim(epc_page))
> + continue;
> +
> + spin_lock(_active_page_list_lock);
> + list_add_tail(_page->list, _active_page_list);
> + spin_unlock(_active_page_list_lock);
> +
> + epc_page->impl->ops->put(epc_page);
> + chunk[i] = NULL;
> + }
> +
> + for (i = 0; i < j; i++) {
> + epc_page = chunk[i];
> + if (epc_page)
> + epc_page->impl->ops->block(epc_page);
> + }
> +
> + for (i = 0; i < j; i++) {
> + epc_page = chunk[i];
> + if (epc_page) {
> + epc_page->impl->ops->write(epc_page);
> + epc_page->impl->ops->put(epc_page);
> +
> + /*
> +  * Put the page back on the free list only after we
> +  * have put() our reference to the owner of the EPC
> +  * page, otherwise the page could be re-allocated and
> +  * we'd call put() on the wrong impl.
> +  */
> + epc_page->desc &= ~SGX_EPC_PAGE_RECLAIMABLE;
> +
> + bank = sgx_epc_bank(epc_page);
> + spin_lock(>lock);
> + bank->pages[bank->free_cnt++] = epc_page;
> + spin_unlock(>lock);
> + }
> + }
> +}

Re: [PATCH] X86/KVM: Do not allow DISABLE_EXITS_MWAIT when LAPIC ARAT is not available

2018-04-11 Thread Sean Christopherson

On Wed, 2018-04-11 at 11:16 +0200, KarimAllah Ahmed wrote:
> If the processor does not have an "Always Running APIC Timer" (aka ARAT),
> we should not give guests direct access to MWAIT. The LAPIC timer would
> stop ticking in deep C-states, so any host deadlines would not wakeup the
> host kernel.
> 
> The host kernel intel_idle driver handles this by switching to broadcast
> mode when ARAT is not available and MWAIT is issued with a deep C-state
> that would stop the LAPIC timer. When MWAIT is passed through, we can not
> tell when MWAIT is issued.
> 
> So just disable this capability when LAPIC ARAT is not available. I am not
> even sure if there are any CPUs with VMX support but no LAPIC ARAT or not.

ARAT was added on WSM, so NHM, the Core 2 family and a few PSC SKUs
support VMX+MWAIT but not ARAT.  

> Cc: Paolo Bonzini 
> Cc: Radim Krčmář 
> Cc: Thomas Gleixner 
> Cc: Ingo Molnar 
> Cc: H. Peter Anvin 
> Cc: x...@kernel.org
> Cc: k...@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Reported-by: Wanpeng Li 
> Signed-off-by: KarimAllah Ahmed 
> ---
>  arch/x86/kvm/x86.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index b2ff74b..0334b25 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -2819,7 +2819,8 @@ static int msr_io(struct kvm_vcpu *vcpu, struct 
> kvm_msrs __user *user_msrs,
>  static inline bool kvm_can_mwait_in_guest(void)
>  {
>   return boot_cpu_has(X86_FEATURE_MWAIT) &&
> - !boot_cpu_has_bug(X86_BUG_MONITOR);
> + !boot_cpu_has_bug(X86_BUG_MONITOR) &&
> + boot_cpu_has(X86_FEATURE_ARAT);
>  }
>  
>  int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)

Re: [PATCH 06/10] KVM/nVMX: Use kvm_vcpu_map when mapping the virtual APIC page

2018-04-12 Thread Sean Christopherson

On Thu, Apr 12, 2018 at 04:38:39PM +0200, Paolo Bonzini wrote:
> On 21/02/2018 18:47, KarimAllah Ahmed wrote:
> > +
> > +   if (kvm_vcpu_map(vcpu, 
> > gpa_to_gfn(vmcs12->virtual_apic_page_addr), map))
> > +   vmcs_write64(VIRTUAL_APIC_PAGE_ADDR, 
> > gfn_to_gpa(map->pfn));
> 
> This should in principle be pfn_to_hpa.  However, pfn_to_hpa is unused;
> let's just remove it and do "<< PAGE_SHIFT".  Unlike gfn_to_gpa, where
> in principle the shift could be different, pfn_to_hpa is *by definition*
> a shift left by PAGE_SHIFT.

Any reason not to use PFN_PHYS instead of the handcoded shift?
 
> Paolo

Re: [PATCH v3 4/9] x86/kvm/mmu: introduce guest_mmu

2018-10-05 Thread Sean Christopherson

On Mon, Oct 01, 2018 at 04:20:05PM +0200, Vitaly Kuznetsov wrote:
> When EPT is used for nested guest we need to re-init MMU as shadow
> EPT MMU (nested_ept_init_mmu_context() does that). When we return back
> from L2 to L1 kvm_mmu_reset_context() in nested_vmx_load_cr3() resets
> MMU back to normal TDP mode. Add a special 'guest_mmu' so we can use
> separate root caches; the improved hit rate is not very important for
> single vCPU performance, but it avoids contention on the mmu_lock for
> many vCPUs.
> 
> On the nested CPUID benchmark, with 16 vCPUs, an L2->L1->L2 vmexit
> goes from 42k to 26k cycles.
> 
> Signed-off-by: Vitaly Kuznetsov 

One nit below, otherwise:

Reviewed-by: Sean Christopherson 

> ---
> Changes since v2:
> - move kvm_mmu_free_roots() call to nested_release_vmcs12().
>   [Sean Christopherson]
> - in kvm_calc_shadow_ept_root_page_role() we need to inherit role
>   from root_mmu so fields like .smap_andnot_wp, .smep_andnot_wp, ...
>   remain correctly initialized [pointed out by Sean Christopherson]
> ---
>  arch/x86/include/asm/kvm_host.h |  3 +++
>  arch/x86/kvm/mmu.c  | 18 +
>  arch/x86/kvm/vmx.c  | 35 +
>  3 files changed, 39 insertions(+), 17 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 404c3438827b..a3829869353b 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -539,6 +539,9 @@ struct kvm_vcpu_arch {
>   /* Non-nested MMU for L1 */
>   struct kvm_mmu root_mmu;
>  
> + /* L1 MMU when running nested */
> + struct kvm_mmu guest_mmu;
> +
>   /*
>* Paging state of an L2 guest (used for nested npt)
>*
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 9cb2f9307e98..bb6584222848 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -4818,6 +4818,9 @@ kvm_calc_shadow_ept_root_page_role(struct kvm_vcpu 
> *vcpu, bool accessed_dirty)
>  {
>   union kvm_mmu_page_role role = vcpu->arch.mmu->base_role;

@role is initialized by the new code below, this can be removed,
e.g. it's eventually removed by patch 7/9 when the two pieces of
the combined role are initialized independently.

> + /* Role is inherited from root_mmu */
> + role.word = vcpu->arch.root_mmu.base_role.word;
> +
>   role.level = PT64_ROOT_4LEVEL;
>   role.direct = false;
>   role.ad_disabled = !accessed_dirty;
> @@ -4966,8 +4969,10 @@ EXPORT_SYMBOL_GPL(kvm_mmu_load);
>  
>  void kvm_mmu_unload(struct kvm_vcpu *vcpu)
>  {
> - kvm_mmu_free_roots(vcpu, vcpu->arch.mmu, KVM_MMU_ROOTS_ALL);
> - WARN_ON(VALID_PAGE(vcpu->arch.mmu->root_hpa));
> + kvm_mmu_free_roots(vcpu, >arch.root_mmu, KVM_MMU_ROOTS_ALL);
> + WARN_ON(VALID_PAGE(vcpu->arch.root_mmu.root_hpa));
> + kvm_mmu_free_roots(vcpu, >arch.guest_mmu, KVM_MMU_ROOTS_ALL);
> + WARN_ON(VALID_PAGE(vcpu->arch.guest_mmu.root_hpa));
>  }
>  EXPORT_SYMBOL_GPL(kvm_mmu_unload);
>  
> @@ -5406,13 +5411,18 @@ int kvm_mmu_create(struct kvm_vcpu *vcpu)
>  
>   vcpu->arch.mmu = >arch.root_mmu;
>   vcpu->arch.walk_mmu = >arch.root_mmu;
> +
>   vcpu->arch.root_mmu.root_hpa = INVALID_PAGE;
>   vcpu->arch.root_mmu.translate_gpa = translate_gpa;
> - vcpu->arch.nested_mmu.translate_gpa = translate_nested_gpa;
> -
>   for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++)
>   vcpu->arch.root_mmu.prev_roots[i] = KVM_MMU_ROOT_INFO_INVALID;
>  
> + vcpu->arch.guest_mmu.root_hpa = INVALID_PAGE;
> + vcpu->arch.guest_mmu.translate_gpa = translate_gpa;
> + for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++)
> + vcpu->arch.guest_mmu.prev_roots[i] = KVM_MMU_ROOT_INFO_INVALID;
> +
> + vcpu->arch.nested_mmu.translate_gpa = translate_nested_gpa;
>   return alloc_mmu_pages(vcpu);
>  }
>  
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 9a3aece131fb..73f04a8638d3 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -8442,8 +8442,10 @@ static void vmx_disable_shadow_vmcs(struct vcpu_vmx 
> *vmx)
>   vmcs_write64(VMCS_LINK_POINTER, -1ull);
>  }
>  
> -static inline void nested_release_vmcs12(struct vcpu_vmx *vmx)
> +static inline void nested_release_vmcs12(struct kvm_vcpu *vcpu)
>  {
> + struct vcpu_vmx *vmx = to_vmx(vcpu);
> +
>   if (vmx->nested.current_vmptr == -1ull)
>   return;
>  
> @@ -8457,10 +8459,12 @@ static inline void nested_release_vmcs12(struct 
> vcpu_vmx *vmx)
>   vmx->nested.posted_intr_nv = -1;
>  
>

Re: [PATCH v3 7/9] x86/kvm/nVMX: introduce source data cache for kvm_init_shadow_ept_mmu()

2018-10-05 Thread Sean Christopherson

On Mon, Oct 01, 2018 at 04:20:08PM +0200, Vitaly Kuznetsov wrote:
> MMU re-initialization is expensive, in particular,
> update_permission_bitmask() and update_pkru_bitmask() are.
> 
> Cache the data used to setup shadow EPT MMU and avoid full re-init when
> it is unchanged.
> 
> Signed-off-by: Vitaly Kuznetsov 

Nit below, otherwise:

Reviewed-by: Sean Christopherson 

> ---
> Changes since v2:
> - Preserve mmu_role.base in kvm_calc_shadow_ept_root_page_role()
>   [Sean Christopherson]
> - Rename kvm_calc_mmu_role_common() -> kvm_calc_mmu_role_ext() to
>   support the change.
> ---
>  arch/x86/include/asm/kvm_host.h | 14 +
>  arch/x86/kvm/mmu.c  | 53 +++--
>  2 files changed, 52 insertions(+), 15 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 1821b0215230..87ddaa1579e7 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -274,7 +274,21 @@ union kvm_mmu_page_role {
>  };
>  
>  union kvm_mmu_extended_role {
> +/*
> + * This structure complements kvm_mmu_page_role caching everything needed for
> + * MMU configuration. If nothing in both these structures changed, MMU
> + * re-configuration can be skipped. @valid bit is set on first usage so we 
> don't
> + * treat all-zero structure as valid data.
> + */
>   u32 word;
> + struct {
> + unsigned int valid:1;
> + unsigned int execonly:1;
> + unsigned int cr4_pse:1;
> + unsigned int cr4_pke:1;
> + unsigned int cr4_smap:1;
> + unsigned int cr4_smep:1;
> + };
>  };
>  
>  union kvm_mmu_role {
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 60c916286bf4..d303f722d671 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -4708,6 +4708,20 @@ static void paging32E_init_context(struct kvm_vcpu 
> *vcpu,
>   paging64_init_context_common(vcpu, context, PT32E_ROOT_LEVEL);
>  }
>  
> +static union kvm_mmu_extended_role kvm_calc_mmu_role_ext(struct kvm_vcpu 
> *vcpu)
> +{
> + union kvm_mmu_extended_role ext = {0};
> +
> + ext.cr4_smep = kvm_read_cr4_bits(vcpu, X86_CR4_SMEP) != 0;
> + ext.cr4_smap = kvm_read_cr4_bits(vcpu, X86_CR4_SMAP) != 0;
> + ext.cr4_pse = !!is_pse(vcpu);
> + ext.cr4_pke = kvm_read_cr4_bits(vcpu, X86_CR4_PKE) != 0;

These can all be !!kvm_read_cr4_bits(), a la the is_pse() line, which
coincidentally is a wrapper to kvm_read_cr4_bits(vcpu, X86_CR4_PSE).
And I think technically even the "!!" can be omitted, though bitwise
unions are definitely not my area of expertise.

> +
> + ext.valid = 1;
> +
> + return ext;
> +}
> +
>  static union kvm_mmu_page_role
>  kvm_calc_tdp_mmu_root_page_role(struct kvm_vcpu *vcpu)
>  {
> @@ -4814,19 +4828,23 @@ void kvm_init_shadow_mmu(struct kvm_vcpu *vcpu)
>  }
>  EXPORT_SYMBOL_GPL(kvm_init_shadow_mmu);
>  
> -static union kvm_mmu_page_role
> -kvm_calc_shadow_ept_root_page_role(struct kvm_vcpu *vcpu, bool 
> accessed_dirty)
> +static union kvm_mmu_role
> +kvm_calc_shadow_ept_root_page_role(struct kvm_vcpu *vcpu, bool 
> accessed_dirty,
> +bool execonly)
>  {
> - union kvm_mmu_page_role role = vcpu->arch.mmu->mmu_role.base;
> + union kvm_mmu_role role;
>  
> - /* Role is inherited from root_mmu */
> - role.word = vcpu->arch.root_mmu.base_role.word;
> + /* Base role is inherited from root_mmu */
> + role.base.word = vcpu->arch.root_mmu.mmu_role.base.word;
> + role.ext = kvm_calc_mmu_role_ext(vcpu);
>  
> - role.level = PT64_ROOT_4LEVEL;
> - role.direct = false;
> - role.ad_disabled = !accessed_dirty;
> - role.guest_mode = true;
> - role.access = ACC_ALL;
> + role.base.level = PT64_ROOT_4LEVEL;
> + role.base.direct = false;
> + role.base.ad_disabled = !accessed_dirty;
> + role.base.guest_mode = true;
> + role.base.access = ACC_ALL;
> +
> + role.ext.execonly = execonly;
>  
>   return role;
>  }
> @@ -4835,10 +4853,16 @@ void kvm_init_shadow_ept_mmu(struct kvm_vcpu *vcpu, 
> bool execonly,
>bool accessed_dirty, gpa_t new_eptp)
>  {
>   struct kvm_mmu *context = vcpu->arch.mmu;
> - union kvm_mmu_page_role root_page_role =
> - kvm_calc_shadow_ept_root_page_role(vcpu, accessed_dirty);
> + union kvm_mmu_role new_role =
> + kvm_calc_shadow_ept_root_page_role(vcpu, accessed_dirty,
> +execonly);
> +
> + __kvm_mmu_new_cr3(vcpu, new_eptp, new_ro

Re: [PATCH v3 8/9] x86/kvm/mmu: check if tdp/shadow MMU reconfiguration is needed

2018-10-05 Thread Sean Christopherson

On Mon, Oct 01, 2018 at 04:20:09PM +0200, Vitaly Kuznetsov wrote:
> MMU reconfiguration in init_kvm_tdp_mmu()/kvm_init_shadow_mmu() can be
> avoided if the source data used to configure it didn't change; enhance
> MMU extended role with the required fields and consolidate common code in
> kvm_calc_mmu_role_common().
> 
> Signed-off-by: Vitaly Kuznetsov 

Same comments about kvm_read_cr4_bits(), otherwise:

Reviewed-by: Sean Christopherson 

> ---
> Changes since v2:
> - Rename 'mmu_init' parameter to 'base_only' [Sean Christopherson]
> ---
>  arch/x86/include/asm/kvm_host.h |  2 +
>  arch/x86/kvm/mmu.c  | 95 +
>  2 files changed, 63 insertions(+), 34 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 87ddaa1579e7..609811066580 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -284,10 +284,12 @@ union kvm_mmu_extended_role {
>   struct {
>   unsigned int valid:1;
>   unsigned int execonly:1;
> + unsigned int cr0_pg:1;
>   unsigned int cr4_pse:1;
>   unsigned int cr4_pke:1;
>   unsigned int cr4_smap:1;
>   unsigned int cr4_smep:1;
> + unsigned int cr4_la57:1;
>   };
>  };
>  
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index d303f722d671..10b39ff83943 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -4712,27 +4712,46 @@ static union kvm_mmu_extended_role 
> kvm_calc_mmu_role_ext(struct kvm_vcpu *vcpu)
>  {
>   union kvm_mmu_extended_role ext = {0};
>  
> + ext.cr0_pg = !!is_paging(vcpu);
>   ext.cr4_smep = kvm_read_cr4_bits(vcpu, X86_CR4_SMEP) != 0;
>   ext.cr4_smap = kvm_read_cr4_bits(vcpu, X86_CR4_SMAP) != 0;
>   ext.cr4_pse = !!is_pse(vcpu);
>   ext.cr4_pke = kvm_read_cr4_bits(vcpu, X86_CR4_PKE) != 0;
> + ext.cr4_la57 = kvm_read_cr4_bits(vcpu, X86_CR4_LA57) != 0;

Can be !!kvm_read_cr4_bits() or maybe just kvm_read_cr4_bits().

>  
>   ext.valid = 1;
>  
>   return ext;
>  }
>  
> -static union kvm_mmu_page_role
> -kvm_calc_tdp_mmu_root_page_role(struct kvm_vcpu *vcpu)
> +static union kvm_mmu_role kvm_calc_mmu_role_common(struct kvm_vcpu *vcpu,
> +bool base_only)
> +{
> + union kvm_mmu_role role = {0};
> +
> + role.base.access = ACC_ALL;
> + role.base.nxe = !!is_nx(vcpu);
> + role.base.cr4_pae = !!is_pae(vcpu);
> + role.base.cr0_wp = is_write_protection(vcpu);
> + role.base.smm = is_smm(vcpu);
> + role.base.guest_mode = is_guest_mode(vcpu);
> +
> + if (base_only)
> + return role;
> +
> + role.ext = kvm_calc_mmu_role_ext(vcpu);
> +
> + return role;
> +}
> +
> +static union kvm_mmu_role
> +kvm_calc_tdp_mmu_root_page_role(struct kvm_vcpu *vcpu, bool base_only)
>  {
> - union kvm_mmu_page_role role = {0};
> + union kvm_mmu_role role = kvm_calc_mmu_role_common(vcpu, base_only);
>  
> - role.guest_mode = is_guest_mode(vcpu);
> - role.smm = is_smm(vcpu);
> - role.ad_disabled = (shadow_accessed_mask == 0);
> - role.level = kvm_x86_ops->get_tdp_level(vcpu);
> - role.direct = true;
> - role.access = ACC_ALL;
> + role.base.ad_disabled = (shadow_accessed_mask == 0);
> + role.base.level = kvm_x86_ops->get_tdp_level(vcpu);
> + role.base.direct = true;
>  
>   return role;
>  }
> @@ -4740,9 +4759,14 @@ kvm_calc_tdp_mmu_root_page_role(struct kvm_vcpu *vcpu)
>  static void init_kvm_tdp_mmu(struct kvm_vcpu *vcpu)
>  {
>   struct kvm_mmu *context = vcpu->arch.mmu;
> + union kvm_mmu_role new_role =
> + kvm_calc_tdp_mmu_root_page_role(vcpu, false);
>  
> - context->mmu_role.base.word = mmu_base_role_mask.word &
> -   kvm_calc_tdp_mmu_root_page_role(vcpu).word;
> + new_role.base.word &= mmu_base_role_mask.word;
> + if (new_role.as_u64 == context->mmu_role.as_u64)
> + return;
> +
> + context->mmu_role.as_u64 = new_role.as_u64;
>   context->page_fault = tdp_page_fault;
>   context->sync_page = nonpaging_sync_page;
>   context->invlpg = nonpaging_invlpg;
> @@ -4782,29 +4806,23 @@ static void init_kvm_tdp_mmu(struct kvm_vcpu *vcpu)
>   reset_tdp_shadow_zero_bits_mask(vcpu, context);
>  }
>  
> -static union kvm_mmu_page_role
> -kvm_calc_shadow_mmu_root_page_role(struct kvm_vcpu *vcpu)
> -{
> - union kvm_mmu_page_role role = {0};
> - bool smep = kvm_read_cr4_bits(vcpu, X86_CR4_SMEP);
>

Re: [PATCH] KVM/nVMX: Do not validate that posted_intr_desc_addr is page aligned

2018-10-24 Thread Sean Christopherson

On Sat, Oct 20, 2018 at 11:42:59PM +0200, KarimAllah Ahmed wrote:
> The spec only requires the posted interrupt descriptor address to be
> 64-bytes aligned (i.e. bits[0:5] == 0). Using page_address_valid also
> forces the address to be page aligned.
> 
> Only validate that the address does not cross the maximum physical address
> without enforcing a page alignment.
> 
> Cc: Paolo Bonzini 
> Cc: Radim Krčmář 
> Cc: Thomas Gleixner 
> Cc: Ingo Molnar 
> Cc: Borislav Petkov 
> Cc: H. Peter Anvin 
> Cc: x...@kernel.org
> Cc: k...@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Fixes: 6de84e581c0 ("nVMX x86: check posted-interrupt descriptor addresss on 
> vmentry of L2")
> Signed-off-by: KarimAllah Ahmed 
> ---
>  arch/x86/kvm/vmx.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 30bf860..47962f2 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -11668,7 +11668,7 @@ static int nested_vmx_check_apicv_controls(struct 
> kvm_vcpu *vcpu,
>   !nested_exit_intr_ack_set(vcpu) ||
>   (vmcs12->posted_intr_nv & 0xff00) ||
>   (vmcs12->posted_intr_desc_addr & 0x3f) ||
> - (!page_address_valid(vcpu, vmcs12->posted_intr_desc_addr
> + (vmcs12->posted_intr_desc_addr >> cpuid_maxphyaddr(vcpu)))
>   return -EINVAL;

Can you update the comment for this code block?  It has a stale blurb
about "the descriptor address has been already checked in
nested_get_vmcs12_pages" and it'd be nice to state why bits[5:0] must
be zero (your changelog is much more helpful than the current comment).

With that:

Reviewed-by: Sean Christopherson 

>  
>   /* tpr shadow is needed by all apicv features. */
> -- 
> 2.7.4
>

Re: [PATCH 2/4] kvm, vmx: move register clearing out of assembly path

2018-10-26 Thread Sean Christopherson

On Wed, Oct 24, 2018 at 10:28:57AM +0200, Julian Stecklina wrote:
> Split the security related register clearing out of the large inline
> assembly VM entry path. This results in two slightly less complicated
> inline assembly statements, where it is clearer what each one does.
> 
> Signed-off-by: Julian Stecklina 
> Reviewed-by: Jan H. Schönherr 
> Reviewed-by: Konrad Jan Miller 
> ---
>  arch/x86/kvm/vmx.c | 33 -
>  1 file changed, 20 insertions(+), 13 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 93562d5..9225099 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -10797,20 +10797,7 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu 
> *vcpu)
>   "mov %%r13, %c[r13](%0) \n\t"
>   "mov %%r14, %c[r14](%0) \n\t"
>   "mov %%r15, %c[r15](%0) \n\t"
> - "xor %%r8d,  %%r8d \n\t"
> - "xor %%r9d,  %%r9d \n\t"
> - "xor %%r10d, %%r10d \n\t"
> - "xor %%r11d, %%r11d \n\t"
> - "xor %%r12d, %%r12d \n\t"
> - "xor %%r13d, %%r13d \n\t"
> - "xor %%r14d, %%r14d \n\t"
> - "xor %%r15d, %%r15d \n\t"
>  #endif
> -
> - "xor %%eax, %%eax \n\t"
> - "xor %%ebx, %%ebx \n\t"
> - "xor %%esi, %%esi \n\t"
> - "xor %%edi, %%edi \n\t"
>   "pop  %%" _ASM_BP "; pop  %%" _ASM_DX " \n\t"
>   ".pushsection .rodata \n\t"
>   ".global vmx_return \n\t"
> @@ -10847,6 +10834,26 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu 
> *vcpu)
>  #endif
> );
>  
> + /* Don't let guest register values survive. */
> + asm volatile (
> + ""
> +#ifdef CONFIG_X86_64
> + "xor %%r8d,  %%r8d \n\t"
> + "xor %%r9d,  %%r9d \n\t"
> + "xor %%r10d, %%r10d \n\t"
> + "xor %%r11d, %%r11d \n\t"
> + "xor %%r12d, %%r12d \n\t"
> + "xor %%r13d, %%r13d \n\t"
> + "xor %%r14d, %%r14d \n\t"
> + "xor %%r15d, %%r15d \n\t"
> +#endif
> + :: "a" (0), "b" (0), "S" (0), "D" (0)

Since clearing the GPRs exists to mitigate speculation junk, I think
we should keep the explicit XOR zeroing instead of deferring to the
compiler.  Explicit XORs will ensure the resulting assembly is the
same regardless of compiler, version, target arch, etc..., whereas the
compiler could theoretically use different zeroing methods[1], e.g. on
my system it generates "mov r32,r32" for EBX, ESI and EDI (loading
from EAX after EAX is zeroed).

And FWIW, I find the original code to be more readable since all GRPs
are zeroed with the same method.


[1] As an aside, I was expecting gcc to generate "xor r32,r32" with
-mtune=sandybridge as sandybridge can do mov elimination on xors
that explicitly zero a register but not on generic reg-to-reg mov,
but I was unable to coerce gcc into using xor.

> + : "cc"
> +#ifdef CONFIG_X86_64
> +   , "r8", "r9", "r10", "r11", "r12", "r13", "r14", "r15"
> +#endif
> + );
> +
>   /*
>* We do not use IBRS in the kernel. If this vCPU has used the
>* SPEC_CTRL MSR it may have left it on; save the value and
> -- 
> 2.7.4
>

Re: [PATCH v14 09/19] x86/mm: x86/sgx: Signal SEGV_SGXERR for #PFs w/ PF_SGX

2018-10-31 Thread Sean Christopherson

On Mon, Oct 01, 2018 at 03:03:30PM -0700, Dave Hansen wrote:
> On 10/01/2018 02:42 PM, Jethro Beekman wrote:
> > 
> > 1) Even though the vDSO function exists, userspace may still call 
> > `ENCLU[EENTER]` manually, so the fault handling as described in the 
> > current patch should also be maintained.
> 
> Why?

Circling back to this question, what if we take the easy way out and
simply signal SIGSEGV without an SGX-specific code?  I.e. treat #PF
with X86_PF_SGX as an access error, no more no less.  That should be
sufficient for userspace to function, albeit with a little more effort,
but presumably no more than would be needed to run on SGX1 hardware.

AFAIK there isn't a way to prevent userspace from manually invoking
EENTER, short of doing some really nasty text poking or PTE swizzling.
We could declare using EENTER as unsupported, but that seems like
cutting off the nose to spite the face.  Supporting userspace EENTER
in a limited capacity would allow people to do whatever crazy tricks
they're wont to do without having to deal with absurd requests for
the vDSO interface.

If we go this route we could also add the vDSO stuff after basic SGX
support is in mainline, obviously with approval from the powers that
be.

Re: [PATCH 4/4] kvm, vmx: remove manually coded vmx instructions

2018-10-26 Thread Sean Christopherson

On Fri, Oct 26, 2018 at 10:46:27AM +, Stecklina, Julian wrote:
> On Wed, 2018-10-24 at 10:44 -0700, Eric Northup wrote:
> > This loses the exception handling from __ex* ->
> > kvm_handle_fault_on_reboot.
> > 
> > If deliberate, this should be called out in changelog.  Has the race
> > which commit 4ecac3fd fixed been mitigated otherwise?
> 
> No, this was not deliberate. Will fix.

This is already in kvm/next as commit 4b1e54786e48 ("KVM/x86: Use
assembly instruction mnemonics instead of .byte streams"), this patch
can be dropped.

Re: RFC: userspace exception fixups

2018-11-02 Thread Sean Christopherson

On Fri, Nov 02, 2018 at 08:02:23PM +0100, Jann Horn wrote:
> On Fri, Nov 2, 2018 at 7:27 PM Sean Christopherson
>  wrote:
> > On Fri, Nov 02, 2018 at 10:48:38AM -0700, Andy Lutomirski wrote:
> > > This whole mechanism seems very complicated, and it's not clear
> > > exactly what behavior user code wants.
> >
> > No argument there.  That's why I like the approach of dumping the
> > exception to userspace without trying to do anything intelligent in
> > the kernel.  Userspace can then do whatever it wants AND we don't
> > have to worry about mucking with stacks.
> >
> > One of the hiccups with the VDSO approach is that the enclave may
> > want to use the untrusted stack, i.e. the stack that has the VDSO's
> > stack frame.  For example, Intel's SDK uses the untrusted stack to
> > pass parameters for EEXIT, which means an AEX might occur with what
> > is effectively a bad stack from the VDSO's perspective.
> 
> What exactly does "uses the untrusted stack to pass parameters for
> EEXIT" mean? I guess you're saying that the enclave is writing to
> RSP+[0...some_positive_offset], and the written data needs to be
> visible to the code outside the enclave afterwards?

As is, they actually do it the other way around, i.e. negative offsets
relative to the untrusted %RSP.  Going into the enclave there is no
reserved space on the stack.  The SDK uses EEXIT like a function call,
i.e. pushing parameters on the stack and making an call outside of the
enclave, hence the name out-call.  This allows the SDK to handle any
reasonable out-call without a priori knowledge of the application's
maximum out-call "size".


Rough outline of what happens in a non-faulting case.

1: Userspace executes EENTER

| userspace stack  | 
 <-- %RSP at EENTER


2: Enclave does EEXIT to invoke out-call function


| userspace stack  | 
 <-- %RSP at EENTER
| out-call func ID |
| param1   |
| ...  |
| paramN   |
 <-- %RSP at EEXIT


3: Userspace re-EENTERs enclave after handling EEXIT request


| userspace stack  | 
 <-- %RSP at original EENTER
| out-call func ID |
| param1   |
| ...  |
| paramN   |
 <-- %RSP at post-EEXIT EENTER


4: Enclave cleans up the stack


| userspace stack  | 
 <-- %RSP back at original EENTER



In the faulting case, an AEX can occur while the enclave is pushing
parameters onto the stack for EEXIT.


1: Userspace executes EENTER

| userspace stack  | 
 <-- %RSP at EENTER


2: AEX occurs during enclave prep for EEXIT


| userspace stack  | 
 <-- %RSP at EENTER
| out-call func ID |
| param1   |
| ...  | 
 <-- %RSP at AEX


3: Userspace re-EENTERs enclave to invoke enclave fault handler


| userspace stack  | 
 <-- %RSP at original EENTER
| out-call func ID |
| param1   |
| ...  | 
 <-- %RSP at AEX
| userspace stack  |
 <-- %RSP at EENTER to fault handler


4: Enclave handles the fault, EEXITs back to userspace


| userspace stack  | 
 <-- %RSP at original EENTER
| out-call func ID |
| param1   |
| ...  | 
 <-- %RSP at AEX
| userspace stack  |
 <-- %RSP at EEXIT from fault handler


5: Userspace pops its stack and ERESUMEs back to the enclave

| userspace stack  | 
 <-- %RSP at original EENTER
| out-call func ID |
| param1   |
| ...  | 
 <-- %RSP at ERESUME


6: Enclave finishes its EEXIT to invoke out-call function


| userspace stuff  | 
 <-- %RSP at original EENTER
| out-call func ID |
| param1   |
| ...  |
| paramN   |
 <-- %RSP at EEXIT 
 
> In other words, the vDSO helper would have to not touch the stack
> pointer (only using the 128-byte redzone to store spilled data, at
>

Re: RFC: userspace exception fixups

2018-11-02 Thread Sean Christopherson

On Fri, Nov 02, 2018 at 10:13:23AM -0700, Dave Hansen wrote:
> On 11/2/18 10:06 AM, Sean Christopherson wrote:
> > On Fri, Nov 02, 2018 at 09:56:44AM -0700, Dave Hansen wrote:
> >> On 11/2/18 9:30 AM, Sean Christopherson wrote:
> >>> What if rather than having userspace register an address for fixup, the
> >>> kernel instead unconditionally does fixup on the ENCLU opcode?
> >>
> >> The problem is knowing what to do for the fixup.  If we have a simple
> >> action to take that's universal, like backing up %RIP, or setting some
> >> other register state, it's not bad.
> > 
> > Isn't the EENTER/RESUME behavior universal?  Or am I missing something?
> 
> Could someone write down all the ways we get in and out of the enclave?
> 
> I think we always get in from userspace calling EENTER or ERESUME.  We
> can't ever enter directly from the kernel, like via an IRET from what I
> understand.

Correct, the only way to get into the enclave is EENTER or ERESUME.
My understanding is that even SMIs bounce through the AEX target
before transitioning to SMM.

> We get *out* from exceptions, hardware interrupts, or enclave-explicit
> EEXITs.  Did I miss any?  Remind me where the hardware lands the control
> flow in each of those exit cases.

And VMExits.  There are basically two cases: EEXIT and everything else.
EEXIT is a glorified indirect jump, e.g. %RBX holds the target %RIP.
Everything else is an Asynchronous Enclave Exit (AEX).  On an AEX, %RIP
is set to a value specified by EENTER/ERESUME, %RBP and %RSP are
restored to pre-enclave values and all other registers are loaded with
synthetic state.  The actual interrupt/exception/VMExit then triggers,
e.g. the %RIP on the stack for an exception is always the AEX target,
not the %RIP inside the enclave that actually faulted.

Re: RFC: userspace exception fixups

2018-11-02 Thread Sean Christopherson

On Fri, Nov 02, 2018 at 09:56:44AM -0700, Dave Hansen wrote:
> On 11/2/18 9:30 AM, Sean Christopherson wrote:
> > What if rather than having userspace register an address for fixup, the
> > kernel instead unconditionally does fixup on the ENCLU opcode?
> 
> The problem is knowing what to do for the fixup.  If we have a simple
> action to take that's universal, like backing up %RIP, or setting some
> other register state, it's not bad.

Isn't the EENTER/RESUME behavior universal?  Or am I missing something?
 
> Think of our prefetch fixups in the page fault code.  We do some
> instruction decoding to look for them, and then largely return from the
> fault and let the CPU retry.  We know *exactly* what to do for these.
> 
> But, if we need to call arbitrary code, or switch stacks, we need an
> explicit ABI around it *anyway*, because the action to take isn't clear.
> 
> For an enclave exit that's because of a hardware interrupt or page
> fault, life is good.  We really *could* just set %RIP to let ERESUME run
> again, kinda like we do for (some) syscall situations.  But the
> situations for which we can't just call ERESUME, like the out-calls make
> this more challenging.  I think we'd need some explicit new interfaces
> for those.

I don't see how out-calls are a problem.  Once EEXIT completes we're
no longer in the enclave and EPCM faults are no longer a concern, i.e.
we don't need to do fixup.  Every other enclave exit is either an
exception or an interrupt.  And the only way to get back into the
enclave is via ENCLU (EENTER or ERESUME).

Re: RFC: userspace exception fixups

2018-11-02 Thread Sean Christopherson

On Thu, Nov 01, 2018 at 04:22:55PM -0700, Andy Lutomirski wrote:
> On Thu, Nov 1, 2018 at 2:24 PM Linus Torvalds
>  wrote:
> >
> > On Thu, Nov 1, 2018 at 12:31 PM Rich Felker  wrote:
> > >
> > > See my other emails in this thread. You would register the *address*
> > > (in TLS) of a function pointer object pointing to the handler, rather
> > > than the function address of the handler. Then switching handler is
> > > just a single store in userspace, no syscalls involved.
> >
> > Yes.
> >
> > And for just EENTER, maybe that's the right model.
> >
> > If we want to generalize it to other thread-synchronous faults, it
> > needs way more information and a list of handlers, but if we limit the
> > thing to _only_ EENTER getting an SGX fault, then a single "this is
> > the fault handler" address is probably the right thing to do.
> 
> It sounds like you're saying that the kernel should know, *before*
> running any user fixup code, whether the fault in question is one that
> wants a fixup.  Sounds reasonable.
> 
> I think it would be nice, but not absolutely necessary, if user code
> didn't need to poke some value into TLS each time it ran a function
> that had a fixup.  With the poke-into-TLS approach, it looks a lot
> like rseq, and rseq doesn't nest very nicely.  I think we really want
> this mechanism to Just Work.  So we could maybe have a syscall that
> associates a list of fixups with a given range of text addresses.  We
> might want the kernel to automatically zap the fixups when the text in
> question is unmapped.

If this is EENTER specific then nesting isn't an issue.  But I don't
see a simple way to restrict the mechanism to EENTER.

What if rather than having userspace register an address for fixup the
kernel instead unconditionally does fixup on the ENCLU opcode?  For
example, skip the instruction and put fault info into some combination
of RDX/RSI/RDI (they're cleared on asynchronous enclave exits).

The decode logic is straightforward since ENCLU doesn't have operands,
we'd just have to eat any ignored prefixes.  The intended convention
for EENTER is to have an ENCLU at the AEX target (to automatically do
ERESUME after INTR, etc...), so this would work regardless of whether
the fault happened on EENTER or in the enclave.  EENTER/ERESUME are
the only ENCLU functions that are allowed outside of an enclave so
there's no danger of accidentally crushing something else.

This way we wouldn't need a VDSO blob and we'd enforce the kernel's
ABI, e.g. a library that tried to use signal handling would go off the
rails when the kernel mucked with the registers.  We could even have
the SGX EPC fault handler return VM_FAULT_SIGBUS if the faulting
instruction isn't ENCLU, e.g. to further enforce that the AEX target
needs to be ENCLU.

Userspace would look something like this:

mov tcs, %xbx   /* Thread Control Structure address */
leaq async_exit(%rip), %rcx /* AEX target for EENTER/RESUME */
mov $SGX_EENTER, %rax   /* EENTER leaf */

async_exit:
ENCLU

fault_handler:

enclave_exit:   /* EEXIT target */

Re: RFC: userspace exception fixups

2018-11-02 Thread Sean Christopherson

On Fri, Nov 02, 2018 at 04:37:10PM +, Jethro Beekman wrote:
> On 2018-11-02 09:30, Sean Christopherson wrote:
> >... The intended convention for EENTER is to have an ENCLU at the AEX target 
> >...
> >
> >... to further enforce that the AEX target needs to be ENCLU.
> 
> Some SGX runtimes may want to use a different AEX target.

To what end?  Userspace gets no indication as to why the AEX occurred.
And if exceptions are getting transfered to userspace the trampoline
would effectively be handling only INTR, NMI, #MC and EPC #PF.

Re: RFC: userspace exception fixups

2018-11-02 Thread Sean Christopherson

On Fri, Nov 02, 2018 at 04:56:36PM +, Jethro Beekman wrote:
> On 2018-11-02 09:52, Sean Christopherson wrote:
> >On Fri, Nov 02, 2018 at 04:37:10PM +, Jethro Beekman wrote:
> >>On 2018-11-02 09:30, Sean Christopherson wrote:
> >>>... The intended convention for EENTER is to have an ENCLU at the AEX 
> >>>target ...
> >>>
> >>>... to further enforce that the AEX target needs to be ENCLU.
> >>
> >>Some SGX runtimes may want to use a different AEX target.
> >
> >To what end?  Userspace gets no indication as to why the AEX occurred.
> >And if exceptions are getting transfered to userspace the trampoline
> >would effectively be handling only INTR, NMI, #MC and EPC #PF.
> >
> 
> Various reasons...
> 
> Userspace may have established an exception handling convention with the
> enclave (by setting TCS.NSSA > 1) and may want to call EENTER instead of
> ERESUME.

The ERESUME trampoline would only be invoked for exceptions that aren't
transferred to userspace.  On #BR, #UD, etc..., the kernel would fixup
%RIP to effectively point at @fault_handler.  Userspace can then do
whatever it wants to handle the fault, e.g. do EENTER if the fault needs
to be serviced by the enclave.

> Userspace may want fine-grained control over enclave scheduling (e.g.
> SGX-Step)

Uh, isn't SGX-Step an attack on SGX?  Preventing userspace from playing
games with enclave scheduling seems like a good thing.

Re: [PATCH v2 2/3] kvm, vmx: move register clearing out of assembly path

2018-10-29 Thread Sean Christopherson

On Mon, Oct 29, 2018 at 04:40:43PM +0100, Julian Stecklina wrote:
> Split the security related register clearing out of the large inline
> assembly VM entry path. This results in two slightly less complicated
> inline assembly statements, where it is clearer what each one does.
> 
> Signed-off-by: Julian Stecklina 
> Reviewed-by: Jan H. Schönherr 
> Reviewed-by: Konrad Jan Miller 
> Reviewed-by: Jim Mattson 
> ---
>  arch/x86/kvm/vmx.c | 45 -
>  1 file changed, 28 insertions(+), 17 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index a6e5a5c..29a2ee7 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -11281,24 +11281,7 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu 
> *vcpu)
>   "mov %%r13, %c[r13](%0) \n\t"
>   "mov %%r14, %c[r14](%0) \n\t"
>   "mov %%r15, %c[r15](%0) \n\t"
> - /*
> - * Clear host registers marked as clobbered to prevent
> - * speculative use.
> - */
> - "xor %%r8d,  %%r8d \n\t"
> - "xor %%r9d,  %%r9d \n\t"
> - "xor %%r10d, %%r10d \n\t"
> - "xor %%r11d, %%r11d \n\t"
> - "xor %%r12d, %%r12d \n\t"
> - "xor %%r13d, %%r13d \n\t"
> - "xor %%r14d, %%r14d \n\t"
> - "xor %%r15d, %%r15d \n\t"
>  #endif
> -
> - "xor %%eax, %%eax \n\t"
> - "xor %%ebx, %%ebx \n\t"
> - "xor %%esi, %%esi \n\t"
> - "xor %%edi, %%edi \n\t"
>   "pop  %%" _ASM_BP "; pop  %%" _ASM_DX " \n\t"
>   ".pushsection .rodata \n\t"
>   ".global vmx_return \n\t"
> @@ -11336,6 +11319,34 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu 
> *vcpu)
> );
>  
>   /*
> +  * Don't let guest register values survive. Registers that cannot
> +  * contain guest values anymore are not touched.

I think it's a good idea to explicitly call out that clearing the GPRs
is done to prevent speculative use.  Simply stating that we don't want
to let guest register values survive doesn't explain *why*.

What about:

/*
 * Explicitly clear (in addition to marking them as clobbered) all GPRs
 * that have not been loaded with host state to prevent speculatively
 * using the guest's values.
 */

> +  */
> + asm volatile (
> + "xor %%eax, %%eax \n\t"
> + "xor %%ebx, %%ebx \n\t"
> + "xor %%esi, %%esi \n\t"
> + "xor %%edi, %%edi \n\t"
> +#ifdef CONFIG_X86_64
> + "xor %%r8d,  %%r8d \n\t"
> + "xor %%r9d,  %%r9d \n\t"
> + "xor %%r10d, %%r10d \n\t"
> + "xor %%r11d, %%r11d \n\t"
> + "xor %%r12d, %%r12d \n\t"
> + "xor %%r13d, %%r13d \n\t"
> + "xor %%r14d, %%r14d \n\t"
> + "xor %%r15d, %%r15d \n\t"
> +#endif
> + ::: "cc"
> +#ifdef CONFIG_X86_64
> +  , "rax", "rbx", "rsi", "rdi"
> +  , "r8", "r9", "r10", "r11", "r12", "r13", "r14", "r15"
> +#else
> +  , "eax", "ebx", "esi", "edi"
> +#endif
> + );
> +
> + /*
>* We do not use IBRS in the kernel. If this vCPU has used the
>* SPEC_CTRL MSR it may have left it on; save the value and
>* turn it off. This is much more efficient than blindly adding
> -- 
> 2.7.4
>

Re: [PATCH v2 1/3] kvm, vmx: move CR2 context switch out of assembly path

2018-10-29 Thread Sean Christopherson

On Mon, Oct 29, 2018 at 04:40:42PM +0100, Julian Stecklina wrote:
> The VM entry/exit path is a giant inline assembly statement. Simplify it
> by doing CR2 context switching in plain C. Move CR2 restore behind IBRS
> clearing, so we reduce the amount of code we execute with IBRS on.

I think it's worth documenting two things in the changelog:

  - Using {read,write}_cr2() means KVM will use pv_mmu_ops instead of
open coding native_{read,write}_cr2().

  - The CR2 code has been done in assembly since KVM's genesis[1],
which predates the addition of the paravirt ops[2], i.e. KVM isn't
deliberately avoiding the paravirt ops.

The above info makes it trivially easy to review this patch from a
correctness standpoint.  With that:

Reviewed-by: Sean Christopherson 


[1] Commit 6aa8b732ca01 ("[PATCH] kvm: userspace interface")
[2] Commit d3561b7fa0fb ("[PATCH] paravirt: header and stubs for 
paravirtualisation")
 
> Signed-off-by: Julian Stecklina 
> Reviewed-by: Jan H. Schönherr 
> Reviewed-by: Konrad Jan Miller 
> Reviewed-by: Jim Mattson 
> ---
>  arch/x86/kvm/vmx.c | 15 +--
>  1 file changed, 5 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index ccc6a01..a6e5a5c 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -11212,6 +11212,9 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu 
> *vcpu)
>   evmcs_rsp = static_branch_unlikely(_evmcs) ?
>   (unsigned long)_evmcs->host_rsp : 0;
>  
> + if (read_cr2() != vcpu->arch.cr2)
> + write_cr2(vcpu->arch.cr2);
> +
>   if (static_branch_unlikely(_l1d_should_flush))
>   vmx_l1d_flush(vcpu);
>  
> @@ -11231,13 +11234,6 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu 
> *vcpu)
>   "2: \n\t"
>   __ex("vmwrite %%" _ASM_SP ", %%" _ASM_DX) "\n\t"
>   "1: \n\t"
> - /* Reload cr2 if changed */
> - "mov %c[cr2](%0), %%" _ASM_AX " \n\t"
> - "mov %%cr2, %%" _ASM_DX " \n\t"
> - "cmp %%" _ASM_AX ", %%" _ASM_DX " \n\t"
> - "je 3f \n\t"
> - "mov %%" _ASM_AX", %%cr2 \n\t"
> - "3: \n\t"
>   /* Check if vmlaunch of vmresume is needed */
>   "cmpl $0, %c[launched](%0) \n\t"
>   /* Load guest registers.  Don't clobber flags. */
> @@ -11298,8 +11294,6 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu 
> *vcpu)
>   "xor %%r14d, %%r14d \n\t"
>   "xor %%r15d, %%r15d \n\t"
>  #endif
> - "mov %%cr2, %%" _ASM_AX "   \n\t"
> - "mov %%" _ASM_AX ", %c[cr2](%0) \n\t"
>  
>   "xor %%eax, %%eax \n\t"
>   "xor %%ebx, %%ebx \n\t"
> @@ -11331,7 +11325,6 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu 
> *vcpu)
>   [r14]"i"(offsetof(struct vcpu_vmx, 
> vcpu.arch.regs[VCPU_REGS_R14])),
>   [r15]"i"(offsetof(struct vcpu_vmx, 
> vcpu.arch.regs[VCPU_REGS_R15])),
>  #endif
> - [cr2]"i"(offsetof(struct vcpu_vmx, vcpu.arch.cr2)),
>   [wordsize]"i"(sizeof(ulong))
> : "cc", "memory"
>  #ifdef CONFIG_X86_64
> @@ -11365,6 +11358,8 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu 
> *vcpu)
>   /* Eliminate branch target predictions from guest mode */
>   vmexit_fill_RSB();
>  
> + vcpu->arch.cr2 = read_cr2();
> +
>   /* All fields are clean at this point */
>   if (static_branch_unlikely(_evmcs))
>   current_evmcs->hv_clean_fields |=
> -- 
> 2.7.4
>

[PATCH] mm/mmu_notifier: rename mmu_notifier_synchronize() to <...>_barrier()

2018-11-05 Thread Sean Christopherson

...and update its comment to explicitly reference its association with
mmu_notifier_call_srcu().

Contrary to its name, mmu_notifier_synchronize() does not synchronize
the notifier's SRCU instance, but rather waits for RCU callbacks to
finished, i.e. it invokes rcu_barrier().  The RCU documentation is
quite clear on this matter, explicitly calling out that rcu_barrier()
does not imply synchronize_rcu().  The misnomer could lean an unwary
developer to incorrectly assume that mmu_notifier_synchronize() can
be used in conjunction with mmu_notifier_unregister_no_release() to
implement a variation of mmu_notifier_unregister() that synchronizes
SRCU without invoking ->release.  A Documentation-allergic and hasty
developer could be further confused by the fact that rcu_barrier() is
indeed a pass-through to synchronize_rcu()... in tiny SRCU.

Signed-off-by: Sean Christopherson 
---
 mm/mmu_notifier.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index 5119ff846769..46ebea6483bf 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -35,12 +35,12 @@ void mmu_notifier_call_srcu(struct rcu_head *rcu,
 }
 EXPORT_SYMBOL_GPL(mmu_notifier_call_srcu);
 
-void mmu_notifier_synchronize(void)
+void mmu_notifier_barrier(void)
 {
-   /* Wait for any running method to finish. */
+   /* Wait for any running RCU callbacks (see above) to finish. */
srcu_barrier();
 }
-EXPORT_SYMBOL_GPL(mmu_notifier_synchronize);
+EXPORT_SYMBOL_GPL(mmu_notifier_barrier);
 
 /*
  * This function can't run concurrently against mmu_notifier_register
-- 
2.19.1

[PATCH] x86/cpufeatures: Remove get_scattered_cpuid_leaf()

2018-11-05 Thread Sean Christopherson

get_scattered_cpuid_leaf() was added[1] to help KVM rebuild hardware-
defined leafs that are rearranged by Linux to avoid bloating the
x86_capability array.  Eventually, the last consumer of the function
was removed[2], but the function itself was kept, perhaps even
intentionally as a form of documentation.

Remove get_scattered_cpuid_leaf() as it is currently not used by KVM.
Furthermore, simply rebuilding the "real" leaf does not resolve all of
KVM's woes when it comes to exposing a scattered CPUID feature, i.e.
keeping the function as documentation may be counter-productive in
some scenarios, e.g. when KVM needs to do more than simply expose the
leaf.

[1] 47bdf3378d62 ("x86/cpuid: Provide get_scattered_cpuid_leaf()")
[2] b7b27aa011a1 ("KVM/x86: Update the reverse_cpuid list to include 
CPUID_7_EDX")

Signed-off-by: Sean Christopherson 
Cc: Paolo Bonzini 
Cc: Radim Krčmář 
---
 arch/x86/kernel/cpu/cpu.h   |  3 ---
 arch/x86/kernel/cpu/scattered.c | 24 
 2 files changed, 27 deletions(-)

diff --git a/arch/x86/kernel/cpu/cpu.h b/arch/x86/kernel/cpu/cpu.h
index da5446acc241..5eb946b9a9f3 100644
--- a/arch/x86/kernel/cpu/cpu.h
+++ b/arch/x86/kernel/cpu/cpu.h
@@ -49,9 +49,6 @@ extern void get_cpu_cap(struct cpuinfo_x86 *c);
 extern void get_cpu_address_sizes(struct cpuinfo_x86 *c);
 extern void cpu_detect_cache_sizes(struct cpuinfo_x86 *c);
 extern void init_scattered_cpuid_features(struct cpuinfo_x86 *c);
-extern u32 get_scattered_cpuid_leaf(unsigned int level,
-   unsigned int sub_leaf,
-   enum cpuid_regs_idx reg);
 extern void init_intel_cacheinfo(struct cpuinfo_x86 *c);
 extern void init_amd_cacheinfo(struct cpuinfo_x86 *c);
 extern void init_hygon_cacheinfo(struct cpuinfo_x86 *c);
diff --git a/arch/x86/kernel/cpu/scattered.c b/arch/x86/kernel/cpu/scattered.c
index 772c219b6889..0631f5328b7f 100644
--- a/arch/x86/kernel/cpu/scattered.c
+++ b/arch/x86/kernel/cpu/scattered.c
@@ -56,27 +56,3 @@ void init_scattered_cpuid_features(struct cpuinfo_x86 *c)
set_cpu_cap(c, cb->feature);
}
 }
-
-u32 get_scattered_cpuid_leaf(unsigned int level, unsigned int sub_leaf,
-enum cpuid_regs_idx reg)
-{
-   const struct cpuid_bit *cb;
-   u32 cpuid_val = 0;
-
-   for (cb = cpuid_bits; cb->feature; cb++) {
-
-   if (level > cb->level)
-   continue;
-
-   if (level < cb->level)
-   break;
-
-   if (reg == cb->reg && sub_leaf == cb->sub_leaf) {
-   if (cpu_has(_cpu_data, cb->feature))
-   cpuid_val |= BIT(cb->bit);
-   }
-   }
-
-   return cpuid_val;
-}
-EXPORT_SYMBOL_GPL(get_scattered_cpuid_leaf);
-- 
2.19.1

Re: RFC: userspace exception fixups

2018-11-06 Thread Sean Christopherson

On Tue, 2018-11-06 at 08:57 -0800, Andy Lutomirski wrote:
>
> So I guess the non-enclave code basically can’t trust its stack pointer
> because of these shenanigans. And the AEP code has to live with the fact
> that its RSP is basically arbitrary and probably can’t even be unwound
> by a debugger?

The SDK provides a Python GDB plugin to hook into the out-call flow and
do more stack shenanigans.  From what I can tell it's fudging the stack
to make it look like a normal stack frame so the debugger can do it's
thing.

> And the EENTER code has to deal with the fact that its red zone can be
> blatantly violated by the enclave?

That's my understanding of things.  So yeah, if it wasn't obvious before,
the trusted and untrusted parts of the SDK are very tightly coupled.

Re: [PATCH v16 18/22] platform/x86: Intel SGX driver

2018-11-06 Thread Sean Christopherson

On Tue, 2018-11-06 at 15:45 +0200, Jarkko Sakkinen wrote:
> Intel Software Guard eXtensions (SGX) is a set of CPU instructions that
> can be used by applications to set aside private regions of code and
> data. The code outside the enclave is disallowed to access the memory
> inside the enclave by the CPU access control.
> 
> SGX driver provides a ioctl API for loading and initializing enclaves.
> Address range for enclaves is reserved with mmap() and they are
> destroyed with munmap(). Enclave construction, measurement and
> initialization is done with the provided the ioctl API.

...

> +struct sgx_encl {
> + unsigned int flags;
> + uint64_t attributes;
> + uint64_t xfrm;
> + unsigned int page_cnt;
> + unsigned int secs_child_cnt;
> + struct mutex lock;
> + struct mm_struct *mm;
> + struct file *backing;

Is there any particular reason why the kernel manages the backing for
the enclave and the PCMDs?  Could we have userspace provide the backing
either through the ECREATE ioctl() or maybe a completely new ioctl(),
e.g. to give userspace the option to back the enclave with a NVDIMM
instead of RAM?  A separate ioctl() with control flags might give us
some flexibility in the future, e.g. maybe there are use cases where
userspace would prefer to kill enclaves rather than swap EPC.

> + struct kref refcount;
> + unsigned long base;
> + unsigned long size;
> + unsigned long ssaframesize;
> + struct radix_tree_root page_tree;
> + struct list_head add_page_reqs;
> + struct work_struct add_page_work;
> + struct sgx_encl_page secs;
> + struct pid *tgid;
> + struct mmu_notifier mmu_notifier;
> + struct notifier_block pm_notifier;
> +};

Re: RFC: userspace exception fixups

2018-11-07 Thread Sean Christopherson

On Wed, Nov 07, 2018 at 07:34:52AM -0800, Sean Christopherson wrote:
> On Tue, Nov 06, 2018 at 05:17:14PM -0800, Andy Lutomirski wrote:
> > On Tue, Nov 6, 2018 at 4:02 PM Sean Christopherson
> >  wrote:
> > >
> > > On Tue, Nov 06, 2018 at 03:39:48PM -0800, Andy Lutomirski wrote:
> > > > On Tue, Nov 6, 2018 at 3:35 PM Sean Christopherson
> > > >  wrote:
> > > > >
> > > > > Sorry if I'm beating a dead horse, but what if we only did fixup on 
> > > > > ENCLU
> > > > > with a specific (ignored) prefix pattern?  I.e. effectively make the 
> > > > > magic
> > > > > fixup opt-in, falling back to signals.  Jamming RIP to skip ENCLU 
> > > > > isn't
> > > > > that far off the architecture, e.g. EENTER stuffs RCX with the next 
> > > > > RIP so
> > > > > that the enclave can EEXIT to immediately after the EENTER location.
> > > > >
> > > >
> > > > How does that even work, though?  On an AEX, RIP points to the ERESUME
> > > > instruction, not the EENTER instruction, so if we skip it we just end
> > > > up in lala land.
> > >
> > > Userspace would obviously need to be aware of the fixup behavior, but
> > > it actually works out fairly nicely to have a separate path for ERESUME
> > > fixup since a fault on EENTER is generally fatal, whereas as a fault on
> > > ERESUME might be recoverable.
> > >
> > 
> > Hmm.
> > 
> > >
> > > do_eenter:
> > > mov tcs, %rbx
> > > lea async_exit, %rcx
> > > mov $EENTER, %rax
> > > ENCLU
> > 
> > Or SOME_SILLY_PREFIX ENCLU?
> 
> Yeah, forgot to include that.
> 
> > >
> > > /*
> > >  * EEXIT or EENTER faulted.  In the latter case, %RAX already holds some
> > >  * fault indicator, e.g. -EFAULT.
> > >  */
> > > eexit_or_eenter_fault:
> > > ret
> > 
> > But userspace wants to know whether it was a fault or not.  So I think
> > we either need two landing pads or we need to hijack a flag bit (are
> > there any known-zeroed flag bits after EEXIT?) to say whether it was a
> > fault.  And, if it was a fault, we should give the vector, the
> > sanitized error code, and possibly CR2.
> 
> As Jethro mentioned, RAX will always be 4 on a successful EEXIT, so we
> can use RAX to indicate a fault.  That's what I was trying to imply with
> EFAULT.  Here's the reg stuffing I use for the POC:
> 
>   regs->ax = EFAULT;
>   regs->di = trapnr;
>   regs->si = error_code;
>   regs->dx = address;
> 
> 
> Well-known RAX values also means the kernel fault handlers only need to
> look for SOME_SILLY_PREFIX ENCLU if RAX==2 || RAX==3, i.e. the fault
> occurred on EENTER or in an enclave (RAX is set to ERESUME's leaf as
> part of the asynchronous enlcave exit flow).

POC kernel code, 64-bit only.

Limiting this to 64-bit isn't necessary, but it makes the code prettier
and allows using REX as the magic prefix.  I like the idea of using REX
because it seems least likely to be repurposed for yet another new
feature.  I have no idea if 64-bit only will fly with the SDK folks.

Going off comments in similar code related to UMIP, we'd need to figure
out how to handle protection keys.


/* REX with all bits set, ignored by ENCLU. */
#define SGX_DO_ENCLU_FIXUP  0x4F

#define SGX_ENCLU_OPCODE0   0x0F
#define SGX_ENCLU_OPCODE1   0x01
#define SGX_ENCLU_OPCODE2   0xD7

/* ENCLU is a three-byte opcode, plus one byte for the magic prefix. */
#define SGX_ENCLU_FIXUP_INSN_LEN4

static int sgx_detect_enclu(struct pt_regs *regs)
{
unsigned char buf[SGX_ENCLU_FIXUP_INSN_LEN];

/* Look for EENTER or ERESUME in RAX, 64-bit mode only. */
if (!regs || (regs->ax != 2 && regs->ax != 3) || !user_64bit_mode(regs))
return 0;

if (copy_from_user(buf, (void __user *)(regs->ip), sizeof(buf)))
return 0;

if (buf[0] == SGX_DO_ENCLU_FIXUP &&
buf[1] == SGX_ENCLU_OPCODE0 &&
buf[2] == SGX_ENCLU_OPCODE1 &&
buf[3] == SGX_ENCLU_OPCODE2)
return SGX_ENCLU_FIXUP_INSN_LEN;

return 0;
}

bool sgx_fixup_enclu_fault(struct pt_regs *regs, int trapnr,
   unsigned long error_code, unsigned long address)
{
int insn_len;

insn_len = sgx_detect_enclu(regs);
if (!insn_len)
return false;

regs->ip += insn_len;
regs->ax = EFAULT;
regs->di = trapnr;
regs->si = error_code;
regs->dx = address;
return true;
}

Re: [PATCH v16 18/22] platform/x86: Intel SGX driver

2018-11-07 Thread Sean Christopherson

On Wed, Nov 07, 2018 at 06:37:57PM +0200, Jarkko Sakkinen wrote:
> On Tue, Nov 06, 2018 at 08:40:00AM -0800, Sean Christopherson wrote:
> > On Tue, 2018-11-06 at 15:45 +0200, Jarkko Sakkinen wrote:
> > > Intel Software Guard eXtensions (SGX) is a set of CPU instructions that
> > > can be used by applications to set aside private regions of code and
> > > data. The code outside the enclave is disallowed to access the memory
> > > inside the enclave by the CPU access control.
> > > 
> > > SGX driver provides a ioctl API for loading and initializing enclaves.
> > > Address range for enclaves is reserved with mmap() and they are
> > > destroyed with munmap(). Enclave construction, measurement and
> > > initialization is done with the provided the ioctl API.
> > 
> > ...
> > 
> > > +struct sgx_encl {
> > > + unsigned int flags;
> > > + uint64_t attributes;
> > > + uint64_t xfrm;
> > > + unsigned int page_cnt;
> > > + unsigned int secs_child_cnt;
> > > + struct mutex lock;
> > > + struct mm_struct *mm;
> > > + struct file *backing;
> > 
> > Is there any particular reason why the kernel manages the backing for
> > the enclave and the PCMDs?  Could we have userspace provide the backing
> > either through the ECREATE ioctl() or maybe a completely new ioctl(),
> > e.g. to give userspace the option to back the enclave with a NVDIMM
> > instead of RAM?  A separate ioctl() with control flags might give us
> > some flexibility in the future, e.g. maybe there are use cases where
> > userspace would prefer to kill enclaves rather than swap EPC.
> 
> Not really except that no one has complained. The very first swapping
> code that I implemented used a VMA as backing storage. I could take
> pieces of that code to replace shmem specifics. The difference was that
> the driver did vm_mmap(). Now that you suggested the above I wonder how
> it did not came to mind back then to provide the VMA as parameter.
> 
> A single buffer that can hold both PCMD entries and swapped pages in its
> address space would probably be the  best way to do it. I would add that
> as a field to struct sgx_enclave_create. If we want the kill-behavior,
> you could signal that with a NULL value.

What do we gain by a single buffer vs. separate buffers?  The ioctl()
would be slightly smaller but it seems like the actual code would be
more complex.

The enclave build process also utilizes the backing as temp storage
to avoid having to alloc kernel memory when queueing pages to be added
by the worker thread (which reminds me that I wanted to document why a
worker thread is used).  Keeping this behavior would effectively make
providing backing mandatory.

Are there any potential complications with ENCLS consuming userspace
pointers?  We'd have to wrap them with user_access_{begin,end}() and
probably tweak the fixup, but I assume having the fixup handler means
we're generally ok?

Re: [PATCH] mm/mmu_notifier: rename mmu_notifier_synchronize() to <...>_barrier()

2018-11-05 Thread Sean Christopherson

On Mon, Nov 05, 2018 at 12:18:33PM -0800, Andrew Morton wrote:
> On Mon,  5 Nov 2018 11:29:55 -0800 Sean Christopherson 
>  wrote:
> 
> > ...and update its comment to explicitly reference its association with
> > mmu_notifier_call_srcu().
> > 
> > Contrary to its name, mmu_notifier_synchronize() does not synchronize
> > the notifier's SRCU instance, but rather waits for RCU callbacks to
> > finished, i.e. it invokes rcu_barrier().  The RCU documentation is
> > quite clear on this matter, explicitly calling out that rcu_barrier()
> > does not imply synchronize_rcu().  The misnomer could lean an unwary
> > developer to incorrectly assume that mmu_notifier_synchronize() can
> > be used in conjunction with mmu_notifier_unregister_no_release() to
> > implement a variation of mmu_notifier_unregister() that synchronizes
> > SRCU without invoking ->release.  A Documentation-allergic and hasty
> > developer could be further confused by the fact that rcu_barrier() is
> > indeed a pass-through to synchronize_rcu()... in tiny SRCU.
> 
> Fair enough.
> 
> > --- a/mm/mmu_notifier.c
> > +++ b/mm/mmu_notifier.c
> > @@ -35,12 +35,12 @@ void mmu_notifier_call_srcu(struct rcu_head *rcu,
> >  }
> >  EXPORT_SYMBOL_GPL(mmu_notifier_call_srcu);
> >  
> > -void mmu_notifier_synchronize(void)
> > +void mmu_notifier_barrier(void)
> >  {
> > -   /* Wait for any running method to finish. */
> > +   /* Wait for any running RCU callbacks (see above) to finish. */
> > srcu_barrier();
> >  }
> > -EXPORT_SYMBOL_GPL(mmu_notifier_synchronize);
> > +EXPORT_SYMBOL_GPL(mmu_notifier_barrier);
> >  
> >  /*
> >   * This function can't run concurrently against mmu_notifier_register
> 
> But as it has no callers, why retain it?

I was hesitant to remove it altogether since it was explicitly added to
complement mmu_notifier_call_srcu()[1] even though the initial user of
mmu_notifier_call_srcu() didn't use mmu_notifier_synchronize()[2].  I
assume there was a good reason for adding the barrier function, but
maybe that's a bad assumption.

[1] b972216e27d1 ("mmu_notifier: add call_srcu and sync function for listener 
to delay call and sync")
[2] https://lore.kernel.org/patchwork/patch/515318/

Re: RFC: userspace exception fixups

2018-11-06 Thread Sean Christopherson

On Tue, Nov 06, 2018 at 03:39:48PM -0800, Andy Lutomirski wrote:
> On Tue, Nov 6, 2018 at 3:35 PM Sean Christopherson
>  wrote:
> >
> > On Tue, Nov 06, 2018 at 03:00:56PM -0800, Andy Lutomirski wrote:
> > >
> > >
> > > >> On Nov 6, 2018, at 1:59 PM, Sean Christopherson 
> > > >>  wrote:
> > > >>
> > > >>> On Tue, 2018-11-06 at 13:41 -0800, Andy Lutomirski wrote:
> > > >> Sean, how does the current SDK AEX handler decide whether to do
> > > >> EENTER, ERESUME, or just bail and consider the enclave dead?  It seems
> > > >> like the *CPU* could give a big hint, but I don't see where there is
> > > >> any architectural indication of why the AEX code got called or any
> > > >> obvious way for the user code to know whether the exit was fixed up by
> > > >> the kernel?
> > > >
> > > > The SDK "unconditionally" does ERESUME at the AEP location, but that's
> > > > bit misleading because its signal handler may muck with the context's
> > > > RIP, e.g. to abort the enclave on a fatal fault.
> > > >
> > > > On an event/exception from within an enclave, the event is immediately
> > > > delivered after loading synthetic state and changing RIP to the AEP.
> > > > In other words, jamming CPU state is essentially a bunch of vectoring
> > > > ucode preamble, but from software's perspective it's a normal event
> > > > that happens to point at the AEP instead of somewhere in the enclave.
> > > > And because the signals the SDK cares about are all synchronous, the
> > > > SDK can simply hardcode ERESUME at the AEP since all of the fault logic
> > > > resides in its signal handler.  IRQs and whatnot simply trampoline back
> > > > into the enclave.
> > > >
> > > > Userspace can do something funky instead of ERESUME, but only *after*
> > > > IRET/RSM/VMRESUME has returned to the AEP location, and in Linux's
> > > > case, after the trap handler has run.
> > > >
> > > > Jumping back a bit, how much do we care about preventing userspace
> > > > from doing stupid things?
> > >
> > > My general feeling is that userspace should be allowed to do apparently
> > > stupid things. For example, as far as the kernel is concerned, Wine and
> > > DOSEMU are just user programs that do stupid things. Linux generally tries
> > > to provide a reasonably complete view of architectural behavior. This is
> > > in contrast to, say, Windows, where IIUC doing an unapproved WRFSBASE May
> > > cause very odd behavior indeed. So magic fixups that do non-architectural
> > > things are not so great.
> >
> > Sorry if I'm beating a dead horse, but what if we only did fixup on ENCLU
> > with a specific (ignored) prefix pattern?  I.e. effectively make the magic
> > fixup opt-in, falling back to signals.  Jamming RIP to skip ENCLU isn't
> > that far off the architecture, e.g. EENTER stuffs RCX with the next RIP so
> > that the enclave can EEXIT to immediately after the EENTER location.
> >
> 
> How does that even work, though?  On an AEX, RIP points to the ERESUME
> instruction, not the EENTER instruction, so if we skip it we just end
> up in lala land.

Userspace would obviously need to be aware of the fixup behavior, but
it actually works out fairly nicely to have a separate path for ERESUME
fixup since a fault on EENTER is generally fatal, whereas as a fault on
ERESUME might be recoverable.


do_eenter:
mov tcs, %rbx
lea async_exit, %rcx 
mov $EENTER, %rax
ENCLU

/*
 * EEXIT or EENTER faulted.  In the latter case, %RAX already holds some
 * fault indicator, e.g. -EFAULT.
 */
eexit_or_eenter_fault:
ret

async_exit:
ENCLU

fixup_handler:

 
> How averse would everyone be to making enclave entry be a syscall?
> The user code would do sys_sgx_enter_enclave(), and the kernel would
> stash away the register state (vm86()-style), point RIP to the vDSO's
> ENCLU instruction, point RCX to another vDSO ENCLU instruction, and
> SYSRET.  The trap handlers would understand what's going on and
> restore register state accordingly.

Wouldn't that blast away any stack changes made by the enclave?

Re: RFC: userspace exception fixups

2018-11-07 Thread Sean Christopherson

On Wed, Nov 07, 2018 at 04:27:58PM -0500, Rich Felker wrote:
> On Tue, Nov 06, 2018 at 03:26:16PM -0800, Sean Christopherson wrote:
> > On Tue, Nov 06, 2018 at 06:17:30PM -0500, Rich Felker wrote:
> > > On Tue, Nov 06, 2018 at 11:02:11AM -0800, Andy Lutomirski wrote:
> > > > On Tue, Nov 6, 2018 at 10:41 AM Dave Hansen  
> > > > wrote:
> > > > >
> > > > > On 11/6/18 10:20 AM, Andy Lutomirski wrote:
> > > > > > I almost feel like the right solution is to call into SGX on its own
> > > > > > private stack or maybe even its own private address space.
> > > > >
> > > > > Yeah, I had the same gut feeling.  Couldn't the debugger even treat 
> > > > > the
> > > > > enclave like its own "thread" with its own stack and its own set of
> > > > > registers and context?  That seems like a much more workable model 
> > > > > than
> > > > > trying to weave it together with the EENTER context.
> > > > 
> > > > So maybe the API should be, roughly
> > > > 
> > > > sgx_exit_reason_t sgx_enter_enclave(pointer_to_enclave, struct
> > > > host_state *state);
> > > > sgx_exit_reason_t sgx_resume_enclave(same args);
> > > > 
> > > > where host_state is something like:
> > > > 
> > > > struct host_state {
> > > >   unsigned long bp, sp, ax, bx, cx, dx, si, di;
> > > > };
> > > > 
> > > > and the values in host_state explicitly have nothing to do with the
> > > > actual host registers.  So, if you want to use the outcall mechanism,
> > > > you'd allocate some memory, point sp to that memory, call
> > > > sgx_enter_enclave(), and then read that memory to do the outcall.
> > > > 
> > > > Actually implementing this would be distinctly nontrivial, and would
> > > > almost certainly need some degree of kernel help to avoid an explosion
> > > > when a signal gets delivered while we have host_state.sp loaded into
> > > > the actual SP register.  Maybe rseq could help with this?
> > > > 
> > > > The ISA here is IMO not well thought through.
> > > 
> > > Maybe I'm mistaken about some fundamentals here, but my understanding
> > > of SGX is that the whole point is that the host application and the
> > > code running in the enclave are mutually adversarial towards one
> > > another. Do any or all of the proposed protocols here account for this
> > > and fully protect the host application from malicious code in the
> > > enclave? It seems that having control over the register file on exit
> > > from the enclave is fundamentally problematic but I assume there must
> > > be some way I'm missing that this is fixed up.
> > 
> > SGX provides protections for the enclave but not the other way around.
> > The kernel has all of its normal non-SGX protections in place, but the
> > enclave can certainly wreak havoc on its userspace process.  The basic
> > design idea is that the enclave is a specialized .so that gets extra
> > security protections but is still effectively part of the overall
> > application, e.g. it has full access to its host userspace process'
> > virtual memory.
> 
> In that case it seems like the only way to use SGX that's not a gaping
> security hole is to run the SGX enclave in its own fully-seccomp (or
> equivalent) process, with no host application in the same address
> space. Since the host application can't see the contents of the
> enclave to make any determination of whether it's safe to run, running
> it in the same address space only makes sense if the cpu provides
> protection against unwanted accesses to the host's memory from the
> enclave -- and according to you, it doesn't.

The enclave's code (and any initial data) isn't encrypted until the
pages are loaded into the Enclave Page Cache (EPC), which can only
be done by the kernel (via ENCLS[EADD]).  In other words, both the
kernel and userspace can vet the code/data before running an enclave.

Practically speaking, an enclave will be coupled with an untrusted
userspace runtime, i.e. it's loader.  Enclaves are also measured
as part of their build process, and so the enclave loader needs to
know which pages to add to the measurement, and in what order.  I
guess technically speaking an enclave could have zero pages added
to its measurement, but that'd probably be a big red flag that said
enclave is up to something fishy.

Re: RFC: userspace exception fixups

2018-11-08 Thread Sean Christopherson

On Thu, Nov 08, 2018 at 12:10:30PM -0800, Dave Hansen wrote:
> On 11/8/18 12:05 PM, Andy Lutomirski wrote:
> > Hmm.  The idea being that the SDK preserves RBP but not RSP.  That's
> > not the most terrible thing in the world.  But could the SDK live with
> > something more like my suggestion where the vDSO supplies a normal
> > function that takes a struct containing registers that are visible to
> > the enclave?  This would make it extremely awkward for the enclave to
> > use the untrusted stack per se, but it would make it quite easy (I
> > think) for the untrusted part of the SDK to allocate some extra memory
> > and just tell the enclave that *that* memory is the stack.
> 
> I really think the enclave should keep its grubby mitts off the
> untrusted stack.  There are lots of ways to get memory, even with
> stack-like semantics, that don't involve mucking with the stack itself.
> 
> I have not heard a good, hard argument for why there is an absolute
> *need* to store things on the actual untrusted stack.

Convenience and performance are the only arguments I've heard, e.g. so
that allocating memory doesn't require an extra EEXIT->EENTER round trip.

> We could quite easily have the untrusted code just promise to allocate a
> stack-sized virtual area (even derived from the stack rlimit size) and
> pass that into the enclave for parameter use.

I agree more and more the further I dig.  AFAIK there is no need to for
the enclave to actually load %rsp.  The initial EENTER can pass in the
base/top of the pseudo-stack and from there the enclave can manage it
purely in software.

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 2858 matches

Mail list logo