Hi all,

This series is RFC Xen SGX virtualization support design and RFC draft patches.

Intel SGX (Software Guard Extensions) is a new set of instructions and memory
access mechanisms targetting for application developers seeking to protect
select code and data from disclosure or modification.

The SGX specification can be found in latest Intel SDM as Volume 3D:

https://software.intel.com/sites/default/files/managed/7c/f1/332831-sdm-vol-3d.pdf

SGX is relatively more complicated on specification (entire Volume D) so it is
unrealistic to list all hardware details here. First part of the design is the
brief SGX introduction, which I think is mandatory for introducing SGX
virtualization support. Part 2 is design itself. And I put some reference at
last.

In first part I only introduced the info related virtualization support,
although this is definitely not the most important part of SGX. Other parts of
SGX (most related to cryptography), ie, enclave measurement, SGX key
architecture, Sealing & Attestion (which is critical feature actually) are
ommited. Please refer to SGX specification for detailed info.

In the design there are some particualr points that I don't know which
implementation is better. For those I added a question mark (?) at the right
of the menu. Your comments on those parts (and other comments as well, of
course) are highly appreciated.

Because SGX has lots of details, so I think the design itself can only be high
level, and I also included the RFC patches which contains lots of details.
Your comments on the patches are also highly appreciated.

The code can also be found at below github repo for your access:

    # git clone https://github.com/01org/xen-sgx -b rfc-v1

And there is another branch named 4.6-sgx is another implementation based on
Xen 4.6, it is old but it has some different implementation with this rfc-v1
patches in terms of design choice (ex, it adds a dedicated hypercall).

Please help to review and give comments. Thanks in advance.

==============================================================================

1. SGX Introduction
    1.1 Overview
        1.1.1 Enclave
        1.1.2 EPC (Enclave Paage Cache)
        1.1.3 ENCLS and ENCLU
    1.2 Discovering SGX Capability
        1.2.1 Enumerate SGX via CPUID
        1.2.2 Intel SGX Opt-in Configuration
    1.3 Enclave Life Cycle
        1.3.1 Constructing & Destroying Enclave
        1.3.2 Enclave Entry and Exit
            1.3.2.1 Synchonous Entry and Exit
            1.3.2.2 Asynchounous Enclave Exit
        1.3.3 EPC Eviction and Reload
    1.4 SGX Launch Control
    1.5 SGX Interaction with IA32 and IA64 Architecture
2. SGX Virtualization Design
    2.1 High Level Toolstack Changes
        2.1.1 New 'epc' parameter
        2.1.2 New XL commands (?)
        2.1.3 Notify domain's virtual EPC base and size to Xen
        2.1.4 Launch Control Support (?)
    2.2 High Level Hypervisor Changes
        2.2.1 EPC Management (?)
        2.2.2 EPC Virtualization (?)
        2.2.3 Populate EPC for Guest
        2.2.4 New Dedicated Hypercall (?)
        2.2.5 Launch Control Support
        2.2.6 CPUID Emulation
        2.2.7 MSR Emulation
        2.2.8 EPT Violation & ENCLS Trapping Handling
        2.2.9 Guest Suspend & Resume
        2.2.10 Destroying Domain
    2.3 Additional Point: Live Migration, Snapshot Support (?)
3. Reference

1. SGX Introduction

1.1 Overview

1.1.1 Enclave

Intel Software Guard Extensions (SGX) is a set of instructions and mechanisms
for memory accesses in order to provide security accesses for sensitive
applications and data. SGX allows an application to use it's pariticular address
space as an *enclave*, which is a protected area provides confidentiality and
integrity even in the presence of privileged malware. Accesses to the enclave
memory area from any software not resident in the enclave are prevented,
including those from privileged software. Below diagram illustrates the presence
of Enclave in application.

        |-----------------------|
        |                       |
        |   |---------------|   |
        |   |   OS kernel   |   |       |-----------------------|
        |   |---------------|   |       |                       |
        |   |               |   |       |   |---------------|   |
        |   |---------------|   |       |   | Entry table   |   |
        |   |   Enclave     |---|-----> |   |---------------|   |
        |   |---------------|   |       |   | Enclave stack |   |
        |   |   App code    |   |       |   |---------------|   |
        |   |---------------|   |       |   | Enclave heap  |   |
        |   |   Enclave     |   |       |   |---------------|   |
        |   |---------------|   |       |   | Enclave code  |   |
        |   |   App code    |   |       |   |---------------|   |
        |   |---------------|   |       |                       |
        |           |           |       |-----------------------|
        |-----------------------|

SGX supports SGX1 and SGX2 extensions. SGX1 provides basic enclave support,
and SGX2 allows additional flexibility in runtime management of enclave
resources and thread execution within an enclave.

1.1.2 EPC (Enclave Page Cache)

Just like normal application memory management, enclave memory management can be
devided into two parts: address space allocation and memory commitment. Address
space allocation is allocating particular range of linear address space for
enclave. Memory commitment is assigning actual resource for the enclave.

Enclave Page Cache (EPC) is the physical resource used to commit to enclave.
EPC is divided to 4K pages. An EPC page is 4K in size and always aligned to 4K
boundary. Hardware performs additional access control checks to restrict access
to the EPC page. The Enclave Page Cache Map (EPCM) is a secure structure which
holds one entry for each EPC page, and is used by hardware to track the status
of each EPC page (invisibe to software). Typically EPC and EPCM are reserved
by BIOS as Processor Reserved Memory but the actual amount, size, and layout
of EPC are model-specific, and dependent on BIOS settings. EPC is enumerated
via new SGX CPUID, and is reported as reserved memory.

EPC pages can either be invalid or valid. There are 4 valid EPC types in SGX1:
regular EPC page, SGX Enclave Control Structure (SECS) page, Thread Control
Structure (TCS) page, and Version Array (VA) page. SGX2 adds Trimmed EPC page.
Each enclave is associated with one SECS page. Each thread in enclave is
associated with one TCS page. VA page is used in EPC page eviction and reload.
Trimmed EPC page is introduced in SGX2 when particular 4K page in enclave is
going to be freed (trimmed) at runtime after enclave is initialized.

1.1.3 ENCLS and ENCLU

Two new instructions ENCLS and ENCLU are introduced to manage enclave and EPC.
ENCLS can only run in ring 0, while ENCLU can only run in ring 3. Both ENCLS and
ENCLU have multiple leaf functions, with EAX indicating the specific leaf
function.

SGX1 supports below ENCLS and ENCLU leaves:

    ENCLS:
    - ECREATE, EADD, EEXTEND, EINIT, EREMOVE (Enclave build and destroy)
    - EPA, EBLOCK, ETRACK, EWB, ELDU/ELDB (EPC eviction & reload)

    ENCLU:
    - EENTER, EEXIT, ERESUME (Enclave entry, exit, re-enter)
    - EGETKEY, EREPORT (SGX key derivation, attestation)

Additionally, SGX2 supports below ENCLS and ENCLU leaves for runtime add/remove
EPC page to enclave after enclave is initialized, along with permission change.

    ENCLS:
    - EAUG, EMODT, EMODPR
    
    ENCLU:
    - EACCEPT, EACCEPTCOPY, EMODPE

VMM is able to interfere with ENCLS running in guest (see 1.2.x SGX interaction
with VMX) but is unable to interfere with ENCLU.

1.2 Discovering SGX Capability

1.2.1 Enumerate SGX via CPUID

If CPUID.0x7.0:EBX.SGX (bit 2) is 1, then processor supports SGX and SGX
capability and resource can be enumerated via new SGX CPUID (0x12).
CPUID.0x12.0x0 reports SGX capability, such as the presence of SGX1, SGX2,
enclave's maximum size for both 32-bit and 64-bit application. CPUID.0x12.0x1
reports the availability of bits that can be set for SECS.ATTRIBUTES.
CPUID.0x12.0x2 reports the EPC resource's base and size. Platform may support
multiple EPC sections, and CPUID.0x12.0x3 and further sub-leaves can be used
to detect the existence of multiple EPC sections (until CPUID reports invalid
EPC).

Refer to 37.7.2 Intel SGX Resource Enumeration Leaves for full description of
SGX CPUID 0x12.

1.2.2 Intel SGX Opt-in Configuration

On processors that support Intel SGX, IA32_FEATURE_CONTROL also provides the
SGX_ENABLE bit (bit 18) to turn on/off SGX. Before system software can enable
and use SGX, BIOS is required to set IA32_FEATURE_CONTROL.SGX_ENABLE = 1 to
opt-in SGX.

Setting SGX_ENABLE follows the rules of IA32_FEATURE_CONTROL.LOCK (bit 0).
Software is considered to have opted into Intel SGX if and only if
IA32_FEATURE_CONTROL.SGX_ENABLE and IA32_FEATURE_CONTROL.LOCK are set to 1.

The setting of IA32_FEATURE_CONTROL.SGX_ENABLE (bit 18) is not reflected by
SGX CPUID. Enclave instructions will behavior differently according to value
of CPUID.0x7.0x0:EBX.SGX and whether BIOS has opted-in SGX.

Refer to 37.7.1 Intel SGX Opt-in Configuration for more information.

1.3 Enclave Life Cycle

1.3.1 Constructing & Destroying Enclave

Enclave is created via ENCLS[ECREATE] leaf by previleged software. Basically
ECREATE converts an invalid EPC page into SECS page, according to a source SECS
structure resides in normal memory. The source SECS contains enclave's info
such as base (linear) address, size, enclave attributes, enclave's measurement,
etc.

After ECREATE, for each 4K linear address space page, priviledged software uses
EADD and EEXTEND to add one EPC page to it. Enclave code/data (resides in normal
memory) is loaded to enclave during EADD for enclave's each 4K page. After all
EPC pages are added to enclave, priviledged software calls EINIT to initialize
the enclave, and then enclave is ready to run.

During enclave is constructed, enclave measurement, which is a SHA256 hash
value, is also built according to enclave's size, code/data itself and its
location in enclave, etc. The measurement can be used to uniquely identify the
enclave. SIGSTRUCT in EINIT leaf also contains the measurement specified by
untrusted software, via MRENCLAVE. EINIT will check the two measurements and
will only succeed when the two matches.

Enclave is destroyed by running EREMOVE for all Enclave's EPC page, and then
for enclave's SECS. EREMOVE will report SGX_CHILD_PRESENT error if it is called
for SECS when there's still regular EPC pages that haven't been removed from
enclave.

Please refer to SDM chapter 39.1 Constructing an Enclave for more infomation.

1.3.2 Enclave Entry and Exit

1.3.2.1 Synchonous Entry and Exit

After enclave is constructed, non-priviledged software use ENCLU[EENTER] to
enter enclave to run. While process runs in enclave, non-priviledged software
can use ENCLU[EEXIT] to exit from enclave and return to normal mode.

1.3.2.2 Asynchounous Enclave Exit

Asynchronous and synchronous events, such as exceptions, interrupts, traps,
SMIs, and VM exits may occur while executing inside an enclave. These events
are referred to as Enclave Exiting Events (EEE). Upon an EEE, the processor
state is securely saved inside the enclave and then replaced by a synthetic
state to prevent leakage of secrets. The process of securely saving state and
establishing the synthetic state is called an Asynchronous Enclave Exit (AEX).

After AEX, non-priviledged software uses ENCLU[ERESUME] to re-enter enclave.
The SGX userspace software maintains a small piece of code (resides in normal
memory) which basically calls ERESUME to re-enter enclave. The address of this
piece of code is called Asynchronous Exit Pointer (AEP). AEP is specified as
parameter in EENTER and will be kept internally in enclave. Upon AEX, AEP will
be pushed to stack and upon returning from EEE handling, such as IRET, AEP will
be loaded to RIP and ERESUME will be called subsequently to re-enter enclave.

During AEX the processor will do context saving and restore automatically
therefore no change to interrupt handling of OS kernel and VMM is required. It
is SGX userspace software's responsibility to setup AEP correctly.

Please refer to SDM chapter 39.2 Enclave Entry and Exit for more infomation.

1.3.3 EPC Eviction and Reload

SGX also allows priviledged software to evict any EPC pages that are used by
enclave. The idea is the same as normal memory swapping. Below is the detail
info of how to evict EPC pages.

Below is the sequence to evict regular EPC page:

        1) Select one or multiple regular EPC pages from one enclave
        2) Remove EPT/PT mapping for selected EPC pages
        3) Send IPIs to remote CPUs to flush TLB of selected EPC pages
        4) EBLOCK on selected EPC pages
        5) ETRACK on enclave's SECS page
        6) allocate one available slot (8-byte) in VA page
        7) EWB on selected EPC pages

With EWB taking:

        - VA slot, to restore eviction version info.
        - one normal 4K page in memory, to store encrypted content of EPC page.
        - one struct PCMD in memory, to store meta data.

    (VA slot is a 8-byte slot in VA page, which is a particualr EPC page.)

And below is the sequence to evict an SECS page or VA page:

        1) locate SECS (or VA) page
        2) remove EPT/PT mapping for SECS (or VA) page
        3) Send IPIs to remote CPUs
        6) allocate one available slot (8-byte) in VA page
        4) EWB on SECS (or) page

And for evicting SECS page, all regular EPC pages that belongs to that SECS
must be evicted out prior, otherwise EWB returns SGX_CHILD_PRESENT error.

And to reload an EPC page:

        1) ELDU/ELDB on EPC page
        2) setup EPT/PT mapping

With ELDU/ELDB taking:

        - location of SECS page
        - linear address of enclave's 4K page (that we are going to reload to)
        - VA slot (used in EWB)
        - 4K page in memory (used in EWB)
        - struct PCMD in memory (used in EWB)

Please refer to SDM chapter 39.5 EPC and Management of EPC pages for more
information.

*********** Instruction Behavior changes in Enclave

- Illegal instructions inside enclave

            Instruction                 Result              Comment

    CPUID,GETSEC,RDPMC,SGDT,SIDT,SLDT,STR,VMCALL,

1.4 SGX Launch Control

SGX requires running "Launch Enclave" (LE) before running any other enclaves.
This is because LE is the only enclave that does not requires EINITTOKEN in
EINIT. Running any other enclave requires a valid EINITTOKEN, which contains
MAC of the (first 192 bytes) EINITTOKEN calculated by EINITTOKEN key. EINIT
will verify the MAC via internally deriving the EINITTOKEN key, and only the
EINITTOKEN that has matched MAC will be accepted by EINIT. The EINITTOKEN key
derivation depends on some info from LE. The typical process is LE generates
EINITTOKEN for other enclave according to LE itself and the target enclave,
and calcualtes the MAC by using ENCLU[EGETKEY] to get the EINITTOKEN key. Only
LE is able to get the EINITTOKEN key.

Running LE requies the SHA256 hash of LE signer's RSA public key (SHA256 of
sigstruct->modulus) to equal to IA32_SGXLEPUBKEYHASH[0-3] MSRs (the 4 MSRs
together makes up 256-bit SHA256 hash value).

If CPUID.0x7.0x0:EBX.SGX is set, then IA32_SGXLEPUBKEYHASHn are readable. If
CPUID.0x7.0x0:ECX.SGX_LAUNCH_CONTROL[bit 30] is set, then IA32_FEATURE_CONTROL
MSR has SGX_LAUNCH_CONTROL_ENABLE bit (bit 17) available. 1-setting of
SGX_LAUNCH_CONTROL_ENABLE bit enables runtime change of IA32_SGXLEPUBKEYHASHn
after IA32_FEATURE_CONTROL is locked. Otherwise, IA32_SGXLEPUBKEYHASHn are
read-only after IA32_FEATURE_CONTROL is locked. IA32_SGXLEPUBKEYHASHn will be
set to SHA256 hash of Intel's default RSA public key.

Above mechanism allows 3rd party to run their own LE.

On physical machine, typically BIOS will provide option to *lock* or *unlock*
IA32_SGXLEPUBKEYHASHn before transfering to OS. BIOS may also provide interface
for user to change default value of IA32_SGXLEPUBKEYHASHn, but what interfaces
will be provided by BIOS is BIOS implementation dependent.

1.5 SGX Interaction with IA32 and IA64 Architecture

SDM Chapter 42 describes SGX interaction with various features in IA32 and IA64
architecture. Below outlines the major ones. Refer to Chapter 42 for full
description of SGX interaction with various IA32 and IA64 features.

1.5.1 VMX Changes for Supporting SGX Virtualization

A new 64-bit ENCLS-exiting bitmap control field is added to VMCS (encoding
0202EH) to control VMEXIT on ENCLS leaf functions. And a new "Enable ENCLS
exiting" control bit (bit 15) is defined in secondary processor based vm
execution control. 1-Setting of "Enable ENCLS exiting" enables ENCLS-exiting
bitmap control. ENCLS-exiting bitmap controls which ENCLS leaves will trigger
VMEXIT.

Additionally two new bits are added to indicate whether VMEXIT (any) is from
enclave. Below two bits will be set if VMEXIT is from enclave:
    - Bit 27 in the Exit reason filed of Basic VM-exit information.
    - Bit 4 in the Interruptibility State of Guest Non-Register State of VMCS.

Refer to 42.5 Interactions with VMX, 27.2.1 Basic VM-Exit Information, and
27.3.4 Saving Non-Register.

1.5.2 Interaction with XSAVE

SGX defines a sub-field called X-Feature Request Mask (XFRM) in the attributes
field of SECS. On enclave entry, SGX HW verifies XFRM in SECS.ATTRIBUTES are
already enabled in XCR0.

Upon AEX, SGX saves the processor extended state and miscellaneous state to
enclave's state-save area (SSA), and clear the secrets from processor extended
state that is used by enclave (from leaking secrets).

Refer to 42.7 Interaction with Processor Extended State and Miscellaneous State

1.5.3 Interaction with S state

When processor goes into S3-S5 state, EPC is destroyed, thus all enclaves are
destroyed as well consequently.

Refer to 42.14 Interaction with S States.

2. SGX Virtualization Design

2.1 High Level Toolstack Changes:

2.1.1 New 'epc' parameter

EPC is limited resource. In order to use EPC efficiently among all domains,
when creating guest, administrator should be able to specify domain's virtual
EPC size. And admin
alao should be able to get all domain's virtual EPC size.

For this purpose, a new 'epc = <size>' parameter is added to XL configuration
file. This parameter specifies guest's virtual EPC size. The EPC base address
will be calculated by toolstack internally, according to guest's memory size,
MMIO size, etc. 'epc' is MB in unit and any 1MB aligned value will be accepted.

2.1.2 New XL commands (?)

Administrator should be able to get physical EPC size, and all domain's virtual
EPC size. For this purpose, we can introduce 2 additional commands:

    # xl sgxinfo

Which will print out physical EPC size, and other SGX info (such as SGX1, SGX2,
etc) if necessary.

    # xl sgxlist <did>

Which will print out particular domain's virtual EPC size, or list all virtual
EPC sizes for all supported domains.

Alternatively, we can also extend existing XL commands by adding new option

    # xl info -sgx

Which will print out physical EPC size along with other physinfo. And

    # xl list <did> -sgx

Which will print out domain's virtual EPC size.

Comments?

In my RFC patches I didn't implement the commands as I don't know which
is better. In the github repo I mentioned at the beginning, there's an old
branch in which I implemented 'xl sgxinfo' and 'xl sgxlist', but they are
implemented via dedicated hypercall for SGX, which I am not sure whether is a
good option so I didn't include it in my RFC patches.

2.1.3 Notify domain's virtual EPC base and size to Xen

Xen needs to know guest's EPC base and size in order to populate EPC pages for
it. Toolstack notifies EPC base and size to Xen via XEN_DOMCTL_set_cpuid.

2.1.4 Launch Control Support (?)

Xen Launch Control Support is about to support running multiple domains with
each running its own LE signed by different owners (if HW allows, explained
below). As explained in 1.4 SGX Launch Control, EINIT for LE (Launch Enclave)
only succeeds when SHA256(SIGSTRUCT.modulus) matches IA32_SGXLEPUBKEYHASHn,
and EINIT for other enclaves will derive EINITTOKEN key according to 
IA32_SGXLEPUBKEYHASHn. Therefore, to support this, guest's virtual
IA32_SGXLEPUBKEYHASHn must be updated to phyiscal MSRs before EINIT (which
also means the physical IA32_SGXLEPUBKEYHASHn need to be *unlocked* in BIOS
before booting to OS).

For physical machine, it is BIOS's writer's decision that whether BIOS would
provide interface for user to specify customerized IA32_SGXLEPUBKEYHASHn (it
is default to digest of Intel's signing key after reset). In reality, OS's SGX
driver may require BIOS to make MSRs *unlocked* and actively write the hash
value to MSRs in order to run EINIT successfully, as in this case, the driver
will not depend on BIOS's capability (whether it allows user to customerize
IA32_SGXLEPUBKEYHASHn value).

The problem is for Xen, do we need a new parameter, such as 'lehash=<SHA256>'
to specify the default value of guset's virtual IA32_SGXLEPUBKEYHASHn? And do
we need a new parameter, such as 'lewr' to specify whether guest's virtual MSRs
are locked or not before handling to guest's OS?

I tends to not introduce 'lehash', as it seems SGX driver would actively update
the MSRs. And new parameter would add additional changes for upper layer
software (such as openstack). And 'lewr' is not needed either as Xen can always
*unlock* the MSRs to guest.

Please give comments?

Currently in my RFC patches above two parameters are not implemented.
Xen hypervisor will always *unlock* the MSRs. Whether there is 'lehash'
parameter or not doesn't impact Xen hypervisor's emulation of
IA32_SGXLEPUBKEYHASHn. See below Xen hypervisor changes for details.

2.2 High Level Xen Hypervisor Changes:

2.2.1 EPC Management (?)

Xen hypervisor needs to detect SGX, discover EPC, and manage EPC before
supporting SGX to guest. EPC is detected via SGX CPUID 0x12.0x2. It's possible
that there are multiple EPC sections (enumerated via sub-leaves 0x3 and so on,
until invaid EPC is reported), but this is only true on multiple-socket server
machines. For server machines there are additional things also needs to be done,
such as NUMA EPC, scheduling, etc. We will support server machine in the future
but currently we only support one EPC.

EPC is reported as reserved memory (so it is not reported as normal memory).
EPC must be managed in 4K pages. CPU hardware uses EPCM to track status of each
EPC pages. Xen needs to manage EPC and provide functions to, ie, alloc and free
EPC pages for guest.

There are two ways to manage EPC: Manage EPC separately; or Integrate it to
existing memory management framework.

It is easy to manage EPC separately, as currently EPC is pretty small (~100MB),
and we can even put them in a single list. However it is not flexible, for
example, you will have to write new algorithms when EPC becomes larger, ex, GB.
And you have to write new code to support NUMA EPC (although this will not come
in short time).

Integrating EPC to existing memory management framework seems more reasonable,
as in this way we can resume memory management data structures/algorithms, and
it will be more flexible to support larger EPC and potentially NUMA EPC. But
modifying MM framework has a higher risk to break existing memory management
code (potentially more bugs).

In my RFC patches currently we choose to manage EPC separately. A new
structure epc_page is added to represent a single 4K EPC page. A whole array
of struct epc_page will be allocated during EPC initialization, so that given
the other, one of PFN of EPC page and 'struct epc_page' can be got by adding
offset.

But maybe integrating EPC to MM framework is more reasonable. Comments?

2.2.2 EPC Virtualization (?)

This part is how to populate EPC for guests. We have 3 choices:
    - Static Partitioning
    - Oversubscription
    - Ballooning

Static Partitioning means all EPC pages will be allocated and mapped to guest
when it is created, and there's no runtime change of page table mappings for EPC
pages. Oversubscription means Xen hypervisor supports EPC page swapping between
domains, meaning Xen is able to evict EPC page from another domain and assign it
to the domain that needs the EPC. With oversubscription, EPC can be assigned to
domain on demand, when EPT violation happens. Ballooning is similar to memory
ballooning. It is basically "Static Partitioning" + "Balloon driver" in guest.

Static Partitioning is the easiest way in terms of implementation, and there
will be no hypervisor overhead (except EPT overhead of course), because in
"Static partitioning", there is no EPT violation for EPC, and Xen doesn't need
to turn on ENCLS VMEXIT for guest as ENCLS runs perfectly in non-root mode.

Ballooning is "Static Partitioning" + "Balloon driver" in guest. Like "Static
Paratitioning", ballooning doesn't need to turn on ENCLS VMEXIT, and doesn't
have EPT violation for EPC either. To support ballooning, we need ballooning
driver in guest to issue hypercall to give up or reclaim EPC pages. In terms of
hypercall, we have two choices: 1) Add new hypercall for EPC ballooning; 2)
Using existing XENMEM_{increase/decrease}_reservation with new memory flag, ie,
XENMEMF_epc. I'll discuss more regarding to adding dedicated hypercall or not
later.

Oversubscription looks nice but it requires more complicated implemetation.
Firstly, as explained in 1.3.3 EPC Eviction & Reload, we need to follow specific
steps to evict EPC pages, and in order to do that, basically Xen needs to trap
ENCLS from guest and keep track of EPC page status and enclave info from all
guest. This is because:
    - To evict regular EPC page, Xen needs to know SECS location
    - Xen needs to know EPC page type: evicting regular EPC and evicting SECS,
      VA page have different steps.
    - Xen needs to know EPC page status: whether the page is blocked or not.

Those info can only be got by trapping ENCLS from guest, and parsing its
parameters (to identify SECS page, etc). Parsing ENCLS parameters means we need
to know which ENCLS leaf is being trapped, and we need to translate guest's
virtual address to get physical address in order to locate EPC page. And once
ENCLS is trapped, we have to emulate ENCLS in Xen, which means we need to
reconstruct ENCLS parameters by remapping all guest's virtual address to Xen's
virtual address (gva->gpa->pa->xen_va), as ENCLS always use *effective address*
which is able to be traslated by processor when running ENCLS.

    --------------------------------------------------------------
                |   ENCLS   |
    --------------------------------------------------------------
                |          /|\
    ENCLS VMEXIT|           | VMENTRY
                |           |
               \|/          |

                1) parse ENCLS parameters
                2) reconstruct(remap) guest's ENCLS parameters
                3) run ENCLS on behalf of guest (and skip ENCLS)
                4) on success, update EPC/enclave info, or inject error

And Xen needs to maintain each EPC page's status (type, blocked or not, in
enclave or not, etc). Xen also needs to maintain all Enclave's info from all
guests, in order to find the correct SECS for regular EPC page, and enclave's
linear address as well.

So in general, "Static Partitioning" has simplest implementation, but obviously
not the best way to use EPC efficiently; "Ballooning" has all pros of Static
Partitioning but requies guest balloon driver; "Oversubscription" is best in
terms of flexibility but requires complicated hypervisor implemetation.

We have implemented "Static Partitioning" in RFC patches, but needs your
feedback on whether it is enough. If not, which one should we do at next stage
-- Ballooning or Oversubscription. IMO Ballooning may be good enough, given fact
that currently memory is also "Static Partitioning" + "Ballooning".

Comments?

2.2.3 Populate EPC for Guest

Toolstack notifies Xen about domain's EPC base and size by XEN_DOMCTL_set_cpuid,
so currently Xen populates all EPC pages for guest in XEN_DOMCTL_set_cpuid,
particularly, in handling XEN_DOMCTL_set_cpuid for CPUID.0x12.0x2. Once Xen
checks the values passed from toolstack is valid, Xen will allocate all EPC
pages and setup EPT mappings for guest.

2.2.4 New Dedicated Hypercall (?)

So far for all the changes mentioned above, if without a dedicated new
hypercall, we have to implement those changes in:

    - xl sgxifo (or xl info -sgx)

    Toolstack can do this by running SGX CPUID directly, along with checking
    host cpu featureset.

    - xl sgxlist (or xl list -sgx)

    This is not quite straightforward. Looks we have to extend
    xen_domctl_getdomaininfo. However SGX is Intel specific feature, so I am
    not sure it's a good idea to extend xen_domctl_getdomaininfo.

    -  Populate EPC for guest

    In XEN_DOMCTL_set_cpuid, Xen populates EPC pages for guest after receiving
    EPC base and size from toolstack.

    - Potential EPC Ballooning

    Need to add new XENMEMF_epc and use existing
    XENMEM_{increase/decrease}_reservation.

With new hypercall for SGX (ie, XEN_sgx_op), all of above can be consolidated
into the hypercall. We can also extend it to more generic hypercall for Intel
platform genrally (ie, XEN_intel_op). For example, the new hypercall would look
like:

    #define XEN_INTEL_SGX_physinfo  0x1
    struct xen_sgx_physinfo {
        /* OUT */
        unsigned long total_epc_pages;
        unsigned long free_epc_pages;
    };
    typedef struct xen_sgx_physinfo xen_sgx_physinfo_t;
    DEFINE_XEN_GUEST_HANDLE(xen_sgx_physinfo_t);

    #define XEN_INTEL_SGX_setup_epc 0x2
    struct xen_sgx_setup_epc {
        /* IN */
        domid_t domid;
        unsigned long epc_base_gfn;
        unsigned long total_epc_pages;
    };
    typedef struct xen_sgx_setup_epc xen_sgx_setup_epc_t;
    DEFINE_XEN_GUEST_HANDLE(xen_sgx_setup_epc_t);

    #define XEN_INTEL_SGX_dominfo   0x3
    struct xen_sgx_dominfo {
        /* IN */
        domid_t domid;
        /* OUT */
        unsigned long epc_base_gfn;
        unsigned long total_epc_pages;
    };
    DEFINE_XEN_GUEST_HANDLE(xen_sgx_dominfo);

    struct xen_sgx_op {
        /* XEN_INTEL_SGX_* */
        int cmd;
        union {
            struct xen_sgx_physinfo physinfo;
            struct xen_sgx_setup_epc setup_epc;
            struct xen_sgx_dominfo dominfo;
        } u;
    };    
    typedef struct xen_sgx_op xen_sgx_op_t;
    DEFINE_XEN_GUEST_HANDLE(xen_sgx_op);

    /* New arch specific hypercall for Intel platform specific operations,
     * __HYPERVISOR_arch_0 is used by Xen x86 machine check... */
    #define __HYPERVISOR_intel_op  __HYPERVISOR_arch_1
    /* Currently only SGX uses this */
    #define XEN_INTEL_OP_sgx                (0x1 << 1)
    struct xen_intel_op {
        int cmd;    /* XEN_INTEL_OP_*** */
        union {
            struct xen_sgx_op sgx_op;
        } u;
    }
    typedef struct xen_intel_op xen_intel_op_t;
    DEFINE_XEN_GUEST_HANDLE(xen_intel_op_t);


In my RFC patches, the new hypercall is not implemented as I am not sure
whether it is a good idea.

Comments?

2.2.5 Launch Control Support

To support running multiple domains with each running its own LE signed by
different owners, physical machine's BIOS must leave IA32_SGXLEPUBKEYHASHn
*unlocked* before handing to Xen. Xen will trap domain's write to
IA32_SGXLEPUBKEYHASHn and keep the value in vcpu internally, and update the
value to physical MSRs when vcpu is scheduled in. This can guarantee that
when EINIT runs in guest, guest's virtual IA32_SGXLEPUBKEYHASHn have been
written to physical MSRs.

SGX_LAUNCH_CONTROL_ENABLE bit will always be set in guest's
IA32_FEATURE_CONTROL MSR (see 2.1.4 Launch Control Support).

If physical IA32_SGXLEPUBKEYHASHn are *locked* in machine's BIOS, then only MSR
read is allowed from guest, and Xen will inject error for guest's MSR writes.

If CPUID.0x7.0x0:ECX.SGX_LAUHCN_CONTROL is not present, then this feature will
not be exposed to guest as well, and SGX_LAUNCH_CONTROL_ENABLE bit is set to 0
(as it is invalid).

2.2.6 CPUID Emulation

Most of native SGX CPUID info can be exposed to guest, expect below two parts:
    - Sub-leaf 0x2 needs to report domain's virtual EPC base and size, instead
      of physical EPC info.
    - Sub-leaf 0x1 needs to be consistent with guest's XCR0. For the reason of
      this part please refer to 1.5.2 Interaction with XSAVE.

2.2.7 MSR Emulation

SGX_ENABLE it in IA32_FEATURE_CONTROL is always set if SGX is exposed to guest,
SGX_LAUNCH_CONTROL_ENABLE bit is handled as in 2.2.4. Any write from guest to
IA32_FEATURE_CONTROL is ignored.

IA32_SGXLEPUBKEYHASHn emulation is described in 2.2.4.

2.2.8 EPT Violation & ENCLS Trapping Handling

Only needed when Xen supports EPC Oversubscription, as explained above.

2.2.9 Guest Suspend & Resume

On hardware, EPC is destroyed when power goes to S3-S5. So Xen will destroy
guest's EPC when guest's power goes into S3-S5. Currently Xen is notified by
Qemu in terms of S State change via HVM_PARAM_ACPI_S_STATE, where Xen will
destroy EPC if S State is S3-S5.

Specifically, Xen will run EREMOVE for guest's each EPC page, as guest may
not handle EPC suspend & resume correctly, in which case physically guest's EPC
pages may still be valid, so Xen needs to run EREMOVE to make sure all EPC
pages are becoming invalid. Otherwise further operation in guest on EPC may
fault as it assumes all EPC pages are invalid after guest is resumed.

For SECS page, EREMOVE may fault with SGX_CHILD_PRESENT, in which case Xen will
keep this SECS page into a list, and call EREMOVE for them again after all EPC
pages have been called with EREMOVE. This time the EREMOVE on SECS will succeed
as all children (regular EPC pages) have already been removed.

2.2.10 Destroying Domain

Normally Xen just frees all EPC pages for domain when it is destroyed. But Xen
will also do EREMOVE on all guest's EPC pages (described in above 2.2.7) before
free them, as guest may shutdown unexpected (ex, user kills guest), and in this
case, guest's EPC may still be valid.

2.3 Additional Point: Live Migration, Snapshot Support (?)

Actually from hardware's point of view, SGX is not migratable. There are two
reasons:

    - SGX key architecture cannot be virtualized.

    For example, some keys are bound to CPU. For example, Sealing key, EREPORT
    key, etc. If VM is migrated to another machine, the same enclave will derive
    the different keys. Taking Sealing key as an example, Sealing key is
    typically used by enclave (enclave can get sealing key by EGETKEY) to *seal*
    its secrets to outside (ex, persistent storage) for further use. If Sealing
    key changes after VM migration, then the enclave can never get the sealed
    secrets back by using sealing key, as it has changed, and old sealing key
    cannot be got back.

    - There's no ENCLS to evict EPC page to normal memory, but at the meaning
    time, still keep content in EPC. Currently once EPC page is evicted, the EPC
    page becomes invalid. So technically, we are unable to implement live
    migration (or check pointing, or snapshot) for enclave.

But, with some workaround, and some facts of existing SGX driver, technically
we are able to support Live migration (or even check pointing, snapshot). This
is because:

    - Changing key (which is bound to CPU) is not a problem in reality

    Take Sealing key as an example. Losing sealed data is not a problem, because
    sealing key is only supposed to encrypt secrets that can be provisioned
    again. The typical work model is, enclave gets secrets provisioned from
    remote (service provider), and use sealing key to store it for further use.
    When enclave tries to *unseal* use sealing key, if the sealing key is
    changed, enclave will find the data is some kind of corrupted (integrity
    check failure), so it will ask secrets to be provisioned again from remote.
    Another reason is, in data center, VM's typically share lots of data, and as
    sealing key is bound to CPU, it means the data encrypted by one enclave on
    one machine cannot be shared by another enclave on another mahcine. So from
    SGX app writer's point of view, developer should treat Sealing key as a
    changeable key, and should handle lose of sealing data anyway. Sealing key
    should only be used to seal secrets that can be easily provisioned again.

    For other keys such as EREPORT key and provisioning key, which are used for
    local attestation and remote attestation, due to the second reason below,
    losing them is not a problem either.

    - Sudden lose of EPC is not a problem.

    On hardware, EPC will be lost if system goes to S3-S5, or reset, or
    shutdown, and SGX driver need to handle lose of EPC due to power transition.
    This is done by cooperation between SGX driver and userspace SGX SDK/apps.
    However during live migration, there may not be power transition in guest,
    so there may not be EPC lose during live migration. And technically we
    cannot *really* live migrate enclave (explained above), so looks it's not
    feasible. But the fact is that both Linux SGX driver and Windows SGX driver
    have already supported *sudden* lose of EPC (not EPC lose during power
    transition), which means both driver are able to recover in case EPC is lost
    at any runtime. With this, technically we are able to support live migration
    by simply ignoring EPC. After VM is migrated, the destination VM will only
    suffer *sudden* lose of EPC, which both Windows SGX driver and Linux SGX
    driver are already able to handle.

    But we must point out such *sudden* lose of EPC is not hardware behavior,
    and other SGX driver for other OSes (such as FreeBSD) may not implement
    this, so for those guests, destination VM will behavior in unexpected
    manner. But I am not sure we need to care about other OSes.

For the same reason, we are able to support check pointing for SGX guest (only
Linux and Windows);

For snapshot, we can support snapshot SGX guest by either:

    - Suspend guest before snapshot (s3-s5). This works for all guests but
      requires user to manually susppend guest.
    - Issue an hypercall to destroy guest's EPC in save_vm. This only works for
      Linux and Windows but doesn't require user intervention.

What's your comments?

3. Reference

    - Intel SGX Homepage
    https://software.intel.com/en-us/sgx

    - Linux SGX SDK
    https://01.org/intel-software-guard-extensions

    - Linux SGX driver for upstreaming
    https://github.com/01org/linux-sgx

    - Intel SGX Specification (SDM Vol 3D)
    
https://software.intel.com/sites/default/files/managed/7c/f1/332831-sdm-vol-3d.pdf

    - Paper: Intel SGX Explained
    https://eprint.iacr.org/2016/086.pdf

    - ISCA 2015 tutorial slides for IntelĀ® SGX - IntelĀ® Software
    https://software.intel.com/sites/default/files/332680-002.pdf

Kai Huang (15):
  xen: x86: expose SGX to HVM domain in CPU featureset
  xen: vmx: detect ENCLS VMEXIT
  xen: x86: add early stage SGX feature detection
  xen: mm: add ioremap_cache
  xen: p2m: new 'p2m_epc' type for EPC mapping
  xen: x86: add SGX basic EPC management
  xen: x86: add functions to populate and destroy EPC for domain
  xen: x86: add SGX cpuid handling support.
  xen: vmx: handle SGX related MSRs
  xen: vmx: handle ENCLS VMEXIT
  xen: vmx: handle VMEXIT from SGX enclave
  xen: x86: reset EPC when guest got suspended.
  xen: tools: add new 'epc' parameter support
  xen: tools: add SGX to applying CPUID policy
  xen: tools: expose EPC in ACPI table

 tools/firmware/hvmloader/util.c             |  23 +
 tools/firmware/hvmloader/util.h             |   3 +
 tools/libacpi/build.c                       |   3 +
 tools/libacpi/dsdt.asl                      |  49 ++
 tools/libacpi/dsdt_acpi_info.asl            |   6 +-
 tools/libacpi/libacpi.h                     |   1 +
 tools/libxc/include/xc_dom.h                |   4 +
 tools/libxc/include/xenctrl.h               |  10 +
 tools/libxc/xc_cpuid_x86.c                  |  68 ++-
 tools/libxl/libxl.h                         |   3 +-
 tools/libxl/libxl_cpuid.c                   |  15 +-
 tools/libxl/libxl_create.c                  |   9 +
 tools/libxl/libxl_dom.c                     |  36 +-
 tools/libxl/libxl_internal.h                |   2 +
 tools/libxl/libxl_nocpuid.c                 |   4 +-
 tools/libxl/libxl_types.idl                 |   6 +
 tools/libxl/libxl_x86.c                     |  12 +
 tools/libxl/libxl_x86_acpi.c                |   3 +
 tools/ocaml/libs/xc/xenctrl_stubs.c         |  11 +-
 tools/python/xen/lowlevel/xc/xc.c           |  11 +-
 tools/xl/xl_parse.c                         |   5 +
 xen/arch/x86/cpuid.c                        |  87 ++-
 xen/arch/x86/domctl.c                       |  47 +-
 xen/arch/x86/hvm/hvm.c                      |   3 +
 xen/arch/x86/hvm/vmx/Makefile               |   1 +
 xen/arch/x86/hvm/vmx/sgx.c                  | 871 ++++++++++++++++++++++++++++
 xen/arch/x86/hvm/vmx/vmcs.c                 |  21 +
 xen/arch/x86/hvm/vmx/vmx.c                  |  73 +++
 xen/arch/x86/hvm/vmx/vvmx.c                 |  11 +
 xen/arch/x86/mm.c                           |  15 +-
 xen/arch/x86/mm/p2m-ept.c                   |   3 +
 xen/arch/x86/mm/p2m.c                       |  41 ++
 xen/include/asm-x86/cpufeature.h            |   4 +
 xen/include/asm-x86/cpuid.h                 |  26 +-
 xen/include/asm-x86/hvm/hvm.h               |   3 +
 xen/include/asm-x86/hvm/vmx/sgx.h           | 100 ++++
 xen/include/asm-x86/hvm/vmx/vmcs.h          |  10 +
 xen/include/asm-x86/hvm/vmx/vmx.h           |   3 +
 xen/include/asm-x86/msr-index.h             |   6 +
 xen/include/asm-x86/p2m.h                   |  12 +-
 xen/include/public/arch-x86/cpufeatureset.h |   3 +-
 xen/include/xen/vmap.h                      |   1 +
 xen/tools/gen-cpuid.py                      |   3 +
 43 files changed, 1607 insertions(+), 21 deletions(-)
 create mode 100644 xen/arch/x86/hvm/vmx/sgx.c
 create mode 100644 xen/include/asm-x86/hvm/vmx/sgx.h

-- 
2.11.0


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

Reply via email to