Re: [RFC v1 00/26] Add TDX Guest Support

2021-04-12 Thread Dan Williams
On Sun, Apr 4, 2021 at 8:02 AM Dave Hansen  wrote:
>
> It occurred to me that I've been doing a lot of digging in the TDX spec
> lately.  I think we can all agree that the "Architecture Specification"
> is not the world's easiest, most disgestable reading.  It's hard to
> figure out what the Linux relation to the spec is.
>
> One bit of Documentation we need for TDX is a description of the memory
> states.  For instance, it would be nice to spell out the different
> classes of memory, how they are selected, who selects them, and who
> enforces the selection.  What faults are generated on each type and who
> can induce those?
>
> For instance:
>
> TD-Private memory is selected by the Shared/Private bit in Present=1
> guest PTEs.  When the hardware page walker sees that bit, it walk the
> secure EPT.  The secure EPT entries can only be written by the TDX
> module, although they are written at the request of the VMM.  The TDX
> module enforces rules like ensuring that the memory mapped by secure EPT
> is not mapped multiple times.  The VMM can remove entries.  From the
> guest perspective, all private memory accesses are either successful, or
> result in a #VE.  Private memory access does not cause VMExits.
>
> Would that be useful to folks?

That paragraph was useful for me as someone coming in cold to TDX
patch review. +1 for more of that style of commentary.


Re: [RFC v1 00/26] Add TDX Guest Support

2021-04-04 Thread Dave Hansen
It occurred to me that I've been doing a lot of digging in the TDX spec
lately.  I think we can all agree that the "Architecture Specification"
is not the world's easiest, most disgestable reading.  It's hard to
figure out what the Linux relation to the spec is.

One bit of Documentation we need for TDX is a description of the memory
states.  For instance, it would be nice to spell out the different
classes of memory, how they are selected, who selects them, and who
enforces the selection.  What faults are generated on each type and who
can induce those?

For instance:

TD-Private memory is selected by the Shared/Private bit in Present=1
guest PTEs.  When the hardware page walker sees that bit, it walk the
secure EPT.  The secure EPT entries can only be written by the TDX
module, although they are written at the request of the VMM.  The TDX
module enforces rules like ensuring that the memory mapped by secure EPT
is not mapped multiple times.  The VMM can remove entries.  From the
guest perspective, all private memory accesses are either successful, or
result in a #VE.  Private memory access does not cause VMExits.

Would that be useful to folks?


Re: [RFC v1 00/26] Add TDX Guest Support

2021-04-03 Thread Andi Kleen
On Sat, Apr 03, 2021 at 09:26:24AM -0700, Dave Hansen wrote:
> On 4/2/21 2:32 PM, Andi Kleen wrote:
> >> If we go this route, what are the rules and restrictions?  Do we have to
> >> say "no MMIO in #VE"?
> > 
> > All we have to say is "No MMIO in #VE before getting thd TDVEINFO arguments"
> > After that it can nest without problems.
> 
> Well, not exactly.  You still can't do things that will could cause a n
> unbounded recusive #VE.

> It doesn't seem *that* far fetched to think that someone might try to
> defer some work or dump data to the console.

I believe the main console code has reentry protection.

I'm not sure about early_printk (with keep), buf it that's the case
it probably should be fixed anyways. I can take a look at that.

Not sure why deferring something would cause another #VE?

 
> > If you nest before that the TDX will cause a triple fault.
> > 
> > The code that cannot do it is a few lines in the early handler which
> > runs with interrupts off.
> 
> >> Which brings up another related point: How do you debug TD guests?  Does
> >> earlyprintk work?
> > 
> > Today it works actually because serial ports are allowed. But I expect it to
> > be closed eventually because serial code is a lot of code to audit. 
> > But you can always disable the filtering with a command line option and
> > then it will always work for debugging.
> 
> Do we need a TDX-specific earlyprintk?  I would imagine it's pretty easy
> to implement.

Don't see a need at this point, the existing mechanisms work.

Maybe if we ever have a problem that only happen in lockdown *and* happens
early, but that's not very likely since lock down primarily changes code
behavior later.

There are also other debug mechanisms for such cases: in TDX if you configure
the TD for debug mode supports using the gdb stub on the hypervisor.

-Andi



Re: [RFC v1 00/26] Add TDX Guest Support

2021-04-03 Thread Dave Hansen
On 4/2/21 2:32 PM, Andi Kleen wrote:
>> If we go this route, what are the rules and restrictions?  Do we have to
>> say "no MMIO in #VE"?
> 
> All we have to say is "No MMIO in #VE before getting thd TDVEINFO arguments"
> After that it can nest without problems.

Well, not exactly.  You still can't do things that will could cause a n
unbounded recusive #VE.

It doesn't seem *that* far fetched to think that someone might try to
defer some work or dump data to the console.

> If you nest before that the TDX will cause a triple fault.
> 
> The code that cannot do it is a few lines in the early handler which
> runs with interrupts off.

>> Which brings up another related point: How do you debug TD guests?  Does
>> earlyprintk work?
> 
> Today it works actually because serial ports are allowed. But I expect it to
> be closed eventually because serial code is a lot of code to audit. 
> But you can always disable the filtering with a command line option and
> then it will always work for debugging.

Do we need a TDX-specific earlyprintk?  I would imagine it's pretty easy
to implement.


Re: [RFC v1 00/26] Add TDX Guest Support

2021-04-02 Thread Andi Kleen
> If we go this route, what are the rules and restrictions?  Do we have to
> say "no MMIO in #VE"?

All we have to say is "No MMIO in #VE before getting thd TDVEINFO arguments"
After that it can nest without problems.

If you nest before that the TDX will cause a triple fault.

The code that cannot do it is a few lines in the early handler which
runs with interrupts off.

The TDX module also makes sure to not inject NMIs while we're in
that region, so NMIs are of no concern.

That was the whole point of avoiding the system call gap problem. We don't
need to make it IST, so it can nest.

I'm not aware of any other special rules.

> Which brings up another related point: How do you debug TD guests?  Does
> earlyprintk work?

Today it works actually because serial ports are allowed. But I expect it to
be closed eventually because serial code is a lot of code to audit. 
But you can always disable the filtering with a command line option and
then it will always work for debugging.

-Andi


Re: [RFC v1 00/26] Add TDX Guest Support

2021-04-02 Thread Dave Hansen
On 4/1/21 7:48 PM, Andi Kleen wrote:
>> I've heard things like "we need to harden the drivers" or "we need to do
>> audits" and that drivers might be "whitelisted".
> 
> The basic driver allow listing patches are already in the repository,
> but not currently posted or complete:
> 
> https://github.com/intel/tdx/commits/guest

That lists exactly 8 ids:

>   { PCI_VENDOR_ID_REDHAT_QUMRANET, 0x1000 }, /* Virtio NET */
>   { PCI_VENDOR_ID_REDHAT_QUMRANET, 0x1001 }, /* Virtio block */
>   { PCI_VENDOR_ID_REDHAT_QUMRANET, 0x1003 }, /* Virtio console */
>   { PCI_VENDOR_ID_REDHAT_QUMRANET, 0x1009 }, /* Virtio FS */
> 
>   { PCI_VENDOR_ID_REDHAT_QUMRANET, 0x1041 }, /* Virtio 1.0 NET */
>   { PCI_VENDOR_ID_REDHAT_QUMRANET, 0x1042 }, /* Virtio 1.0 block */
>   { PCI_VENDOR_ID_REDHAT_QUMRANET, 0x1043 }, /* Virtio 1.0 console */
>   { PCI_VENDOR_ID_REDHAT_QUMRANET, 0x1049 }, /* Virtio 1.0 FS */

How many places do those 8 drivers touch MMIO?

>> Are there any "real" hardware drivers
>> involved like how QEMU emulates an e1000 or rtl8139 device?  
> 
> Not currently (but some later hypervisor might rely on one of those)
> 
>> What about
>> the APIC or HPET?
> 
> No IO-APIC, but the local APIC. No HPET.

Sean seemed worried about other x86-specific oddities.  Are there any
more, or is the local APIC the only non-driver MMIO?

>> Without something concrete, it's really hard to figure out if we should
>> go full-blown paravirtualized MMIO, or do something like the #VE
>> trapping that's in this series currently.
> 
> As Sean says the concern about MMIO is less drivers (which should
> be generally ok if they work on other architectures which require MMIO
> magic), but other odd code that only ran on x86 before.
> 
> I really don't understand your crusade against #VE. It really
> isn't that bad if we can avoid the few corner cases.

The problem isn't with #VE per se.  It's with posting a series that
masquerades as a full solution while *NOT* covering or even enumerating
the corner cases.  That's exactly what happened with #VE to start with:
it was implemented in a way that exposed the kernel to #VE during the
syscall gap (and the SWAPGS gap for that matter).

So, I'm pushing for a design that won't have corner cases.  If MMIO
itself is disallowed, then we can scream about *any* detected MMIO.
Then, there's no worry about #VE nesting.  No #VE, no #VE nesting.  We
don't even have to consider if #VE needs NMI-like semantics.

> For me it would seem wrong to force all MMIO for all drivers to some
> complicated paravirt construct, blowing up code side everywhere
> and adding complicated self modifying code, when it's only needed for very
> few drivers. But we also don't want to patch every MMIO to be special cased
> even those few drivers.
> 
> #VE based MMIO avoids all that cleanly while being nicely non intrusive.

But, we're not selling used cars here.  Using #VE is has downsides.
Let's not pretend that it doesn't.

If we go this route, what are the rules and restrictions?  Do we have to
say "no MMIO in #VE"?

I'm really the most worried about the console.  Consoles and NMIs have
been a nightmare, IIRC.  Doesn't this just make it *WORSE* because now
the deepest reaches of the console driver are guaranteed to #VE?

Which brings up another related point: How do you debug TD guests?  Does
earlyprintk work?


Re: [RFC v1 00/26] Add TDX Guest Support

2021-04-01 Thread Andi Kleen
> I've heard things like "we need to harden the drivers" or "we need to do
> audits" and that drivers might be "whitelisted".

The basic driver allow listing patches are already in the repository,
but not currently posted or complete:

https://github.com/intel/tdx/commits/guest

> 
> What are we talking about specifically?  Which drivers?  How many
> approximately?  Just virtio?  

Right now just virtio, later other drivers that hypervisors need.

> Are there any "real" hardware drivers
> involved like how QEMU emulates an e1000 or rtl8139 device?  

Not currently (but some later hypervisor might rely on one of those)

> What about
> the APIC or HPET?

No IO-APIC, but the local APIC. No HPET.

> 
> How broadly across the kernel is this going to go?

Not very broadly for drivers.

> 
> Without something concrete, it's really hard to figure out if we should
> go full-blown paravirtualized MMIO, or do something like the #VE
> trapping that's in this series currently.

As Sean says the concern about MMIO is less drivers (which should
be generally ok if they work on other architectures which require MMIO
magic), but other odd code that only ran on x86 before.

I really don't understand your crusade against #VE. It really
isn't that bad if we can avoid the few corner cases.

For me it would seem wrong to force all MMIO for all drivers to some
complicated paravirt construct, blowing up code side everywhere
and adding complicated self modifying code, when it's only needed for very
few drivers. But we also don't want to patch every MMIO to be special cased
even those few drivers.

#VE based MMIO avoids all that cleanly while being nicely non intrusive.

-Andi



Re: [RFC v1 00/26] Add TDX Guest Support

2021-04-01 Thread Dave Hansen
On 2/5/21 3:38 PM, Kuppuswamy Sathyanarayanan wrote:
> Intel's Trust Domain Extensions (TDX) protect guest VMs from malicious
> hosts and some physical attacks. This series adds the bare-minimum
> support to run a TDX guest. The host-side support will be submitted
> separately. Also support for advanced TD guest features like attestation
> or debug-mode will be submitted separately. Also, at this point it is not
> secure with some known holes in drivers, and also hasn’t been fully audited
> and fuzzed yet.

I want to hear a lot more about this driver model.

I've heard things like "we need to harden the drivers" or "we need to do
audits" and that drivers might be "whitelisted".

What are we talking about specifically?  Which drivers?  How many
approximately?  Just virtio?  Are there any "real" hardware drivers
involved like how QEMU emulates an e1000 or rtl8139 device?  What about
the APIC or HPET?

How broadly across the kernel is this going to go?

Without something concrete, it's really hard to figure out if we should
go full-blown paravirtualized MMIO, or do something like the #VE
trapping that's in this series currently.


Re: [RFC v1 00/26] Add TDX Guest Support

2021-03-31 Thread Kuppuswamy, Sathyanarayanan

Hi All,

On 2/5/21 3:38 PM, Kuppuswamy Sathyanarayanan wrote:

Hi All,

NOTE: This series is not ready for wide public review. It is being
specifically posted so that Peter Z and other experts on the entry
code can look for problems with the new exception handler (#VE).
That's also why x86@ is not being spammed.


We are currently working on a solution to fix the issues raised in
"Add #VE support for TDX guest" patch. While we fix that issue, I would
like to know if there are issues in other patches in this series. So if
possible can you please review other patches in the series and let us
know your comments?.

If you want me to rebase the series on top of v5.12-rcX kernel and repost it,
please let me know.



Intel's Trust Domain Extensions (TDX) protect guest VMs from malicious
hosts and some physical attacks. This series adds the bare-minimum
support to run a TDX guest. The host-side support will be submitted
separately. Also support for advanced TD guest features like attestation
or debug-mode will be submitted separately. Also, at this point it is not
secure with some known holes in drivers, and also hasn’t been fully audited
and fuzzed yet.

TDX has a lot of similarities to SEV. It enhances confidentiality and
of guest memory and state (like registers) and includes a new exception
(#VE) for the same basic reasons as SEV-ES. Like SEV-SNP (not merged
yet), TDX limits the host's ability to effect changes in the guest
physical address space.

In contrast to the SEV code in the kernel, TDX guest memory is integrity
protected and isolated; the host is prevented from accessing guest
memory (even ciphertext).

The TDX architecture also includes a new CPU mode called
Secure-Arbitration Mode (SEAM). The software (TDX module) running in this
mode arbitrates interactions between host and guest and implements many of
the guarantees of the TDX architecture.

Some of the key differences between TD and regular VM is,

1. Multi CPU bring-up is done using the ACPI MADT wake-up table.
2. A new #VE exception handler is added. The TDX module injects #VE exception
to the guest TD in cases of instructions that need to be emulated, 
disallowed
MSR accesses, subset of CPUID leaves, etc.
3. By default memory is marked as private, and TD will selectively share it with
VMM based on need.
4. Remote attestation is supported to enable a third party (either the owner of
the workload or a user of the services provided by the workload) to 
establish
that the workload is running on an Intel-TDX-enabled platform located 
within a
TD prior to providing that workload data.

You can find TDX related documents in the following link.

https://software.intel.com/content/www/br/pt/develop/articles/intel-trust-domain-extensions.html

This RFC series has been reviewed by Dave Hansen.

Kirill A. Shutemov (16):
   x86/paravirt: Introduce CONFIG_PARAVIRT_XL
   x86/tdx: Get TD execution environment information via TDINFO
   x86/traps: Add #VE support for TDX guest
   x86/tdx: Add HLT support for TDX guest
   x86/tdx: Wire up KVM hypercalls
   x86/tdx: Add MSR support for TDX guest
   x86/tdx: Handle CPUID via #VE
   x86/io: Allow to override inX() and outX() implementation
   x86/tdx: Handle port I/O
   x86/tdx: Handle in-kernel MMIO
   x86/mm: Move force_dma_unencrypted() to common code
   x86/tdx: Exclude Shared bit from __PHYSICAL_MASK
   x86/tdx: Make pages shared in ioremap()
   x86/tdx: Add helper to do MapGPA TDVMALL
   x86/tdx: Make DMA pages shared
   x86/kvm: Use bounce buffers for TD guest

Kuppuswamy Sathyanarayanan (6):
   x86/cpufeatures: Add TDX Guest CPU feature
   x86/cpufeatures: Add is_tdx_guest() interface
   x86/tdx: Handle MWAIT, MONITOR and WBINVD
   ACPI: tables: Add multiprocessor wake-up support
   x86/topology: Disable CPU hotplug support for TDX platforms.
   x86/tdx: Introduce INTEL_TDX_GUEST config option

Sean Christopherson (4):
   x86/boot: Add a trampoline for APs booting in 64-bit mode
   x86/boot: Avoid #VE during compressed boot for TDX platforms
   x86/boot: Avoid unnecessary #VE during boot process
   x86/tdx: Forcefully disable legacy PIC for TDX guests

  arch/x86/Kconfig |  28 +-
  arch/x86/boot/compressed/Makefile|   2 +
  arch/x86/boot/compressed/head_64.S   |  10 +-
  arch/x86/boot/compressed/misc.h  |   1 +
  arch/x86/boot/compressed/pgtable.h   |   2 +-
  arch/x86/boot/compressed/tdx.c   |  32 ++
  arch/x86/boot/compressed/tdx_io.S|   9 +
  arch/x86/include/asm/apic.h  |   3 +
  arch/x86/include/asm/asm-prototypes.h|   1 +
  arch/x86/include/asm/cpufeatures.h   |   1 +
  arch/x86/include/asm/idtentry.h  |   4 +
  arch/x86/include/asm/io.h|  25 +-
  arch/x86/include/asm/irqflags.h  |  42 +-
  arch/x86/include/asm/kvm_para.h  |  21 +
  arch/x86/include/asm/paravirt.h  |  22 +-
  arch/x86/include/asm/paravirt_types.h|   3 +-