Re: [RFC v1 00/26] Add TDX Guest Support
On Sun, Apr 4, 2021 at 8:02 AM Dave Hansen wrote: > > It occurred to me that I've been doing a lot of digging in the TDX spec > lately. I think we can all agree that the "Architecture Specification" > is not the world's easiest, most disgestable reading. It's hard to > figure out what the Linux relation to the spec is. > > One bit of Documentation we need for TDX is a description of the memory > states. For instance, it would be nice to spell out the different > classes of memory, how they are selected, who selects them, and who > enforces the selection. What faults are generated on each type and who > can induce those? > > For instance: > > TD-Private memory is selected by the Shared/Private bit in Present=1 > guest PTEs. When the hardware page walker sees that bit, it walk the > secure EPT. The secure EPT entries can only be written by the TDX > module, although they are written at the request of the VMM. The TDX > module enforces rules like ensuring that the memory mapped by secure EPT > is not mapped multiple times. The VMM can remove entries. From the > guest perspective, all private memory accesses are either successful, or > result in a #VE. Private memory access does not cause VMExits. > > Would that be useful to folks? That paragraph was useful for me as someone coming in cold to TDX patch review. +1 for more of that style of commentary.
Re: [RFC v1 00/26] Add TDX Guest Support
It occurred to me that I've been doing a lot of digging in the TDX spec lately. I think we can all agree that the "Architecture Specification" is not the world's easiest, most disgestable reading. It's hard to figure out what the Linux relation to the spec is. One bit of Documentation we need for TDX is a description of the memory states. For instance, it would be nice to spell out the different classes of memory, how they are selected, who selects them, and who enforces the selection. What faults are generated on each type and who can induce those? For instance: TD-Private memory is selected by the Shared/Private bit in Present=1 guest PTEs. When the hardware page walker sees that bit, it walk the secure EPT. The secure EPT entries can only be written by the TDX module, although they are written at the request of the VMM. The TDX module enforces rules like ensuring that the memory mapped by secure EPT is not mapped multiple times. The VMM can remove entries. From the guest perspective, all private memory accesses are either successful, or result in a #VE. Private memory access does not cause VMExits. Would that be useful to folks?
Re: [RFC v1 00/26] Add TDX Guest Support
On Sat, Apr 03, 2021 at 09:26:24AM -0700, Dave Hansen wrote: > On 4/2/21 2:32 PM, Andi Kleen wrote: > >> If we go this route, what are the rules and restrictions? Do we have to > >> say "no MMIO in #VE"? > > > > All we have to say is "No MMIO in #VE before getting thd TDVEINFO arguments" > > After that it can nest without problems. > > Well, not exactly. You still can't do things that will could cause a n > unbounded recusive #VE. > It doesn't seem *that* far fetched to think that someone might try to > defer some work or dump data to the console. I believe the main console code has reentry protection. I'm not sure about early_printk (with keep), buf it that's the case it probably should be fixed anyways. I can take a look at that. Not sure why deferring something would cause another #VE? > > If you nest before that the TDX will cause a triple fault. > > > > The code that cannot do it is a few lines in the early handler which > > runs with interrupts off. > > >> Which brings up another related point: How do you debug TD guests? Does > >> earlyprintk work? > > > > Today it works actually because serial ports are allowed. But I expect it to > > be closed eventually because serial code is a lot of code to audit. > > But you can always disable the filtering with a command line option and > > then it will always work for debugging. > > Do we need a TDX-specific earlyprintk? I would imagine it's pretty easy > to implement. Don't see a need at this point, the existing mechanisms work. Maybe if we ever have a problem that only happen in lockdown *and* happens early, but that's not very likely since lock down primarily changes code behavior later. There are also other debug mechanisms for such cases: in TDX if you configure the TD for debug mode supports using the gdb stub on the hypervisor. -Andi
Re: [RFC v1 00/26] Add TDX Guest Support
On 4/2/21 2:32 PM, Andi Kleen wrote: >> If we go this route, what are the rules and restrictions? Do we have to >> say "no MMIO in #VE"? > > All we have to say is "No MMIO in #VE before getting thd TDVEINFO arguments" > After that it can nest without problems. Well, not exactly. You still can't do things that will could cause a n unbounded recusive #VE. It doesn't seem *that* far fetched to think that someone might try to defer some work or dump data to the console. > If you nest before that the TDX will cause a triple fault. > > The code that cannot do it is a few lines in the early handler which > runs with interrupts off. >> Which brings up another related point: How do you debug TD guests? Does >> earlyprintk work? > > Today it works actually because serial ports are allowed. But I expect it to > be closed eventually because serial code is a lot of code to audit. > But you can always disable the filtering with a command line option and > then it will always work for debugging. Do we need a TDX-specific earlyprintk? I would imagine it's pretty easy to implement.
Re: [RFC v1 00/26] Add TDX Guest Support
> If we go this route, what are the rules and restrictions? Do we have to > say "no MMIO in #VE"? All we have to say is "No MMIO in #VE before getting thd TDVEINFO arguments" After that it can nest without problems. If you nest before that the TDX will cause a triple fault. The code that cannot do it is a few lines in the early handler which runs with interrupts off. The TDX module also makes sure to not inject NMIs while we're in that region, so NMIs are of no concern. That was the whole point of avoiding the system call gap problem. We don't need to make it IST, so it can nest. I'm not aware of any other special rules. > Which brings up another related point: How do you debug TD guests? Does > earlyprintk work? Today it works actually because serial ports are allowed. But I expect it to be closed eventually because serial code is a lot of code to audit. But you can always disable the filtering with a command line option and then it will always work for debugging. -Andi
Re: [RFC v1 00/26] Add TDX Guest Support
On 4/1/21 7:48 PM, Andi Kleen wrote: >> I've heard things like "we need to harden the drivers" or "we need to do >> audits" and that drivers might be "whitelisted". > > The basic driver allow listing patches are already in the repository, > but not currently posted or complete: > > https://github.com/intel/tdx/commits/guest That lists exactly 8 ids: > { PCI_VENDOR_ID_REDHAT_QUMRANET, 0x1000 }, /* Virtio NET */ > { PCI_VENDOR_ID_REDHAT_QUMRANET, 0x1001 }, /* Virtio block */ > { PCI_VENDOR_ID_REDHAT_QUMRANET, 0x1003 }, /* Virtio console */ > { PCI_VENDOR_ID_REDHAT_QUMRANET, 0x1009 }, /* Virtio FS */ > > { PCI_VENDOR_ID_REDHAT_QUMRANET, 0x1041 }, /* Virtio 1.0 NET */ > { PCI_VENDOR_ID_REDHAT_QUMRANET, 0x1042 }, /* Virtio 1.0 block */ > { PCI_VENDOR_ID_REDHAT_QUMRANET, 0x1043 }, /* Virtio 1.0 console */ > { PCI_VENDOR_ID_REDHAT_QUMRANET, 0x1049 }, /* Virtio 1.0 FS */ How many places do those 8 drivers touch MMIO? >> Are there any "real" hardware drivers >> involved like how QEMU emulates an e1000 or rtl8139 device? > > Not currently (but some later hypervisor might rely on one of those) > >> What about >> the APIC or HPET? > > No IO-APIC, but the local APIC. No HPET. Sean seemed worried about other x86-specific oddities. Are there any more, or is the local APIC the only non-driver MMIO? >> Without something concrete, it's really hard to figure out if we should >> go full-blown paravirtualized MMIO, or do something like the #VE >> trapping that's in this series currently. > > As Sean says the concern about MMIO is less drivers (which should > be generally ok if they work on other architectures which require MMIO > magic), but other odd code that only ran on x86 before. > > I really don't understand your crusade against #VE. It really > isn't that bad if we can avoid the few corner cases. The problem isn't with #VE per se. It's with posting a series that masquerades as a full solution while *NOT* covering or even enumerating the corner cases. That's exactly what happened with #VE to start with: it was implemented in a way that exposed the kernel to #VE during the syscall gap (and the SWAPGS gap for that matter). So, I'm pushing for a design that won't have corner cases. If MMIO itself is disallowed, then we can scream about *any* detected MMIO. Then, there's no worry about #VE nesting. No #VE, no #VE nesting. We don't even have to consider if #VE needs NMI-like semantics. > For me it would seem wrong to force all MMIO for all drivers to some > complicated paravirt construct, blowing up code side everywhere > and adding complicated self modifying code, when it's only needed for very > few drivers. But we also don't want to patch every MMIO to be special cased > even those few drivers. > > #VE based MMIO avoids all that cleanly while being nicely non intrusive. But, we're not selling used cars here. Using #VE is has downsides. Let's not pretend that it doesn't. If we go this route, what are the rules and restrictions? Do we have to say "no MMIO in #VE"? I'm really the most worried about the console. Consoles and NMIs have been a nightmare, IIRC. Doesn't this just make it *WORSE* because now the deepest reaches of the console driver are guaranteed to #VE? Which brings up another related point: How do you debug TD guests? Does earlyprintk work?
Re: [RFC v1 00/26] Add TDX Guest Support
> I've heard things like "we need to harden the drivers" or "we need to do > audits" and that drivers might be "whitelisted". The basic driver allow listing patches are already in the repository, but not currently posted or complete: https://github.com/intel/tdx/commits/guest > > What are we talking about specifically? Which drivers? How many > approximately? Just virtio? Right now just virtio, later other drivers that hypervisors need. > Are there any "real" hardware drivers > involved like how QEMU emulates an e1000 or rtl8139 device? Not currently (but some later hypervisor might rely on one of those) > What about > the APIC or HPET? No IO-APIC, but the local APIC. No HPET. > > How broadly across the kernel is this going to go? Not very broadly for drivers. > > Without something concrete, it's really hard to figure out if we should > go full-blown paravirtualized MMIO, or do something like the #VE > trapping that's in this series currently. As Sean says the concern about MMIO is less drivers (which should be generally ok if they work on other architectures which require MMIO magic), but other odd code that only ran on x86 before. I really don't understand your crusade against #VE. It really isn't that bad if we can avoid the few corner cases. For me it would seem wrong to force all MMIO for all drivers to some complicated paravirt construct, blowing up code side everywhere and adding complicated self modifying code, when it's only needed for very few drivers. But we also don't want to patch every MMIO to be special cased even those few drivers. #VE based MMIO avoids all that cleanly while being nicely non intrusive. -Andi
Re: [RFC v1 00/26] Add TDX Guest Support
On 2/5/21 3:38 PM, Kuppuswamy Sathyanarayanan wrote: > Intel's Trust Domain Extensions (TDX) protect guest VMs from malicious > hosts and some physical attacks. This series adds the bare-minimum > support to run a TDX guest. The host-side support will be submitted > separately. Also support for advanced TD guest features like attestation > or debug-mode will be submitted separately. Also, at this point it is not > secure with some known holes in drivers, and also hasn’t been fully audited > and fuzzed yet. I want to hear a lot more about this driver model. I've heard things like "we need to harden the drivers" or "we need to do audits" and that drivers might be "whitelisted". What are we talking about specifically? Which drivers? How many approximately? Just virtio? Are there any "real" hardware drivers involved like how QEMU emulates an e1000 or rtl8139 device? What about the APIC or HPET? How broadly across the kernel is this going to go? Without something concrete, it's really hard to figure out if we should go full-blown paravirtualized MMIO, or do something like the #VE trapping that's in this series currently.
Re: [RFC v1 00/26] Add TDX Guest Support
Hi All, On 2/5/21 3:38 PM, Kuppuswamy Sathyanarayanan wrote: Hi All, NOTE: This series is not ready for wide public review. It is being specifically posted so that Peter Z and other experts on the entry code can look for problems with the new exception handler (#VE). That's also why x86@ is not being spammed. We are currently working on a solution to fix the issues raised in "Add #VE support for TDX guest" patch. While we fix that issue, I would like to know if there are issues in other patches in this series. So if possible can you please review other patches in the series and let us know your comments?. If you want me to rebase the series on top of v5.12-rcX kernel and repost it, please let me know. Intel's Trust Domain Extensions (TDX) protect guest VMs from malicious hosts and some physical attacks. This series adds the bare-minimum support to run a TDX guest. The host-side support will be submitted separately. Also support for advanced TD guest features like attestation or debug-mode will be submitted separately. Also, at this point it is not secure with some known holes in drivers, and also hasn’t been fully audited and fuzzed yet. TDX has a lot of similarities to SEV. It enhances confidentiality and of guest memory and state (like registers) and includes a new exception (#VE) for the same basic reasons as SEV-ES. Like SEV-SNP (not merged yet), TDX limits the host's ability to effect changes in the guest physical address space. In contrast to the SEV code in the kernel, TDX guest memory is integrity protected and isolated; the host is prevented from accessing guest memory (even ciphertext). The TDX architecture also includes a new CPU mode called Secure-Arbitration Mode (SEAM). The software (TDX module) running in this mode arbitrates interactions between host and guest and implements many of the guarantees of the TDX architecture. Some of the key differences between TD and regular VM is, 1. Multi CPU bring-up is done using the ACPI MADT wake-up table. 2. A new #VE exception handler is added. The TDX module injects #VE exception to the guest TD in cases of instructions that need to be emulated, disallowed MSR accesses, subset of CPUID leaves, etc. 3. By default memory is marked as private, and TD will selectively share it with VMM based on need. 4. Remote attestation is supported to enable a third party (either the owner of the workload or a user of the services provided by the workload) to establish that the workload is running on an Intel-TDX-enabled platform located within a TD prior to providing that workload data. You can find TDX related documents in the following link. https://software.intel.com/content/www/br/pt/develop/articles/intel-trust-domain-extensions.html This RFC series has been reviewed by Dave Hansen. Kirill A. Shutemov (16): x86/paravirt: Introduce CONFIG_PARAVIRT_XL x86/tdx: Get TD execution environment information via TDINFO x86/traps: Add #VE support for TDX guest x86/tdx: Add HLT support for TDX guest x86/tdx: Wire up KVM hypercalls x86/tdx: Add MSR support for TDX guest x86/tdx: Handle CPUID via #VE x86/io: Allow to override inX() and outX() implementation x86/tdx: Handle port I/O x86/tdx: Handle in-kernel MMIO x86/mm: Move force_dma_unencrypted() to common code x86/tdx: Exclude Shared bit from __PHYSICAL_MASK x86/tdx: Make pages shared in ioremap() x86/tdx: Add helper to do MapGPA TDVMALL x86/tdx: Make DMA pages shared x86/kvm: Use bounce buffers for TD guest Kuppuswamy Sathyanarayanan (6): x86/cpufeatures: Add TDX Guest CPU feature x86/cpufeatures: Add is_tdx_guest() interface x86/tdx: Handle MWAIT, MONITOR and WBINVD ACPI: tables: Add multiprocessor wake-up support x86/topology: Disable CPU hotplug support for TDX platforms. x86/tdx: Introduce INTEL_TDX_GUEST config option Sean Christopherson (4): x86/boot: Add a trampoline for APs booting in 64-bit mode x86/boot: Avoid #VE during compressed boot for TDX platforms x86/boot: Avoid unnecessary #VE during boot process x86/tdx: Forcefully disable legacy PIC for TDX guests arch/x86/Kconfig | 28 +- arch/x86/boot/compressed/Makefile| 2 + arch/x86/boot/compressed/head_64.S | 10 +- arch/x86/boot/compressed/misc.h | 1 + arch/x86/boot/compressed/pgtable.h | 2 +- arch/x86/boot/compressed/tdx.c | 32 ++ arch/x86/boot/compressed/tdx_io.S| 9 + arch/x86/include/asm/apic.h | 3 + arch/x86/include/asm/asm-prototypes.h| 1 + arch/x86/include/asm/cpufeatures.h | 1 + arch/x86/include/asm/idtentry.h | 4 + arch/x86/include/asm/io.h| 25 +- arch/x86/include/asm/irqflags.h | 42 +- arch/x86/include/asm/kvm_para.h | 21 + arch/x86/include/asm/paravirt.h | 22 +- arch/x86/include/asm/paravirt_types.h| 3 +-