Re: [RFC v1 12/26] x86/tdx: Handle in-kernel MMIO

2021-04-01 Thread Dave Hansen
On 4/1/21 3:26 PM, Sean Christopherson wrote:
> On Thu, Apr 01, 2021, Dave Hansen wrote:
>> On 2/5/21 3:38 PM, Kuppuswamy Sathyanarayanan wrote:
>>> From: "Kirill A. Shutemov" 
>>>
>>> Handle #VE due to MMIO operations. MMIO triggers #VE with EPT_VIOLATION
>>> exit reason.
>>>
>>> For now we only handle subset of instruction that kernel uses for MMIO
>>> oerations. User-space access triggers SIGBUS.
>> ..
>>> +   case EXIT_REASON_EPT_VIOLATION:
>>> +   ve->instr_len = tdx_handle_mmio(regs, ve);
>>> +   break;
>>
>> Is MMIO literally the only thing that can cause an EPT violation for TDX
>> guests?
> 
> Any EPT Violation, or specifically EPT Violation #VE?  Any memory access can
> cause an EPT violation, but the VMM will get the ones that lead to VM-Exit.  
> The
> guest will only get the ones that cause #VE.

I'll rephrase: Is MMIO literally the only thing that can cause us to get
into the EXIT_REASON_EPT_VIOLATION case of the switch() here?

> Assuming you're asking about #VE... No, any shared memory access can take a 
> #VE
> since the VMM controls the shared EPT tables and can clear the SUPPRESS_VE 
> bit 
> at any time.  But, if the VMM is friendly, #VE should be limited to MMIO.

OK, but what are we doing in the case of unfriendly VMMs?  What does
*this* code do as-is, and where do we want to take it?

>From the _looks_ of this patch, tdx_handle_mmio() is the be all end all
solution to all EXIT_REASON_EPT_VIOLATION events.

>> But for an OS where we have source for the *ENTIRE* thing, and where we
>> have a chokepoint for MMIO accesses (arch/x86/include/asm/io.h), it
>> seems like an *AWFUL* idea to:
>> 1. Have the kernel set up special mappings for I/O memory
>> 2. Kernel generates special instructions to access that memory
>> 3. Kernel faults on that memory
>> 4. Kernel cracks its own special instructions to see what they were
>>doing
>> 5. Kernel calls up to host to do the MMIO
>>
>> Instead of doing 2/3/4, why not just have #2 call up to the host
>> directly?  This patch seems a very slow, roundabout way to do
>> paravirtualized MMIO.
>>
>> BTW, there's already some SEV special-casing in io.h.
> 
> I implemented #2 a while back for build_mmio_{read,write}(), I'm guessing the
> code is floating around somewhere.  The gotcha is that there are nasty little
> pieces of the kernel that don't use the helpers provided by io.h, e.g. the I/O
> APIC code likes to access MMIO via a struct overlay, so the compiler is free 
> to
> use any instruction that satisfies the constraint.

So, there aren't an infinite number of these.  It's also 100% possible
to add some tooling to the kernel today to help you find these.  You
could also have added tooling to KVM hosts to help find these.

Folks are *also* saying that we'll need a driver audit just to trust
that drivers aren't vulnerable to attacks from devices or from the host.
 This can quite easily be a part of that effort.

> The I/O APIC can and should be forced off, but dollars to donuts says there 
> are
> more special snowflakes lying in wait.  If the kernel uses an allowlist for
> drivers, then in theory it should be possible to hunt down all offenders.  But
> I think we'll want fallback logic to handle kernel MMIO #VEs, especially if 
> the
> kernel needs ISA cracking logic for userspace.  Without fallback logic, any 
> MMIO
> #VE from the kernel would be fatal, which is too harsh IMO since the behavior
> isn't so obviously wrong, e.g. versus the split lock #AC purge where there's 
> no
> legitimate reason for the kernel to generate a split lock.

I'll buy that this patch is convenient for *debugging*.  It helped folks
bootstrap the TDX support and get it going.

IMNHO, if a driver causes a #VE, it's a bug.  Just like if it goes off
the rails and touches bad memory and #GP's or #PF's.

Are there any printk's in the #VE handler?  Guess what those do.  Print
to the console.  Guess what consoles do.  MMIO.  You can't get away from
doing audits of the console drivers.  Sure, you can go make #VE special,
like NMIs, but that's not going to be fun.  At least the guest doesn't
have to deal with the fatality of a nested #VE, but it's still fatal.

I just don't like us pretending that we're Windows and have no control
over the code we run and throwing up our hands.


Re: [RFC v1 12/26] x86/tdx: Handle in-kernel MMIO

2021-04-01 Thread Sean Christopherson
On Thu, Apr 01, 2021, Dave Hansen wrote:
> On 2/5/21 3:38 PM, Kuppuswamy Sathyanarayanan wrote:
> > From: "Kirill A. Shutemov" 
> > 
> > Handle #VE due to MMIO operations. MMIO triggers #VE with EPT_VIOLATION
> > exit reason.
> > 
> > For now we only handle subset of instruction that kernel uses for MMIO
> > oerations. User-space access triggers SIGBUS.
> ..
> > +   case EXIT_REASON_EPT_VIOLATION:
> > +   ve->instr_len = tdx_handle_mmio(regs, ve);
> > +   break;
> 
> Is MMIO literally the only thing that can cause an EPT violation for TDX
> guests?

Any EPT Violation, or specifically EPT Violation #VE?  Any memory access can
cause an EPT violation, but the VMM will get the ones that lead to VM-Exit.  The
guest will only get the ones that cause #VE.

Assuming you're asking about #VE... No, any shared memory access can take a #VE
since the VMM controls the shared EPT tables and can clear the SUPPRESS_VE bit 
at any time.  But, if the VMM is friendly, #VE should be limited to MMIO.

There's also the unaccepted private memory case, but if Linux gets an option to
opt out of that, then #VE is limited to shared memory.

> Forget userspace for a minute.  #VE's from userspace are annoying, but
> fine.  We can't control what userspace does.  If an action it takes
> causes a #VE in the TDX architecture, tough cookies, the kernel must
> handle it and try to recover or kill the app.
> 
> The kernel is very different.  We know in advance (must know,
> actually...) which instructions might cause exceptions of any kind.
> That's why we have exception tables and copy_to/from_user().  That's why
> we can handle kernel page faults on userspace, but not inside spinlocks.
> 
> Binary-dependent OSes are also very different.  It's going to be natural
> for them to want to take existing, signed drivers and use them in TDX
> guests.  They might want to do something like this.
> 
> But for an OS where we have source for the *ENTIRE* thing, and where we
> have a chokepoint for MMIO accesses (arch/x86/include/asm/io.h), it
> seems like an *AWFUL* idea to:
> 1. Have the kernel set up special mappings for I/O memory
> 2. Kernel generates special instructions to access that memory
> 3. Kernel faults on that memory
> 4. Kernel cracks its own special instructions to see what they were
>doing
> 5. Kernel calls up to host to do the MMIO
> 
> Instead of doing 2/3/4, why not just have #2 call up to the host
> directly?  This patch seems a very slow, roundabout way to do
> paravirtualized MMIO.
> 
> BTW, there's already some SEV special-casing in io.h.

I implemented #2 a while back for build_mmio_{read,write}(), I'm guessing the
code is floating around somewhere.  The gotcha is that there are nasty little
pieces of the kernel that don't use the helpers provided by io.h, e.g. the I/O
APIC code likes to access MMIO via a struct overlay, so the compiler is free to
use any instruction that satisfies the constraint.

The I/O APIC can and should be forced off, but dollars to donuts says there are
more special snowflakes lying in wait.  If the kernel uses an allowlist for
drivers, then in theory it should be possible to hunt down all offenders.  But
I think we'll want fallback logic to handle kernel MMIO #VEs, especially if the
kernel needs ISA cracking logic for userspace.  Without fallback logic, any MMIO
#VE from the kernel would be fatal, which is too harsh IMO since the behavior
isn't so obviously wrong, e.g. versus the split lock #AC purge where there's no
legitimate reason for the kernel to generate a split lock.


Re: [RFC v1 12/26] x86/tdx: Handle in-kernel MMIO

2021-04-01 Thread Dave Hansen
On 2/5/21 3:38 PM, Kuppuswamy Sathyanarayanan wrote:
> From: "Kirill A. Shutemov" 
> 
> Handle #VE due to MMIO operations. MMIO triggers #VE with EPT_VIOLATION
> exit reason.
> 
> For now we only handle subset of instruction that kernel uses for MMIO
> oerations. User-space access triggers SIGBUS.
..
> + case EXIT_REASON_EPT_VIOLATION:
> + ve->instr_len = tdx_handle_mmio(regs, ve);
> + break;

Is MMIO literally the only thing that can cause an EPT violation for TDX
guests?

Forget userspace for a minute.  #VE's from userspace are annoying, but
fine.  We can't control what userspace does.  If an action it takes
causes a #VE in the TDX architecture, tough cookies, the kernel must
handle it and try to recover or kill the app.

The kernel is very different.  We know in advance (must know,
actually...) which instructions might cause exceptions of any kind.
That's why we have exception tables and copy_to/from_user().  That's why
we can handle kernel page faults on userspace, but not inside spinlocks.

Binary-dependent OSes are also very different.  It's going to be natural
for them to want to take existing, signed drivers and use them in TDX
guests.  They might want to do something like this.

But for an OS where we have source for the *ENTIRE* thing, and where we
have a chokepoint for MMIO accesses (arch/x86/include/asm/io.h), it
seems like an *AWFUL* idea to:
1. Have the kernel set up special mappings for I/O memory
2. Kernel generates special instructions to access that memory
3. Kernel faults on that memory
4. Kernel cracks its own special instructions to see what they were
   doing
5. Kernel calls up to host to do the MMIO

Instead of doing 2/3/4, why not just have #2 call up to the host
directly?  This patch seems a very slow, roundabout way to do
paravirtualized MMIO.

BTW, there's already some SEV special-casing in io.h.


[RFC v1 12/26] x86/tdx: Handle in-kernel MMIO

2021-02-05 Thread Kuppuswamy Sathyanarayanan
From: "Kirill A. Shutemov" 

Handle #VE due to MMIO operations. MMIO triggers #VE with EPT_VIOLATION
exit reason.

For now we only handle subset of instruction that kernel uses for MMIO
oerations. User-space access triggers SIGBUS.

Signed-off-by: Kirill A. Shutemov 
Reviewed-by: Andi Kleen 
Signed-off-by: Kuppuswamy Sathyanarayanan 

---
 arch/x86/kernel/tdx.c | 120 ++
 1 file changed, 120 insertions(+)

diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index 3846d2807a7a..eff58329751e 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -6,6 +6,8 @@
 #include 
 #include 
 #include 
+#include 
+#include  /* force_sig_fault() */
 
 #ifdef CONFIG_KVM_GUEST
 #include "tdx-kvm.c"
@@ -270,6 +272,121 @@ static void tdx_handle_io(struct pt_regs *regs, u32 
exit_qual)
}
 }
 
+static unsigned long tdx_mmio(int size, bool write, unsigned long addr,
+   unsigned long val)
+{
+   register long r10 asm("r10") = TDVMCALL_STANDARD;
+   register long r11 asm("r11") = EXIT_REASON_EPT_VIOLATION;
+   register long r12 asm("r12") = size;
+   register long r13 asm("r13") = write;
+   register long r14 asm("r14") = addr;
+   register long r15 asm("r15") = val;
+   register long rcx asm("rcx");
+   long ret;
+
+   /* Allow to pass R10, R11, R12, R13, R14 and R15 down to the VMM */
+   rcx = BIT(10) | BIT(11) | BIT(12) | BIT(13) | BIT(14) | BIT(15);
+
+   asm volatile(TDCALL
+   : "=a"(ret), "=r"(r10), "=r"(r11), "=r"(r12), "=r"(r13),
+ "=r"(r14), "=r"(r15)
+   : "a"(TDVMCALL), "r"(rcx), "r"(r10), "r"(r11), "r"(r12),
+ "r"(r13), "r"(r14), "r"(r15)
+   : );
+
+   WARN_ON(ret || r10);
+
+   return r11;
+}
+
+static inline void *get_reg_ptr(struct pt_regs *regs, struct insn *insn)
+{
+   static const int regoff[] = {
+   offsetof(struct pt_regs, ax),
+   offsetof(struct pt_regs, cx),
+   offsetof(struct pt_regs, dx),
+   offsetof(struct pt_regs, bx),
+   offsetof(struct pt_regs, sp),
+   offsetof(struct pt_regs, bp),
+   offsetof(struct pt_regs, si),
+   offsetof(struct pt_regs, di),
+   offsetof(struct pt_regs, r8),
+   offsetof(struct pt_regs, r9),
+   offsetof(struct pt_regs, r10),
+   offsetof(struct pt_regs, r11),
+   offsetof(struct pt_regs, r12),
+   offsetof(struct pt_regs, r13),
+   offsetof(struct pt_regs, r14),
+   offsetof(struct pt_regs, r15),
+   };
+   int regno;
+
+   regno = X86_MODRM_REG(insn->modrm.value);
+   if (X86_REX_R(insn->rex_prefix.value))
+   regno += 8;
+
+   return (void *)regs + regoff[regno];
+}
+
+static int tdx_handle_mmio(struct pt_regs *regs, struct ve_info *ve)
+{
+   int size;
+   bool write;
+   unsigned long *reg;
+   struct insn insn;
+   unsigned long val = 0;
+
+   /*
+* User mode would mean the kernel exposed a device directly
+* to ring3, which shouldn't happen except for things like
+* DPDK.
+*/
+   if (user_mode(regs)) {
+   pr_err("Unexpected user-mode MMIO access.\n");
+   force_sig_fault(SIGBUS, BUS_ADRERR, (void __user *) ve->gla);
+   return 0;
+   }
+
+   kernel_insn_init(, (void *) regs->ip, MAX_INSN_SIZE);
+   insn_get_length();
+   insn_get_opcode();
+
+   write = ve->exit_qual & 0x2;
+
+   size = insn.opnd_bytes;
+   switch (insn.opcode.bytes[0]) {
+   /* MOV r/m8 r8  */
+   case 0x88:
+   /* MOV r8   r/m8*/
+   case 0x8A:
+   /* MOV r/m8 imm8*/
+   case 0xC6:
+   size = 1;
+   break;
+   }
+
+   if (inat_has_immediate(insn.attr)) {
+   BUG_ON(!write);
+   val = insn.immediate.value;
+   tdx_mmio(size, write, ve->gpa, val);
+   return insn.length;
+   }
+
+   BUG_ON(!inat_has_modrm(insn.attr));
+
+   reg = get_reg_ptr(regs, );
+
+   if (write) {
+   memcpy(, reg, size);
+   tdx_mmio(size, write, ve->gpa, val);
+   } else {
+   val = tdx_mmio(size, write, ve->gpa, val);
+   memset(reg, 0, size);
+   memcpy(reg, , size);
+   }
+   return insn.length;
+}
+
 void __init tdx_early_init(void)
 {
if (!cpuid_has_tdx_guest())
@@ -331,6 +448,9 @@ int tdx_handle_virtualization_exception(struct pt_regs 
*regs,
case EXIT_REASON_IO_INSTRUCTION:
tdx_handle_io(regs, ve->exit_qual);
break;
+   case EXIT_REASON_EPT_VIOLATION:
+   ve->instr_len = tdx_handle_mmio(regs, ve);
+   break;
default: