Re: TDX #VE in SYSCALL gap (was: [RFD] x86: Curing the exception and syscall trainwreck in hardware)

2020-08-30 Thread Linus Torvalds
On Sun, Aug 30, 2020 at 8:37 AM Andy Lutomirski  wrote:
>
> There's no such thing as "just" using an IST.  Using IST opens a huge
> can of works due to its recursion issues.

I absolutely despise all the x86 "indirect system structures". They
are horrible garbage. IST is only yet another example of that kind of
brokenness, and annoys me particularly because it (and swapgs) were
actually making x86 _worse_.

The old i386 exception model was actually better than what x86-64 did,
and IST is a big part of the problem. Just have a supervisor stack,
and push the state on it. Stop playing games with multiple stacks
depending on some magical indirect system state.

Other examples of stupid and bad indirection:

 - the GDT and LDT.

   The kernel should never have to use them. It would be much better
if the segment "shadow" state would stop being shadow state, and be
the REAL state that the kernel (and user space, for that matter)
accesses.

   Yeah, we got halfway there with MSR_FS/GS_BASE, but what a complete
garbage crock that was. So now we're forced to use the selector *and*
the base reghister, and they may be out of sync with each other, so
you have the worst of both worlds.

   Keep the GDT and LDT around for compatibility reasons, so that old
broken programs that want to load the segment state the oldfashioned
way can do so. But make it clear that that is purely for legacy, and
make the modern code just save and restore the actual true
non-indirect segment state.

   For new models, give us a way to load base/limit/permissions
directly, and reset them on kernel entry. No more descriptor table
indirection games.

 - the IDT and the TSS segment.

   Exact same arguments as above. Keep them around for legacy
programs, but let us just set "this is the entrypoint, this the the
kernel stack" as registers. Christ, we're probably better off with one
single entry-point for the whole kernel (ok, give us a separate one
for NMI/MCE/doublefault, since they are _so_ special, and maybe
separate "CPU exceptions" from "external interrupts), together with
just a register that says what the exception was.

 - swapgs needs to die.

   The kernel GS/FS segments should just be separate segment registers
from user space. No "swapping" needed. In CPL0, "gs" just means
something different from user space. No save/restore code for it, no
swapping, no nothing.

Honestly, I think %rsp/%rip could work like that too. Just make "rsp"
and "rip" be a completely different register in kernel mode - rename
it in the front-end of the CPU or whatever.

Imagine not having to save/restore rsp/rip on kernel entry/exit at
all, because returning to user more just implicitly starts using
ursp/urip. And a context switch uses (fast) MSR's to save/restore the
user state (or, since it's actually a real register in the register
file, just a new "mov" instruction to access the user registers).

 Linus


Re: TDX #VE in SYSCALL gap (was: [RFD] x86: Curing the exception and syscall trainwreck in hardware)

2020-08-30 Thread Andy Lutomirski
On Wed, Aug 26, 2020 at 12:16 PM Sean Christopherson
 wrote:
>
> On Tue, Aug 25, 2020 at 10:28:53AM -0700, Andy Lutomirski wrote:
> > On Tue, Aug 25, 2020 at 10:19 AM Sean Christopherson
> >  wrote:
> > > One thought would be to have the TDX module (thing that runs in SEAM and
> > > sits between the VMM and the guest) provide a TDCALL (hypercall from guest
> > > to TDX module) to the guest that would allow the guest to specify a very
> > > limited number of GPAs that must never generate a #VE, e.g. go straight to
> > > guest shutdown if a disallowed GPA would go pending.  That seems doable
> > > from a TDX perspective without incurring noticeable overhead (assuming the
> > > list of GPAs is very small) and should be easy to to support in the guest,
> > > e.g. make a TDCALL/hypercall or two during boot to protect the SYSCALL
> > > page and its scratch data.
> >
> > I guess you could do that, but this is getting gross.  The x86
> > architecture has really gone off the rails here.
>
> Does it suck less than using an IST?  Honest question.
>
> I will add my voice to the "fix SYSCALL" train, but the odds of that getting
> a proper fix in time to intercept TDX are not good.  On the other hand,
> "fixing" the SYSCALL issue in the TDX module is much more feasible, but only
> if we see real value in such an approach as opposed to just using an IST.  I
> personally like the idea of a TDX module solution as I think it would be
> simpler for the kernel to implement/support, and would mean we wouldn't need
> to roll back IST usage for #VE if the heavens should part and bestow upon us
> a sane SYSCALL.

There's no such thing as "just" using an IST.  Using IST opens a huge
can of works due to its recursion issues.

The TDX module solution is utterly gross but may well suck less than
using an IST.


Re: TDX #VE in SYSCALL gap (was: [RFD] x86: Curing the exception and syscall trainwreck in hardware)

2020-08-26 Thread Sean Christopherson
On Tue, Aug 25, 2020 at 10:28:53AM -0700, Andy Lutomirski wrote:
> On Tue, Aug 25, 2020 at 10:19 AM Sean Christopherson
>  wrote:
> > One thought would be to have the TDX module (thing that runs in SEAM and
> > sits between the VMM and the guest) provide a TDCALL (hypercall from guest
> > to TDX module) to the guest that would allow the guest to specify a very
> > limited number of GPAs that must never generate a #VE, e.g. go straight to
> > guest shutdown if a disallowed GPA would go pending.  That seems doable
> > from a TDX perspective without incurring noticeable overhead (assuming the
> > list of GPAs is very small) and should be easy to to support in the guest,
> > e.g. make a TDCALL/hypercall or two during boot to protect the SYSCALL
> > page and its scratch data.
> 
> I guess you could do that, but this is getting gross.  The x86
> architecture has really gone off the rails here.

Does it suck less than using an IST?  Honest question.

I will add my voice to the "fix SYSCALL" train, but the odds of that getting
a proper fix in time to intercept TDX are not good.  On the other hand,
"fixing" the SYSCALL issue in the TDX module is much more feasible, but only
if we see real value in such an approach as opposed to just using an IST.  I
personally like the idea of a TDX module solution as I think it would be
simpler for the kernel to implement/support, and would mean we wouldn't need
to roll back IST usage for #VE if the heavens should part and bestow upon us
a sane SYSCALL.


RE: TDX #VE in SYSCALL gap (was: [RFD] x86: Curing the exception and syscall trainwreck in hardware)

2020-08-25 Thread Thomas Gleixner
On Tue, Aug 25 2020 at 17:35, Tony Luck wrote:
>> > Or malicious hypervisor action, and that's a problem.
>> >
>> > Suppose the hypervisor remaps a GPA used in the SYSCALL gap (e.g. the
>> > actual SYSCALL text or the first memory it accesses -- I don't have a
>> > TDX spec so I don't know the details).
>
> Is it feasible to defend against a malicious (or buggy) hypervisor?
>
> Obviously, we can't leave holes that guests can exploit. But the hypervisor
> can crash the system no matter how clever TDX is.

If it crashes and burns reliably then fine, but is that guaranteed?

I have serious doubts about that given the history and fragility of all
of this and I really have zero interest in dealing with the fallout a
year from now.

Thanks,

tglx


Re: TDX #VE in SYSCALL gap (was: [RFD] x86: Curing the exception and syscall trainwreck in hardware)

2020-08-25 Thread Dave Hansen
On 8/25/20 10:59 AM, Andrew Cooper wrote:
> If I've read the TDX spec/whitepaper properly, the main hypervisor can
> write to all the encrypted pages.  This will destroy data, break the
> MAC, and yields #PF inside the SEAM hypervisor, or the TD when the cache
> line is next referenced.

I think you're talking about:

> Attempting to access a private KeyID by software outside the SEAM
> mode would cause a page-fault exception (#PF).

I don't think that ever results in a TD guest #PF.  "A MAC-verification
failure would be fatal to the TD and lead to its termination."  In this
context, I think that means that the TD stops running and can not be
reentered.


Re: TDX #VE in SYSCALL gap (was: [RFD] x86: Curing the exception and syscall trainwreck in hardware)

2020-08-25 Thread Andrew Cooper
On 25/08/2020 18:35, Luck, Tony wrote:
>>> Or malicious hypervisor action, and that's a problem.
>>>
>>> Suppose the hypervisor remaps a GPA used in the SYSCALL gap (e.g. the
>>> actual SYSCALL text or the first memory it accesses -- I don't have a
>>> TDX spec so I don't know the details).
> Is it feasible to defend against a malicious (or buggy) hypervisor?
>
> Obviously, we can't leave holes that guests can exploit. But the hypervisor
> can crash the system no matter how clever TDX is.

You have to be more specific about what you mean by "malicious" hypervisor.

Nothing can protect against a hypervisor which refuses to schedule the
Trusted Domain.  The guest cannot protect against availability
maliciousness.  However, you can use market forces to fix that problem. 
(I'll take my credit card elsewhere if you don't schedule my VM, etc)

Things are more complicated when it comes to integrity or
confidentiality of the TD, but the prevailing feeling seems to be
"crashing obviously and reliably if something goes wrong is ok".

If I've read the TDX spec/whitepaper properly, the main hypervisor can
write to all the encrypted pages.  This will destroy data, break the
MAC, and yields #PF inside the SEAM hypervisor, or the TD when the cache
line is next referenced.

Cunning timing on behalf of a malicious hypervisor (hitting the SYSCALL
gap) will cause the guest's #PF handler to run on a user stack, opening
a privilege escalation hole.

Whatever you might want to say about the exact integrity/confidentiality
expectations, I think "the hypervisor can open a user=>kernel privilege
escalation hole inside the TD" is not what people would consider acceptable.

On AMD parts, this is why the #VC handler is IST, in an attempt to at
least notice this damage and crash.  There is no way TDX can get away
with requiring #PF to be IST as well.

~Andrew



Re: TDX #VE in SYSCALL gap (was: [RFD] x86: Curing the exception and syscall trainwreck in hardware)

2020-08-25 Thread Andy Lutomirski
On Tue, Aug 25, 2020 at 10:36 AM Luck, Tony  wrote:
>
> > > Or malicious hypervisor action, and that's a problem.
> > >
> > > Suppose the hypervisor remaps a GPA used in the SYSCALL gap (e.g. the
> > > actual SYSCALL text or the first memory it accesses -- I don't have a
> > > TDX spec so I don't know the details).
>
> Is it feasible to defend against a malicious (or buggy) hypervisor?
>
> Obviously, we can't leave holes that guests can exploit. But the hypervisor
> can crash the system no matter how clever TDX is.

Crashing the system is one thing.  Corrupting the system in a way that
could allow code execution is another thing entirely.  And the whole
point of TDX is to defend the guest against the hypervisor.


RE: TDX #VE in SYSCALL gap (was: [RFD] x86: Curing the exception and syscall trainwreck in hardware)

2020-08-25 Thread Luck, Tony
> > Or malicious hypervisor action, and that's a problem.
> >
> > Suppose the hypervisor remaps a GPA used in the SYSCALL gap (e.g. the
> > actual SYSCALL text or the first memory it accesses -- I don't have a
> > TDX spec so I don't know the details).

Is it feasible to defend against a malicious (or buggy) hypervisor?

Obviously, we can't leave holes that guests can exploit. But the hypervisor
can crash the system no matter how clever TDX is.

-Tony


Re: TDX #VE in SYSCALL gap (was: [RFD] x86: Curing the exception and syscall trainwreck in hardware)

2020-08-25 Thread Andy Lutomirski
On Tue, Aug 25, 2020 at 10:19 AM Sean Christopherson
 wrote:
>
> On Tue, Aug 25, 2020 at 09:49:05AM -0700, Andy Lutomirski wrote:
> > On Mon, Aug 24, 2020 at 9:40 PM Sean Christopherson
> >  wrote:
> > >
> > > +Andy
> > >
> > > On Mon, Aug 24, 2020 at 02:52:01PM +0100, Andrew Cooper wrote:
> > > > And to help with coordination, here is something prepared (slightly)
> > > > earlier.
> > > >
> > > > https://docs.google.com/document/d/1hWejnyDkjRRAW-JEsRjA5c9CKLOPc6VKJQsuvODlQEI/edit?usp=sharing
> > > >
> > > > This identifies the problems from software's perspective, along with
> > > > proposing behaviour which ought to resolve the issues.
> > > >
> > > > It is still a work-in-progress.  The #VE section still needs updating in
> > > > light of the publication of the recent TDX spec.
> > >
> > > For #VE on memory accesses in the SYSCALL gap (or NMI entry), is this
> > > something we (Linux) as the guest kernel actually want to handle 
> > > gracefully
> > > (where gracefully means not panicking)?  For TDX, a #VE in the SYSCALL gap
> > > would require one of two things:
> > >
> > >   a) The guest kernel to not accept/validate the GPA->HPA mapping for the
> > >  relevant pages, e.g. code or scratch data.
> > >
> > >   b) The host VMM to remap the GPA (making the GPA->HPA pending again).
> > >
> > > (a) is only possible if there's a fatal buggy guest kernel (or perhaps 
> > > vBIOS).
> > > (b) requires either a buggy or malicious host VMM.
> > >
> > > I ask because, if the answer is "no, panic at will", then we shouldn't 
> > > need
> > > to burn an IST for TDX #VE.  Exceptions won't morph to #VE and hitting an
> > > instruction based #VE in the SYSCALL gap would be a CPU bug or a kernel 
> > > bug.
> >
> > Or malicious hypervisor action, and that's a problem.
> >
> > Suppose the hypervisor remaps a GPA used in the SYSCALL gap (e.g. the
> > actual SYSCALL text or the first memory it accesses -- I don't have a
> > TDX spec so I don't know the details).
>
> You can thank our legal department :-)
>
> > The user does SYSCALL, the kernel hits the funny GPA, and #VE is delivered.
> > The microcode wil write the IRET frame, with mostly user-controlled 
> > contents,
> > wherever RSP points, and RSP is also user controlled.  Calling this a 
> > "panic"
> > is charitable -- it's really game over against an attacker who is moderately
> > clever.
> >
> > The kernel can't do anything about this because it's game over before
> > the kernel has had the chance to execute any instructions.
>
> Hrm, I was thinking that SMAP=1 would give the necessary protections, but
> in typing that out I realized userspace can throw in an RSP value that
> points at kernel memory.  Duh.
>
> One thought would be to have the TDX module (thing that runs in SEAM and
> sits between the VMM and the guest) provide a TDCALL (hypercall from guest
> to TDX module) to the guest that would allow the guest to specify a very
> limited number of GPAs that must never generate a #VE, e.g. go straight to
> guest shutdown if a disallowed GPA would go pending.  That seems doable
> from a TDX perspective without incurring noticeable overhead (assuming the
> list of GPAs is very small) and should be easy to to support in the guest,
> e.g. make a TDCALL/hypercall or two during boot to protect the SYSCALL
> page and its scratch data.

I guess you could do that, but this is getting gross.  The x86
architecture has really gone off the rails here.


Re: TDX #VE in SYSCALL gap (was: [RFD] x86: Curing the exception and syscall trainwreck in hardware)

2020-08-25 Thread Sean Christopherson
On Tue, Aug 25, 2020 at 09:49:05AM -0700, Andy Lutomirski wrote:
> On Mon, Aug 24, 2020 at 9:40 PM Sean Christopherson
>  wrote:
> >
> > +Andy
> >
> > On Mon, Aug 24, 2020 at 02:52:01PM +0100, Andrew Cooper wrote:
> > > And to help with coordination, here is something prepared (slightly)
> > > earlier.
> > >
> > > https://docs.google.com/document/d/1hWejnyDkjRRAW-JEsRjA5c9CKLOPc6VKJQsuvODlQEI/edit?usp=sharing
> > >
> > > This identifies the problems from software's perspective, along with
> > > proposing behaviour which ought to resolve the issues.
> > >
> > > It is still a work-in-progress.  The #VE section still needs updating in
> > > light of the publication of the recent TDX spec.
> >
> > For #VE on memory accesses in the SYSCALL gap (or NMI entry), is this
> > something we (Linux) as the guest kernel actually want to handle gracefully
> > (where gracefully means not panicking)?  For TDX, a #VE in the SYSCALL gap
> > would require one of two things:
> >
> >   a) The guest kernel to not accept/validate the GPA->HPA mapping for the
> >  relevant pages, e.g. code or scratch data.
> >
> >   b) The host VMM to remap the GPA (making the GPA->HPA pending again).
> >
> > (a) is only possible if there's a fatal buggy guest kernel (or perhaps 
> > vBIOS).
> > (b) requires either a buggy or malicious host VMM.
> >
> > I ask because, if the answer is "no, panic at will", then we shouldn't need
> > to burn an IST for TDX #VE.  Exceptions won't morph to #VE and hitting an
> > instruction based #VE in the SYSCALL gap would be a CPU bug or a kernel bug.
> 
> Or malicious hypervisor action, and that's a problem.
> 
> Suppose the hypervisor remaps a GPA used in the SYSCALL gap (e.g. the
> actual SYSCALL text or the first memory it accesses -- I don't have a
> TDX spec so I don't know the details).

You can thank our legal department :-)

> The user does SYSCALL, the kernel hits the funny GPA, and #VE is delivered.
> The microcode wil write the IRET frame, with mostly user-controlled contents,
> wherever RSP points, and RSP is also user controlled.  Calling this a "panic"
> is charitable -- it's really game over against an attacker who is moderately
> clever.
> 
> The kernel can't do anything about this because it's game over before
> the kernel has had the chance to execute any instructions.

Hrm, I was thinking that SMAP=1 would give the necessary protections, but
in typing that out I realized userspace can throw in an RSP value that
points at kernel memory.  Duh.

One thought would be to have the TDX module (thing that runs in SEAM and
sits between the VMM and the guest) provide a TDCALL (hypercall from guest
to TDX module) to the guest that would allow the guest to specify a very
limited number of GPAs that must never generate a #VE, e.g. go straight to
guest shutdown if a disallowed GPA would go pending.  That seems doable
from a TDX perspective without incurring noticeable overhead (assuming the
list of GPAs is very small) and should be easy to to support in the guest,
e.g. make a TDCALL/hypercall or two during boot to protect the SYSCALL
page and its scratch data.


Re: TDX #VE in SYSCALL gap (was: [RFD] x86: Curing the exception and syscall trainwreck in hardware)

2020-08-25 Thread Andy Lutomirski
On Mon, Aug 24, 2020 at 9:40 PM Sean Christopherson
 wrote:
>
> +Andy
>
> On Mon, Aug 24, 2020 at 02:52:01PM +0100, Andrew Cooper wrote:
> > And to help with coordination, here is something prepared (slightly)
> > earlier.
> >
> > https://docs.google.com/document/d/1hWejnyDkjRRAW-JEsRjA5c9CKLOPc6VKJQsuvODlQEI/edit?usp=sharing
> >
> > This identifies the problems from software's perspective, along with
> > proposing behaviour which ought to resolve the issues.
> >
> > It is still a work-in-progress.  The #VE section still needs updating in
> > light of the publication of the recent TDX spec.
>
> For #VE on memory accesses in the SYSCALL gap (or NMI entry), is this
> something we (Linux) as the guest kernel actually want to handle gracefully
> (where gracefully means not panicking)?  For TDX, a #VE in the SYSCALL gap
> would require one of two things:
>
>   a) The guest kernel to not accept/validate the GPA->HPA mapping for the
>  relevant pages, e.g. code or scratch data.
>
>   b) The host VMM to remap the GPA (making the GPA->HPA pending again).
>
> (a) is only possible if there's a fatal buggy guest kernel (or perhaps vBIOS).
> (b) requires either a buggy or malicious host VMM.
>
> I ask because, if the answer is "no, panic at will", then we shouldn't need
> to burn an IST for TDX #VE.  Exceptions won't morph to #VE and hitting an
> instruction based #VE in the SYSCALL gap would be a CPU bug or a kernel bug.

Or malicious hypervisor action, and that's a problem.

Suppose the hypervisor remaps a GPA used in the SYSCALL gap (e.g. the
actual SYSCALL text or the first memory it accesses -- I don't have a
TDX spec so I don't know the details).  The user does SYSCALL, the
kernel hits the funny GPA, and #VE is delivered.  The microcode wil
write the IRET frame, with mostly user-controlled contents, wherever
RSP points, and RSP is also user controlled.  Calling this a "panic"
is charitable -- it's really game over against an attacker who is
moderately clever.

The kernel can't do anything about this because it's game over before
the kernel has had the chance to execute any instructions.


Re: TDX #VE in SYSCALL gap (was: [RFD] x86: Curing the exception and syscall trainwreck in hardware)

2020-08-25 Thread Dave Hansen
On 8/24/20 9:39 PM, Sean Christopherson wrote:
> +Andy
> 
> On Mon, Aug 24, 2020 at 02:52:01PM +0100, Andrew Cooper wrote:
>> And to help with coordination, here is something prepared (slightly)
>> earlier.
>>
>> https://docs.google.com/document/d/1hWejnyDkjRRAW-JEsRjA5c9CKLOPc6VKJQsuvODlQEI/edit?usp=sharing
>>
>> This identifies the problems from software's perspective, along with
>> proposing behaviour which ought to resolve the issues.
>>
>> It is still a work-in-progress.  The #VE section still needs updating in
>> light of the publication of the recent TDX spec.
> 
> For #VE on memory accesses in the SYSCALL gap (or NMI entry), is this
> something we (Linux) as the guest kernel actually want to handle gracefully
> (where gracefully means not panicking)?  For TDX, a #VE in the SYSCALL gap
> would require one of two things:
> 
>   a) The guest kernel to not accept/validate the GPA->HPA mapping for the
>  relevant pages, e.g. code or scratch data.
> 
>   b) The host VMM to remap the GPA (making the GPA->HPA pending again).
> 
> (a) is only possible if there's a fatal buggy guest kernel (or perhaps vBIOS).
> (b) requires either a buggy or malicious host VMM.
> 
> I ask because, if the answer is "no, panic at will", then we shouldn't need
> to burn an IST for TDX #VE.  Exceptions won't morph to #VE and hitting an
> instruction based #VE in the SYSCALL gap would be a CPU bug or a kernel bug.
> Ditto for a #VE in NMI entry before it gets to a thread stack.
> 
> Am I missing anything?

No, that was my expectation as well.  My only concern is that someone
might unintentionally put a #VE'ing instruction in one of the tricky
entry paths, like if we decided we needed CPUID for serialization or
used a WRMSR that #VE's.

It's just something we need to look out for when mucking in the entry
paths.  But, it's not that hard given how few things actually #VE.