Re: [RFC v1 05/26] x86/traps: Add #VE support for TDX guest

2021-02-14 Thread Andy Lutomirski
On Sun, Feb 14, 2021 at 11:33 AM Andi Kleen  wrote:
>
> On Fri, Feb 12, 2021 at 01:48:36PM -0800, Dave Hansen wrote:
> > On 2/12/21 1:47 PM, Andy Lutomirski wrote:
> > >> What about adding a property to the TD, e.g. via a flag set during TD 
> > >> creation,
> > >> that controls whether unaccepted accesses cause #VE or are, for all 
> > >> intents and
> > >> purposes, fatal?  That would allow Linux to pursue treating EPT #VEs for 
> > >> private
> > >> GPAs as fatal, but would give us a safety and not prevent others from 
> > >> utilizing
> > >> #VEs.
> > > That seems reasonable.
> >
> > Ditto.
> >
> > We first need to double check to see if the docs are right, though.
>
> I confirmed with the TDX module owners that #VE can only happen for:
> - unaccepted pages

Can the hypervisor cause an already-accepted secure-EPT page to
transition to the unaccepted state?  If so, NAK.  Sorry, upstream
Linux does not need yet more hacks to make it kind-of-sort-of work on
the broken x86 exception architecture, especially for a feature that
is marketed for security.

As I understand it, the entire point of the TDX modular design is to
make it possible to fix at least some amount of architectural error
without silicon revisions.  If it is indeed the case that an access to
an unaccepted secure-EPT page will cause #VE, then Intel needs to take
the following actions:

1. Update the documentation to make the behavior comprehensible to
mere mortals.  Right now, this information appears to exist in the
form of emails and is, as far as I can tell, not present in the
documentation in a way that we can understand.  Keep in mind that this
discussion includes a number of experts on the software aspects of the
x86 architecture, and the fact that none of us who don't work for
Intel can figure out, authoritatively, what the spec is trying to tell
us should be a huge red flag.

2. Fix the architecture.  Barring some unexpected discovery, some
highly compelling reason, or a design entailing a number of
compromises that will, frankly, be rather embarrassing, upstream Linux
will not advertise itself as a secure implementation of a TDX guest
with the architecture in its current state.  If you would like Linux
to print a giant message along the lines of "WARNING: The TDX
architecture is defective and, as a result, your system is vulnerable
to compromise attack by a malicious hypervisor that uses the
TDH.PAGE.MEM.REMOVE operation.  The advertised security properties of
the Intel TDX architecture are not available.  Use TDX at your own
risk.", we could consider that.  I think it would look pretty bad.

3. Engage with the ISV community, including Linux, to meaningfully
review new Intel designs for software usability.  Meaningful review
does not mean that you send us a spec, we tell you that it's broken,
and you ship it anyway.  Meaningful review also means that the
questions that the software people ask you need to be answered in a
public, authoritative location, preferably the primary spec publicly
available at Intel's website.  Emails don't count for this purpose.

There is no particular shortage of CVEs of varying degrees of severity
due to nonsensical warts in the x86 architecture causing CPL 0 kernels
to malfunction and become subject to privilege escalation.  We are
telling you loud and clear that the current TDX architecture appears
to be a minefield and that it is *specifically* vulnerable to an
attack in which a page accessed early in SYSCALL path (or late in the
SYSRET path) causes #VE. You need to take this seriously.

--Andy


Re: [RFC v1 05/26] x86/traps: Add #VE support for TDX guest

2021-02-14 Thread Andi Kleen
On Fri, Feb 12, 2021 at 01:48:36PM -0800, Dave Hansen wrote:
> On 2/12/21 1:47 PM, Andy Lutomirski wrote:
> >> What about adding a property to the TD, e.g. via a flag set during TD 
> >> creation,
> >> that controls whether unaccepted accesses cause #VE or are, for all 
> >> intents and
> >> purposes, fatal?  That would allow Linux to pursue treating EPT #VEs for 
> >> private
> >> GPAs as fatal, but would give us a safety and not prevent others from 
> >> utilizing
> >> #VEs.
> > That seems reasonable.
> 
> Ditto.
> 
> We first need to double check to see if the docs are right, though.

I confirmed with the TDX module owners that #VE can only happen for:
- unaccepted pages
- instructions like MSR access or CPUID
- specific instructions that are no in the syscall gap

Also if there are future asynchronous #VEs they would only happen
with IF=1, which would also protect the gap.

So no need to make #VE an IST.

-Andi



Re: [RFC v1 05/26] x86/traps: Add #VE support for TDX guest

2021-02-12 Thread Dave Hansen
On 2/12/21 1:47 PM, Andy Lutomirski wrote:
>> What about adding a property to the TD, e.g. via a flag set during TD 
>> creation,
>> that controls whether unaccepted accesses cause #VE or are, for all intents 
>> and
>> purposes, fatal?  That would allow Linux to pursue treating EPT #VEs for 
>> private
>> GPAs as fatal, but would give us a safety and not prevent others from 
>> utilizing
>> #VEs.
> That seems reasonable.

Ditto.

We first need to double check to see if the docs are right, though.


Re: [RFC v1 05/26] x86/traps: Add #VE support for TDX guest

2021-02-12 Thread Andy Lutomirski
On Fri, Feb 12, 2021 at 1:37 PM Sean Christopherson  wrote:
>
> On Fri, Feb 12, 2021, Dave Hansen wrote:
> > On 2/12/21 12:54 PM, Sean Christopherson wrote:
> > > Ah, I see what you're thinking.
> > >
> > > Treating an EPT #VE as fatal was also considered as an option.  IIUC it 
> > > was
> > > thought that finding every nook and cranny that could access a page, 
> > > without
> > > forcing the kernel to pre-accept huge swaths of memory, would be very 
> > > difficult.
> > > It'd be wonderful if that's not the case.
> >
> > We have to manually set up the page table entries for every physical
> > page of memory (except for the hard-coded early stuff below 8MB or
> > whatever).  We *KNOW*, 100% before physical memory is accessed.
> >
> > There aren't nooks and crannies where memory is accessed.  There are a
> > few, very well-defined choke points which must be crossed before memory
> > is accessed.  Page table creation, bootmem and the core page allocator
> > come to mind.
>
> Heh, for me, that's two places too many beyond my knowledge domain to feel
> comfortable putting a stake in the ground saying #VE isn't necessary.
>
> Joking aside, I agree that treating EPT #VEs as fatal would be ideal, but 
> from a
> TDX architecture perspective, when considering all possible kernels, drivers,
> configurations, etc..., it's risky to say that there will _never_ be a 
> scenario
> that "requires" #VE.
>
> What about adding a property to the TD, e.g. via a flag set during TD 
> creation,
> that controls whether unaccepted accesses cause #VE or are, for all intents 
> and
> purposes, fatal?  That would allow Linux to pursue treating EPT #VEs for 
> private
> GPAs as fatal, but would give us a safety and not prevent others from 
> utilizing
> #VEs.

That seems reasonable.


Re: [RFC v1 05/26] x86/traps: Add #VE support for TDX guest

2021-02-12 Thread Sean Christopherson
On Fri, Feb 12, 2021, Dave Hansen wrote:
> On 2/12/21 12:54 PM, Sean Christopherson wrote:
> > Ah, I see what you're thinking.
> > 
> > Treating an EPT #VE as fatal was also considered as an option.  IIUC it was
> > thought that finding every nook and cranny that could access a page, without
> > forcing the kernel to pre-accept huge swaths of memory, would be very 
> > difficult.
> > It'd be wonderful if that's not the case.
> 
> We have to manually set up the page table entries for every physical
> page of memory (except for the hard-coded early stuff below 8MB or
> whatever).  We *KNOW*, 100% before physical memory is accessed.
> 
> There aren't nooks and crannies where memory is accessed.  There are a
> few, very well-defined choke points which must be crossed before memory
> is accessed.  Page table creation, bootmem and the core page allocator
> come to mind.

Heh, for me, that's two places too many beyond my knowledge domain to feel
comfortable putting a stake in the ground saying #VE isn't necessary.

Joking aside, I agree that treating EPT #VEs as fatal would be ideal, but from a
TDX architecture perspective, when considering all possible kernels, drivers,
configurations, etc..., it's risky to say that there will _never_ be a scenario
that "requires" #VE.

What about adding a property to the TD, e.g. via a flag set during TD creation,
that controls whether unaccepted accesses cause #VE or are, for all intents and
purposes, fatal?  That would allow Linux to pursue treating EPT #VEs for private
GPAs as fatal, but would give us a safety and not prevent others from utilizing
#VEs.

I suspect it would also be helpful for debug, e.g. if the kernel manages to do
something stupid and maps memory it hasn't accepted, in which case debugging a
#VE in the guest is likely easier than an opaque EPT violation in the host.

> If Linux doesn't have a really good handle on which physical pages are
> accessed when, we've got bigger problems on our hands.  Remember, we
> even have debugging mechanisms that unmap pages from the kernel when
> they're in the allocator.  We know so well that nobody is accessing
> those physical addresses that we even tell hypervisors they can toss the
> page contents and remove the physical backing (guest free page hinting).


Re: [RFC v1 05/26] x86/traps: Add #VE support for TDX guest

2021-02-12 Thread Dave Hansen
On 2/12/21 12:54 PM, Sean Christopherson wrote:
> Ah, I see what you're thinking.
> 
> Treating an EPT #VE as fatal was also considered as an option.  IIUC it was
> thought that finding every nook and cranny that could access a page, without
> forcing the kernel to pre-accept huge swaths of memory, would be very 
> difficult.
> It'd be wonderful if that's not the case.

We have to manually set up the page table entries for every physical
page of memory (except for the hard-coded early stuff below 8MB or
whatever).  We *KNOW*, 100% before physical memory is accessed.

There aren't nooks and crannies where memory is accessed.  There are a
few, very well-defined choke points which must be crossed before memory
is accessed.  Page table creation, bootmem and the core page allocator
come to mind.

If Linux doesn't have a really good handle on which physical pages are
accessed when, we've got bigger problems on our hands.  Remember, we
even have debugging mechanisms that unmap pages from the kernel when
they're in the allocator.  We know so well that nobody is accessing
those physical addresses that we even tell hypervisors they can toss the
page contents and remove the physical backing (guest free page hinting).


Re: [RFC v1 05/26] x86/traps: Add #VE support for TDX guest

2021-02-12 Thread Sean Christopherson
On Fri, Feb 12, 2021, Dave Hansen wrote:
> On 2/12/21 12:37 PM, Sean Christopherson wrote:
> > There needs to be a mechanism for lazy/deferred/on-demand acceptance of 
> > pages.
> > E.g. pre-accepting every page in a VM with hundreds of GB of memory will be
> > ridiculously slow.
> > 
> > #VE is the best option to do that:
> > 
> >   - Relatively sane re-entrancy semantics.
> >   - Hardware accelerated.
> >   - Doesn't require stealing an IRQ from the guest.
> 
> TDX already provides a basic environment for the guest when it starts
> up.  The guest has some known, good memory.  The guest also has a very,
> very clear understanding of which physical pages it uses and when.  It's
> staged, of course, as decompression happens and the guest comes up.
> 
> But, the guest still knows which guest physical pages it accesses and
> when.  It doesn't need on-demand faulting in of non-accepted pages.  It
> can simply decline to expose non-accepted pages to the wider system
> before they've been accepted.
> 
> It would be nuts to merrily free non-accepted pages into the page
> allocator and handle the #VE fallout as they're touched from
> god-knows-where.
> 
> I don't see *ANY* case for #VE to occur inside the guest kernel, outside
> of *VERY* narrow places like copy_from_user().  Period.  #VE from ring-0
> is not OK.
> 
> So, no, #VE is not the best option.  No #VE's in the first place is the
> best option.

Ah, I see what you're thinking.

Treating an EPT #VE as fatal was also considered as an option.  IIUC it was
thought that finding every nook and cranny that could access a page, without
forcing the kernel to pre-accept huge swaths of memory, would be very difficult.
It'd be wonderful if that's not the case.


Re: [RFC v1 05/26] x86/traps: Add #VE support for TDX guest

2021-02-12 Thread Dave Hansen
On 2/12/21 12:37 PM, Sean Christopherson wrote:
> There needs to be a mechanism for lazy/deferred/on-demand acceptance of pages.
> E.g. pre-accepting every page in a VM with hundreds of GB of memory will be
> ridiculously slow.
> 
> #VE is the best option to do that:
> 
>   - Relatively sane re-entrancy semantics.
>   - Hardware accelerated.
>   - Doesn't require stealing an IRQ from the guest.

TDX already provides a basic environment for the guest when it starts
up.  The guest has some known, good memory.  The guest also has a very,
very clear understanding of which physical pages it uses and when.  It's
staged, of course, as decompression happens and the guest comes up.

But, the guest still knows which guest physical pages it accesses and
when.  It doesn't need on-demand faulting in of non-accepted pages.  It
can simply decline to expose non-accepted pages to the wider system
before they've been accepted.

It would be nuts to merrily free non-accepted pages into the page
allocator and handle the #VE fallout as they're touched from
god-knows-where.

I don't see *ANY* case for #VE to occur inside the guest kernel, outside
of *VERY* narrow places like copy_from_user().  Period.  #VE from ring-0
is not OK.

So, no, #VE is not the best option.  No #VE's in the first place is the
best option.


Re: [RFC v1 05/26] x86/traps: Add #VE support for TDX guest

2021-02-12 Thread Sean Christopherson
On Fri, Feb 12, 2021, Andy Lutomirski wrote:
> 
> > On Feb 12, 2021, at 12:06 PM, Sean Christopherson  wrote:
> > 
> > On Fri, Feb 12, 2021, Andy Lutomirski wrote:
> >>> On Fri, Feb 5, 2021 at 3:39 PM Kuppuswamy Sathyanarayanan
> >>>  wrote:
> >>> 
> >>> From: "Kirill A. Shutemov" 
> >>> 
> >>> The TDX module injects #VE exception to the guest TD in cases of
> >>> disallowed instructions, disallowed MSR accesses and subset of CPUID
> >>> leaves. Also, it's theoretically possible for CPU to inject #VE
> >>> exception on EPT violation, but the TDX module makes sure this does
> >>> not happen, as long as all memory used is properly accepted using
> >>> TDCALLs.
> >> 
> >> By my very cursory reading of the TDX arch specification 9.8.2,
> >> "Secure" EPT violations don't send #VE.  But the docs are quite
> >> unclear, or at least the docs I found are.
> > 
> > The version I have also states that SUPPRESS_VE is always set.  So either 
> > there
> > was a change in direction, or the public docs need to be updated.  Lazy 
> > accept
> > requires a #VE, either from hardware or from the module.  The latter would
> > require walking the Secure EPT tables on every EPT violation...
> > 
> >> What happens if the guest attempts to access a secure GPA that is not
> >> ACCEPTed?  For example, suppose the VMM does THH.MEM.PAGE.REMOVE on a 
> >> secure
> >> address and the guest accesses it, via instruction fetch or data access.
> >> What happens?
> > 
> > Well, as currently written in the spec, it will generate an EPT violation 
> > and
> > the host will have no choice but to kill the guest.
> 
> Or page the page back in and try again?

The intended use isn't for swapping a page or migrating a page.  Those flows
have dedicated APIs, and do not _remove_ a page.

E.g. the KVM RFC patches already support zapping Secure EPT entries if NUMA
balancing kicks in.  But, in TDX terminology, that is a BLOCK/UNBLOCK operation.

Removal is for converting a private page to a shared page, and for paravirt
memory ballooning.

> In regular virt guests, if the host pages out a guest page, it’s the host’s
> job to put it back when needed. In paravirt, a well designed async of
> protocol can sometimes let the guest to useful work when this happens. If a
> guest (or bare metal) has its memory hot removed (via balloon or whatever)
> and the kernel messes up and accesses removed memory, the guest (or bare
> metal) is toast.
> 
> I don’t see why TDX needs to be any different.

The REMOVE API isn't intended for swap.  In fact, it can't be used for swap. If
a page is removed, its contents are lost.  Because the original contents are
lost, the guest is required to re-accept the page so that the host can't
silently get the guest to consume a zero page that the guest thinks has valid
data.

For swap, the contents are preserved, and so explicit re-acceptance is not
required.  From the guest's perspective, it's really just a high-latency memory
access.


Re: [RFC v1 05/26] x86/traps: Add #VE support for TDX guest

2021-02-12 Thread Sean Christopherson
On Fri, Feb 12, 2021, Dave Hansen wrote:
> On 2/12/21 12:06 PM, Sean Christopherson wrote:
> >> What happens if the guest attempts to access a secure GPA that is not
> >> ACCEPTed?  For example, suppose the VMM does THH.MEM.PAGE.REMOVE on a 
> >> secure
> >> address and the guest accesses it, via instruction fetch or data access.
> >> What happens?
> > Well, as currently written in the spec, it will generate an EPT violation 
> > and
> > the host will have no choice but to kill the guest.
> 
> That's actually perfect behavior from my perspective.  Host does
> something stupid.  Host gets left holding the pieces.  No enabling to do
> in the guest.
> 
> This doesn't *preclude* the possibility that the VMM and guest could
> establish a protocol to remove guest pages.  It just means that the host
> can't go it alone and that if they guest and host get out of sync, the
> guest dies.
> 
> In other words, I think I'm rooting for the docs, as written. :)

I tentatively agree that the host should not be able to remove pages without
guest approval, but that's not the only use case for #VE on EPT violations.
It's not even really an intended use case.

There needs to be a mechanism for lazy/deferred/on-demand acceptance of pages.
E.g. pre-accepting every page in a VM with hundreds of GB of memory will be
ridiculously slow.

#VE is the best option to do that:

  - Relatively sane re-entrancy semantics.
  - Hardware accelerated.
  - Doesn't require stealing an IRQ from the guest.


Re: [RFC v1 05/26] x86/traps: Add #VE support for TDX guest

2021-02-12 Thread Andy Lutomirski


> On Feb 12, 2021, at 12:06 PM, Sean Christopherson  wrote:
> 
> On Fri, Feb 12, 2021, Andy Lutomirski wrote:
>>> On Fri, Feb 5, 2021 at 3:39 PM Kuppuswamy Sathyanarayanan
>>>  wrote:
>>> 
>>> From: "Kirill A. Shutemov" 
>>> 
>>> The TDX module injects #VE exception to the guest TD in cases of
>>> disallowed instructions, disallowed MSR accesses and subset of CPUID
>>> leaves. Also, it's theoretically possible for CPU to inject #VE
>>> exception on EPT violation, but the TDX module makes sure this does
>>> not happen, as long as all memory used is properly accepted using
>>> TDCALLs.
>> 
>> By my very cursory reading of the TDX arch specification 9.8.2,
>> "Secure" EPT violations don't send #VE.  But the docs are quite
>> unclear, or at least the docs I found are.
> 
> The version I have also states that SUPPRESS_VE is always set.  So either 
> there
> was a change in direction, or the public docs need to be updated.  Lazy accept
> requires a #VE, either from hardware or from the module.  The latter would
> require walking the Secure EPT tables on every EPT violation...
> 
>> What happens if the guest attempts to access a secure GPA that is not
>> ACCEPTed?  For example, suppose the VMM does THH.MEM.PAGE.REMOVE on a secure
>> address and the guest accesses it, via instruction fetch or data access.
>> What happens?
> 
> Well, as currently written in the spec, it will generate an EPT violation and
> the host will have no choice but to kill the guest.

Or page the page back in and try again?

In regular virt guests, if the host pages out a guest page, it’s the host’s job 
to put it back when needed. In paravirt, a well designed async of protocol can 
sometimes let the guest to useful work when this happens. If a guest (or bare 
metal) has its memory hot removed (via balloon or whatever) and the kernel 
messes up and accesses removed memory, the guest (or bare metal) is toast.

I don’t see why TDX needs to be any different.

Re: [RFC v1 05/26] x86/traps: Add #VE support for TDX guest

2021-02-12 Thread Dave Hansen
On 2/12/21 12:06 PM, Sean Christopherson wrote:
>> What happens if the guest attempts to access a secure GPA that is not
>> ACCEPTed?  For example, suppose the VMM does THH.MEM.PAGE.REMOVE on a secure
>> address and the guest accesses it, via instruction fetch or data access.
>> What happens?
> Well, as currently written in the spec, it will generate an EPT violation and
> the host will have no choice but to kill the guest.

That's actually perfect behavior from my perspective.  Host does
something stupid.  Host gets left holding the pieces.  No enabling to do
in the guest.

This doesn't *preclude* the possibility that the VMM and guest could
establish a protocol to remove guest pages.  It just means that the host
can't go it alone and that if they guest and host get out of sync, the
guest dies.

In other words, I think I'm rooting for the docs, as written. :)


Re: [RFC v1 05/26] x86/traps: Add #VE support for TDX guest

2021-02-12 Thread Sean Christopherson
On Fri, Feb 12, 2021, Andy Lutomirski wrote:
> On Fri, Feb 5, 2021 at 3:39 PM Kuppuswamy Sathyanarayanan
>  wrote:
> >
> > From: "Kirill A. Shutemov" 
> >
> > The TDX module injects #VE exception to the guest TD in cases of
> > disallowed instructions, disallowed MSR accesses and subset of CPUID
> > leaves. Also, it's theoretically possible for CPU to inject #VE
> > exception on EPT violation, but the TDX module makes sure this does
> > not happen, as long as all memory used is properly accepted using
> > TDCALLs.
> 
> By my very cursory reading of the TDX arch specification 9.8.2,
> "Secure" EPT violations don't send #VE.  But the docs are quite
> unclear, or at least the docs I found are.

The version I have also states that SUPPRESS_VE is always set.  So either there
was a change in direction, or the public docs need to be updated.  Lazy accept
requires a #VE, either from hardware or from the module.  The latter would
require walking the Secure EPT tables on every EPT violation...

> What happens if the guest attempts to access a secure GPA that is not
> ACCEPTed?  For example, suppose the VMM does THH.MEM.PAGE.REMOVE on a secure
> address and the guest accesses it, via instruction fetch or data access.
> What happens?

Well, as currently written in the spec, it will generate an EPT violation and
the host will have no choice but to kill the guest.


Re: [RFC v1 05/26] x86/traps: Add #VE support for TDX guest

2021-02-12 Thread Andy Lutomirski
On Fri, Feb 5, 2021 at 3:39 PM Kuppuswamy Sathyanarayanan
 wrote:
>
> From: "Kirill A. Shutemov" 
>
> The TDX module injects #VE exception to the guest TD in cases of
> disallowed instructions, disallowed MSR accesses and subset of CPUID
> leaves. Also, it's theoretically possible for CPU to inject #VE
> exception on EPT violation, but the TDX module makes sure this does
> not happen, as long as all memory used is properly accepted using
> TDCALLs.

By my very cursory reading of the TDX arch specification 9.8.2,
"Secure" EPT violations don't send #VE.  But the docs are quite
unclear, or at least the docs I found are.  What happens if the guest
attempts to access a secure GPA that is not ACCEPTed?  For example,
suppose the VMM does THH.MEM.PAGE.REMOVE on a secure address and the
guest accesses it, via instruction fetch or data access.  What
happens?


Re: [RFC v1 05/26] x86/traps: Add #VE support for TDX guest

2021-02-12 Thread Dave Hansen
On 2/5/21 3:38 PM, Kuppuswamy Sathyanarayanan wrote:
> More details on cases where #VE exceptions are allowed/not-allowed:
> 
> The #VE exception do not occur in the paranoid entry paths, like NMIs.
> While other operations during an NMI might cause #VE, these are in the
> NMI code that can handle nesting, so there is no concern about
> reentrancy. This is similar to how #PF is handled in NMIs.
> 
> The #VE exception also cannot happen in entry/exit code with the
> wrong gs, such as the SWAPGS code, so it's entry point does not
> need "paranoid" handling.

Considering:

https://lore.kernel.org/lkml/20200825171903.GA20660@sjchrist-ice/

I would suggest revisiting this part of the changelog.


Re: [RFC v1 05/26] x86/traps: Add #VE support for TDX guest

2021-02-08 Thread Kuppuswamy, Sathyanarayanan




On 2/8/21 8:59 AM, Peter Zijlstra wrote:

'cute', might be useful to have that mentioned somewhere.

we will add a note for it in comments.

--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer


Re: [RFC v1 05/26] x86/traps: Add #VE support for TDX guest

2021-02-08 Thread Peter Zijlstra
On Mon, Feb 08, 2021 at 08:46:23AM -0800, Sean Christopherson wrote:
> On Mon, Feb 08, 2021, Peter Zijlstra wrote:
> > On Mon, Feb 08, 2021 at 08:23:01AM -0800, Andi Kleen wrote:
> > > > > +#ifdef CONFIG_INTEL_TDX_GUEST
> > > > > +DEFINE_IDTENTRY(exc_virtualization_exception)
> > > > > +{
> > > > > + struct ve_info ve;
> > > > > + int ret;
> > > > > +
> > > > > + RCU_LOCKDEP_WARN(!rcu_is_watching(), "entry code didn't wake 
> > > > > RCU");
> > > > > +
> > > > > + /* Consume #VE info before re-enabling interrupts */
> > > > 
> > > > So what happens if NMI happens here, and triggers a nested #VE ?
> > > 
> > > Yes that's a gap. We should probably bail out and reexecute the original
> > > instruction. The VE handler would need to set a flag for that.
> 
> No, NMI cannot happen here.  The TDX-Module "blocks" NMIs until the #VE info 
> is
> consumed by the guest.

'cute', might be useful to have that mentioned somewhere.

> > > Or alternatively the NMI always gets the VE information and puts
> > > it on some internal stack, but that would seem clunkier.
> > 
> > The same is possible with MCE and #DB I imagine.
> 
> The MCE "architecture" for a TDX guest is rather stupid.  The guest is 
> required
> to keep CR4.MCE=1, but at least for TDX 1.0 the VMM is not allowed to inject 
> #MC.
> So, for better or worse, #MC is a non-issue.
> 
> #VE->#DB->#VE would be an issue, presumably this needs to be noinstr (or 
> whatever
> it is that prevents #DBs on functions).

Ah, it is that already ofcourse, so yeah #DB can't happen here.


Re: [RFC v1 05/26] x86/traps: Add #VE support for TDX guest

2021-02-08 Thread Sean Christopherson
On Mon, Feb 08, 2021, Peter Zijlstra wrote:
> On Mon, Feb 08, 2021 at 08:23:01AM -0800, Andi Kleen wrote:
> > > > +#ifdef CONFIG_INTEL_TDX_GUEST
> > > > +DEFINE_IDTENTRY(exc_virtualization_exception)
> > > > +{
> > > > +   struct ve_info ve;
> > > > +   int ret;
> > > > +
> > > > +   RCU_LOCKDEP_WARN(!rcu_is_watching(), "entry code didn't wake 
> > > > RCU");
> > > > +
> > > > +   /* Consume #VE info before re-enabling interrupts */
> > > 
> > > So what happens if NMI happens here, and triggers a nested #VE ?
> > 
> > Yes that's a gap. We should probably bail out and reexecute the original
> > instruction. The VE handler would need to set a flag for that.

No, NMI cannot happen here.  The TDX-Module "blocks" NMIs until the #VE info is
consumed by the guest.

> > Or alternatively the NMI always gets the VE information and puts
> > it on some internal stack, but that would seem clunkier.
> 
> The same is possible with MCE and #DB I imagine.

The MCE "architecture" for a TDX guest is rather stupid.  The guest is required
to keep CR4.MCE=1, but at least for TDX 1.0 the VMM is not allowed to inject 
#MC.
So, for better or worse, #MC is a non-issue.

#VE->#DB->#VE would be an issue, presumably this needs to be noinstr (or 
whatever
it is that prevents #DBs on functions).


Re: [RFC v1 05/26] x86/traps: Add #VE support for TDX guest

2021-02-08 Thread Andi Kleen
> > > So what happens if NMI happens here, and triggers a nested #VE ?
> > 
> > Yes that's a gap. We should probably bail out and reexecute the original
> > instruction. The VE handler would need to set a flag for that.
> > 
> > Or alternatively the NMI always gets the VE information and puts
> > it on some internal stack, but that would seem clunkier.
> 
> The same is possible with MCE and #DB I imagine.

I don't think there are currently any plans to inject #MC into TDX guests. It's
doubtful this could be done securely.

#DB is trickier because it will happen every time, so simply reexecuting
won't work. I guess it would need the ve info stack, or some care in 
kprobes/kernel
debugger that it cannot happen. I think I would prefer the later.

-Andi



Re: [RFC v1 05/26] x86/traps: Add #VE support for TDX guest

2021-02-08 Thread Peter Zijlstra
On Mon, Feb 08, 2021 at 08:23:01AM -0800, Andi Kleen wrote:
> > Which is supposedly then set up to avoid #VE during the syscall gap,
> > yes? Which then results in #VE not having to be IST.
> 
> Yes that is currently true because all memory is pre-accepted.
> 
> If we ever do lazy accept we would need to make sure the memory accessed in
> the syscall gap is already accepted, or move over to an IST.

I think we're going to mandate the entry text/data will have to be
pre-accepted to avoid IST. ISTs really are crap.

> > > +#ifdef CONFIG_INTEL_TDX_GUEST
> > > +DEFINE_IDTENTRY(exc_virtualization_exception)
> > > +{
> > > + struct ve_info ve;
> > > + int ret;
> > > +
> > > + RCU_LOCKDEP_WARN(!rcu_is_watching(), "entry code didn't wake RCU");
> > > +
> > > + /* Consume #VE info before re-enabling interrupts */
> > 
> > So what happens if NMI happens here, and triggers a nested #VE ?
> 
> Yes that's a gap. We should probably bail out and reexecute the original
> instruction. The VE handler would need to set a flag for that.
> 
> Or alternatively the NMI always gets the VE information and puts
> it on some internal stack, but that would seem clunkier.

The same is possible with MCE and #DB I imagine.


Re: [RFC v1 05/26] x86/traps: Add #VE support for TDX guest

2021-02-08 Thread Andi Kleen
> Which is supposedly then set up to avoid #VE during the syscall gap,
> yes? Which then results in #VE not having to be IST.

Yes that is currently true because all memory is pre-accepted.

If we ever do lazy accept we would need to make sure the memory accessed in
the syscall gap is already accepted, or move over to an IST.

> > +#ifdef CONFIG_INTEL_TDX_GUEST
> > +DEFINE_IDTENTRY(exc_virtualization_exception)
> > +{
> > +   struct ve_info ve;
> > +   int ret;
> > +
> > +   RCU_LOCKDEP_WARN(!rcu_is_watching(), "entry code didn't wake RCU");
> > +
> > +   /* Consume #VE info before re-enabling interrupts */
> 
> So what happens if NMI happens here, and triggers a nested #VE ?

Yes that's a gap. We should probably bail out and reexecute the original
instruction. The VE handler would need to set a flag for that.

Or alternatively the NMI always gets the VE information and puts
it on some internal stack, but that would seem clunkier.


-Andi


Re: [RFC v1 05/26] x86/traps: Add #VE support for TDX guest

2021-02-08 Thread Peter Zijlstra
On Fri, Feb 05, 2021 at 03:38:22PM -0800, Kuppuswamy Sathyanarayanan wrote:
> From: "Kirill A. Shutemov" 
> 
> The TDX module injects #VE exception to the guest TD in cases of
> disallowed instructions, disallowed MSR accesses and subset of CPUID
> leaves. Also, it's theoretically possible for CPU to inject #VE
> exception on EPT violation, but the TDX module makes sure this does
> not happen, as long as all memory used is properly accepted using
> TDCALLs. You can find more details about it in, Guest-Host-Communication
> Interface (GHCI) for Intel Trust Domain Extensions (Intel TDX)
> specification, sec 2.3.
> 
> Add basic infrastructure to handle #VE. If there is no handler for a
> given #VE, since its a unexpected event (fault case), treat it as a
> general protection fault and handle it using do_general_protection()
> call.
> 
> TDCALL[TDGETVEINFO] provides information about #VE such as exit reason.
> 
> More details on cases where #VE exceptions are allowed/not-allowed:
> 
> The #VE exception do not occur in the paranoid entry paths, like NMIs.
> While other operations during an NMI might cause #VE, these are in the
> NMI code that can handle nesting, so there is no concern about
> reentrancy. This is similar to how #PF is handled in NMIs.
> 
> The #VE exception also cannot happen in entry/exit code with the
> wrong gs, such as the SWAPGS code, so it's entry point does not
> need "paranoid" handling.

All of the above are arranged by using the below secure EPT for init
text and data?

> Any memory accesses can cause #VE if it causes an EPT
> violation.  However, the VMM is only in direct control of some of the
> EPT tables.  The Secure EPT tables are controlled by the TDX module
> which guarantees no EPT violations will result in #VE for the guest,
> once the memory has been accepted.

Which is supposedly then set up to avoid #VE during the syscall gap,
yes? Which then results in #VE not having to be IST.

> +#ifdef CONFIG_INTEL_TDX_GUEST
> +DEFINE_IDTENTRY(exc_virtualization_exception)
> +{
> + struct ve_info ve;
> + int ret;
> +
> + RCU_LOCKDEP_WARN(!rcu_is_watching(), "entry code didn't wake RCU");
> +
> + /* Consume #VE info before re-enabling interrupts */

So what happens if NMI happens here, and triggers a nested #VE ?

> + ret = tdx_get_ve_info();
> + cond_local_irq_enable(regs);
> + if (!ret)
> + ret = tdx_handle_virtualization_exception(regs, );
> + /*
> +  * If #VE exception handler could not handle it successfully, treat
> +  * it as #GP(0) and handle it.
> +  */
> + if (ret)
> + do_general_protection(regs, 0);
> + cond_local_irq_disable(regs);
> +}
> +#endif


[RFC v1 05/26] x86/traps: Add #VE support for TDX guest

2021-02-05 Thread Kuppuswamy Sathyanarayanan
From: "Kirill A. Shutemov" 

The TDX module injects #VE exception to the guest TD in cases of
disallowed instructions, disallowed MSR accesses and subset of CPUID
leaves. Also, it's theoretically possible for CPU to inject #VE
exception on EPT violation, but the TDX module makes sure this does
not happen, as long as all memory used is properly accepted using
TDCALLs. You can find more details about it in, Guest-Host-Communication
Interface (GHCI) for Intel Trust Domain Extensions (Intel TDX)
specification, sec 2.3.

Add basic infrastructure to handle #VE. If there is no handler for a
given #VE, since its a unexpected event (fault case), treat it as a
general protection fault and handle it using do_general_protection()
call.

TDCALL[TDGETVEINFO] provides information about #VE such as exit reason.

More details on cases where #VE exceptions are allowed/not-allowed:

The #VE exception do not occur in the paranoid entry paths, like NMIs.
While other operations during an NMI might cause #VE, these are in the
NMI code that can handle nesting, so there is no concern about
reentrancy. This is similar to how #PF is handled in NMIs.

The #VE exception also cannot happen in entry/exit code with the
wrong gs, such as the SWAPGS code, so it's entry point does not
need "paranoid" handling.

Any memory accesses can cause #VE if it causes an EPT
violation.  However, the VMM is only in direct control of some of the
EPT tables.  The Secure EPT tables are controlled by the TDX module
which guarantees no EPT violations will result in #VE for the guest,
once the memory has been accepted.

Co-developed-by: Sean Christopherson 
Signed-off-by: Sean Christopherson 
Signed-off-by: Kirill A. Shutemov 
Reviewed-by: Andi Kleen 
Signed-off-by: Kuppuswamy Sathyanarayanan 

---
 arch/x86/include/asm/idtentry.h |  4 ++
 arch/x86/include/asm/tdx.h  | 14 +++
 arch/x86/kernel/idt.c   |  6 +++
 arch/x86/kernel/tdx.c   | 31 ++
 arch/x86/kernel/traps.c | 73 ++---
 5 files changed, 105 insertions(+), 23 deletions(-)

diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index 247a60a47331..a2cbb68f9ae8 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -615,6 +615,10 @@ DECLARE_IDTENTRY_VC(X86_TRAP_VC,   exc_vmm_communication);
 DECLARE_IDTENTRY_XENCB(X86_TRAP_OTHER, exc_xen_hypervisor_callback);
 #endif
 
+#ifdef CONFIG_INTEL_TDX_GUEST
+DECLARE_IDTENTRY(X86_TRAP_VE,  exc_virtualization_exception);
+#endif
+
 /* Device interrupts common/spurious */
 DECLARE_IDTENTRY_IRQ(X86_TRAP_OTHER,   common_interrupt);
 #ifdef CONFIG_X86_LOCAL_APIC
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index f8cdc8eb1046..90eb61b07d1f 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -15,6 +15,7 @@
 #define TDCALL ".byte 0x66,0x0f,0x01,0xcc"
 
 #define TDINFO 1
+#define TDGETVEINFO3
 
 /* Common API to check TDX support in decompression and common kernel code. */
 bool is_tdx_guest(void);
@@ -32,4 +33,17 @@ static inline void tdx_early_init(void) { };
 
 #endif /* CONFIG_INTEL_TDX_GUEST */
 
+struct ve_info {
+   unsigned int exit_reason;
+   unsigned long exit_qual;
+   unsigned long gla;
+   unsigned long gpa;
+   unsigned int instr_len;
+   unsigned int instr_info;
+};
+
+unsigned long tdx_get_ve_info(struct ve_info *ve);
+int tdx_handle_virtualization_exception(struct pt_regs *regs,
+   struct ve_info *ve);
+
 #endif /* _ASM_X86_TDX_H */
diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c
index ee1a283f8e96..546b6b636c7d 100644
--- a/arch/x86/kernel/idt.c
+++ b/arch/x86/kernel/idt.c
@@ -64,6 +64,9 @@ static const __initconst struct idt_data early_idts[] = {
 */
INTG(X86_TRAP_PF,   asm_exc_page_fault),
 #endif
+#ifdef CONFIG_INTEL_TDX_GUEST
+   INTG(X86_TRAP_VE,   asm_exc_virtualization_exception),
+#endif
 };
 
 /*
@@ -87,6 +90,9 @@ static const __initconst struct idt_data def_idts[] = {
INTG(X86_TRAP_MF,   asm_exc_coprocessor_error),
INTG(X86_TRAP_AC,   asm_exc_alignment_check),
INTG(X86_TRAP_XF,   asm_exc_simd_coprocessor_error),
+#ifdef CONFIG_INTEL_TDX_GUEST
+   INTG(X86_TRAP_VE,   asm_exc_virtualization_exception),
+#endif
 
 #ifdef CONFIG_X86_32
TSKG(X86_TRAP_DF,   GDT_ENTRY_DOUBLEFAULT_TSS),
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index 13303bfdfdd1..ae2d5c847700 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -62,3 +62,34 @@ void __init tdx_early_init(void)
 
pr_info("TDX guest is initialized\n");
 }
+
+unsigned long tdx_get_ve_info(struct ve_info *ve)
+{
+   register long r8 asm("r8");
+   register long r9 asm("r9");
+   register long r10 asm("r10");
+   unsigned long ret;
+
+   asm volatile(TDCALL
+