Re: [edk2-devel] GuestPhysAddrSize questions

2024-03-06 Thread Paolo Bonzini
On Mon, Mar 4, 2024 at 6:24 PM Tom Lendacky  wrote:
>
> On 3/4/24 07:09, Gerd Hoffmann wrote:
> >Hi,
> >
> >>> 23:16 GuestPhysAddrSize Maximum guest physical address size in bits.
> >>> This number applies only to guests using 
> >>> nested
> >>> paging. When this field is zero, refer to the
> >>> PhysAddrSize field for the maximum guest
> >>> physical address size. See “Secure Virtual
> >>> Machine” in APM Volume 2.
> >
> >> I believe the main purpose of GuestPhysAddrSize was for software use (for
> >> nested virtualization) and that the hardware itself has always returned 
> >> zero
> >> for that value. So you should be able to use that field. Adding @Paolo for
> >> his thoughts.
> >
> > Reviewers mentioned this is meant for nested guests, i.e. (if I
> > understand this correctly) the l0 hypervisor can use that to tell
> > the l1 hypervisor what the l2 guest phys-bits should be.
> >
> > Is this nested virtualization use documented somewhere?  Tried to
> > search for GuestPhysAddrSize or Fn8000_0008_EAX in APM Volume 2,
> > found nothing.
>
> Right, and I don't think you'll see anything added to the APM that will
> state how it can be used by software. The APM is an architectural
> definition and won't talk about hypervisors and using nested paging, etc.

I don't think that's a problem. The problem is that the definition in
the APM is ambiguous and can mean one of three things:

1) it can be a suggested value for PhysAddrSize (bits 7:0) of guests
that use nested paging. This would imply that in a nested page guest,
with GuestPhysAddrSize=48, setting bits 51:48 would cause a
#PF(reserved) exception.

2) it can be equivalent to LinAddrSize (bits 15:8) but for nested page
tables. This would imply that, with GuestPhysAddrSize=48, VMRUN would
fail if hCR4.LA57=1.

3) it can be what I propose above: the architecture defined a
situation that can only happen when using nested paging (on AMD:
host_CR4.LA57=0, PhysAddrSize=52), and GuestPhysAddrSize is an
architectural way to explain the situation to guests.

The message above suggests that the intended meaning is (1). That is
because "the l0 hypervisor can use that to tell the l1 hypervisor what
the l2 guest phys-bits should be" is exactly the same as "the
processor can use that to tell the hypervisor what the guest phys-bits
should be" (just shifted one level down).

However, there are no processors that implement (1) or (2), so my
suggestion is to clarify that the intended meaning is (3). Do you
agree that the above proposal is a plausible interpretation of what is
already in the APM, but clearer? And do you think there is a way for
the clarification to make it into the APM?

Paolo



-=-=-=-=-=-=-=-=-=-=-=-
Groups.io Links: You receive all messages sent to this group.
View/Reply Online (#116465): https://edk2.groups.io/g/devel/message/116465
Mute This Topic: https://groups.io/mt/104510523/21656
Group Owner: devel+ow...@edk2.groups.io
Unsubscribe: https://edk2.groups.io/g/devel/unsub [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-




Re: [edk2-devel] [PATCH 2/2] MdeModulePkg/DxeCore: Fix stack overflow issue due to nested interrupts

2024-03-01 Thread Paolo Bonzini
Il ven 1 mar 2024, 12:10 Michael Brown  ha scritto:

> It feels as though this should be able to be cleanly modelled with a
> single global state array
>
>BOOLEAN mSavedInterruptState[TPL_HIGH_LEVEL]
>

Pretty much, yes. But you only have to write it when the interrupts are
already disabled, so the bitmask works and makes it easier to clear "all
values at NewTpl and above" with just an AND.

>
> (or possibly a bitmask, though using the array avoids having to disable
> interrupts just to write a value).
>
> I still need to think through the subtleties, to make sure it could cope
> with pathological edge cases such as
>
>OldTpl = gBS->RaiseTPL (TPL_HIGH_LEVEL);
>
>...
>
>gBS->RestoreTPL (OldTpl);
>gBS->RestoreTPL (OldTpl);
>
> or
>
>OldTpl = gBS->RaiseTPL (TPL_HIGH_LEVEL - 1);
>gBS->RaiseTPL (TPL_HIGH_LEVEL);
>
>..
>
>gBS->RestoreTPL (OldTpl);
>
> I think that at least one of the above pathological usage patterns would
> break the existing mInterruptedTplMask patches, since they currently
> clear state in RestoreTPL() and so will not correctly handle a duplicate
> call to RestoreTPL().
>

I think my patch works (modulo the 1 vs. 1U issue) for the second.
Declaring the first to be invalid is a good idea IMO. Also it would only
break in interrupt handlers and would revert to just causing a stack
overflow in the interrupt storm scenario, so it wouldn't be too bad...

Paolo


> I'll try to get a patch put together over the weekend.
>
> Thanks,
>
> Michael
>
>


-=-=-=-=-=-=-=-=-=-=-=-
Groups.io Links: You receive all messages sent to this group.
View/Reply Online (#116238): https://edk2.groups.io/g/devel/message/116238
Mute This Topic: https://groups.io/mt/104642317/21656
Group Owner: devel+ow...@edk2.groups.io
Unsubscribe: https://edk2.groups.io/g/devel/unsub [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-




Re: [edk2-devel] [PATCH 2/2] MdeModulePkg/DxeCore: Fix stack overflow issue due to nested interrupts

2024-03-01 Thread Paolo Bonzini
On Fri, Mar 1, 2024 at 10:27 AM Michael Brown  wrote:
> > My version is attached, feel free to reuse it (either entirely or
> > partially) for a hypothetical v2. Apologies to you and Mike K for the
> > confusion!
>
> I much prefer this version of the patch, because the explanations are
> easier to follow and to reason about.

Thanks! I find the new logic to be quite appealing... now that I
understand it. Hopefully this provides a way to find the best of both
worlds.

> There is an implicit assumption that if interrupts are disabled when
> RaiseTPL() is called then we must have been called from an interrupt
> handler.  How sure are we that this assumption is correct?
>
> It's possible that it doesn't matter.  The new logic will effectively
> mean that RestoreTPL() will restore not only the TPL but also the
> interrupts-enabled state to whatever existed at the time of the
> corresponding RaiseTPL().

Right: that's what my comment says

+  // However, when the handler calls RestoreTPL
+  // before returning, we want to keep interrupts disabled.  This
+  // restores the exact state at the beginning of the handler,
+  // before the call to RaiseTPL(): low TPL and interrupts disabled.

but indeed it applies beyond interrupt handlers. It might even be a bugfix.

> Maybe this is what we want?  If so, then we
> should probably phrase the comments in these terms instead of in terms
> of being called from an interrupt handler.

I think phrasing the comments with reference to interrupt handlers is
fine, but it may be worth adding a comment to either
mInterruptedTplMask or CoreRestoreTpl(), like

+/// NOTE: Strictly speaking, this applies to anything that
+/// calls RaiseTPL() with interrupts disabled, not just
+/// interrupt handlers.  Interrupt handlers are just the case
+/// that we care the most about, because of the potential
+/// for stack overflow.

Paolo



-=-=-=-=-=-=-=-=-=-=-=-
Groups.io Links: You receive all messages sent to this group.
View/Reply Online (#116228): https://edk2.groups.io/g/devel/message/116228
Mute This Topic: https://groups.io/mt/104642317/21656
Group Owner: devel+ow...@edk2.groups.io
Unsubscribe: https://edk2.groups.io/g/devel/unsub [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-




Re: [edk2-devel] [PATCH 2/2] MdeModulePkg/DxeCore: Fix stack overflow issue due to nested interrupts

2024-03-01 Thread Paolo Bonzini
One fix is needed in the code.

On Thu, Feb 29, 2024 at 2:04 PM Ray Ni  wrote:
> +  //
> +  // Save the "Interrupted TPL" (TPL that was interrupted).
> +  //
> +  mInterruptedTplMask |= (UINTN)(1 << gEfiCurrentTpl);
> +}
>}

> +  //
> +  // Clear interrupted TPL level mask, but do not re-enable interrupts 
> here
> +  // This will return to CoreTimerTick() and interrupts will be 
> re-enabled
> +  // when the timer interrupt handlers returns from interrupt context.
> +  //
> +  ASSERT ((INTN)gEfiCurrentTpl == HighBitSet64 (mInterruptedTplMask));
> +  mInterruptedTplMask &= ~(UINTN)(1 << gEfiCurrentTpl);
> +}
>}

Both of these need to use "1U" to avoid sign extending bit 31 into bits 31..63.

The same issue is (in three places) present in my own version of the patch. :(

Paolo



-=-=-=-=-=-=-=-=-=-=-=-
Groups.io Links: You receive all messages sent to this group.
View/Reply Online (#116225): https://edk2.groups.io/g/devel/message/116225
Mute This Topic: https://groups.io/mt/104642317/21656
Group Owner: devel+ow...@edk2.groups.io
Unsubscribe: https://edk2.groups.io/g/devel/unsub [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-




Re: [edk2-devel] [PATCH 2/2] MdeModulePkg/DxeCore: Fix stack overflow issue due to nested interrupts

2024-03-01 Thread Paolo Bonzini
On Fri, Mar 1, 2024 at 4:08 AM Ni, Ray  wrote:
> @@ -161,5 +191,46 @@ CoreRestoreTpl (
>IN EFI_TPL NewTpl
>)
>  {
> +  BOOLEAN InInterruptHandler = FALSE;
> +
> +  //
> +  // Unwind the nested interrupt handlers up to the required
> +  // TPL, paying attention not to overflow the stack.  While
> +  // not strictly necessary according to the specification,
> +  // accept the possibility that multiple RaiseTPL calls are
> +  // undone by a single RestoreTPL
> +  //
> +  while ((INTN)NewTpl <= HighBitSet64 (mInterruptedTplMask)) {
> 1. why "<="? I thought when RestoreTPL() is called there are only two cases:
>a. NewTpl == HighBitSet64 (...)
>b. NewTpl > HighBitSet64 (...)
>   1.a is the case when TimerInterruptHandler() or CoreTimerTick() restores
>   TPL from HIGH to non-HIGH.
>   1.b is the case when the pending event backs call RaiseTPL/RestoreTPL().
>   Because only pending events whose TPL > "Interrupted TPL" can run, the
>   RestoreTPL() call from the event callbacks cannot change the TPL to a value
>   less than or equal to "Interrupted TPL".
>   So, I think "<=" can be "==".
>
> 2. can you explain a bit more about the reason of "while"?

Both are just for extra safety. The required invariant is that all
bits at or below current TPL are cleared, and using "while (... <=
...)" makes it more robust to incorrect usage of gBS->RestoreTPL().

Indeed, the patch at the top of thread also uses "(INTN)gEfiCurrentTpl
> HighBitSet64 (mInterruptedTplMask)", which is <= when you reverse
the condition.  It then asserts inside the conditional that "==" would
be enough.

So I am starting to see more and more similarities between the two
approaches.  I went a step further with fresh mind, removing the while
loop... and basically reinvented your and Michael's patch. :) The only
difference in the logic is a slightly different handling of
mInterruptedTplMask in CoreRestoreTpl(), which is a bit safer in my
case.

However, my roundabout way of getting to the same patch resulted in
very different comments. Personally, I found the large text at the
head of mInterruptedTplMask a bit too much, and the ones inside the
function too focused on "how" and not "why". Maybe it's my exposure to
NestedInterruptTplLib, but I find that a much smaller text can achieve
the same purpose, by explaining the logic instead of the individual
steps.

My version is attached, feel free to reuse it (either entirely or
partially) for a hypothetical v2. Apologies to you and Mike K for the
confusion!

> +
> +if (InterruptedTpl == NewTpl) {
> +  break;
> 3. "break" or "return"? I think we should exit from this function.

Indeed, this should have been a return.

Paolo


-=-=-=-=-=-=-=-=-=-=-=-
Groups.io Links: You receive all messages sent to this group.
View/Reply Online (#116224): https://edk2.groups.io/g/devel/message/116224
Mute This Topic: https://groups.io/mt/104642317/21656
Group Owner: devel+ow...@edk2.groups.io
Unsubscribe: https://edk2.groups.io/g/devel/unsub [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-


From b9f0bc3ef83b40c29e093acda1d0741b8f5610e5 Mon Sep 17 00:00:00 2001
From: Paolo Bonzini 
Date: Fri, 1 Mar 2024 09:11:48 +0100
Subject: [PATCH] MdeModulePkg: fix stack overflow issue due to nested
 interrupts
Content-Type: text/plain; charset=UTF-8

This is a heavily simplified version of the fix in the OvmfPkg
NestedInterruptTplLib.  Putting it in DXE core allows CoreRestoreTpl()
to lower the TPL while keeping interrupts disabled, removing the need
for either DisableInterruptsOnIret() or the complex deferred execution
mechanism.

Instead, CoreRaiseTpl() uses the current state of the interrupt flag to
second guess whether it's being called from an interrupt handler; when
restoring the outer TPL at the end of the handler, interrupts remain
disabled until IRET.  This eliminates the possibility that a nested
invocation of the interrupt handler has the same TPL as the outer one.

Signed-off-by: Michael D Kinney 
Signed-off-by: Ray Ni 
[Rewrote the CoreRestoreTpl part and the commit message. - Paolo]
Signed-off-by: Paolo Bonzini 
---
 MdeModulePkg/Core/Dxe/Event/Tpl.c | 85 ++-
 1 file changed, 72 insertions(+), 13 deletions(-)

diff --git a/MdeModulePkg/Core/Dxe/Event/Tpl.c b/MdeModulePkg/Core/Dxe/Event/Tpl.c
index b33f80573c..0a4f99521c 100644
--- a/MdeModulePkg/Core/Dxe/Event/Tpl.c
+++ b/MdeModulePkg/Core/Dxe/Event/Tpl.c
@@ -9,6 +9,16 @@ SPDX-License-Identifier: BSD-2-Clause-Patent
 #include "DxeMain.h"
 #include "Event.h"
 
+///
+/// Bit mask of TPLs that were interrupted (typically during RestoreTPL's
+/// event dispatching, though there are reports that the Windows boot loader
+/// executes stray STIs at TPL_HI

Re: [edk2-devel] [PATCH 2/2] MdeModulePkg/DxeCore: Fix stack overflow issue due to nested interrupts

2024-02-29 Thread Paolo Bonzini
On Thu, Feb 29, 2024 at 2:04 PM Ray Ni  wrote:
> @@ -134,9 +262,9 @@ CoreRestoreTpl (
>}
>
>//
> -  // Set the new value
> +  // Set the new TPL with interrupt disabled.
>//
> -
> +  CoreSetInterruptState (FALSE);
>gEfiCurrentTpl = NewTpl;
>
>//
> @@ -144,7 +272,22 @@ CoreRestoreTpl (
>// interrupts are enabled
>//
>if (gEfiCurrentTpl < TPL_HIGH_LEVEL) {
> -CoreSetInterruptState (TRUE);
> +if ((INTN)gEfiCurrentTpl > HighBitSet64 (mInterruptedTplMask)) {
> +  //
> +  // Only enable interrupts if restoring to a level above the highest
> +  // interrupted TPL level.  This allows interrupt nesting, but only for
> +  // events at higher TPL level than the current TPL level.
> +  //
> +  CoreSetInterruptState (TRUE);
> +} else {
> +  //
> +  // Clear interrupted TPL level mask, but do not re-enable interrupts 
> here
> +  // This will return to CoreTimerTick() and interrupts will be 
> re-enabled
> +  // when the timer interrupt handlers returns from interrupt context.
> +  //
> +  ASSERT ((INTN)gEfiCurrentTpl == HighBitSet64 (mInterruptedTplMask));
> +  mInterruptedTplMask &= ~(UINTN)(1 << gEfiCurrentTpl);
> +}
>}

Ok, now I understand what's going on and it's indeed the same logic as
NestedInterruptTplLib, with DisableInterruptsOnIret() replaced by
skipping CoreSetInterruptState(TRUE). It's similar to what I proposed
elsewhere in the thread, just written differenty.

I agree with Michael Brown that the spec is unclear on the state of
the interrupt flag on exit from gBS->RestoreTPL(), but perhaps this
change is feasible if the interrupt handlers just raise the TPL first
and restore it last.

Just as an exercise for me to understand the code better, I tried
rewriting the code in terms of the CoreRestoreTplInternal() function
that I proposed. I find it easier to read, but I guess that's a bit in
the eye of the beholder, and it is a little more defensively coded. I
attach it (untested beyond compilation) for reference.

Paolo


-=-=-=-=-=-=-=-=-=-=-=-
Groups.io Links: You receive all messages sent to this group.
View/Reply Online (#116192): https://edk2.groups.io/g/devel/message/116192
Mute This Topic: https://groups.io/mt/104642317/21656
Group Owner: devel+ow...@edk2.groups.io
Unsubscribe: https://edk2.groups.io/g/devel/unsub [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-


From 23c4f60cf79f29ab5eff55a02c72bb504804d02a Mon Sep 17 00:00:00 2001
From: Paolo Bonzini 
Date: Thu, 29 Feb 2024 23:54:50 +0100
Subject: [PATCH 1/2] MdeModulePkg: introduce CoreRestoreTplInternal()
Content-Type: text/plain; charset=UTF-8

Introduce a function that restores the TPL just like gBS->RestoreTPL(),
but can optionally return with interrupts disabled.  This can be used
from interrupt handlers, so that any nested interrupt handler will only
see an elevated TPL and not the value on entry to the interrupt handler.

Signed-off-by: Paolo Bonzini 
---
 MdeModulePkg/Core/Dxe/Event/Tpl.c | 46 +--
 1 file changed, 31 insertions(+), 15 deletions(-)

diff --git a/MdeModulePkg/Core/Dxe/Event/Tpl.c b/MdeModulePkg/Core/Dxe/Event/Tpl.c
index b33f80573c..fe95ea3896 100644
--- a/MdeModulePkg/Core/Dxe/Event/Tpl.c
+++ b/MdeModulePkg/Core/Dxe/Event/Tpl.c
@@ -90,10 +90,11 @@ CoreRaiseTpl (
   @param  NewTpl  New, lower, task priority
 
 **/
-VOID
+static VOID
 EFIAPI
-CoreRestoreTpl (
-  IN EFI_TPL  NewTpl
+CoreRestoreTplInternal (
+  IN EFI_TPL  NewTpl,
+  IN BOOLEAN  DesiredInterruptState
   )
 {
   EFI_TPL  OldTpl;
@@ -107,11 +108,6 @@ CoreRestoreTpl (
 
   ASSERT (VALID_TPL (NewTpl));
 
-  //
-  // If lowering below HIGH_LEVEL, make sure
-  // interrupts are enabled
-  //
-
   if ((OldTpl >= TPL_HIGH_LEVEL) &&  (NewTpl < TPL_HIGH_LEVEL)) {
 gEfiCurrentTpl = TPL_HIGH_LEVEL;
   }
@@ -126,6 +122,11 @@ CoreRestoreTpl (
 }
 
 gEfiCurrentTpl = PendingTpl;
+
+//
+// If lowering below HIGH_LEVEL, make sure
+// interrupts are enabled
+//
 if (gEfiCurrentTpl < TPL_HIGH_LEVEL) {
   CoreSetInterruptState (TRUE);
 }
@@ -134,16 +135,31 @@ CoreRestoreTpl (
   }
 
   //
-  // Set the new value
+  // Set the new TPL with interrupt disabled.  If DesiredInterruptState
+  // is FALSE, this ensures that any nested interrupt handler will only
+  // see an elevated TPL and not NewTpl.
   //
-
+  CoreSetInterruptState (FALSE);
   gEfiCurrentTpl = NewTpl;
 
-  //
-  // If lowering below HIGH_LEVEL, make sure
-  // interrupts are enabled
-  //
-  if (gEfiCurrentTpl < TPL_HIGH_LEVEL) {
+  if (DesiredInterruptState) {
+ASSERT(gEfiCurrentTpl < TPL_HIGH_LEVEL);
 CoreSetInterruptState (TRUE);
   }
 }
+
+/**
+  Lowers the task priority to the previous value.   If the new
+  priority unmasks events at a higher priority, they are dispatched.
+
+

Re: [edk2-devel] [PATCH 2/2] MdeModulePkg/DxeCore: Fix stack overflow issue due to nested interrupts

2024-02-29 Thread Paolo Bonzini

On 2/29/24 20:22, Michael Brown wrote:

The design of NestedInterruptTplLib is that each nested interrupt must increase 
the TPL, but if I understand correctly there is a hole here:

   //
   // Call RestoreTPL() to allow event notifications to be
   // dispatched.  This will implicitly re-enable interrupts.
   //
   gBS->RestoreTPL (InterruptedTPL);

   //
   // Re-disable interrupts after the call to RestoreTPL() to ensure
   // that we have exclusive access to the shared state.
   //
   DisableInterrupts ();

because gBS->RestoreTPL will unconditionally enable interrupts if InterruptedTPL 
< TPL_HIGH_LEVEL.


There's no hole there.


Yes, what I meant is that the whole of NestedInterruptTplLib is designed 
around plugging that hole.  But if you can avoid re-enabling interrupts 
at the end of CoreRestoreTpl(), you don't need DisableInterruptsOnIret() 
anymore.


See the comments in the code for further details - I made them fairly 
extensive.  


Yup, I remember. :)

If possible, the easiest solution would be to merge 
NestedInterruptTplLib into Core DXE.


The question with that approach would be how to cleanly violate the 
abstraction layer that separates the timer interrupt handler (existing 
in a separate DXE driver executable) from the implementation of 
CoreRestoreTplInternal() (existing in core DXE and not exposed via the 
boot services table).


See outline in the other mail I have sent.

Paolo



-=-=-=-=-=-=-=-=-=-=-=-
Groups.io Links: You receive all messages sent to this group.
View/Reply Online (#116190): https://edk2.groups.io/g/devel/message/116190
Mute This Topic: https://groups.io/mt/104642317/21656
Group Owner: devel+ow...@edk2.groups.io
Unsubscribe: https://edk2.groups.io/g/devel/unsub [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-




Re: [edk2-devel] [PATCH 2/2] MdeModulePkg/DxeCore: Fix stack overflow issue due to nested interrupts

2024-02-29 Thread Paolo Bonzini

On 2/29/24 20:16, Kinney, Michael D wrote:




-Original Message-
From: Paolo Bonzini 
Sent: Thursday, February 29, 2024 11:04 AM
To: Ni, Ray ; devel@edk2.groups.io
Cc: Kinney, Michael D ; Liming Gao
; Laszlo Ersek ; Michael
Brown 
Subject: Re: [PATCH 2/2] MdeModulePkg/DxeCore: Fix stack overflow issue
due to nested interrupts

On 2/29/24 14:02, Ray Ni wrote:

In the end, it will lower the TPL to TPL_APPLICATION with interrupt

enabled.


However, it's possible that another timer interrupt happens just in

the end

of RestoreTPL() function when TPL is TPL_APPLICATION.


How do non-OVMF platforms solve the issue?  Do they just have the same
bug as in https://bugzilla.tianocore.org/show_bug.cgi?id=4162 ?


Yes.  This same issue can be reproduced on non-OVMF platforms.

This proposal here is an attempt to integrate a common fix into the DXE Core.

I would agree conceptually that integrating the NestedInterruptTplLib work
into the DXE Core is another option.

I believe the root cause of all of these scenarios is enabling interrupts
in RestoreTPL() when processing a timer interrupt between the last processed
event and the return from the interrupt handler. Ther are some instances
of the Timer Arch Protocol implementation that call Raise/Restore TPL, so
we want a DXE Core change that is compatible with the DXE Core doing 
Raise/Restore
when processing a timer interrupt and the Timer Arch Protocol implementation
also doing the Raise/Restore TPL.


Ok, now I understand better.

The reason why the NestedInterruptTplLib was introduced (as opposed to 
doing it in core DXE) was to enable returning with disabled interrupts 
from the nested interrupt handler, but I think it can be done with a 
function like the CoreRestoreTplInternal() I outlined in the previous 
email, which is the same as current CoreRestoreTpl() but finishes with


  if (!DesiredInterruptState) {
CoreSetInterruptState (FALSE);
  }
  gEfiCurrentTpl = NewTpl;
  if (DesiredInterruptState) {
ASSERT (gEfiCurrentTpl < TPL_HIGH_LEVEL);
CoreSetInterruptState (TRUE);
  }

The new CoreRaiseTpl would be the same as in Ray and your patch, while 
the CoreRestoreTpl would be something like this:


if (NewTpl == HighBitSet64 (mInterruptedTplMask)) {
  static NESTED_INTERRUPT_STATE NestedInterruptState;
  mInterruptedTplMask &= ~(UINTN)(1 << NewTpl);
  //
  // Use the deferred invocation logic that is currently
  // in NestedInterruptTplLib.
  //
  // But unlike current NestedInterruptRestoreTPL(), if the logic
  // is part of core DXE, the
  //
  //gBS->RestoreTPL (InterruptedTPL);
  //DisableInterrupts ();
  //
  // pair that requires "disable interrupts on IRET" logic can
  // be done without ever enabling interrupts,  with
  // CoreRestoreTplInternal(InterruptedTPL, FALSE)
  //
  // As an aside, NestedInterruptState might as well become a
  // pair of globals.
  //
  NestedInterruptRestoreTPL (NewTpl, );
} else {
  CoreRestoreTplInternal(NewTpl, NewTpl < TPL_HIGH_LEVEL);
}

Requiring matching raise/restore pairs is a bit scary.  It can be 
avoided by changing the "if" to a


  while (NewTpl >= HighBitSet64 (mInterruptedTplMask))
mInterruptedTplMask &=
  ~(UINTN)(1 << HighBitSet64 (mInterruptedTplMask));

Then, if inlining NestedInterruptRestoreTPL() allows simplifications, 
they can be done on top after the merge of NestedInterruptTplLib.  In 
particular, I suspect that the while loop above can be unified with the 
loop in NestedInterruptRestoreTPL().  But again, that would be best 
reviewed as a separate change.


All this, as Michael said, is however conditional on being able to deal 
with the TPL_HIGH_LEVEL+STI shenanigans that Windows does.


Paolo



The design of NestedInterruptTplLib is that each nested interrupt must
increase the TPL, but if I understand correctly there is a hole here:

//
// Call RestoreTPL() to allow event notifications to be
// dispatched.  This will implicitly re-enable interrupts.
//
gBS->RestoreTPL (InterruptedTPL);

//
// Re-disable interrupts after the call to RestoreTPL() to ensure
// that we have exclusive access to the shared state.
//
DisableInterrupts ();

because gBS->RestoreTPL will unconditionally enable interrupts if
InterruptedTPL < TPL_HIGH_LEVEL.


If possible, the easiest solution would be to merge
NestedInterruptTplLib into Core DXE.  This way, instead of calling
gBS->RestoreTPL, NestedInterruptTplLib can call a custom version of
CoreRestoreTpl that exits with interrupts disabled.  That is, something
like

VOID EFIAPI CoreRestoreTplInternal(IN EFI_TPL NewTpl,
 IN BOOLEAN InterruptState)
{
//
// The caller can request disabled interrupts to access shared
// state, but TPL_HIGH_LEVEL must *not* have them enabled

Re: [edk2-devel] [PATCH 2/2] MdeModulePkg/DxeCore: Fix stack overflow issue due to nested interrupts

2024-02-29 Thread Paolo Bonzini

On 2/29/24 14:02, Ray Ni wrote:

In the end, it will lower the TPL to TPL_APPLICATION with interrupt enabled.

However, it's possible that another timer interrupt happens just in the end
of RestoreTPL() function when TPL is TPL_APPLICATION.


How do non-OVMF platforms solve the issue?  Do they just have the same 
bug as in https://bugzilla.tianocore.org/show_bug.cgi?id=4162 ?


The design of NestedInterruptTplLib is that each nested interrupt must 
increase the TPL, but if I understand correctly there is a hole here:


  //
  // Call RestoreTPL() to allow event notifications to be
  // dispatched.  This will implicitly re-enable interrupts.
  //
  gBS->RestoreTPL (InterruptedTPL);

  //
  // Re-disable interrupts after the call to RestoreTPL() to ensure
  // that we have exclusive access to the shared state.
  //
  DisableInterrupts ();

because gBS->RestoreTPL will unconditionally enable interrupts if 
InterruptedTPL < TPL_HIGH_LEVEL.



If possible, the easiest solution would be to merge 
NestedInterruptTplLib into Core DXE.  This way, instead of calling 
gBS->RestoreTPL, NestedInterruptTplLib can call a custom version of 
CoreRestoreTpl that exits with interrupts disabled.  That is, something like


VOID EFIAPI CoreRestoreTplInternal(IN EFI_TPL NewTpl,
   IN BOOLEAN InterruptState)
{
  //
  // The caller can request disabled interrupts to access shared
  // state, but TPL_HIGH_LEVEL must *not* have them enabled.
  //
  ASSERT(!(NewTpl == TPL_HIGH_LEVEL && InterruptState));

  // ...

  gEfiCurrentTpl = NewTpl;
  CoreSetInterruptState (InterruptState);
}

Now, CoreRestoreTpl is just

  //
  // If lowering below HIGH_LEVEL, make sure
  // interrupts are enabled
  //
  CoreRestoreTplInternal(NewTpl, NewTpl < TPL_HIGH_LEVEL);

whereas NestedInterruptRestoreTPL can do

  //
  // Call RestoreTPL() to allow event notifications to be
  // dispatched.  This will implicitly re-enable interrupts,
  // but only if events have to be dispatched.
  //
  CoreRestoreTplInternal(InterruptedTPL, FALSE);

  //
  // Interrupts are now disabled, so we can access shared state.
  //

This avoids the unlimited nesting of interrupts because each stack frame 
will indeed have a higher TPL than the outer version.


Paolo



-=-=-=-=-=-=-=-=-=-=-=-
Groups.io Links: You receive all messages sent to this group.
View/Reply Online (#116181): https://edk2.groups.io/g/devel/message/116181
Mute This Topic: https://groups.io/mt/104642317/21656
Group Owner: devel+ow...@edk2.groups.io
Unsubscribe: https://edk2.groups.io/g/devel/unsub [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-




Re: [edk2-devel] [PATCH 2/2] MdeModulePkg/DxeCore: Fix stack overflow issue due to nested interrupts

2024-02-29 Thread Paolo Bonzini
Il gio 29 feb 2024, 17:45 Kinney, Michael D  ha
scritto:

> Hi Michael,
>
> Can you provide a pointer to the UEFI Spec statement this breaks?
>

The spec does say that interrupts are disabled for TPL_HIGH_LEVEL, but
indeed it doesn't say they are always enabled at lower levels. However, if
the interrupts aren't always enabled whenever you're below TPL_HIGH_LEVEL,
you get priority inversions (and deadlocks).

For example, if you end up running with interrupts disabled at
TPL_CALLBACK, you are disabling the dispatching of timers at TPL_NOTIFY.

I guess this can be deduced from these two passages:

- "The functions in these queues are invoked in FIFO order, starting with
the highest priority level queue and proceeding to the lowest priority
queue that is unmasked by the current TPL"

- "If Type is TimerRelative and TriggerTime is 0, then the timer event will
be signaled on the next timer tick" (in the description of gBS->SetTimer)

Paolo


> Thanks,
>
> Mike
>
> > -Original Message-
> > From: Michael Brown 
> > Sent: Thursday, February 29, 2024 5:23 AM
> > To: devel@edk2.groups.io; Ni, Ray 
> > Cc: Kinney, Michael D ; Liming Gao
> > ; Laszlo Ersek ; Paolo
> > Bonzini 
> > Subject: Re: [edk2-devel] [PATCH 2/2] MdeModulePkg/DxeCore: Fix stack
> > overflow issue due to nested interrupts
> >
> > On 29/02/2024 13:02, Ni, Ray wrote:
> > > A ideal solution is to not keep the interrupt disabled when
> > > RestoreTPL(TPL_HIGH -> not TPL_HIGH) is executed in the timer
> > interrupt
> > > context because the interrupt handler will re-enable the interrupt
> > with
> > > arch specific instructions (e.g.: IRET for x86).
> > >
> > > The patch introduces mInterruptedTplMask which tells RestoreTPL() if
> > > it's called in the interrupt context and whether it should defer
> > enabling
> > > the interrupt.
> >
> > NACK.  This breaks the specification-defined behaviour for
> > RestoreTPL().
> >
> > What guarantees do we have that there is no code anywhere in the world
> > that relies upon RestoreTPL() unconditionally re-enabling interrupts.
> >
> > I also find this code substantially harder to follow than
> > NestedInterruptTplLib (which does not break any specified behaviour).
> >
> > Thanks,
> >
> > Michael
>
>


-=-=-=-=-=-=-=-=-=-=-=-
Groups.io Links: You receive all messages sent to this group.
View/Reply Online (#116179): https://edk2.groups.io/g/devel/message/116179
Mute This Topic: https://groups.io/mt/104642317/21656
Group Owner: devel+ow...@edk2.groups.io
Unsubscribe: https://edk2.groups.io/g/devel/unsub [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-




Re: [edk2-devel] GuestPhysAddrSize questions

2024-02-22 Thread Paolo Bonzini
On Thu, Feb 22, 2024 at 5:13 PM Paolo Bonzini  wrote:
> Also, to clarify the hardware behavior, if hCR4.LA57=0 and host
> PhysAddrSize==52, then will guest physical addresses above 2^48
>
> 1) cause a reserved #PF in the guest, or
>
> 2) cause a non-present NPF exit in the hypervisor?
>
> I remember that several years ago we had a discussion on hCR4.LA57=0
> reducing the address space compared to MAXPHYADDR, but I cannot find the
> emails and also at the time I didn't notice GuestPhysAddrSize.

Found them! They say that "according to the design document, CPU will
report #NPF if the guest references a PA that is greater than 48 bits
while the hypervisor is in 4-level nested paging mode". That's nice,
because it's the same behavior as the affected Intel processors.

Paolo

> Anyhow, basically we would like GuestPhysAddrSize to be "redefined" as
>
> Maximum usable physical address size in bits.  Physical addresses
> above this size should not be used, but will not produce a "reserved"
> page fault.  When this field is zero, all bits up to PhysAddrSize are
> usable.  This field is expected to be nonzero only on guests where
> the hypervisor is using nested paging.



-=-=-=-=-=-=-=-=-=-=-=-
Groups.io Links: You receive all messages sent to this group.
View/Reply Online (#115859): https://edk2.groups.io/g/devel/message/115859
Mute This Topic: https://groups.io/mt/104510523/21656
Group Owner: devel+ow...@edk2.groups.io
Unsubscribe: https://edk2.groups.io/g/devel/unsub [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-




Re: [edk2-devel] GuestPhysAddrSize questions

2024-02-22 Thread Paolo Bonzini

On 2/22/24 16:44, Tom Lendacky wrote:

On 2/22/24 05:24, Gerd Hoffmann wrote:

   Hi,


+    if (Cr4.Bits.LA57) {
+  if (PhysBits > 48) {
+    /*
+ * Some Intel CPUs support 5-level paging, have more than 48
+ * phys-bits but support only 4-level EPT, which effectively
+ * limits guest phys-bits to 48.
+ *
+ * AMD Processors have a different but somewhat related
+ * problem: They can handle guest phys-bits larger than 48
+ * only in case the host runs in 5-level paging mode.
+ *
+ * Until we have some way to communicate that kind of
+ * limitations from hypervisor to guest, limit phys-bits
+ * to 48 unconditionally.
+ */


So I'm looking for some communication path.  One option would be to use
some bits in the KVM cpuid leaves.  Another possible candidate is cpuid
leaf 0x8008.

 From the AMD APM (revision 3.35):

   CPUID Fn8000_0008_EAX Long Mode Size Identifiers
   

   The value returned in EAX provides information about the maximum host
   and guest physical and linear address width (in bits) supported by the
   processor.

   Bits   FieldName    Description

   31:24  —    Reserved

   23:16 GuestPhysAddrSize Maximum guest physical address size in bits.
   This number applies only to guests using 
nested

   paging. When this field is zero, refer to the
   PhysAddrSize field for the maximum guest
   physical address size. See “Secure Virtual
   Machine” in APM Volume 2.

   15:8  LinAddrSize   Maximum linear address size in bits.

   7:0   PhysAddrSize  Maximum physical address size in bits. When
   GuestPhysAddrSize is zero, this field also
   indicates the maximum guest physical address
   size.

The description of the GuestPhysAddrSize is somewhat vague.  Is this a
value the hypervisor should use to figure how much address space it can
give to guests?  Or is this a value the hypervisor can set to inform the
guest about the available address space (which would be a solution to
the problem outlined in the comment above)?  Or both?


I believe the main purpose of GuestPhysAddrSize was for software use 
(for nested virtualization) and that the hardware itself has always 
returned zero for that value. So you should be able to use that field. 
Adding @Paolo for his thoughts.


I have already discussed this with Gerd, so I don't really have many 
thoughts to add.


Anyhow, basically we would like GuestPhysAddrSize to be "redefined" as

   Maximum usable physical address size in bits.  Physical addresses
   above this size should not be used, but will not produce a "reserved"
   page fault.  When this field is zero, all bits up to PhysAddrSize are
   usable.  This field is expected to be nonzero only on guests where
   the hypervisor is using nested paging.

Also, to clarify the hardware behavior, if hCR4.LA57=0 and host 
PhysAddrSize==52, then will guest physical addresses above 2^48


1) cause a reserved #PF in the guest, or

2) cause a non-present NPF exit in the hypervisor?

I remember that several years ago we had a discussion on hCR4.LA57=0 
reducing the address space compared to MAXPHYADDR, but I cannot find the 
emails and also at the time I didn't notice GuestPhysAddrSize.


Paolo



-=-=-=-=-=-=-=-=-=-=-=-
Groups.io Links: You receive all messages sent to this group.
View/Reply Online (#115832): https://edk2.groups.io/g/devel/message/115832
Mute This Topic: https://groups.io/mt/104510523/21656
Group Owner: devel+ow...@edk2.groups.io
Unsubscribe: https://edk2.groups.io/g/devel/unsub [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-




Re: [edk2-devel] [PATCH v2] UefiCpuPkg/MmSaveStateLib: Remove checking Smm Rev ID in AMD MmSaveStateLib

2023-10-31 Thread Paolo Bonzini
On Tue, Oct 31, 2023 at 10:16 AM Attar, AbdulLateef (Abdul Lateef)
 wrote:
>
> [Public]
>
> +Laszlo, +Gerd, +Paolo
> PR:  https://github.com/tianocore/edk2/pull/4982

I left a comment in the PR.

The patch is only correct if this code is only ever used on 64-bit
processors. Note that KVM uses the legacy 32-bit format of the SMRAM
state save area, if the virtual machine is configured to clear the LM
bit of CPUID.

Second, the commit message does not explain why you are doing this.
Without such an explanation, it is impossible to provide more
constructive feedback.

Paolo

> -Original Message-
> From: Lin, Jacque 
> Sent: Tuesday, October 31, 2023 11:07 AM
> To: devel@edk2.groups.io
> Cc: Lin, Jacque ; Attar, AbdulLateef (Abdul Lateef) 
> ; Chang, Abner 
> Subject: [PATCH v2] UefiCpuPkg/MmSaveStateLib: Remove checking Smm Rev ID in 
> AMD MmSaveStateLib
>
> Remove checking SMM Rev ID in AMD Save State lib when reading Save State 
> Register EFI_MM_SAVE_STATE_REGISTER_IO.
> For AMD, it is not necessary to check SmmRevId when reading Save State 
> Register EFI_MM_SAVE_STATE_REGISTER_IO.
>
> Cc: Abdul Lateef Attar 
> Cc: Abner Chang 
> Signed-off-by: Jacque Lin 
> ---
>  UefiCpuPkg/Library/MmSaveStateLib/AmdMmSaveState.c | 13 -
>  1 file changed, 13 deletions(-)
>
> diff --git a/UefiCpuPkg/Library/MmSaveStateLib/AmdMmSaveState.c 
> b/UefiCpuPkg/Library/MmSaveStateLib/AmdMmSaveState.c
> index 3315a6cc44..c4bf6ad4bb 100644
> --- a/UefiCpuPkg/Library/MmSaveStateLib/AmdMmSaveState.c
> +++ b/UefiCpuPkg/Library/MmSaveStateLib/AmdMmSaveState.c
> @@ -102,7 +102,6 @@ MmSaveStateReadRegister (
>OUT VOID*Buffer   ) {-  UINT32 
> SmmRevId;   EFI_MM_SAVE_STATE_IO_INFO  *IoInfo;   AMD_SMRAM_SAVE_STATE_MAP   
> *CpuSaveState;   UINT8  DataWidth;@@ -124,18 +123,6 @@ 
> MmSaveStateReadRegister (
> // Check for special EFI_MM_SAVE_STATE_REGISTER_IO   if (Register == 
> EFI_MM_SAVE_STATE_REGISTER_IO) {-//-// Get SMM Revision ID-//-
> MmSaveStateReadRegisterByIndex (CpuIndex, 
> AMD_MM_SAVE_STATE_REGISTER_SMMREVID_INDEX, sizeof (SmmRevId), );--   
>  //-// See if the CPU supports the IOMisc register in the save state-
> //-if (SmmRevId < AMD_SMM_MIN_REV_ID_X64) {-  return EFI_NOT_FOUND;-  
>   }- // Check if IO Restart Dword [IO Trap] is valid or not using bit 1.  
>if (!(CpuSaveState->x64.IO_DWord & 0x02u)) {   return EFI_NOT_FOUND;--
> 2.36.1.windows.1
>



-=-=-=-=-=-=-=-=-=-=-=-
Groups.io Links: You receive all messages sent to this group.
View/Reply Online (#110402): https://edk2.groups.io/g/devel/message/110402
Mute This Topic: https://groups.io/mt/102292190/21656
Group Owner: devel+ow...@edk2.groups.io
Unsubscribe: https://edk2.groups.io/g/devel/unsub [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-




Re: [edk2-rfc] [edk2-devel] RFC: design review for TDVF in OVMF

2021-06-09 Thread Paolo Bonzini

On 09/06/21 16:28, James Bottomley wrote:

That would cut across the ApEntrypoint and the guidedStructureEnd.
However, nothing says anything in the reset vector guided structure has
to be data ... so it could equally well be code.  That means we can do
guid based entries that contain the 32 bit real and 64 bit entry
points.  This would also come with the added advantage that we can scan
the OVMF binary to see what entry points it supports.


Isn't the initial state included in the save area just like for SEV-ES? 
 So it's not even QEMU, but rather some external tool that builds the 
encrypted image, that needs to understand that GUIDed structure.


The GUIDed structure can either include the entry point code; or it 
could have room for a couple 8-byte pointers since any fixed-size area 
in the GUIDed structure would be just a jump anyway.


Paolo



-=-=-=-=-=-=-=-=-=-=-=-
Groups.io Links: You receive all messages sent to this group.
View/Reply Online (#76282): https://edk2.groups.io/g/devel/message/76282
Mute This Topic: https://groups.io/mt/83283616/21656
Group Owner: devel+ow...@edk2.groups.io
Unsubscribe: https://edk2.groups.io/g/devel/unsub [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-




Re: [edk2-devel] separate OVMF binary for TDX? [was: OvmfPkg: Reserve the Secrets and Cpuid page for the SEV-SNP guest]

2021-04-15 Thread Paolo Bonzini

On 15/04/21 01:34, Erdem Aktas wrote:

We do not want to generate different binaries for AMD, Intel, Intel
with TDX, AMD with SEV/SNP etc


My question is why the user would want a single binary for VMs with and 
without TDX/SNP.  I know there is attestation, but why would you even 
want the _possibility_ that your guest starts running without TDX or SNP 
protection, and only find out later via attestation?


For a similar reason, OVMF already supports shipping a binary that fails 
to boot if SMM is not available to the firmware, because then secure 
boot would be trivially circumvented.


I can understand having a single binary for both TDX or SNP.  That's not 
a problem since you can set up the SEV startup VMSA to 32-bit protected 
mode just like TDX wants.



therefore we were expecting the TDX
changes to be part of the upstream code.


Having 1 or more binaries should be unrelated to the changes being 
upstream (or more likely, I am misunderstanding you).


Thanks,

Paolo



-=-=-=-=-=-=-=-=-=-=-=-
Groups.io Links: You receive all messages sent to this group.
View/Reply Online (#74118): https://edk2.groups.io/g/devel/message/74118
Mute This Topic: https://groups.io/mt/81969494/21656
Group Owner: devel+ow...@edk2.groups.io
Unsubscribe: https://edk2.groups.io/g/devel/unsub [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-




Re: [edk2-devel] [RFC PATCH 00/14] Firmware Support for Fast Live Migration for AMD SEV

2021-03-05 Thread Paolo Bonzini

On 04/03/21 21:45, Laszlo Ersek wrote:

On 03/04/21 10:21, Paolo Bonzini wrote:

Hi Tobin,

as mentioned in the reply to the QEMU patches posted by Tobin, I
think the firmware helper approach is very good, but there are some
disadvantages in the idea of auxiliary vCPUs. These are especially
true in the VMM, where it's much nicer to have a separate VM that
goes through a specialized run loop; however, even in the firmware
level there are some complications (as you pointed out) in letting
MpService workers run after ExitBootServices.

My idea would be that the firmware would start the VM as usual using
the same launch data; then, the firmware would detect it was running
as a migration helper VM during the SEC or PEI phases (for example
via the GHCB or some other unencrypted communication area), and
divert execution to the migration helper instead of proceeding to the
next boot phase. This would be somewhat similar in spirit to how edk2
performs S3 resume, if my memory serves correctly.


Very cool. You'd basically warm-reboot the virtual machine into a new
boot mode (cf. BOOT_WITH_FULL_CONFIGURATION vs. BOOT_ON_S3_RESUME in
OvmfPkg/PlatformPei).

To me that's much more attractive than a "background job".

The S3 parallel is great. What I'm missing is:

- Is it possible to warm-reboot an SEV VM? (I vaguely recall that it's
not possible for SEV-ES at least.) Because, that's how we'd transfer
control to the early parts of the firmware again, IIUC your idea, while
preserving the memory contents.


It's not exactly a warm reboot.  It's two VMs booted at the same time, 
with exactly the same contents as far as encrypted RAM goes, but 
different unencrypted RAM.  The difference makes one VM boot regularly 
and the other end up in the migration helper.  The migration helper can 
be entirely contained in PEI, or it can even be its own OS, stored as a 
flat binary in the firmware.  Whatever is easier.


The divergence would happen much earlier than S3 though.  It would have 
to happen before the APs are brought up, for example, and essentially 
before the first fw_cfg access if (as is likely) the migration helper VM 
does not have fw_cfg at all.  That's why I brought up the possibility of 
diverging as soon as SEC.



- Who would initiate this process? S3 suspend is guest-initiated. (Not
that we couldn't use the guest agent, if needed.)

(In case the idea is really about a separate VM, and not about rebooting
the already running VM, then I don't understand -- how would a separate
VM access the guest RAM that needs to be migrated?)


Answering the other message:


(For some unsolicited personal information, now I feel less bad about
this idea never occurring to me -- I never knew about the KVM patch set
that would enable encryption context sharing. (TBH I thought that was
prevented, by design, in the SEV hardware...))


As far as the SEV hardware is concerned, a "VM" is defined by the ASID.

The VM would be separate at the KVM level, but it would share the ASID 
(and thus the guest RAM) with the primary VM.  So as far as the SEV 
hardware and the processor are concerned, the separate VM would be just 
one more VMCB that runs with that ASID.  Only KVM knows that they are 
backed by different file descriptors etc.


In fact, another advantage is that it would be much easier to scale the 
migration helper to multiple vCPUs.  This is probably also a case for 
diverging much earlier than PEI, because a multi-processor migration 
helper running in PEI or DXE would require ACPI tables and a lot of 
infrastructure that is probably undesirable.


Paolo



-=-=-=-=-=-=-=-=-=-=-=-
Groups.io Links: You receive all messages sent to this group.
View/Reply Online (#72486): https://edk2.groups.io/g/devel/message/72486
Mute This Topic: https://groups.io/mt/81036365/21656
Group Owner: devel+ow...@edk2.groups.io
Unsubscribe: https://edk2.groups.io/g/devel/unsub [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-




Re: [edk2-devel] [RFC PATCH 00/14] Firmware Support for Fast Live Migration for AMD SEV

2021-03-04 Thread Paolo Bonzini
Hi Tobin,

as mentioned in the reply to the QEMU patches posted by Tobin, I think the 
firmware helper approach is very good, but there are some disadvantages in the 
idea of auxiliary vCPUs. These are especially true in the VMM, where it's much 
nicer to have a separate VM that goes through a specialized run loop; however, 
even in the firmware level there are some complications (as you pointed out) in 
letting MpService workers run after ExitBootServices.

My idea would be that the firmware would start the VM as usual using the same 
launch data; then, the firmware would detect it was running as a migration 
helper VM during the SEC or PEI phases (for example via the GHCB or some other 
unencrypted communication area), and divert execution to the migration helper 
instead of proceeding to the next boot phase. This would be somewhat similar in 
spirit to how edk2 performs S3 resume, if my memory serves correctly.

What do you think?

Thanks,

Paolo


-=-=-=-=-=-=-=-=-=-=-=-
Groups.io Links: You receive all messages sent to this group.
View/Reply Online (#72431): https://edk2.groups.io/g/devel/message/72431
Mute This Topic: https://groups.io/mt/81036365/21656
Group Owner: devel+ow...@edk2.groups.io
Unsubscribe: https://edk2.groups.io/g/devel/unsub [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-




Re: [edk2-devel] [PATCH v8 07/10] OvmfPkg/SmmCpuFeaturesLib: call CPU hot-eject handler

2021-02-23 Thread Paolo Bonzini

On 23/02/21 18:06, Laszlo Ersek wrote:

On 02/23/21 08:45, Paolo Bonzini wrote:

On 22/02/21 15:53, Laszlo Ersek wrote:

+
+  if (mCpuHotEjectData != NULL) {
+    CPU_HOT_EJECT_HANDLER Handler;
+
+    Handler = mCpuHotEjectData->Handler;

This patch looks otherwise OK to me, but:

In patch v8 08/10, we have a ReleaseMemoryFence(). (For now, it is only
expressed as a MemoryFence() call; we'll make that more precise later.)

(1) I think that should be paired with an AcquireMemoryFence() call,
just before loading "mCpuHotEjectData->Handler" above -- for now, also
expressed as a MemoryFence() call only.


In Linux terms, there is a control dependency here.  However, it should
at least be a separate statement to load mCpuHotEjectData (which from my
EDK2 reminiscences should be a global) into a local variable.  So

   EjectData = mCPUHotEjectData;
   // Optional AcquireMemoryFence here
   if (EjectData != NULL) {
     CPU_HOT_EJECT_HANDLER Handler;

     Handler = EjectData->Handler;
     if (Handler != NULL) {
   Handler (CpuIndex);
     }
   }


Yes, "mCPUHotEjectData" is a global.

"mCpuHotEjectData" itself is set up on the BSP (from the entry point
function of the PiSmmCpuSmmDxe driver), before any other APs have a
chance to execute any SMM-related code at all. Furthermore, once set up,
mCpuHotEjectData never changes -- it remains set to a particular
non-NULL value forever, or it remains NULL forever. (The latter case
applies when the possible CPU count is 1; IOW, then there is no AP at all.)


Ok, that's what I was missing.  However, your code below has *two* loads 
of mCpuHotEjectData and the fence would have to go after the second 
(between the load of mCpuHotEjectData and the load of the Handler 
field).  Therefore I would still use a local variable even if you decide 
to put the fence inside the "if", which I agree is the clearest.


Paolo


Because of that, I thought that the first comparison (mCpuHotEjectData
!= NULL) would not need any fence -- I thought it was similar to a
userspace program that (a) set a global variable in the "main" thread,
before calling pthread_create(), (b) treated the global variable as a
constant, ever after (meaning all threads).

However, mCpuHotEjectData->Handler is changed regularly (modified by the
BSP, and read "later" by all processors). That's why I thought the
acquire fence was needed in the following location:

   if (mCpuHotEjectData != NULL) {
 CPU_HOT_EJECT_HANDLER Handler;

 //
 // HERE -- AcquireMemoryFence()
 //
 Handler = mCpuHotEjectData->Handler;
 if (Handler != NULL) {
   Handler (CpuIndex);
 }
   }

Thanks!
Laszlo





-=-=-=-=-=-=-=-=-=-=-=-
Groups.io Links: You receive all messages sent to this group.
View/Reply Online (#72094): https://edk2.groups.io/g/devel/message/72094
Mute This Topic: https://groups.io/mt/80819862/21656
Group Owner: devel+ow...@edk2.groups.io
Unsubscribe: https://edk2.groups.io/g/devel/unsub [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-




Re: [edk2-devel] [PATCH v8 07/10] OvmfPkg/SmmCpuFeaturesLib: call CPU hot-eject handler

2021-02-22 Thread Paolo Bonzini

On 22/02/21 15:53, Laszlo Ersek wrote:

+
+  if (mCpuHotEjectData != NULL) {
+CPU_HOT_EJECT_HANDLER Handler;
+
+Handler = mCpuHotEjectData->Handler;

This patch looks otherwise OK to me, but:

In patch v8 08/10, we have a ReleaseMemoryFence(). (For now, it is only
expressed as a MemoryFence() call; we'll make that more precise later.)

(1) I think that should be paired with an AcquireMemoryFence() call,
just before loading "mCpuHotEjectData->Handler" above -- for now, also
expressed as a MemoryFence() call only.


In Linux terms, there is a control dependency here.  However, it should 
at least be a separate statement to load mCpuHotEjectData (which from my 
EDK2 reminiscences should be a global) into a local variable.  So


  EjectData = mCPUHotEjectData;
  // Optional AcquireMemoryFence here
  if (EjectData != NULL) {
CPU_HOT_EJECT_HANDLER Handler;

Handler = EjectData->Handler;
if (Handler != NULL) {
  Handler (CpuIndex);
}
  }

Thanks,

Paolo



-=-=-=-=-=-=-=-=-=-=-=-
Groups.io Links: You receive all messages sent to this group.
View/Reply Online (#72038): https://edk2.groups.io/g/devel/message/72038
Mute This Topic: https://groups.io/mt/80819862/21656
Group Owner: devel+ow...@edk2.groups.io
Unsubscribe: https://edk2.groups.io/g/devel/unsub [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-




Re: [edk2-devel] [PATCH] OvmfPkg: End timer interrupt later to avoid stack overflow under load

2020-06-17 Thread Paolo Bonzini
On 17/06/20 17:46, Laszlo Ersek wrote:
>> That said, Igor's patch seems correct to me.  In fact, I'd even move
>> DisableInterrupts before gBS->RestoreTPL unless there's a good reason
>> not to do so.
> OK, thank you!
> 
> Igor, please confirm if you'd like to submit v2 with the update
> suggested by Paolo, or if you prefer the current version. We're at the
> beginning of the current development cycle, so I guess we can apply the
> patch and see how it works in practice. If it ends up wreaking havoc on
> some platforms, we can always revert the patch in time for the next
> stable tag (edk2-stable202008).

For what it's worth "correct" means that I don't see anything that could
break and in fact I find it good policy hygienic not to allow recursive
interrupts.

v1 is okay for me too, so:

Reviewed-by: Paolo Bonzini 

Paolo


-=-=-=-=-=-=-=-=-=-=-=-
Groups.io Links: You receive all messages sent to this group.

View/Reply Online (#61432): https://edk2.groups.io/g/devel/message/61432
Mute This Topic: https://groups.io/mt/74913405/21656
Group Owner: devel+ow...@edk2.groups.io
Unsubscribe: https://edk2.groups.io/g/devel/unsub  [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-



Re: [edk2-devel] [PATCH] OvmfPkg: End timer interrupt later to avoid stack overflow under load

2020-06-17 Thread Paolo Bonzini
On 16/06/20 20:42, Laszlo Ersek wrote:
> (Hmmm... maybe the hypervisor *has* to queue the timer interrupts,
> otherwise some of them would simply be lost, and the guest would lose
> track of time.)

Yes, there are various kinds of coalescing of interrupts that
hypervisors perform to help the guest keep track of time.  This is
especially true of the PIT and RTC; newer OSes track time directly from
the TSC, the HPET or the APIC timer so they tolerate lost ticks much better.

That said, Igor's patch seems correct to me.  In fact, I'd even move
DisableInterrupts before gBS->RestoreTPL unless there's a good reason
not to do so.

Paolo


-=-=-=-=-=-=-=-=-=-=-=-
Groups.io Links: You receive all messages sent to this group.

View/Reply Online (#61416): https://edk2.groups.io/g/devel/message/61416
Mute This Topic: https://groups.io/mt/74913405/21656
Group Owner: devel+ow...@edk2.groups.io
Unsubscribe: https://edk2.groups.io/g/devel/unsub  [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-



Re: [edk2-devel] privileged entropy sources in QEMU/KVM guests

2019-11-07 Thread Paolo Bonzini
On 07/11/19 14:27, Laszlo Ersek wrote:
> The VirtioRngDxe driver is a UEFI driver that follows the UEFI driver
> model. Meaning (in this context), it is connected to the virtio-rng
> device in the BDS phase, by platform BDS code.
> 
> Put differently, the non-privileged driver that's the source of the
> sensitive data would have to be a "platform DXE driver". The virtio
> drivers are not such drivers however.

Yes, it would have to be a platform DXE driver.  What is it that limits
virtio to the BDS phase?

Paolo

-=-=-=-=-=-=-=-=-=-=-=-
Groups.io Links: You receive all messages sent to this group.

View/Reply Online (#50229): https://edk2.groups.io/g/devel/message/50229
Mute This Topic: https://groups.io/mt/45640732/21656
Group Owner: devel+ow...@edk2.groups.io
Unsubscribe: https://edk2.groups.io/g/devel/unsub  [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-



Re: [edk2-devel] privileged entropy sources in QEMU/KVM guests

2019-11-07 Thread Paolo Bonzini
On 07/11/19 12:55, Daniel P. Berrangé wrote:
>> Yes, I would make SMM use a cryptographic pseudo-random number generator 
>> and seed it from virtio-rng from DXE, way before the OS starts and can 
>> "attack" it.
>>
>> Once you've gotten a seed, you can create a CSPRNG with a stream cipher 
>> such as ChaCha20, which is literally 30 lines of code.
> If all we need is a one-time seed then virtio-rng is possibly overkill as
> that provides a continuous stream. Instead could QEMU read a few bytes
> from the host's /dev/urandom and pass it to EDK via fw_cfg, which can
> use it for the CSPRNG seed. EDK would have to erase the fw_cfg field
> to prevent the seed value leaking to the guest OS, but other than that
> its quite straightforward.

That would need anyway a change to the emulated hardware.  If the guest
is able to use virtio-rng after the firmware exits (which is the case is
all the firmware needs is a one-time seed), then using virtio-rng is the
simplest alternative as it needs no change at all outside the firmware.

Paolo

-=-=-=-=-=-=-=-=-=-=-=-
Groups.io Links: You receive all messages sent to this group.

View/Reply Online (#50211): https://edk2.groups.io/g/devel/message/50211
Mute This Topic: https://groups.io/mt/45640732/21656
Group Owner: devel+ow...@edk2.groups.io
Unsubscribe: https://edk2.groups.io/g/devel/unsub  [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-



Re: [edk2-devel] privileged entropy sources in QEMU/KVM guests

2019-11-07 Thread Paolo Bonzini
On 07/11/19 12:52, Daniel P. Berrangé wrote:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=bb5530e4082446aac3a3d69780cd4dbfa4520013
> 
> Is it practical to provide a jitter entropy source for EDK2
> too ?

The hard part is not collecting jitter (though the firmware might be too
deterministic for that), but rather turning it into a random number seed
(mixing data from various sources, crediting entropy, etc.).

Paolo

-=-=-=-=-=-=-=-=-=-=-=-
Groups.io Links: You receive all messages sent to this group.

View/Reply Online (#50210): https://edk2.groups.io/g/devel/message/50210
Mute This Topic: https://groups.io/mt/45640732/21656
Group Owner: devel+ow...@edk2.groups.io
Unsubscribe: https://edk2.groups.io/g/devel/unsub  [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-



Re: [edk2-devel] privileged entropy sources in QEMU/KVM guests

2019-11-07 Thread Paolo Bonzini
On 07/11/19 11:25, Ard Biesheuvel wrote:
>> This looks problematic on QEMU. Entropy is a valuable resource, and
>> whatever resource SMM drivers depend on, should not be possible for e.g.
>> a 3rd party UEFI driver (or even for the runtime OS) to exhaust.
>> Therefore, it's not *only* the case that SMM drivers must not consume
>> EFI_RNG_PROTOCOL (which exists at a less critical privilege level, i.e.
>> outside of SMM/SMRAM), but also that SMM drivers must not depend on the
>> same piece of *hardware* that feeds EFI_RNG_PROTOCOL.
>>
> The typical model is to seed a DRBG [deterministic pseudorandom
> sequence generator] using a sufficient amount of high quality entropy.
> Once you have done that, it is rather hard to exhaust a DRBG - it is a
> mathematical construction that is designed to last for a long time (<=
> 2^48 invocations [not bytes] according to the NIST spec), after which
> it does not degrade although it may have generated so much output that
> its internal state may be inferred if you have captured enough of it
> (which is a rather theoretical issue IMHO)
> 
> The problem is that using the output of a DRBG as a seed is
> non-trivial - the spec describes ways to do this, but wiring
> virtio-rng to a DRBG in the host and using its output to seed a DRBG
> in the guest is slighly problematic.
> 
> So it seems to me that the correct way to model this is to make the
> host's true entropy source a shared resource like any other.
> 

Yes, I would make SMM use a cryptographic pseudo-random number generator 
and seed it from virtio-rng from DXE, way before the OS starts and can 
"attack" it.

Once you've gotten a seed, you can create a CSPRNG with a stream cipher 
such as ChaCha20, which is literally 30 lines of code.

Paolo

#define ROTL(a,b) (((a) << (b)) | ((a) >> (32 - (b
#define QR(a, b, c, d) (\
a += b,  d ^= a,  d = ROTL(d,16),   \
c += d,  b ^= c,  b = ROTL(b,12),   \
a += b,  d ^= a,  d = ROTL(d, 8),   \
c += d,  b ^= c,  b = ROTL(b, 7))
#define ROUNDS 20

// initial state:
// in[0] = 0x65787061
// in[1] = 0x6e642033
// in[2] = 0x322d6279
// in[3] = 0x7465206b
// in[4]..in[11] = key (seed)
// in[12]..in[13] = 0
// in[14]..in[15] = nonce, can probably use RDTSC?
static uint32_t in[16];

uint32_t chacha_rng(void)
{
int i;
static uint32_t x[16], p;
if (p < 16)
return in[p++] + x[p++];

if (++in[12] == 0)
++in[13];

for (i = 0; i < 16; ++i)
x[i] = in[i];

// 10 loops × 2 rounds/loop = 20 rounds
for (i = 0; i < ROUNDS; i += 2) {
// Odd round
QR(x[0], x[4], x[ 8], x[12]); // column 0
QR(x[1], x[5], x[ 9], x[13]); // column 1
QR(x[2], x[6], x[10], x[14]); // column 2
QR(x[3], x[7], x[11], x[15]); // column 3
// Even round
QR(x[0], x[5], x[10], x[15]); // diagonal 1 (main diagonal)
QR(x[1], x[6], x[11], x[12]); // diagonal 2
QR(x[2], x[7], x[ 8], x[13]); // diagonal 3
QR(x[3], x[4], x[ 9], x[14]); // diagonal 4
}
p = 1;
return in[0] + x[0];
}

-=-=-=-=-=-=-=-=-=-=-=-
Groups.io Links: You receive all messages sent to this group.

View/Reply Online (#50203): https://edk2.groups.io/g/devel/message/50203
Mute This Topic: https://groups.io/mt/45640732/21656
Group Owner: devel+ow...@edk2.groups.io
Unsubscribe: https://edk2.groups.io/g/devel/unsub  [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-



Re: [edk2-devel] [Qemu-devel] [PATCH 1/2] q35: implement 128K SMRAM at default SMBASE address

2019-09-24 Thread Paolo Bonzini
On 20/09/19 11:28, Laszlo Ersek wrote:
>> On QEMU side,  we can drop black-hole approach and allocate
>> dedicated SMRAM region, which explicitly gets mapped into
>> RAM address space and after SMI hanlder initialization, gets
>> unmapped (locked). So that SMRAM would be accessible only
>> from SMM context. That way RAM at 0x3 could be used as
>> normal when SMRAM is unmapped.
>
> I prefer the black-hole approach, introduced in your current patch
> series, if it can work. Way less opportunity for confusion.

Another possibility would be to alias the 0xA..0xB SMRAM to
0x3..0x4 (only when in SMM).

I'm not super enthusiastic about adding this kind of QEMU-only feature.
 The alternative would be to implement VT-d range locking through the
intel-iommu device's PCI configuration space (which includes _adding_
the configuration space, i.e. making the IOMMU a PCI device in the first
place, and the support to the firmware for configuring the VT-d BAR at
0xfed9).  This would be the right way to do it, but it would entail
a lot of work throughout the stack. :(  So I guess some variant of this
would be okay, as long as it's peppered with "this is not how real
hardware does it" comments in both QEMU and EDK2.

Thanks,

Paolo

> I've started work on the counterpart OVMF patches; I'll report back.


-=-=-=-=-=-=-=-=-=-=-=-
Groups.io Links: You receive all messages sent to this group.

View/Reply Online (#47939): https://edk2.groups.io/g/devel/message/47939
Mute This Topic: https://groups.io/mt/34201782/21656
Group Owner: devel+ow...@edk2.groups.io
Unsubscribe: https://edk2.groups.io/g/devel/unsub  [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-



Re: [edk2-rfc] [edk2-devel] CPU hotplug using SMM with QEMU+OVMF

2019-08-23 Thread Paolo Bonzini
On 22/08/19 22:06, Kinney, Michael D wrote:
> The SMBASE register is internal and cannot be directly accessed 
> by any CPU.  There is an SMBASE field that is member of the SMM Save
> State area and can only be modified from SMM and requires the
> execution of an RSM instruction from SMM for the SMBASE register to
> be updated from the current SMBASE field value.  The new SMBASE
> register value is only used on the next SMI.

Actually there is also an SMBASE MSR, even though in current silicon
it's read-only and its use is theoretically limited to SMM-transfer
monitors.  If that MSR could be made accessible somehow outside SMM,
that would be great.

> Once all the CPUs have been initialized for SMM, the CPUs that are not needed
> can be hot removed.  As noted above, the SMBASE value does not change on
> an INIT.  So as long as the hot add operation does not do a RESET, the
> SMBASE value must be preserved.

IIRC, hot-remove + hot-add will unplugs/plugs a completely different CPU.

> Another idea is to emulate this behavior.  If the hot plug controller
> provide registers (only accessible from SMM) to assign the SMBASE address
> for every CPU.  When a CPU is hot added, QEMU can set the internal SMBASE
> register value from the hot plug controller register value.  If the SMM
> Monarch sends an INIT or an SMI from the Local APIC to the hot added CPU,
> then the SMBASE register should not be modified and the CPU starts execution
> within TSEG the first time it receives an SMI.

Yes, this would work.  But again---if the issue is real on current
hardware too, I'd rather have a matching solution for virtual platforms.

If the current hardware for example remembers INIT-preserved across
hot-remove/hot-add, we could emulate that.

I guess the fundamental question is: how do bare metal platforms avoid
this issue, or plan to avoid this issue?  Once we know that, we can use
that information to find a way to implement it in KVM.  Only if it is
impossible we'll have a different strategy that is specific to our platform.

Paolo

> Jiewen and I can collect specific questions on this topic and continue
> the discussion here.  For example, I do not think there is any method
> other than what I referenced above to program the SMBASE register, but
> I can ask if there are any other methods.


-=-=-=-=-=-=-=-=-=-=-=-
Groups.io Links: You receive all messages sent to this group.

View/Reply Online (#46307): https://edk2.groups.io/g/devel/message/46307
Mute This Topic: https://groups.io/mt/32979681/21656
Group Owner: devel+ow...@edk2.groups.io
Unsubscribe: https://edk2.groups.io/g/devel/unsub  [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-



Re: [edk2-rfc] [edk2-devel] CPU hotplug using SMM with QEMU+OVMF

2019-08-23 Thread Paolo Bonzini
On 23/08/19 00:32, Kinney, Michael D wrote:
> Paolo,
> 
> It is my understanding that real HW hot plug uses the SDM defined
> methods.  Meaning the initial SMI is to 3000:8000 and they rebase
> to TSEG in the first SMI.  They must have chipset specific methods
> to protect 3000:8000 from DMA.

It would be great if you could check.

> Can we add a chipset feature to prevent DMA to 64KB range from
> 0x3-0x3 and the UEFI Memory Map and ACPI content can be
> updated so the Guest OS knows to not use that range for DMA?

If real hardware does it at the chipset level, we will probably use
Igor's suggestion of aliasing A-seg to 3000:.  Before starting the
new CPU, the SMI handler can prepare the SMBASE relocation trampoline at
A000:8000 and the hot-plugged CPU will find it at 3000:8000 when it
receives the initial SMI.  Because this is backed by RAM at
0xA-0xA, DMA cannot access it and would still go through to RAM
at 0x3.

Paolo

-=-=-=-=-=-=-=-=-=-=-=-
Groups.io Links: You receive all messages sent to this group.

View/Reply Online (#46306): https://edk2.groups.io/g/devel/message/46306
Mute This Topic: https://groups.io/mt/32979681/21656
Group Owner: devel+ow...@edk2.groups.io
Unsubscribe: https://edk2.groups.io/g/devel/unsub  [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-



Re: [edk2-rfc] [edk2-devel] CPU hotplug using SMM with QEMU+OVMF

2019-08-22 Thread Paolo Bonzini
On 22/08/19 20:29, Laszlo Ersek wrote:
> On 08/22/19 08:18, Paolo Bonzini wrote:
>> On 21/08/19 22:17, Kinney, Michael D wrote:
>>> DMA protection of memory ranges is a chipset feature. For the current
>>> QEMU implementation, what ranges of memory are guaranteed to be
>>> protected from DMA?  Is it only A/B seg and TSEG?
>>
>> Yes.
> 
> This thread (esp. Jiewen's and Mike's messages) are the first time that
> I've heard about the *existence* of such RAM ranges / the chipset
> feature. :)
> 
> Out of interest (independently of virtualization), how is a general
> purpose OS informed by the firmware, "never try to set up DMA to this
> RAM area"? Is this communicated through ACPI _CRS perhaps?
> 
> ... Ah, almost: ACPI 6.2 specifies _DMA, in "6.2.4 _DMA (Direct Memory
> Access)". It writes,
> 
> For example, if a platform implements a PCI bus that cannot access
> all of physical memory, it has a _DMA object under that PCI bus that
> describes the ranges of physical memory that can be accessed by
> devices on that bus.
> 
> Sorry about the digression, and also about being late to this thread,
> continually -- I'm primarily following and learning.

It's much simpler: these ranges are not in e820, for example

kernel: BIOS-e820: [mem 0x00059000-0x0008bfff] usable
kernel: BIOS-e820: [mem 0x0008c000-0x000f] reserved

The ranges are not special-cased in any way by QEMU.  Simply, AB-segs
and TSEG RAM are not part of the address space except when in SMM.
Therefore, DMA to those ranges ends up respectively to low VGA RAM[1]
and to the bit bucket.  When AB-segs are open, for example, DMA to that
area becomes possible.

Paolo

[1] old timers may remember DEF SEG=: BLOAD "foo.img",0.  It still
works with some disk device models.

-=-=-=-=-=-=-=-=-=-=-=-
Groups.io Links: You receive all messages sent to this group.

View/Reply Online (#46233): https://edk2.groups.io/g/devel/message/46233
Mute This Topic: https://groups.io/mt/32979681/21656
Group Owner: devel+ow...@edk2.groups.io
Unsubscribe: https://edk2.groups.io/g/devel/unsub  [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-



Re: [edk2-rfc] [edk2-devel] CPU hotplug using SMM with QEMU+OVMF

2019-08-22 Thread Paolo Bonzini
On 22/08/19 19:59, Laszlo Ersek wrote:
> The firmware and QEMU could agree on a formula, which would compute the
> CPU-specific SMBASE from a value pre-programmed by the firmware, and the
> initial APIC ID of the hot-added CPU.
> 
> Yes, it would duplicate code -- the calculation -- between QEMU and
> edk2. While that's not optimal, it wouldn't be a first.

No, that would be unmaintainable.  The best solution to me seems to be
to make SMBASE programmable from non-SMM code if some special conditions
hold.  Michael, would it be possible to get in contact with the Intel
architects?

Paolo

-=-=-=-=-=-=-=-=-=-=-=-
Groups.io Links: You receive all messages sent to this group.

View/Reply Online (#46232): https://edk2.groups.io/g/devel/message/46232
Mute This Topic: https://groups.io/mt/32979681/21656
Group Owner: devel+ow...@edk2.groups.io
Unsubscribe: https://edk2.groups.io/g/devel/unsub  [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-



Re: [edk2-rfc] [edk2-devel] CPU hotplug using SMM with QEMU+OVMF

2019-08-22 Thread Paolo Bonzini
On 21/08/19 22:17, Kinney, Michael D wrote:
> Paolo,
> 
> It makes sense to match real HW.

Note that it'd also be fine to match some kind of official Intel
specification even if no processor (currently?) supports it.

> That puts us back to
> the reset vector and handling the initial SMI at
> 3000:8000.  That is all workable from a FW implementation
> perspective.  It look like the only issue left is DMA.
> 
> DMA protection of memory ranges is a chipset feature.
> For the current QEMU implementation, what ranges of
> memory are guaranteed to be protected from DMA?  Is
> it only A/B seg and TSEG?

Yes.

Paolo

>> Yes, all of these would work.  Again, I'm interested in
>> having something that has a hope of being implemented in
>> real hardware.
>>
>> Another, far easier to implement possibility could be a
>> lockable MSR (could be the existing
>> MSR_SMM_FEATURE_CONTROL) that allows programming the
>> SMBASE outside SMM.  It would be nice if such a bit
>> could be defined by Intel.
>>
>> Paolo


-=-=-=-=-=-=-=-=-=-=-=-
Groups.io Links: You receive all messages sent to this group.

View/Reply Online (#46204): https://edk2.groups.io/g/devel/message/46204
Mute This Topic: https://groups.io/mt/32979681/21656
Group Owner: devel+ow...@edk2.groups.io
Unsubscribe: https://edk2.groups.io/g/devel/unsub  [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-



Re: [edk2-rfc] [edk2-devel] CPU hotplug using SMM with QEMU+OVMF

2019-08-22 Thread Paolo Bonzini
On 21/08/19 19:25, Kinney, Michael D wrote:
> Could we have an initial SMBASE that is within TSEG.
> 
> If we bring in hot plug CPUs one at a time, then initial
> SMBASE in TSEG can reprogram the SMBASE to the correct 
> value for that CPU.
> 
> Can we add a register to the hot plug controller that
> allows the BSP to set the initial SMBASE value for 
> a hot added CPU?  The default can be 3000:8000 for
> compatibility.
> 
> Another idea is when the SMI handler runs for a hot add
> CPU event, the SMM monarch programs the hot plug controller
> register with the SMBASE to use for the CPU that is being
> added.  As each CPU is added, a different SMBASE value can
> be programmed by the SMM Monarch.

Yes, all of these would work.  Again, I'm interested in having something
that has a hope of being implemented in real hardware.

Another, far easier to implement possibility could be a lockable MSR
(could be the existing MSR_SMM_FEATURE_CONTROL) that allows programming
the SMBASE outside SMM.  It would be nice if such a bit could be defined
by Intel.

Paolo

-=-=-=-=-=-=-=-=-=-=-=-
Groups.io Links: You receive all messages sent to this group.

View/Reply Online (#46203): https://edk2.groups.io/g/devel/message/46203
Mute This Topic: https://groups.io/mt/32979681/21656
Group Owner: devel+ow...@edk2.groups.io
Unsubscribe: https://edk2.groups.io/g/devel/unsub  [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-



Re: [edk2-rfc] [edk2-devel] CPU hotplug using SMM with QEMU+OVMF

2019-08-21 Thread Paolo Bonzini
On 21/08/19 17:48, Kinney, Michael D wrote:
> Perhaps there is a way to avoid the 3000:8000 startup
> vector.
> 
> If a CPU is added after a cold reset, it is already in a
> different state because one of the active CPUs needs to
> release it by interacting with the hot plug controller.
> 
> Can the SMRR for CPUs in that state be pre-programmed to
> match the SMRR in the rest of the active CPUs?
> 
> For OVMF we expect all the active CPUs to use the same
> SMRR value, so a check can be made to verify that all 
> the active CPUs have the same SMRR value.  If they do,
> then any CPU released through the hot plug controller 
> can have its SMRR pre-programmed and the initial SMI
> will start within TSEG.
> 
> We just need to decide what to do in the unexpected 
> case where all the active CPUs do not have the same
> SMRR value.
> 
> This should also reduce the total number of steps.

The problem is not the SMRR but the SMBASE.  If the SMBASE area is
outside TSEG, it is vulnerable to DMA attacks independent of the SMRR.
SMBASE is also different for all CPUs, so it cannot be preprogrammed.

(As an aside, virt platforms are also immune to cache poisoning so they
don't have SMRR yet - we could use them for SMM_CODE_CHK_EN and block
execution outside SMRR but we never got round to it).

An even simpler alternative would be to make Ah the initial SMBASE.
 However, I would like to understand what hardware platforms plan to do,
if anything.

Paolo

> Mike
> 
>> -Original Message-
>> From: r...@edk2.groups.io [mailto:r...@edk2.groups.io] On
>> Behalf Of Yao, Jiewen
>> Sent: Sunday, August 18, 2019 4:01 PM
>> To: Paolo Bonzini 
>> Cc: Alex Williamson ; Laszlo
>> Ersek ; devel@edk2.groups.io; edk2-
>> rfc-groups-io ; qemu devel list
>> ; Igor Mammedov
>> ; Chen, Yingwen
>> ; Nakajima, Jun
>> ; Boris Ostrovsky
>> ; Joao Marcal Lemos Martins
>> ; Phillip Goerl
>> 
>> Subject: Re: [edk2-rfc] [edk2-devel] CPU hotplug using
>> SMM with QEMU+OVMF
>>
>> in real world, we deprecate AB-seg usage because they
>> are vulnerable to smm cache poison attack.
>> I assume cache poison is out of scope in the virtual
>> world, or there is a way to prevent ABseg cache poison.
>>
>> thank you!
>> Yao, Jiewen
>>
>>
>>> 在 2019年8月19日,上午3:50,Paolo Bonzini
>>  写道:
>>>
>>>> On 17/08/19 02:20, Yao, Jiewen wrote:
>>>> [Jiewen] That is OK. Then we MUST add the third
>> adversary.
>>>> -- Adversary: Simple hardware attacker, who can use
>> device to perform DMA attack in the virtual world.
>>>> NOTE: The DMA attack in the real world is out of
>> scope. That is be handled by IOMMU in the real world,
>> such as VTd. -- Please do clarify if this is TRUE.
>>>>
>>>> In the real world:
>>>> #1: the SMM MUST be non-DMA capable region.
>>>> #2: the MMIO MUST be non-DMA capable region.
>>>> #3: the stolen memory MIGHT be DMA capable region or
>> non-DMA capable
>>>> region. It depends upon the silicon design.
>>>> #4: the normal OS accessible memory - including ACPI
>> reclaim, ACPI
>>>> NVS, and reserved memory not included by #3 - MUST be
>> DMA capable region.
>>>> As such, IOMMU protection is NOT required for #1 and
>> #2. IOMMU
>>>> protection MIGHT be required for #3 and MUST be
>> required for #4.
>>>> I assume the virtual environment is designed in the
>> same way. Please
>>>> correct me if I am wrong.
>>>>
>>>
>>> Correct.  The 0x3...0x3 area is the only
>> problematic one;
>>> Igor's idea (or a variant, for example optionally
>> remapping
>>> 0xa..0xa SMRAM to 0x3) is becoming more
>> and more attractive.
>>>
>>> Paolo
>>
>> 
> 


-=-=-=-=-=-=-=-=-=-=-=-
Groups.io Links: You receive all messages sent to this group.

View/Reply Online (#46171): https://edk2.groups.io/g/devel/message/46171
Mute This Topic: https://groups.io/mt/32979681/21656
Group Owner: devel+ow...@edk2.groups.io
Unsubscribe: https://edk2.groups.io/g/devel/unsub  [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-



Re: [edk2-devel] CPU hotplug using SMM with QEMU+OVMF

2019-08-19 Thread Paolo Bonzini
On 19/08/19 01:00, Yao, Jiewen wrote:
> in real world, we deprecate AB-seg usage because they are vulnerable
> to smm cache poison attack. I assume cache poison is out of scope in
> the virtual world, or there is a way to prevent ABseg cache poison.

Indeed the SMRR would not cover the A-seg on real hardware.  However, if
the chipset allowed aliasing A-seg SMRAM to 0x3, it would only be
used for SMBASE relocation of hotplugged CPU.  The firmware would still
keep low SMRAM disabled, *except around SMBASE relocation of hotplugged
CPUs*.  To avoid cache poisoning attacks, you only have to issue a
WBINVD before enabling low SMRAM and before disabling it.  Hotplug SMI
is not a performance-sensitive path, so it's not a big deal.

So I guess you agree that PCI DMA attacks are a potential vector also on
real hardware.  As Alex pointed out, VT-d is not a solution because
there could be legitimate DMA happening during CPU hotplug.  For OVMF
we'll probably go with Igor's idea, it would be nice if Intel chipsets
supported it too. :)

Paolo

-=-=-=-=-=-=-=-=-=-=-=-
Groups.io Links: You receive all messages sent to this group.

View/Reply Online (#46043): https://edk2.groups.io/g/devel/message/46043
Mute This Topic: https://groups.io/mt/32852911/21656
Group Owner: devel+ow...@edk2.groups.io
Unsubscribe: https://edk2.groups.io/g/devel/unsub  [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-



Re: [edk2-devel] CPU hotplug using SMM with QEMU+OVMF

2019-08-19 Thread Paolo Bonzini
On 17/08/19 02:20, Yao, Jiewen wrote:
> [Jiewen] That is OK. Then we MUST add the third adversary.
> -- Adversary: Simple hardware attacker, who can use device to perform DMA 
> attack in the virtual world.
> NOTE: The DMA attack in the real world is out of scope. That is be handled by 
> IOMMU in the real world, such as VTd. -- Please do clarify if this is TRUE.
> 
> In the real world:
> #1: the SMM MUST be non-DMA capable region.
> #2: the MMIO MUST be non-DMA capable region.
> #3: the stolen memory MIGHT be DMA capable region or non-DMA capable
> region. It depends upon the silicon design.
> #4: the normal OS accessible memory - including ACPI reclaim, ACPI
> NVS, and reserved memory not included by #3 - MUST be DMA capable region.
> As such, IOMMU protection is NOT required for #1 and #2. IOMMU
> protection MIGHT be required for #3 and MUST be required for #4.
> I assume the virtual environment is designed in the same way. Please
> correct me if I am wrong.
> 

Correct.  The 0x3...0x3 area is the only problematic one;
Igor's idea (or a variant, for example optionally remapping
0xa..0xa SMRAM to 0x3) is becoming more and more attractive.

Paolo

-=-=-=-=-=-=-=-=-=-=-=-
Groups.io Links: You receive all messages sent to this group.

View/Reply Online (#46039): https://edk2.groups.io/g/devel/message/46039
Mute This Topic: https://groups.io/mt/32852911/21656
Group Owner: devel+ow...@edk2.groups.io
Unsubscribe: https://edk2.groups.io/g/devel/unsub  [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-



Re: [edk2-devel] CPU hotplug using SMM with QEMU+OVMF

2019-08-16 Thread Paolo Bonzini
On 15/08/19 18:07, Igor Mammedov wrote:
> Looking at Q35 code and Seabios SMM relocation as example, if I see it
> right QEMU has:
> - SMRAM is aliased from DRAM at 0xa
> - and TSEG steals from the top of low RAM when configured
> 
> Now problem is that default SMBASE at 0x3 isn't backed by anything
> in SMRAM address space and default SMI entry falls-through to the same
> location in System address space.
> 
> The later is not trusted and entry into SMM mode will corrupt area + might
> jump to 'random' SMI handler (hence save/restore code in Seabios).
> 
> Here is an idea, can we map a memory region at 0x3 in SMRAM address
> space with relocation space/code reserved. It could be a part of TSEG
> (so we don't have to invent ABI to configure that)?

No, there could be real mode code using it.  What we _could_ do is
initialize SMBASE to 0xa, but I think it's better to not deviate too
much from processor behavior (even if it's admittedly a 20-years legacy
that doesn't make any sense).

Paolo

-=-=-=-=-=-=-=-=-=-=-=-
Groups.io Links: You receive all messages sent to this group.

View/Reply Online (#45808): https://edk2.groups.io/g/devel/message/45808
Mute This Topic: https://groups.io/mt/32852911/21656
Group Owner: devel+ow...@edk2.groups.io
Unsubscribe: https://edk2.groups.io/g/devel/unsub  [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-



Re: [edk2-devel] CPU hotplug using SMM with QEMU+OVMF

2019-08-16 Thread Paolo Bonzini
On 15/08/19 17:00, Laszlo Ersek wrote:
> On 08/14/19 16:04, Paolo Bonzini wrote:
>> On 14/08/19 15:20, Yao, Jiewen wrote:
>>>> - Does this part require a new branch somewhere in the OVMF SEC code?
>>>>   How do we determine whether the CPU executing SEC is BSP or
>>>>   hot-plugged AP?
>>> [Jiewen] I think this is blocked from hardware perspective, since the first 
>>> instruction.
>>> There are some hardware specific registers can be used to determine if the 
>>> CPU is new added.
>>> I don’t think this must be same as the real hardware.
>>> You are free to invent some registers in device model to be used in OVMF 
>>> hot plug driver.
>>
>> Yes, this would be a new operation mode for QEMU, that only applies to
>> hot-plugged CPUs.  In this mode the AP doesn't reply to INIT or SMI, in
>> fact it doesn't reply to anything at all.
>>
>>>> - How do we tell the hot-plugged AP where to start execution? (I.e. that
>>>>   it should execute code at a particular pflash location.)
>>> [Jiewen] Same real mode reset vector at :FFF0.
>>
>> You do not need a reset vector or INIT/SIPI/SIPI sequence at all in
>> QEMU.  The AP does not start execution at all when it is unplugged, so
>> no cache-as-RAM etc.
>>
>> We only need to modify QEMU so that hot-plugged APIs do not reply to
>> INIT/SIPI/SMI.
>>
>>> I don’t think there is problem for real hardware, who always has CAR.
>>> Can QEMU provide some CPU specific space, such as MMIO region?
>>
>> Why is a CPU-specific region needed if every other processor is in SMM
>> and thus trusted.
> 
> I was going through the steps Jiewen and Yingwen recommended.
> 
> In step (02), the new CPU is expected to set up RAM access. In step
> (03), the new CPU, executing code from flash, is expected to "send board
> message to tell host CPU (GPIO->SCI) -- I am waiting for hot-add
> message." For that action, the new CPU may need a stack (minimally if we
> want to use C function calls).
> 
> Until step (03), there had been no word about any other (= pre-plugged)
> CPUs (more precisely, Jiewen even confirmed "No impact to other
> processors"), so I didn't assume that other CPUs had entered SMM.
> 
> Paolo, I've attempted to read Jiewen's response, and yours, as carefully
> as I can. I'm still very confused. If you have a better understanding,
> could you please write up the 15-step process from the thread starter
> again, with all QEMU customizations applied? Such as, unnecessary steps
> removed, and platform specifics filled in.

Sure.

(01a) QEMU: create new CPU.  The CPU already exists, but it does not
 start running code until unparked by the CPU hotplug controller.

(01b) QEMU: trigger SCI

(02-03) no equivalent

(04) Host CPU: (OS) execute GPE handler from DSDT

(05) Host CPU: (OS) Port 0xB2 write, all CPUs enter SMM (NOTE: New CPU
 will not enter CPU because SMI is disabled)

(06) Host CPU: (SMM) Save 38000, Update 38000 -- fill simple SMM
 rebase code.

(07a) Host CPU: (SMM) Write to CPU hotplug controller to enable
 new CPU

(07b) Host CPU: (SMM) Send INIT/SIPI/SIPI to new CPU.

(08a) New CPU: (Low RAM) Enter protected mode.

(08b) New CPU: (Flash) Signals host CPU to proceed and enter cli;hlt loop.

(09) Host CPU: (SMM) Send SMI to the new CPU only.

(10) New CPU: (SMM) Run SMM code at 38000, and rebase SMBASE to
 TSEG.

(11) Host CPU: (SMM) Restore 38000.

(12) Host CPU: (SMM) Update located data structure to add the new CPU
 information. (This step will involve CPU_SERVICE protocol)

(13) New CPU: (Flash) do whatever other initialization is needed

(14) New CPU: (Flash) Deadloop, and wait for INIT-SIPI-SIPI.

(15) Host CPU: (OS) Send INIT-SIPI-SIPI to pull new CPU in..


In other words, the cache-as-RAM phase of 02-03 is replaced by the
INIT-SIPI-SIPI sequence of 07b-08a-08b.


>> The QEMU DSDT could be modified (when secure boot is in effect) to OUT
>> to 0xB2 when hotplug happens.  It could write a well-known value to
>> 0xB2, to be read by an SMI handler in edk2.
> 
> I dislike involving QEMU's generated DSDT in anything SMM (even
> injecting the SMI), because the AML interpreter runs in the OS.
> 
> If a malicious OS kernel is a bit too enlightened about the DSDT, it
> could willfully diverge from the process that we design. If QEMU
> broadcast the SMI internally, the guest OS could not interfere with that.
> 
> If the purpose of the SMI is specifically to force all CPUs into SMM
> (and thereby force them into trusted state), then the OS would be
> explicitly counter-interested in carrying out the AML operations from
> QEMU's DSDT.

But since the hotplug controller 

Re: [edk2-devel] CPU hotplug using SMM with QEMU+OVMF

2019-08-16 Thread Paolo Bonzini
On 16/08/19 04:46, Yao, Jiewen wrote:
> Comment below:
> 
> 
>> -Original Message-
>> From: Paolo Bonzini [mailto:pbonz...@redhat.com]
>> Sent: Friday, August 16, 2019 12:21 AM
>> To: Laszlo Ersek ; devel@edk2.groups.io; Yao, Jiewen
>> 
>> Cc: edk2-rfc-groups-io ; qemu devel list
>> ; Igor Mammedov ;
>> Chen, Yingwen ; Nakajima, Jun
>> ; Boris Ostrovsky ;
>> Joao Marcal Lemos Martins ; Phillip Goerl
>> 
>> Subject: Re: [edk2-devel] CPU hotplug using SMM with QEMU+OVMF
>>
>> On 15/08/19 17:00, Laszlo Ersek wrote:
>>> On 08/14/19 16:04, Paolo Bonzini wrote:
>>>> On 14/08/19 15:20, Yao, Jiewen wrote:
>>>>>> - Does this part require a new branch somewhere in the OVMF SEC
>> code?
>>>>>>   How do we determine whether the CPU executing SEC is BSP or
>>>>>>   hot-plugged AP?
>>>>> [Jiewen] I think this is blocked from hardware perspective, since the 
>>>>> first
>> instruction.
>>>>> There are some hardware specific registers can be used to determine if
>> the CPU is new added.
>>>>> I don’t think this must be same as the real hardware.
>>>>> You are free to invent some registers in device model to be used in
>> OVMF hot plug driver.
>>>>
>>>> Yes, this would be a new operation mode for QEMU, that only applies to
>>>> hot-plugged CPUs.  In this mode the AP doesn't reply to INIT or SMI, in
>>>> fact it doesn't reply to anything at all.
>>>>
>>>>>> - How do we tell the hot-plugged AP where to start execution? (I.e.
>> that
>>>>>>   it should execute code at a particular pflash location.)
>>>>> [Jiewen] Same real mode reset vector at :FFF0.
>>>>
>>>> You do not need a reset vector or INIT/SIPI/SIPI sequence at all in
>>>> QEMU.  The AP does not start execution at all when it is unplugged, so
>>>> no cache-as-RAM etc.
>>>>
>>>> We only need to modify QEMU so that hot-plugged APIs do not reply to
>>>> INIT/SIPI/SMI.
>>>>
>>>>> I don’t think there is problem for real hardware, who always has CAR.
>>>>> Can QEMU provide some CPU specific space, such as MMIO region?
>>>>
>>>> Why is a CPU-specific region needed if every other processor is in SMM
>>>> and thus trusted.
>>>
>>> I was going through the steps Jiewen and Yingwen recommended.
>>>
>>> In step (02), the new CPU is expected to set up RAM access. In step
>>> (03), the new CPU, executing code from flash, is expected to "send board
>>> message to tell host CPU (GPIO->SCI) -- I am waiting for hot-add
>>> message." For that action, the new CPU may need a stack (minimally if we
>>> want to use C function calls).
>>>
>>> Until step (03), there had been no word about any other (= pre-plugged)
>>> CPUs (more precisely, Jiewen even confirmed "No impact to other
>>> processors"), so I didn't assume that other CPUs had entered SMM.
>>>
>>> Paolo, I've attempted to read Jiewen's response, and yours, as carefully
>>> as I can. I'm still very confused. If you have a better understanding,
>>> could you please write up the 15-step process from the thread starter
>>> again, with all QEMU customizations applied? Such as, unnecessary steps
>>> removed, and platform specifics filled in.
>>
>> Sure.
>>
>> (01a) QEMU: create new CPU.  The CPU already exists, but it does not
>>  start running code until unparked by the CPU hotplug controller.
>>
>> (01b) QEMU: trigger SCI
>>
>> (02-03) no equivalent
>>
>> (04) Host CPU: (OS) execute GPE handler from DSDT
>>
>> (05) Host CPU: (OS) Port 0xB2 write, all CPUs enter SMM (NOTE: New CPU
>>  will not enter CPU because SMI is disabled)
>>
>> (06) Host CPU: (SMM) Save 38000, Update 38000 -- fill simple SMM
>>  rebase code.
>>
>> (07a) Host CPU: (SMM) Write to CPU hotplug controller to enable
>>  new CPU
>>
>> (07b) Host CPU: (SMM) Send INIT/SIPI/SIPI to new CPU.
> [Jiewen] NOTE: INIT/SIPI/SIPI can be sent by a malicious CPU. There is no
> restriction that INIT/SIPI/SIPI can only be sent in SMM.

All of the CPUs are now in SMM, and INIT/SIPI/SIPI will be discarded
before 07a, so this is okay.

However I do see a problem, because a PCI device's DMA could overwrite
0x38000 between (06) and (10) and hijack the code that is executed in
SMM.  How is this avoided on re

Re: [edk2-devel] CPU hotplug using SMM with QEMU+OVMF

2019-08-15 Thread Paolo Bonzini
On 15/08/19 11:55, Yao, Jiewen wrote:
> Hi Paolo
> I am not sure what do you mean - "You do not need a reset vector ...".
> If so, where is the first instruction of the new CPU in the virtualization 
> environment?
> Please help me understand that at first. Then we can continue the discussion.

The BSP starts running from 0xFFF0.  APs do not start running at all
and just sit waiting for an INIT-SIPI-SIPI sequence.  Please see my
proposal in the reply to Laszlo.

Paolo

-=-=-=-=-=-=-=-=-=-=-=-
Groups.io Links: You receive all messages sent to this group.

View/Reply Online (#45741): https://edk2.groups.io/g/devel/message/45741
Mute This Topic: https://groups.io/mt/32852911/21656
Group Owner: devel+ow...@edk2.groups.io
Unsubscribe: https://edk2.groups.io/g/devel/unsub  [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-



Re: [edk2-devel] CPU hotplug using SMM with QEMU+OVMF

2019-08-14 Thread Paolo Bonzini
On 14/08/19 15:20, Yao, Jiewen wrote:
>> - Does this part require a new branch somewhere in the OVMF SEC code?
>>   How do we determine whether the CPU executing SEC is BSP or
>>   hot-plugged AP?
> [Jiewen] I think this is blocked from hardware perspective, since the first 
> instruction.
> There are some hardware specific registers can be used to determine if the 
> CPU is new added.
> I don’t think this must be same as the real hardware.
> You are free to invent some registers in device model to be used in OVMF hot 
> plug driver.

Yes, this would be a new operation mode for QEMU, that only applies to
hot-plugged CPUs.  In this mode the AP doesn't reply to INIT or SMI, in
fact it doesn't reply to anything at all.

>> - How do we tell the hot-plugged AP where to start execution? (I.e. that
>>   it should execute code at a particular pflash location.)
> [Jiewen] Same real mode reset vector at :FFF0.

You do not need a reset vector or INIT/SIPI/SIPI sequence at all in
QEMU.  The AP does not start execution at all when it is unplugged, so
no cache-as-RAM etc.

We only need to modify QEMU so that hot-plugged APIs do not reply to
INIT/SIPI/SMI.

> I don’t think there is problem for real hardware, who always has CAR.
> Can QEMU provide some CPU specific space, such as MMIO region?

Why is a CPU-specific region needed if every other processor is in SMM
and thus trusted.

>>   Does CPU hotplug apply only at the socket level? If the CPU is
>>   multi-core, what is responsible for hot-plugging all cores present in
>>   the socket?

I can answer this: the SMM handler would interact with the hotplug
controller in the same way that ACPI DSDT does normally.  This supports
multiple hotplugs already.

Writes to the hotplug controller from outside SMM would be ignored.

>>> (03) New CPU: (Flash) send board message to tell host CPU (GPIO->SCI)
>>>  -- I am waiting for hot-add message.
>>
>> Maybe we can simplify this in QEMU by broadcasting an SMI to existent
>> processors immediately upon plugging the new CPU.

The QEMU DSDT could be modified (when secure boot is in effect) to OUT
to 0xB2 when hotplug happens.  It could write a well-known value to
0xB2, to be read by an SMI handler in edk2.


>>
>>>(NOTE: Host CPU can only
>> send
>>>  instruction in SMM mode. -- The register is SMM only)
>>
>> Sorry, I don't follow -- what register are we talking about here, and
>> why is the BSP needed to send anything at all? What "instruction" do you
>> have in mind?
> [Jiewen] The new CPU does not enable SMI at reset.
> At some point of time later, the CPU need enable SMI, right?
> The "instruction" here means, the host CPUs need tell to CPU to enable SMI.

Right, this would be a write to the CPU hotplug controller

>>> (04) Host CPU: (OS) get message from board that a new CPU is added.
>>>  (GPIO -> SCI)
>>>
>>> (05) Host CPU: (OS) All CPUs enter SMM (SCI->SWSMI) (NOTE: New CPU
>>>  will not enter CPU because SMI is disabled)
>>
>> I don't understand the OS involvement here. But, again, perhaps QEMU can
>> force all existent CPUs into SMM immediately upon adding the new CPU.
> [Jiewen] OS here means the Host CPU running code in OS environment, not in 
> SMM environment.

See above.

>>> (06) Host CPU: (SMM) Save 38000, Update 38000 -- fill simple SMM
>>>  rebase code.
>>>
>>> (07) Host CPU: (SMM) Send message to New CPU to Enable SMI.
>>
>> Aha, so this is the SMM-only register you mention in step (03). Is the
>> register specified in the Intel SDM?
> [Jiewen] Right. That is the register to let host CPU tell new CPU to enable 
> SMI.
> It is platform specific register. Not defined in SDM.
> You may invent one in device model.

See above.

>>> (10) New CPU: (SMM) Response first SMI at 38000, and rebase SMBASE to
>>>  TSEG.
>>
>> What code does the new CPU execute after it completes step (10)? Does it
>> halt?
>
> [Jiewen] The new CPU exits SMM and return to original place - where it is
> interrupted to enter SMM - running code on the flash.

So in our case we'd need an INIT/SIPI/SIPI sequence between (06) and (07).

>>> (11) Host CPU: (SMM) Restore 38000.
>>
>> These steps (i.e., (06) through (11)) don't appear RAS-specific. The
>> only platform-specific feature seems to be SMI masking register, which
>> could be extracted into a new SmmCpuFeaturesLib API.
>>
>> Thus, would you please consider open sourcing firmware code for steps
>> (06) through (11)?
>>
>> Alternatively -- and in particular because the stack for step (01)
>> concerns me --, we could approach this from a high-level, functional
>> perspective. The states that really matter are the relocated SMBASE for
>> the new CPU, and the state of the full system, right at the end of step
>> (11).
>>
>> When the SMM setup quiesces during normal firmware boot, OVMF could
>> use
>> existent (finalized) SMBASE infomation to *pre-program* some virtual
>> QEMU hardware, with such state that would be expected, as