Re: [PATCH 2/2] tests: add testing of parameter=1 for SMP topology

2024-05-15 Thread Xiaoyao Li

On 5/13/2024 8:33 PM, Daniel P. Berrangé wrote:

Validate that it is possible to pass 'parameter=1' for any SMP topology
parameter, since unsupported parameters are implicitly considered to
always have a value of 1.

Signed-off-by: Daniel P. Berrangé 
---
  tests/unit/test-smp-parse.c | 8 
  1 file changed, 8 insertions(+)

diff --git a/tests/unit/test-smp-parse.c b/tests/unit/test-smp-parse.c
index 56165e6644..56ce5128f1 100644
--- a/tests/unit/test-smp-parse.c
+++ b/tests/unit/test-smp-parse.c
@@ -330,6 +330,14 @@ static const struct SMPTestData data_generic_valid[] = {
  .config = SMP_CONFIG_GENERIC(T, 8, T, 2, T, 4, T, 2, T, 16),
  .expect_prefer_sockets = CPU_TOPOLOGY_GENERIC(8, 2, 4, 2, 16),
  .expect_prefer_cores   = CPU_TOPOLOGY_GENERIC(8, 2, 4, 2, 16),
+}, {
+/*
+ * Unsupported parameters are always allowed to be set to '1'
+ * config: -smp 
8,books=1,drawers=1,sockets=2,modules=1,dies=1,cores=4,threads=2,maxcpus=8


cores=2 not 4.


+ * expect: cpus=8,sockets=2,cores=2,threads=2,maxcpus=8 */
+.config = SMP_CONFIG_WITH_FULL_TOPO(8, 1, 1, 2, 1, 1, 2, 2, 8),
+.expect_prefer_sockets = CPU_TOPOLOGY_GENERIC(8, 2, 2, 2, 8),
+.expect_prefer_cores   = CPU_TOPOLOGY_GENERIC(8, 2, 2, 2, 8),
  },
  };
  





Re: [PATCH 6/6] target/i386/confidential-guest: Fix comment of x86_confidential_guest_kvm_type()

2024-04-26 Thread Xiaoyao Li

On 4/26/2024 6:07 PM, Zhao Liu wrote:

Update the comment to match the X86ConfidentialGuestClass
implementation.

Suggested-by: Xiaoyao Li 


I think it should be "Reported-by"


Signed-off-by: Zhao Liu 
---
  target/i386/confidential-guest.h | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/target/i386/confidential-guest.h b/target/i386/confidential-guest.h
index 532e172a60b6..06d54a120227 100644
--- a/target/i386/confidential-guest.h
+++ b/target/i386/confidential-guest.h
@@ -44,7 +44,7 @@ struct X86ConfidentialGuestClass {
  /**
   * x86_confidential_guest_kvm_type:
   *
- * Calls #X86ConfidentialGuestClass.unplug callback of @plug_handler.
+ * Calls #X86ConfidentialGuestClass.kvm_type() callback.
   */
  static inline int x86_confidential_guest_kvm_type(X86ConfidentialGuest *cg)
  {





Re: [PATCH for-9.1 0/7] target/i386/kvm: Cleanup the kvmclock feature name

2024-04-25 Thread Xiaoyao Li

On 4/25/2024 6:29 PM, Zhao Liu wrote:

On Thu, Apr 25, 2024 at 04:40:10PM +0800, Xiaoyao Li wrote:

Date: Thu, 25 Apr 2024 16:40:10 +0800
From: Xiaoyao Li 
Subject: Re: [PATCH for-9.1 0/7] target/i386/kvm: Cleanup the kvmclock
  feature name

On 4/25/2024 3:17 PM, Zhao Liu wrote:

Hi Xiaoyao,

On Wed, Apr 24, 2024 at 11:57:11PM +0800, Xiaoyao Li wrote:

Date: Wed, 24 Apr 2024 23:57:11 +0800
From: Xiaoyao Li 
Subject: Re: [PATCH for-9.1 0/7] target/i386/kvm: Cleanup the kvmclock
   feature name

On 3/29/2024 6:19 PM, Zhao Liu wrote:

From: Zhao Liu 

Hi list,

This series is based on Paolo's guest_phys_bits patchset [1].

Currently, the old and new kvmclocks have the same feature name
"kvmclock" in FeatureWordInfo[FEAT_KVM].

When I tried to dig into the history of this unusual naming and fix it,
I realized that Tim was already trying to rename it, so I picked up his
renaming patch [2] (with a new commit message and other minor changes).

13 years age, the same name was introduced in [3], and its main purpose
is to make it easy for users to enable/disable 2 kvmclocks. Then, in
2012, Don tried to rename the new kvmclock, but the follow-up did not
address Igor and Eduardo's comments about compatibility.

Tim [2], not long ago, and I just now, were both puzzled by the naming
one after the other.


The commit message of [3] said the reason clearly:

When we tweak flags involving this value - specially when we use "-",
we have to act on both.

So you are trying to change it to "when people want to disable kvmclock,
they need to use '-kvmclock,-kvmclock2' instead of '-kvmclock'"

IMHO, I prefer existing code and I don't see much value of differentiating
them. If the current code puzzles you, then we can add comment to explain.


It's enough to just enable kvmclock2 for Guest; kvmclock (old) is
redundant in the presence of kvmclock2.

So operating both feature bits at the same time is not a reasonable
choice, we should only keep kvmclock2 for Guest. It's possible because
the oldest linux (v4.5) which QEMU i386 supports has kvmclock2.


who said the oldest Linux QEMU supports is from 4.5? what about 2.x kernel?


For Host (docs/system/target-i386.rst).


Besides, not only the Linux guest, whatever guest OS is, it will be broken
if the guest is using kvmclock and QEMU starts to drop support of kvmclock.


I'm not aware of any minimum version requirements for Guest supported
by KVM, but there are no commitment.


the common commitment is at least keeping backwards compatibility.


So, again, hard NAK to drop the support of kvmclock. It breaks existing
guests that use kvmclock. You cannot force people to upgrade their existing
VMs to use kvmclock2 instead of kvmclock.


I agree, legacy kvmclock can be left out, if the old kernel does not
support kvmclock2 and strongly requires kvmclock, it can be enabled
using 9.1 and earlier machines or legacy-kvmclock, as long as Host still
supports it.

What's the gap in handling it this way? especially considering that
kvmclock2 was introduced in v2.6.35, and earlier kernels are no longer
maintained. The availability of the PV feature requires compatibility
for both Host and Guest.

Anyway, the above discussion here is about future plans, and this series
does not prevent any Guest from ignoring kvmclock2 in favor of kvmclock.

What this series is doing, i.e. separating the current two kvmclock and
ensuring CPUID compatibility via legacy-kvmclock, could balance the
compatibility requirements of the ancient (unmaintained kernel) with
the need for future feature changes.


You introduce a user-visible change that people need to use 
"-kvmclock,-kvmclock2" to totally disable kvmclock.


The only difference between kvmclock and kvmclock2 is the MSR index. And 
from users' perspective, they don't care this difference. The existing 
usage is simple. When I want kvmclock, use "+kvmclock". When I don't 
want it, use "-kvmclock".


You are complicating things and I don't see a strong reason that we have 
to do it.



Thanks,
Zhao






Re: [PATCH for-9.1 0/7] target/i386/kvm: Cleanup the kvmclock feature name

2024-04-25 Thread Xiaoyao Li

On 4/25/2024 3:17 PM, Zhao Liu wrote:

Hi Xiaoyao,

On Wed, Apr 24, 2024 at 11:57:11PM +0800, Xiaoyao Li wrote:

Date: Wed, 24 Apr 2024 23:57:11 +0800
From: Xiaoyao Li 
Subject: Re: [PATCH for-9.1 0/7] target/i386/kvm: Cleanup the kvmclock
  feature name

On 3/29/2024 6:19 PM, Zhao Liu wrote:

From: Zhao Liu 

Hi list,

This series is based on Paolo's guest_phys_bits patchset [1].

Currently, the old and new kvmclocks have the same feature name
"kvmclock" in FeatureWordInfo[FEAT_KVM].

When I tried to dig into the history of this unusual naming and fix it,
I realized that Tim was already trying to rename it, so I picked up his
renaming patch [2] (with a new commit message and other minor changes).

13 years age, the same name was introduced in [3], and its main purpose
is to make it easy for users to enable/disable 2 kvmclocks. Then, in
2012, Don tried to rename the new kvmclock, but the follow-up did not
address Igor and Eduardo's comments about compatibility.

Tim [2], not long ago, and I just now, were both puzzled by the naming
one after the other.


The commit message of [3] said the reason clearly:

   When we tweak flags involving this value - specially when we use "-",
   we have to act on both.

So you are trying to change it to "when people want to disable kvmclock,
they need to use '-kvmclock,-kvmclock2' instead of '-kvmclock'"

IMHO, I prefer existing code and I don't see much value of differentiating
them. If the current code puzzles you, then we can add comment to explain.


It's enough to just enable kvmclock2 for Guest; kvmclock (old) is
redundant in the presence of kvmclock2.

So operating both feature bits at the same time is not a reasonable
choice, we should only keep kvmclock2 for Guest. It's possible because
the oldest linux (v4.5) which QEMU i386 supports has kvmclock2.


who said the oldest Linux QEMU supports is from 4.5? what about 2.x kernel?

Besides, not only the Linux guest, whatever guest OS is, it will be 
broken if the guest is using kvmclock and QEMU starts to drop support of 
kvmclock.


So, again, hard NAK to drop the support of kvmclock. It breaks existing 
guests that use kvmclock. You cannot force people to upgrade their 
existing VMs to use kvmclock2 instead of kvmclock.



Pls see:
https://elixir.bootlin.com/linux/v4.5/source/arch/x86/include/uapi/asm/kvm_para.h#L22

With this in mind, I'm trying to implement a silky smooth transition to
"only kcmclock2 support".

This series is now separating kvmclock and kvmclock2, and then I plan to
go to the KVM side and deprecate kvmclock2, and then finally remove
kvmclock (old) in QEMU. Separating the two features makes the process
clearer.

Continuing to use the same name equally would have silently caused the
CPUID to change on the Guest side, which would have hurt compatibility
as well.


So, this series is to push for renaming the new kvmclock feature to
"kvmclock2" and adding compatibility support for older machines (PC 9.0
and older).

Finally, let's put an end to decades of doubt about this name.


Next Step
=

This series just separates the two kvmclocks from the naming, and in
subsequent patches I plan to stop setting kvmclock(old kcmclock) by
default as long as KVM supports kvmclock2 (new kvmclock).


No. It will break existing guests that use KVM_FEATURE_CLOCKSOURCE.


Please refer to my elaboration on v4.5 above, where kvmclock2 is
available, it should be used in priority.


Also, try to deprecate the old kvmclock in KVM side.

[1]: 
https://lore.kernel.org/qemu-devel/20240325141422.1380087-1-pbonz...@redhat.com/
[2]: 
https://lore.kernel.org/qemu-devel/20230908124534.25027-4-twied...@redhat.com/
[3]: 
https://lore.kernel.org/qemu-devel/1300401727-5235-3-git-send-email-glom...@redhat.com/
[4]: 
https://lore.kernel.org/qemu-devel/1348171412-23669-3-git-send-email-...@cloudswitch.com/


Thanks,
Zhao







Re: [PATCH for-9.1 0/7] target/i386/kvm: Cleanup the kvmclock feature name

2024-04-24 Thread Xiaoyao Li

On 3/29/2024 6:19 PM, Zhao Liu wrote:

From: Zhao Liu 

Hi list,

This series is based on Paolo's guest_phys_bits patchset [1].

Currently, the old and new kvmclocks have the same feature name
"kvmclock" in FeatureWordInfo[FEAT_KVM].

When I tried to dig into the history of this unusual naming and fix it,
I realized that Tim was already trying to rename it, so I picked up his
renaming patch [2] (with a new commit message and other minor changes).

13 years age, the same name was introduced in [3], and its main purpose
is to make it easy for users to enable/disable 2 kvmclocks. Then, in
2012, Don tried to rename the new kvmclock, but the follow-up did not
address Igor and Eduardo's comments about compatibility.

Tim [2], not long ago, and I just now, were both puzzled by the naming
one after the other.


The commit message of [3] said the reason clearly:

  When we tweak flags involving this value - specially when we use "-",
  we have to act on both.

So you are trying to change it to "when people want to disable kvmclock, 
they need to use '-kvmclock,-kvmclock2' instead of '-kvmclock'"


IMHO, I prefer existing code and I don't see much value of 
differentiating them. If the current code puzzles you, then we can add 
comment to explain.



So, this series is to push for renaming the new kvmclock feature to
"kvmclock2" and adding compatibility support for older machines (PC 9.0
and older).

Finally, let's put an end to decades of doubt about this name.


Next Step
=

This series just separates the two kvmclocks from the naming, and in
subsequent patches I plan to stop setting kvmclock(old kcmclock) by
default as long as KVM supports kvmclock2 (new kvmclock).


No. It will break existing guests that use KVM_FEATURE_CLOCKSOURCE.


Also, try to deprecate the old kvmclock in KVM side.

[1]: 
https://lore.kernel.org/qemu-devel/20240325141422.1380087-1-pbonz...@redhat.com/
[2]: 
https://lore.kernel.org/qemu-devel/20230908124534.25027-4-twied...@redhat.com/
[3]: 
https://lore.kernel.org/qemu-devel/1300401727-5235-3-git-send-email-glom...@redhat.com/
[4]: 
https://lore.kernel.org/qemu-devel/1348171412-23669-3-git-send-email-...@cloudswitch.com/

Thanks and Best Regards,
Zhao

---
Tim Wiederhake (1):
   target/i386: Fix duplicated kvmclock name in FEAT_KVM

Zhao Liu (6):
   target/i386/kvm: Add feature bit definitions for KVM CPUID
   target/i386/kvm: Remove local MSR_KVM_WALL_CLOCK and
 MSR_KVM_SYSTEM_TIME definitions
   target/i386/kvm: Only Save/load kvmclock MSRs when kvmclock enabled
   target/i386/kvm: Save/load MSRs of new kvmclock
 (KVM_FEATURE_CLOCKSOURCE2)
   target/i386/kvm: Add legacy_kvmclock cpu property
   target/i386/kvm: Update comment in kvm_cpu_realizefn()

  hw/i386/kvm/clock.c   |  5 ++--
  hw/i386/pc.c  |  1 +
  target/i386/cpu.c |  3 +-
  target/i386/cpu.h | 32 +
  target/i386/kvm/kvm-cpu.c | 25 -
  target/i386/kvm/kvm.c | 59 +--
  6 files changed, 99 insertions(+), 26 deletions(-)






Re: [PATCH for-9.1 2/7] target/i386/kvm: Remove local MSR_KVM_WALL_CLOCK and MSR_KVM_SYSTEM_TIME definitions

2024-04-24 Thread Xiaoyao Li

On 3/29/2024 6:19 PM, Zhao Liu wrote:

From: Zhao Liu 

These 2 MSRs have been already defined in the kvm_para header
(standard-headers/asm-x86/kvm_para.h).

Remove QEMU local definitions to avoid duplication.

Signed-off-by: Zhao Liu 


Reviewed-by: Xiaoyao Li 


---
  target/i386/kvm/kvm.c | 3 ---
  1 file changed, 3 deletions(-)

diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
index 2f3c8bc3a4ed..be88339fb8bd 100644
--- a/target/i386/kvm/kvm.c
+++ b/target/i386/kvm/kvm.c
@@ -78,9 +78,6 @@
  #define KVM_APIC_BUS_CYCLE_NS   1
  #define KVM_APIC_BUS_FREQUENCY  (10ULL / KVM_APIC_BUS_CYCLE_NS)
  
-#define MSR_KVM_WALL_CLOCK  0x11

-#define MSR_KVM_SYSTEM_TIME 0x12
-
  /* A 4096-byte buffer can hold the 8-byte kvm_msrs header, plus
   * 255 kvm_msr_entry structs */
  #define MSR_BUF_SIZE 4096





Re: [PATCH for-9.1 1/7] target/i386/kvm: Add feature bit definitions for KVM CPUID

2024-04-24 Thread Xiaoyao Li

On 3/29/2024 6:19 PM, Zhao Liu wrote:

From: Zhao Liu 

Add feature definiations for KVM_CPUID_FEATURES in CPUID (
CPUID[4000_0001].EAX and CPUID[4000_0001].EDX), to get rid of lots of
offset calculations.

Signed-off-by: Zhao Liu 
---
  hw/i386/kvm/clock.c   |  5 ++---
  target/i386/cpu.h | 23 +++
  target/i386/kvm/kvm.c | 28 ++--
  3 files changed, 39 insertions(+), 17 deletions(-)

diff --git a/hw/i386/kvm/clock.c b/hw/i386/kvm/clock.c
index 40aa9a32c32c..7c9752d5036f 100644
--- a/hw/i386/kvm/clock.c
+++ b/hw/i386/kvm/clock.c
@@ -27,7 +27,6 @@
  #include "qapi/error.h"
  
  #include 

-#include "standard-headers/asm-x86/kvm_para.h"
  #include "qom/object.h"
  
  #define TYPE_KVM_CLOCK "kvmclock"

@@ -334,8 +333,8 @@ void kvmclock_create(bool create_always)
  
  assert(kvm_enabled());

  if (create_always ||
-cpu->env.features[FEAT_KVM] & ((1ULL << KVM_FEATURE_CLOCKSOURCE) |
-   (1ULL << KVM_FEATURE_CLOCKSOURCE2))) {
+cpu->env.features[FEAT_KVM] & (CPUID_FEAT_KVM_CLOCK |
+   CPUID_FEAT_KVM_CLOCK2)) {
  sysbus_create_simple(TYPE_KVM_CLOCK, -1, NULL);
  }
  }
diff --git a/target/i386/cpu.h b/target/i386/cpu.h
index 83e473584517..b1b8d11cb0fe 100644
--- a/target/i386/cpu.h
+++ b/target/i386/cpu.h
@@ -27,6 +27,7 @@
  #include "qapi/qapi-types-common.h"
  #include "qemu/cpu-float.h"
  #include "qemu/timer.h"
+#include "standard-headers/asm-x86/kvm_para.h"
  
  #define XEN_NR_VIRQS 24
  
@@ -951,6 +952,28 @@ uint64_t x86_cpu_get_supported_feature_word(FeatureWord w,

  /* Packets which contain IP payload have LIP values */
  #define CPUID_14_0_ECX_LIP  (1U << 31)
  
+/* (Old) KVM paravirtualized clocksource */

+#define CPUID_FEAT_KVM_CLOCK(1U << KVM_FEATURE_CLOCKSOURCE)


we can drop the _FEAT_, just name it as

CPUID_KVM_CLOCK


+/* (New) KVM specific paravirtualized clocksource */
+#define CPUID_FEAT_KVM_CLOCK2   (1U << KVM_FEATURE_CLOCKSOURCE2)
+/* KVM asynchronous page fault */
+#define CPUID_FEAT_KVM_ASYNCPF  (1U << KVM_FEATURE_ASYNC_PF)
+/* KVM stolen (when guest vCPU is not running) time accounting */
+#define CPUID_FEAT_KVM_STEAL_TIME   (1U << KVM_FEATURE_STEAL_TIME)
+/* KVM paravirtualized end-of-interrupt signaling */
+#define CPUID_FEAT_KVM_PV_EOI   (1U << KVM_FEATURE_PV_EOI)
+/* KVM paravirtualized spinlocks support */
+#define CPUID_FEAT_KVM_PV_UNHALT(1U << KVM_FEATURE_PV_UNHALT)
+/* KVM host-side polling on HLT control from the guest */
+#define CPUID_FEAT_KVM_POLL_CONTROL (1U << KVM_FEATURE_POLL_CONTROL)
+/* KVM interrupt based asynchronous page fault*/
+#define CPUID_FEAT_KVM_ASYNCPF_INT  (1U << KVM_FEATURE_ASYNC_PF_INT)
+/* KVM 'Extended Destination ID' support for external interrupts */
+#define CPUID_FEAT_KVM_MSI_EXT_DEST_ID  (1U << KVM_FEATURE_MSI_EXT_DEST_ID)
+
+/* Hint to KVM that vCPUs expect never preempted for an unlimited time */
+#define CPUID_FEAT_KVM_HINTS_REALTIME(1U << KVM_HINTS_REALTIME)
+
  /* CLZERO instruction */
  #define CPUID_8000_0008_EBX_CLZERO  (1U << 0)
  /* Always save/restore FP error pointers */
diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
index e68cbe929302..2f3c8bc3a4ed 100644
--- a/target/i386/kvm/kvm.c
+++ b/target/i386/kvm/kvm.c
@@ -481,13 +481,13 @@ uint32_t kvm_arch_get_supported_cpuid(KVMState *s, 
uint32_t function,
   * be enabled without the in-kernel irqchip
   */
  if (!kvm_irqchip_in_kernel()) {
-ret &= ~(1U << KVM_FEATURE_PV_UNHALT);
+ret &= ~CPUID_FEAT_KVM_PV_UNHALT;
  }
  if (kvm_irqchip_is_split()) {
-ret |= 1U << KVM_FEATURE_MSI_EXT_DEST_ID;
+ret |= CPUID_FEAT_KVM_MSI_EXT_DEST_ID;
  }
  } else if (function == KVM_CPUID_FEATURES && reg == R_EDX) {
-ret |= 1U << KVM_HINTS_REALTIME;
+ret |= CPUID_FEAT_KVM_HINTS_REALTIME;
  }
  
  return ret;

@@ -3324,20 +3324,20 @@ static int kvm_put_msrs(X86CPU *cpu, int level)
  kvm_msr_entry_add(cpu, MSR_IA32_TSC, env->tsc);
  kvm_msr_entry_add(cpu, MSR_KVM_SYSTEM_TIME, env->system_time_msr);
  kvm_msr_entry_add(cpu, MSR_KVM_WALL_CLOCK, env->wall_clock_msr);
-if (env->features[FEAT_KVM] & (1 << KVM_FEATURE_ASYNC_PF_INT)) {
+if (env->features[FEAT_KVM] & CPUID_FEAT_KVM_ASYNCPF_INT) {
  kvm_msr_entry_add(cpu, MSR_KVM_ASYNC_PF_INT, 
env->async_pf_int_msr);
  }
-if (env->features[FEAT_KVM] & (1 << KVM_FEATURE_ASYNC_PF)) {
+if (env->features[FEAT_KVM] & CPUID_FEAT_KVM_ASYNCPF) {
  kvm_msr_entry_add(cpu, MSR_KVM_ASYNC_PF_EN, env->async_pf_en_msr);
  }
-if (env->features[FEAT_KVM] & (1 << KVM_FEATURE_PV_EOI)) {
+if (env->features[FEAT_KVM] & CPUID_FEAT_KVM_PV_EOI) {
  kvm_msr_entry_add(cpu, MSR_KVM_PV_EOI_EN, env->pv_eoi_en_msr);
   

Re: [PULL 43/63] target/i386: Implement mc->kvm_type() to get VM type

2024-04-24 Thread Xiaoyao Li

On 4/23/2024 11:09 PM, Paolo Bonzini wrote:

+
+/**
+ * x86_confidential_guest_kvm_type:
+ *
+ * Calls #X86ConfidentialGuestClass.unplug callback of @plug_handler.


the comment needs to be updated:

Calls #X86ConfidentialGuestClass.kvm_type() callback


+ */
+static inline int x86_confidential_guest_kvm_type(X86ConfidentialGuest *cg)
+{
+X86ConfidentialGuestClass *klass = X86_CONFIDENTIAL_GUEST_GET_CLASS(cg);
+
+if (klass->kvm_type) {
+return klass->kvm_type(cg);
+} else {
+return 0;
+}
+}





Re: [PULL 25/63] i386/kvm: Move architectural CPUID leaf generation to separate helper

2024-04-23 Thread Xiaoyao Li

On 4/23/2024 11:09 PM, Paolo Bonzini wrote:

From: Sean Christopherson 

Move the architectural (for lack of a better term) CPUID leaf generation
to a separate helper so that the generation code can be reused by TDX,
which needs to generate a canonical VM-scoped configuration.

For now this is just a cleanup, so keep the function static.

Signed-off-by: Sean Christopherson 
Signed-off-by: Xiaoyao Li 
Message-ID: <20240229063726.610065-23-xiaoyao...@intel.com>
Reviewed-by: Xiaoyao Li 
Signed-off-by: Paolo Bonzini 
---
  target/i386/kvm/kvm.c | 449 +-
  1 file changed, 227 insertions(+), 222 deletions(-)

diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
index e68cbe92930..fcf9603d3e6 100644
--- a/target/i386/kvm/kvm.c
+++ b/target/i386/kvm/kvm.c
@@ -1706,6 +1706,231 @@ static void kvm_init_nested_state(CPUX86State *env)
  }
  }
  
+static uint32_t kvm_x86_build_cpuid(CPUX86State *env,

+struct kvm_cpuid_entry2 *entries,
+uint32_t cpuid_i)
+{
+uint32_t limit, i, j;
+uint32_t unused;
+struct kvm_cpuid_entry2 *c;
+
+cpu_x86_cpuid(env, 0, 0, , , , );
+
+for (i = 0; i <= limit; i++) {
+j = 0;
+if (cpuid_i == KVM_MAX_CPUID_ENTRIES) {
+goto full;
+}
+c = [cpuid_i++];
+switch (i) {
+case 2: {
+/* Keep reading function 2 till all the input is received */
+int times;
+
+c->function = i;
+c->flags = KVM_CPUID_FLAG_STATEFUL_FUNC |
+   KVM_CPUID_FLAG_STATE_READ_NEXT;
+cpu_x86_cpuid(env, i, 0, >eax, >ebx, >ecx, >edx);
+times = c->eax & 0xff;
+
+for (j = 1; j < times; ++j) {
+if (cpuid_i == KVM_MAX_CPUID_ENTRIES) {
+goto full;
+}
+c = [cpuid_i++];
+c->function = i;
+c->flags = KVM_CPUID_FLAG_STATEFUL_FUNC;
+cpu_x86_cpuid(env, i, 0, >eax, >ebx, >ecx, >edx);
+}
+break;
+}
+case 0x1f:
+if (env->nr_dies < 2) {
+cpuid_i--;
+break;
+}
+/* fallthrough */
+case 4:
+case 0xb:
+case 0xd:
+for (j = 0; ; j++) {
+if (i == 0xd && j == 64) {
+break;
+}
+
+c->function = i;
+c->flags = KVM_CPUID_FLAG_SIGNIFCANT_INDEX;
+c->index = j;
+cpu_x86_cpuid(env, i, j, >eax, >ebx, >ecx, >edx);
+
+if (i == 4 && c->eax == 0) {
+break;
+}
+if (i == 0xb && !(c->ecx & 0xff00)) {
+break;
+}
+if (i == 0x1f && !(c->ecx & 0xff00)) {
+break;
+}
+if (i == 0xd && c->eax == 0) {
+continue;
+}
+if (cpuid_i == KVM_MAX_CPUID_ENTRIES) {
+goto full;
+}
+c = [cpuid_i++];
+}
+break;
+case 0x12:
+for (j = 0; ; j++) {
+c->function = i;
+c->flags = KVM_CPUID_FLAG_SIGNIFCANT_INDEX;
+c->index = j;
+cpu_x86_cpuid(env, i, j, >eax, >ebx, >ecx, >edx);
+
+if (j > 1 && (c->eax & 0xf) != 1) {
+break;
+}
+
+if (cpuid_i == KVM_MAX_CPUID_ENTRIES) {
+goto full;
+}
+c = [cpuid_i++];
+}
+break;
+case 0x7:
+case 0x14:
+case 0x1d:
+case 0x1e: {
+uint32_t times;
+
+c->function = i;
+c->index = 0;
+c->flags = KVM_CPUID_FLAG_SIGNIFCANT_INDEX;
+cpu_x86_cpuid(env, i, 0, >eax, >ebx, >ecx, >edx);
+times = c->eax;
+
+for (j = 1; j <= times; ++j) {
+if (cpuid_i == KVM_MAX_CPUID_ENTRIES) {
+goto full;
+}
+c = [cpuid_i++];
+c->function = i;
+c->index = j;
+c->flags = KVM_CPUID_FLAG_SIGNIFCANT_INDEX;
+cpu_x86_cpuid(env, i, j, >eax, >ebx, >ecx, >edx);
+}
+break;
+}
+default:
+c->function = i;
+c->flags = 0;
+cpu_x86_cpuid(env, i, 0, >eax, >ebx, >ecx, >edx);
+if (!c->eax && !c->ebx && !c->ecx &a

Re: [PATCH v5 28/65] i386/tdx: Disable pmu for TD guest

2024-04-16 Thread Xiaoyao Li

On 4/16/2024 4:32 PM, Chenyi Qiang wrote:



On 2/29/2024 2:36 PM, Xiaoyao Li wrote:

Current KVM doesn't support PMU for TD guest. It returns error if TD is
created with PMU bit being set in attributes.

Disable PMU for TD guest on QEMU side.

Signed-off-by: Xiaoyao Li 
---
  target/i386/kvm/tdx.c | 2 ++
  1 file changed, 2 insertions(+)

diff --git a/target/i386/kvm/tdx.c b/target/i386/kvm/tdx.c
index 262e86fd2c67..1c12cda002b8 100644
--- a/target/i386/kvm/tdx.c
+++ b/target/i386/kvm/tdx.c
@@ -496,6 +496,8 @@ int tdx_pre_create_vcpu(CPUState *cpu, Error **errp)
  g_autofree struct kvm_tdx_init_vm *init_vm = NULL;
  int r = 0;
  
+object_property_set_bool(OBJECT(cpu), "pmu", false, _abort);


Is it necessary to output some prompt if the user wants to enable pmu by
"-cpu host,pmu=on"? As in patch 27, it mentions PMU is configured by
x86cpu->enable_pmu, but PMU is actually not support in this series and
will be disabled silently.


We do this in QEMU is mainly for KVM, because KVM will fail to init TD 
if ATTRIBUTE.PERFMON is set.


It's expected that KVM reports PERFMON in attributes.fixed0 when KVM 
cannot provide support of it. Then QEMU will print error message 
automatically when validate the attributes.


For QEMU part, next version is going to set the default value of "pmu" 
to false in kvm_cpu_max_instance_init(), so that "-cpu host" will not 
enable pmu for TDX VMs by default.


I suppose both the KVM and QEMU change will show up in the next version.


+
  QEMU_LOCK_GUARD(_guest->lock);
  if (tdx_guest->initialized) {
  return r;





Re: [PATCH v2] hw/i386/acpi: Set PCAT_COMPAT bit only when pic is not disabled

2024-04-03 Thread Xiaoyao Li

On 4/3/2024 11:12 PM, Igor Mammedov wrote:

On Wed,  3 Apr 2024 10:59:53 -0400
Xiaoyao Li  wrote:


A value 1 of PCAT_COMPAT (bit 0) of MADT.Flags indicates that the system
also has a PC-AT-compatible dual-8259 setup, i.e., the PIC.

When PIC is not enabled (pic=off) for x86 machine, the PCAT_COMPAT bit
needs to be cleared. Otherwise, the guest thinks there is a present PIC.


Can you add to commit message reproducer (aka qemu CLI and relevant
logs/symptoms observed on guest side)?


When booting a VM with "-machine xx,pic=off", there is supposed to be no 
PIC for the guest. When guest probes PIC, it should find nothing and log 
of below should be printed:


  [0.155970] Using NULL legacy PIC

However, the fact is that no such log printed in guest kernel, with the 
VM created with "pic=off". This is because guest think there is a 
present due to pcat_compat is reporte as 1 in MADT. See Linux code of 
probe_8259A() in arch/x86/kernel/i8259.c 





   
Signed-off-by: Xiaoyao Li   
---

changes in v2:
- Clarify more in commit message;
---
  hw/i386/acpi-common.c | 4 +++-
  1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/hw/i386/acpi-common.c b/hw/i386/acpi-common.c
index 20f19269da40..0cc2919bb851 100644
--- a/hw/i386/acpi-common.c
+++ b/hw/i386/acpi-common.c
@@ -107,7 +107,9 @@ void acpi_build_madt(GArray *table_data, BIOSLinker *linker,
  acpi_table_begin(, table_data);
  /* Local APIC Address */
  build_append_int_noprefix(table_data, APIC_DEFAULT_ADDRESS, 4);
-build_append_int_noprefix(table_data, 1 /* PCAT_COMPAT */, 4); /* Flags */
+/* Flags. bit 0: PCAT_COMPAT */
+build_append_int_noprefix(table_data,
+  x86ms->pic != ON_OFF_AUTO_OFF ? 1 : 0 , 4);
  
  for (i = 0; i < apic_ids->len; i++) {

  pc_madt_cpu_entry(i, apic_ids, table_data, false);







[PATCH v2] hw/i386/acpi: Set PCAT_COMPAT bit only when pic is not disabled

2024-04-03 Thread Xiaoyao Li
A value 1 of PCAT_COMPAT (bit 0) of MADT.Flags indicates that the system
also has a PC-AT-compatible dual-8259 setup, i.e., the PIC.

When PIC is not enabled (pic=off) for x86 machine, the PCAT_COMPAT bit
needs to be cleared. Otherwise, the guest thinks there is a present PIC.

Signed-off-by: Xiaoyao Li 
---
changes in v2:
- Clarify more in commit message;
---
 hw/i386/acpi-common.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/hw/i386/acpi-common.c b/hw/i386/acpi-common.c
index 20f19269da40..0cc2919bb851 100644
--- a/hw/i386/acpi-common.c
+++ b/hw/i386/acpi-common.c
@@ -107,7 +107,9 @@ void acpi_build_madt(GArray *table_data, BIOSLinker *linker,
 acpi_table_begin(, table_data);
 /* Local APIC Address */
 build_append_int_noprefix(table_data, APIC_DEFAULT_ADDRESS, 4);
-build_append_int_noprefix(table_data, 1 /* PCAT_COMPAT */, 4); /* Flags */
+/* Flags. bit 0: PCAT_COMPAT */
+build_append_int_noprefix(table_data,
+  x86ms->pic != ON_OFF_AUTO_OFF ? 1 : 0 , 4);
 
 for (i = 0; i < apic_ids->len; i++) {
 pc_madt_cpu_entry(i, apic_ids, table_data, false);
-- 
2.34.1




Re: [PATCH] hw/i386/acpi: Set PCAT_COMPAT bit only when pic is not disabled

2024-04-02 Thread Xiaoyao Li

On 4/2/2024 10:31 PM, Michael S. Tsirkin wrote:

On Tue, Apr 02, 2024 at 09:18:44PM +0800, Xiaoyao Li wrote:

On 4/2/2024 6:02 PM, Michael S. Tsirkin wrote:

On Tue, Apr 02, 2024 at 04:25:16AM -0400, Xiaoyao Li wrote:

Set MADT.FLAGS[bit 0].PCAT_COMPAT based on x86ms->pic.

Signed-off-by: Xiaoyao Li 


Please include more info in the commit log:
what is the behaviour you observe, why it is wrong,
how does the patch fix it, what is guest behaviour
before and after.


Sorry, I thought it was straightforward.

A value 1 of PCAT_COMPAT (bit 0) of MADT.Flags indicates that the system
also has a PC-AT-compatible dual-8259 setup, i.e., the PIC.

When PIC is not enabled for x86 machine, the PCAT_COMPAT bit needs to be
cleared. Otherwise, the guest thinks there is a present PIC even it is
booted with pic=off on QEMU.

(I haven't seen real issue from Linux guest. The user of PIC inside guest
seems only the pit calibration. Whether pit calibration is triggered depends
on other things. But logically, current code is wrong, we need to fix it
anyway.

@Isaku, please share more info if you have)



+ Kirill,

It seems to have issue with legacy irqs with PCAT_COMPAT set 1 while no 
PIC on QEMU side. Kirill, could you elaborate it?




That's sufficient, thanks! Pls put this in commit log and resubmit.


The commit log and the subject should not repeat
what the diff already states.


---
   hw/i386/acpi-common.c | 4 +++-
   1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/hw/i386/acpi-common.c b/hw/i386/acpi-common.c
index 20f19269da40..0cc2919bb851 100644
--- a/hw/i386/acpi-common.c
+++ b/hw/i386/acpi-common.c
@@ -107,7 +107,9 @@ void acpi_build_madt(GArray *table_data, BIOSLinker *linker,
   acpi_table_begin(, table_data);
   /* Local APIC Address */
   build_append_int_noprefix(table_data, APIC_DEFAULT_ADDRESS, 4);
-build_append_int_noprefix(table_data, 1 /* PCAT_COMPAT */, 4); /* Flags */
+/* Flags. bit 0: PCAT_COMPAT */
+build_append_int_noprefix(table_data,
+  x86ms->pic != ON_OFF_AUTO_OFF ? 1 : 0 , 4);
   for (i = 0; i < apic_ids->len; i++) {
   pc_madt_cpu_entry(i, apic_ids, table_data, false);
--
2.34.1









Re: [PATCH] hw/i386/acpi: Set PCAT_COMPAT bit only when pic is not disabled

2024-04-02 Thread Xiaoyao Li

On 4/2/2024 6:02 PM, Michael S. Tsirkin wrote:

On Tue, Apr 02, 2024 at 04:25:16AM -0400, Xiaoyao Li wrote:

Set MADT.FLAGS[bit 0].PCAT_COMPAT based on x86ms->pic.

Signed-off-by: Xiaoyao Li 


Please include more info in the commit log:
what is the behaviour you observe, why it is wrong,
how does the patch fix it, what is guest behaviour
before and after.


Sorry, I thought it was straightforward.

A value 1 of PCAT_COMPAT (bit 0) of MADT.Flags indicates that the system 
also has a PC-AT-compatible dual-8259 setup, i.e., the PIC.


When PIC is not enabled for x86 machine, the PCAT_COMPAT bit needs to be 
cleared. Otherwise, the guest thinks there is a present PIC even it is 
booted with pic=off on QEMU.


(I haven't seen real issue from Linux guest. The user of PIC inside 
guest seems only the pit calibration. Whether pit calibration is 
triggered depends on other things. But logically, current code is wrong, 
we need to fix it anyway.


@Isaku, please share more info if you have)



The commit log and the subject should not repeat
what the diff already states.


---
  hw/i386/acpi-common.c | 4 +++-
  1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/hw/i386/acpi-common.c b/hw/i386/acpi-common.c
index 20f19269da40..0cc2919bb851 100644
--- a/hw/i386/acpi-common.c
+++ b/hw/i386/acpi-common.c
@@ -107,7 +107,9 @@ void acpi_build_madt(GArray *table_data, BIOSLinker *linker,
  acpi_table_begin(, table_data);
  /* Local APIC Address */
  build_append_int_noprefix(table_data, APIC_DEFAULT_ADDRESS, 4);
-build_append_int_noprefix(table_data, 1 /* PCAT_COMPAT */, 4); /* Flags */
+/* Flags. bit 0: PCAT_COMPAT */
+build_append_int_noprefix(table_data,
+  x86ms->pic != ON_OFF_AUTO_OFF ? 1 : 0 , 4);
  
  for (i = 0; i < apic_ids->len; i++) {

  pc_madt_cpu_entry(i, apic_ids, table_data, false);
--
2.34.1







[PATCH] hw/i386/acpi: Set PCAT_COMPAT bit only when pic is not disabled

2024-04-02 Thread Xiaoyao Li
Set MADT.FLAGS[bit 0].PCAT_COMPAT based on x86ms->pic.

Signed-off-by: Xiaoyao Li 
---
 hw/i386/acpi-common.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/hw/i386/acpi-common.c b/hw/i386/acpi-common.c
index 20f19269da40..0cc2919bb851 100644
--- a/hw/i386/acpi-common.c
+++ b/hw/i386/acpi-common.c
@@ -107,7 +107,9 @@ void acpi_build_madt(GArray *table_data, BIOSLinker *linker,
 acpi_table_begin(, table_data);
 /* Local APIC Address */
 build_append_int_noprefix(table_data, APIC_DEFAULT_ADDRESS, 4);
-build_append_int_noprefix(table_data, 1 /* PCAT_COMPAT */, 4); /* Flags */
+/* Flags. bit 0: PCAT_COMPAT */
+build_append_int_noprefix(table_data,
+  x86ms->pic != ON_OFF_AUTO_OFF ? 1 : 0 , 4);
 
 for (i = 0; i < apic_ids->len; i++) {
 pc_madt_cpu_entry(i, apic_ids, table_data, false);
-- 
2.34.1




Re: [PATCH 26/26] i386/kvm: Move architectural CPUID leaf generation to separate helper

2024-04-01 Thread Xiaoyao Li

On 3/23/2024 2:11 AM, Paolo Bonzini wrote:

From: Sean Christopherson 

Move the architectural (for lack of a better term) CPUID leaf generation
to a separate helper so that the generation code can be reused by TDX,
which needs to generate a canonical VM-scoped configuration.

For now this is just a cleanup, so keep the function static.

Signed-off-by: Sean Christopherson 
Signed-off-by: Xiaoyao Li 
Message-ID: <20240229063726.610065-23-xiaoyao...@intel.com>
[Unify error reporting, rename function. - Paolo]
Signed-off-by: Paolo Bonzini 
---
  target/i386/kvm/kvm.c | 446 +-
  1 file changed, 224 insertions(+), 222 deletions(-)

diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
index 2577e345502..eab6261e1f5 100644
--- a/target/i386/kvm/kvm.c
+++ b/target/i386/kvm/kvm.c
@@ -1752,6 +1752,228 @@ static void kvm_init_nested_state(CPUX86State *env)
  }
  }
  
+static uint32_t kvm_x86_build_cpuid(CPUX86State *env,

+struct kvm_cpuid_entry2 *entries,
+uint32_t cpuid_i)
+{
+uint32_t limit, i, j;
+uint32_t unused;
+struct kvm_cpuid_entry2 *c;
+
+cpu_x86_cpuid(env, 0, 0, , , , );
+
+for (i = 0; i <= limit; i++) {
+j = 0;
+if (cpuid_i == KVM_MAX_CPUID_ENTRIES) {
+goto full;
+}
+c = [cpuid_i++];
+switch (i) {
+case 2: {
+/* Keep reading function 2 till all the input is received */
+int times;
+
+c->function = i;
+c->flags = KVM_CPUID_FLAG_STATEFUL_FUNC |
+   KVM_CPUID_FLAG_STATE_READ_NEXT;
+cpu_x86_cpuid(env, i, 0, >eax, >ebx, >ecx, >edx);
+times = c->eax & 0xff;
+
+for (j = 1; j < times; ++j) {
+if (cpuid_i == KVM_MAX_CPUID_ENTRIES) {
+goto full;
+}
+c = [cpuid_i++];
+c->function = i;
+c->flags = KVM_CPUID_FLAG_STATEFUL_FUNC;
+cpu_x86_cpuid(env, i, 0, >eax, >ebx, >ecx, >edx);
+}
+break;
+}
+case 0x1f:
+if (env->nr_dies < 2) {
+cpuid_i--;
+break;
+}
+/* fallthrough */
+case 4:
+case 0xb:
+case 0xd:
+for (j = 0; ; j++) {
+if (i == 0xd && j == 64) {
+break;
+}
+
+c->function = i;
+c->flags = KVM_CPUID_FLAG_SIGNIFCANT_INDEX;
+c->index = j;
+cpu_x86_cpuid(env, i, j, >eax, >ebx, >ecx, >edx);
+
+if (i == 4 && c->eax == 0) {
+break;
+}
+if (i == 0xb && !(c->ecx & 0xff00)) {
+break;
+}
+if (i == 0x1f && !(c->ecx & 0xff00)) {
+break;
+}
+if (i == 0xd && c->eax == 0) {
+continue;
+}
+if (cpuid_i == KVM_MAX_CPUID_ENTRIES) {
+goto full;
+}
+c = [cpuid_i++];
+}
+break;
+case 0x12:
+for (j = 0; ; j++) {
+c->function = i;
+c->flags = KVM_CPUID_FLAG_SIGNIFCANT_INDEX;
+c->index = j;
+cpu_x86_cpuid(env, i, j, >eax, >ebx, >ecx, >edx);
+
+if (j > 1 && (c->eax & 0xf) != 1) {
+break;
+}
+
+if (cpuid_i == KVM_MAX_CPUID_ENTRIES) {
+goto full;
+}
+c = [cpuid_i++];
+}
+break;
+case 0x7:
+case 0x14:
+case 0x1d:
+case 0x1e: {
+uint32_t times;
+
+c->function = i;
+c->index = 0;
+c->flags = KVM_CPUID_FLAG_SIGNIFCANT_INDEX;
+cpu_x86_cpuid(env, i, 0, >eax, >ebx, >ecx, >edx);
+times = c->eax;
+
+for (j = 1; j <= times; ++j) {
+if (cpuid_i == KVM_MAX_CPUID_ENTRIES) {
+goto full;
+}
+c = [cpuid_i++];
+c->function = i;
+c->index = j;
+c->flags = KVM_CPUID_FLAG_SIGNIFCANT_INDEX;
+cpu_x86_cpuid(env, i, j, >eax, >ebx, >ecx, >edx);
+}
+break;
+}
+default:
+c->function = i;
+c->flags = 0;
+cpu_x86_cpuid(env, i, 0, >eax, >ebx, >ecx, >edx);
+if (!c->eax &&

Re: [PATCH v3 48/49] hw/i386/sev: Use guest_memfd for legacy ROMs

2024-03-27 Thread Xiaoyao Li

On 3/21/2024 2:12 AM, Isaku Yamahata wrote:

On Wed, Mar 20, 2024 at 03:39:44AM -0500,
Michael Roth  wrote:


TODO: make this SNP-specific if TDX disables legacy ROMs in general


TDX disables pc.rom, not disable isa-bios. IIRC, TDX doesn't need pc pflash.


Not TDX doesn't need pc pflash, but TDX cannot support pflash.

Can SNP support the behavior of pflash? That what's got changed will be 
synced back to OVMF file?



Xiaoyao can chime in.

Thanks,



Current SNP guest kernels will attempt to access these regions with
with C-bit set, so guest_memfd is needed to handle that. Otherwise,
kvm_convert_memory() will fail when the guest kernel tries to access it
and QEMU attempts to call KVM_SET_MEMORY_ATTRIBUTES to set these ranges
to private.

Whether guests should actually try to access ROM regions in this way (or
need to deal with legacy ROM regions at all), is a separate issue to be
addressed on kernel side, but current SNP guest kernels will exhibit
this behavior and so this handling is needed to allow QEMU to continue
running existing SNP guest kernels.

Signed-off-by: Michael Roth 
---
  hw/i386/pc.c   | 13 +
  hw/i386/pc_sysfw.c | 13 ++---
  2 files changed, 19 insertions(+), 7 deletions(-)

diff --git a/hw/i386/pc.c b/hw/i386/pc.c
index feb7a93083..5feaeb43ee 100644
--- a/hw/i386/pc.c
+++ b/hw/i386/pc.c
@@ -1011,10 +1011,15 @@ void pc_memory_init(PCMachineState *pcms,
  pc_system_firmware_init(pcms, rom_memory);
  
  option_rom_mr = g_malloc(sizeof(*option_rom_mr));

-memory_region_init_ram(option_rom_mr, NULL, "pc.rom", PC_ROM_SIZE,
-   _fatal);
-if (pcmc->pci_enabled) {
-memory_region_set_readonly(option_rom_mr, true);
+if (machine_require_guest_memfd(machine)) {
+memory_region_init_ram_guest_memfd(option_rom_mr, NULL, "pc.rom",
+   PC_ROM_SIZE, _fatal);
+} else {
+memory_region_init_ram(option_rom_mr, NULL, "pc.rom", PC_ROM_SIZE,
+   _fatal);
+if (pcmc->pci_enabled) {
+memory_region_set_readonly(option_rom_mr, true);
+}
  }
  memory_region_add_subregion_overlap(rom_memory,
  PC_ROM_MIN_VGA,
diff --git a/hw/i386/pc_sysfw.c b/hw/i386/pc_sysfw.c
index 9dbb3f7337..850f86edd4 100644
--- a/hw/i386/pc_sysfw.c
+++ b/hw/i386/pc_sysfw.c
@@ -54,8 +54,13 @@ static void pc_isa_bios_init(MemoryRegion *rom_memory,
  /* map the last 128KB of the BIOS in ISA space */
  isa_bios_size = MIN(flash_size, 128 * KiB);
  isa_bios = g_malloc(sizeof(*isa_bios));
-memory_region_init_ram(isa_bios, NULL, "isa-bios", isa_bios_size,
-   _fatal);
+if (machine_require_guest_memfd(current_machine)) {
+memory_region_init_ram_guest_memfd(isa_bios, NULL, "isa-bios",
+   isa_bios_size, _fatal);
+} else {
+memory_region_init_ram(isa_bios, NULL, "isa-bios", isa_bios_size,
+   _fatal);
+}
  memory_region_add_subregion_overlap(rom_memory,
  0x10 - isa_bios_size,
  isa_bios,
@@ -68,7 +73,9 @@ static void pc_isa_bios_init(MemoryRegion *rom_memory,
 ((uint8_t*)flash_ptr) + (flash_size - isa_bios_size),
 isa_bios_size);
  
-memory_region_set_readonly(isa_bios, true);

+if (!machine_require_guest_memfd(current_machine)) {
+memory_region_set_readonly(isa_bios, true);
+}
  }
  
  static PFlashCFI01 *pc_pflash_create(PCMachineState *pcms,

--
2.25.1









Re: [PATCH for-9.1 v5 2/3] target/i386: add guest-phys-bits cpu property

2024-03-26 Thread Xiaoyao Li

On 3/25/2024 10:14 PM, Paolo Bonzini wrote:

From: Gerd Hoffmann 

Allows to set guest-phys-bits (cpuid leaf 8008, eax[23:16])
via -cpu $model,guest-phys-bits=$nr.

Signed-off-by: Gerd Hoffmann 
Message-ID: <20240318155336.156197-3-kra...@redhat.com>
Signed-off-by: Paolo Bonzini 


Reviewed-by: Xiaoyao Li 


---
v4->v5:
- move here all non-KVM parts
- add compat property and support for special value "-1" (accelerator
   defines value)

  target/i386/cpu.h |  1 +
  hw/i386/pc.c  |  4 +++-
  target/i386/cpu.c | 22 ++
  3 files changed, 26 insertions(+), 1 deletion(-)

diff --git a/target/i386/cpu.h b/target/i386/cpu.h
index 6b057380791..83e47358451 100644
--- a/target/i386/cpu.h
+++ b/target/i386/cpu.h
@@ -2026,6 +2026,7 @@ struct ArchCPU {
  
  /* Number of physical address bits supported */

  uint32_t phys_bits;
+uint32_t guest_phys_bits;
  
  /* in order to simplify APIC support, we leave this pointer to the

 user */
diff --git a/hw/i386/pc.c b/hw/i386/pc.c
index 461fcaa1b48..9c4b3969cc8 100644
--- a/hw/i386/pc.c
+++ b/hw/i386/pc.c
@@ -78,7 +78,9 @@
  { "qemu64-" TYPE_X86_CPU, "model-id", "QEMU Virtual CPU version " v, },\
  { "athlon-" TYPE_X86_CPU, "model-id", "QEMU Virtual CPU version " v, },
  
-GlobalProperty pc_compat_9_0[] = {};

+GlobalProperty pc_compat_9_0[] = {
+{ TYPE_X86_CPU, "guest-phys-bits", "0" },
+};
  const size_t pc_compat_9_0_len = G_N_ELEMENTS(pc_compat_9_0);
  
  GlobalProperty pc_compat_8_2[] = {};

diff --git a/target/i386/cpu.c b/target/i386/cpu.c
index 33760a2ee16..eef3d08473e 100644
--- a/target/i386/cpu.c
+++ b/target/i386/cpu.c
@@ -6570,6 +6570,7 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, 
uint32_t count,
  if (env->features[FEAT_8000_0001_EDX] & CPUID_EXT2_LM) {
  /* 64 bit processor */
   *eax |= (cpu_x86_virtual_addr_width(env) << 8);
+ *eax |= (cpu->guest_phys_bits << 16);
  }
  *ebx = env->features[FEAT_8000_0008_EBX];
  if (cs->nr_cores * cs->nr_threads > 1) {
@@ -7329,6 +7330,14 @@ static void x86_cpu_realizefn(DeviceState *dev, Error 
**errp)
  goto out;
  }
  
+if (cpu->guest_phys_bits == -1) {

+/*
+ * If it was not set by the user, or by the accelerator via
+ * cpu_exec_realizefn, clear.
+ */
+cpu->guest_phys_bits = 0;
+}
+
  if (cpu->ucode_rev == 0) {
  /*
   * The default is the same as KVM's. Note that this check
@@ -7379,6 +7388,14 @@ static void x86_cpu_realizefn(DeviceState *dev, Error 
**errp)
  if (cpu->phys_bits == 0) {
  cpu->phys_bits = TCG_PHYS_ADDR_BITS;
  }
+if (cpu->guest_phys_bits &&
+(cpu->guest_phys_bits > cpu->phys_bits ||
+cpu->guest_phys_bits < 32)) {
+error_setg(errp, "guest-phys-bits should be between 32 and %u "
+ " (but is %u)",
+ cpu->phys_bits, cpu->guest_phys_bits);
+return;
+}
  } else {
  /* For 32 bit systems don't use the user set value, but keep
   * phys_bits consistent with what we tell the guest.
@@ -7387,6 +7404,10 @@ static void x86_cpu_realizefn(DeviceState *dev, Error 
**errp)
  error_setg(errp, "phys-bits is not user-configurable in 32 bit");
  return;
  }
+if (cpu->guest_phys_bits != 0) {
+error_setg(errp, "guest-phys-bits is not user-configurable in 32 
bit");
+return;
+}
  
  if (env->features[FEAT_1_EDX] & (CPUID_PSE36 | CPUID_PAE)) {

  cpu->phys_bits = 36;
@@ -7887,6 +7908,7 @@ static Property x86_cpu_properties[] = {
  DEFINE_PROP_BOOL("x-force-features", X86CPU, force_features, false),
  DEFINE_PROP_BOOL("kvm", X86CPU, expose_kvm, true),
  DEFINE_PROP_UINT32("phys-bits", X86CPU, phys_bits, 0),
+DEFINE_PROP_UINT32("guest-phys-bits", X86CPU, guest_phys_bits, -1),
  DEFINE_PROP_BOOL("host-phys-bits", X86CPU, host_phys_bits, false),
  DEFINE_PROP_UINT8("host-phys-bits-limit", X86CPU, host_phys_bits_limit, 
0),
  DEFINE_PROP_BOOL("fill-mtrr-mask", X86CPU, fill_mtrr_mask, true),





Re: [PATCH 12/26] KVM: track whether guest state is encrypted

2024-03-26 Thread Xiaoyao Li

On 3/23/2024 2:11 AM, Paolo Bonzini wrote:

So far, KVM has allowed KVM_GET/SET_* ioctls to execute even if the
guest state is encrypted, in which case they do nothing.  For the new
API using VM types, instead, the ioctls will fail which is a safer and
more robust approach.

The new API will be the only one available for SEV-SNP and TDX, but it
is also usable for SEV and SEV-ES.  In preparation for that, require
architecture-specific KVM code to communicate the point at which guest
state is protected (which must be after kvm_cpu_synchronize_post_init(),
though that might change in the future in order to suppor migration).
 From that point, skip reading registers so that cpu->vcpu_dirty is
never true: if it ever becomes true, kvm_arch_put_registers() will
fail miserably.

Signed-off-by: Paolo Bonzini 
---
  include/sysemu/kvm.h |  2 ++
  include/sysemu/kvm_int.h |  1 +
  accel/kvm/kvm-all.c  | 14 --
  target/i386/sev.c|  1 +
  4 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/include/sysemu/kvm.h b/include/sysemu/kvm.h
index fad9a7e8ff3..302e8f6f1e5 100644
--- a/include/sysemu/kvm.h
+++ b/include/sysemu/kvm.h
@@ -539,6 +539,8 @@ bool kvm_dirty_ring_enabled(void);
  
  uint32_t kvm_dirty_ring_size(void);
  
+void kvm_mark_guest_state_protected(void);

+
  /**
   * kvm_hwpoisoned_mem - indicate if there is any hwpoisoned page
   * reported for the VM.
diff --git a/include/sysemu/kvm_int.h b/include/sysemu/kvm_int.h
index 882e37e12c5..3496be7997a 100644
--- a/include/sysemu/kvm_int.h
+++ b/include/sysemu/kvm_int.h
@@ -87,6 +87,7 @@ struct KVMState
  bool kernel_irqchip_required;
  OnOffAuto kernel_irqchip_split;
  bool sync_mmu;
+bool guest_state_protected;
  uint64_t manual_dirty_log_protect;
  /* The man page (and posix) say ioctl numbers are signed int, but
   * they're not.  Linux, glibc and *BSD all treat ioctl numbers as
diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index a8cecd040eb..05fa3533c66 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -2698,7 +2698,7 @@ bool kvm_cpu_check_are_resettable(void)
  
  static void do_kvm_cpu_synchronize_state(CPUState *cpu, run_on_cpu_data arg)

  {
-if (!cpu->vcpu_dirty) {
+if (!cpu->vcpu_dirty && !kvm_state->guest_state_protected) {
  int ret = kvm_arch_get_registers(cpu);
  if (ret) {
  error_report("Failed to get registers: %s", strerror(-ret));
@@ -2712,7 +2712,7 @@ static void do_kvm_cpu_synchronize_state(CPUState *cpu, 
run_on_cpu_data arg)
  
  void kvm_cpu_synchronize_state(CPUState *cpu)

  {
-if (!cpu->vcpu_dirty) {
+if (!cpu->vcpu_dirty && !kvm_state->guest_state_protected) {
  run_on_cpu(cpu, do_kvm_cpu_synchronize_state, RUN_ON_CPU_NULL);
  }
  }
@@ -2747,6 +2747,11 @@ static void do_kvm_cpu_synchronize_post_init(CPUState 
*cpu, run_on_cpu_data arg)
  
  void kvm_cpu_synchronize_post_init(CPUState *cpu)

  {
+/*
+ * This runs before the machine_init_done notifiers, and is the last
+ * opportunity to synchronize the state of confidential guests.
+ */ > +assert(!kvm_state->guest_state_protected);


So, this requires confidential guests to call 
kvm_mark_guest_state_protected() in its machine_init_done notifier callback?


But for TDX, the guest_state is protected at the beginning, not some 
time later when machine_init_done.



  run_on_cpu(cpu, do_kvm_cpu_synchronize_post_init, RUN_ON_CPU_NULL);
  }
  
@@ -4094,3 +4099,8 @@ void query_stats_schemas_cb(StatsSchemaList **result, Error **errp)

  query_stats_schema_vcpu(first_cpu, _args);
  }
  }
+
+void kvm_mark_guest_state_protected(void)
+{
+kvm_state->guest_state_protected = true;
+}
diff --git a/target/i386/sev.c b/target/i386/sev.c
index b8f79d34d19..c49a8fd55eb 100644
--- a/target/i386/sev.c
+++ b/target/i386/sev.c
@@ -755,6 +755,7 @@ sev_launch_get_measure(Notifier *notifier, void *unused)
  if (ret) {
  exit(1);
  }
+kvm_mark_guest_state_protected();
  }
  
  /* query the measurement blob length */





Re: [PATCH 21/26] kvm/memory: Make memory type private by default if it has guest memfd backend

2024-03-26 Thread Xiaoyao Li

On 3/23/2024 2:11 AM, Paolo Bonzini wrote:

From: Xiaoyao Li 

KVM side leaves the memory to shared by default, while may incur the


/s/while/which/

fix typo from myself.


overhead of paging conversion on the first visit of each page. Because
the expectation is that page is likely to private for the VMs that
require private memory (has guest memfd).

Explicitly set the memory to private when memory region has valid
guest memfd backend.

Signed-off-by: Xiaoyao Li 
Signed-off-by: Michael Roth 
Message-ID: <20240320083945.991426-16-michael.r...@amd.com>
Signed-off-by: Paolo Bonzini 
---
  accel/kvm/kvm-all.c | 10 ++
  1 file changed, 10 insertions(+)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index 7fbaf31cbaf..56b17cbd8aa 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -1430,6 +1430,16 @@ static void kvm_set_phys_mem(KVMMemoryListener *kml,
  strerror(-err));
  abort();
  }
+
+if (memory_region_has_guest_memfd(mr)) {
+err = kvm_set_memory_attributes_private(start_addr, slot_size);
+if (err) {
+error_report("%s: failed to set memory attribute private: 
%s\n",
+ __func__, strerror(-err));
+exit(1);
+}
+}
+
  start_addr += slot_size;
  ram_start_offset += slot_size;
  ram += slot_size;





Re: [PATCH 25/26] kvm: handle KVM_EXIT_MEMORY_FAULT

2024-03-26 Thread Xiaoyao Li

On 3/23/2024 2:11 AM, Paolo Bonzini wrote:

From: Chao Peng 

When geeting KVM_EXIT_MEMORY_FAULT exit, it indicates userspace needs to


typo: /s/geeting/getting


do the memory conversion on the RAMBlock to turn the memory into desired
attribute, i.e., private/shared.

Currently only KVM_MEMORY_EXIT_FLAG_PRIVATE in flags is valid when
KVM_EXIT_MEMORY_FAULT happens.

Note, KVM_EXIT_MEMORY_FAULT makes sense only when the RAMBlock has
guest_memfd memory backend.

Note, KVM_EXIT_MEMORY_FAULT returns with -EFAULT, so special handling is
added.

When page is converted from shared to private, the original shared
memory can be discarded via ram_block_discard_range(). Note, shared
memory can be discarded only when it's not back'ed by hugetlb because
hugetlb is supposed to be pre-allocated and no need for discarding.

Signed-off-by: Chao Peng 
Co-developed-by: Xiaoyao Li 
Signed-off-by: Xiaoyao Li 

Message-ID: <20240320083945.991426-13-michael.r...@amd.com>
Signed-off-by: Paolo Bonzini 
---
  include/sysemu/kvm.h   |  2 +
  accel/kvm/kvm-all.c| 99 +-
  accel/kvm/trace-events |  2 +
  3 files changed, 93 insertions(+), 10 deletions(-)

diff --git a/include/sysemu/kvm.h b/include/sysemu/kvm.h
index 2cb31925091..698f1640fe2 100644
--- a/include/sysemu/kvm.h
+++ b/include/sysemu/kvm.h
@@ -541,4 +541,6 @@ int kvm_create_guest_memfd(uint64_t size, uint64_t flags, 
Error **errp);
  
  int kvm_set_memory_attributes_private(hwaddr start, hwaddr size);

  int kvm_set_memory_attributes_shared(hwaddr start, hwaddr size);
+
+int kvm_convert_memory(hwaddr start, hwaddr size, bool to_private);
  #endif
diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index 56b17cbd8aa..afd7f992e39 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -2893,6 +2893,70 @@ static void kvm_eat_signals(CPUState *cpu)
  } while (sigismember(, SIG_IPI));
  }
  
+int kvm_convert_memory(hwaddr start, hwaddr size, bool to_private)

+{
+MemoryRegionSection section;
+ram_addr_t offset;
+MemoryRegion *mr;
+RAMBlock *rb;
+void *addr;
+int ret = -1;
+
+trace_kvm_convert_memory(start, size, to_private ? "shared_to_private" : 
"private_to_shared");
+
+if (!QEMU_PTR_IS_ALIGNED(start, qemu_real_host_page_size()) ||
+!QEMU_PTR_IS_ALIGNED(size, qemu_real_host_page_size())) {
+return -1;
+}
+
+if (!size) {
+return -1;
+}
+
+section = memory_region_find(get_system_memory(), start, size);
+mr = section.mr;
+if (!mr) {
+return -1;
+}
+
+if (!memory_region_has_guest_memfd(mr)) {
+error_report("Converting non guest_memfd backed memory region "
+ "(0x%"HWADDR_PRIx" ,+ 0x%"HWADDR_PRIx") to %s",
+ start, size, to_private ? "private" : "shared");
+ret = -1;


No need for it. ret is initialized as -1 at the function start.


+goto out_unref;
+}
+
+if (to_private) {
+ret = kvm_set_memory_attributes_private(start, size);
+} else {
+ret = kvm_set_memory_attributes_shared(start, size);
+}
+if (ret) {
+goto out_unref;
+}
+
+addr = memory_region_get_ram_ptr(mr) + section.offset_within_region;
+rb = qemu_ram_block_from_host(addr, false, );
+
+if (to_private) {
+if (rb->page_size == qemu_real_host_page_size()) {
+/*
+* shared memory is back'ed by  hugetlb, which is supposed to be


Please fix the bad comment indentation for me, as well as the extra 
space before 'hugetlb'



+* pre-allocated and doesn't need to be discarded
+*/
+goto out_unref;
+}
+ret = ram_block_discard_range(rb, offset, size);
+} else {
+ret = ram_block_discard_guest_memfd_range(rb, offset, size);
+}
+
+out_unref:
+memory_region_unref(section.mr);
+return ret;
+}
+
  int kvm_cpu_exec(CPUState *cpu)
  {
  struct kvm_run *run = cpu->kvm_run;
@@ -2960,18 +3024,20 @@ int kvm_cpu_exec(CPUState *cpu)
  ret = EXCP_INTERRUPT;
  break;
  }
-fprintf(stderr, "error: kvm run failed %s\n",
-strerror(-run_ret));
+if (!(run_ret == -EFAULT && run->exit_reason == 
KVM_EXIT_MEMORY_FAULT)) {
+fprintf(stderr, "error: kvm run failed %s\n",
+strerror(-run_ret));
  #ifdef TARGET_PPC
-if (run_ret == -EBUSY) {
-fprintf(stderr,
-"This is probably because your SMT is enabled.\n"
-"VCPU can only run on primary threads with all "
-"secondary threads offline.\n");
-}
+if (run_ret == -EBUSY) {
+fprintf(stderr,
+

Re: [PATCH 3/7] KVM: track whether guest state is encrypted

2024-03-22 Thread Xiaoyao Li

On 3/19/2024 9:59 PM, Paolo Bonzini wrote:

So far, KVM has allowed KVM_GET/SET_* ioctls to execute even if the
guest state is encrypted, in which case they do nothing.  For the new
API using VM types, instead, the ioctls will fail which is a safer and
more robust approach.

The new API will be the only one available for SEV-SNP and TDX, but it
is also usable for SEV and SEV-ES.  In preparation for that, require
architecture-specific KVM code to communicate the point at which guest
state is protected (which must be after kvm_cpu_synchronize_post_init(),
though that might change in the future in order to suppor migration).
 From that point, skip reading registers so that cpu->vcpu_dirty is
never true: if it ever becomes true, kvm_arch_put_registers() will
fail miserably.

Signed-off-by: Paolo Bonzini 


Reviewed-by: Xiaoyao Li 




Re: [PATCH 4/7] KVM: remove kvm_arch_cpu_check_are_resettable

2024-03-22 Thread Xiaoyao Li

On 3/19/2024 9:59 PM, Paolo Bonzini wrote:

Board reset requires writing a fresh CPU state.  As far as KVM is
concerned, the only thing that blocks reset is that CPU state is
encrypted; therefore, kvm_cpus_are_resettable() can simply check
if that is the case.

Signed-off-by: Paolo Bonzini 


Reviewed-by: Xiaoyao Li 





Re: [PATCH 5/7] target/i386: introduce x86-confidential-guest

2024-03-22 Thread Xiaoyao Li

On 3/19/2024 9:59 PM, Paolo Bonzini wrote:

Introduce a common superclass for x86 confidential guest implementations.
It will extend ConfidentialGuestSupportClass with a method that provides
the VM type to be passed to KVM_CREATE_VM.

Signed-off-by: Paolo Bonzini 


Reviewed-by: Xiaoyao Li 





Re: [PATCH 6/7] target/i386: Implement mc->kvm_type() to get VM type

2024-03-22 Thread Xiaoyao Li

On 3/19/2024 9:59 PM, Paolo Bonzini wrote:

From: Xiaoyao Li 

KVM is introducing a new API to create confidential guests, which
will be used by TDX and SEV-SNP but is also available for SEV and
SEV-ES.  The API uses the VM type argument to KVM_CREATE_VM to
identify which confidential computing technology to use.

Since there are no other expected uses of VM types, delegate
mc->kvm_type() for x86 boards to the confidential-guest-support
object pointed to by ms->cgs.

For example, if a sev-guest object is specified to confidential-guest-support,
like,

   qemu -machine ...,confidential-guest-support=sev0 \
-object sev-guest,id=sev0,...

it will check if a VM type KVM_X86_SEV_VM or KVM_X86_SEV_ES_VM
is supported, and if so use them together with the KVM_SEV_INIT2
function of the KVM_MEMORY_ENCRYPT_OP ioctl. If not, it will fall back to
KVM_SEV_INIT and KVM_SEV_ES_INIT.

This is a preparatory work towards TDX and SEV-SNP support, but it
will also enable support for VMSA features such as DebugSwap, which
are only available via KVM_SEV_INIT2.

Co-developed-by: Xiaoyao Li 
Signed-off-by: Xiaoyao Li 
Signed-off-by: Paolo Bonzini 


Reviewed-by: Xiaoyao Li 

some nits below.


---
  target/i386/confidential-guest.h | 19 ++
  target/i386/kvm/kvm_i386.h   |  2 ++
  hw/i386/x86.c|  6 +
  target/i386/kvm/kvm.c| 44 
  4 files changed, 71 insertions(+)

diff --git a/target/i386/confidential-guest.h b/target/i386/confidential-guest.h
index ca12d5a8fba..532e172a60b 100644
--- a/target/i386/confidential-guest.h
+++ b/target/i386/confidential-guest.h
@@ -36,5 +36,24 @@ struct X86ConfidentialGuest {
  struct X86ConfidentialGuestClass {
  /*  */
  ConfidentialGuestSupportClass parent;
+
+/*  */
+int (*kvm_type)(X86ConfidentialGuest *cg);
  };
+
+/**
+ * x86_confidential_guest_kvm_type:
+ *
+ * Calls #X86ConfidentialGuestClass.unplug callback of @plug_handler.


ah, forgot to change the callback name after copy+paste.


+ */
+static inline int x86_confidential_guest_kvm_type(X86ConfidentialGuest *cg)
+{
+X86ConfidentialGuestClass *klass = X86_CONFIDENTIAL_GUEST_GET_CLASS(cg);
+
+if (klass->kvm_type) {
+return klass->kvm_type(cg);
+} else {
+return 0;
+}
+}
  #endif
diff --git a/target/i386/kvm/kvm_i386.h b/target/i386/kvm/kvm_i386.h
index 30fedcffea3..02168122787 100644
--- a/target/i386/kvm/kvm_i386.h
+++ b/target/i386/kvm/kvm_i386.h
@@ -37,6 +37,7 @@ bool kvm_hv_vpindex_settable(void);
  bool kvm_enable_sgx_provisioning(KVMState *s);
  bool kvm_hyperv_expand_features(X86CPU *cpu, Error **errp);
  
+int kvm_get_vm_type(MachineState *ms, const char *vm_type);

  void kvm_arch_reset_vcpu(X86CPU *cs);
  void kvm_arch_after_reset_vcpu(X86CPU *cpu);
  void kvm_arch_do_init_vcpu(X86CPU *cs);
@@ -49,6 +50,7 @@ void kvm_request_xsave_components(X86CPU *cpu, uint64_t mask);
  
  #ifdef CONFIG_KVM
  
+bool kvm_is_vm_type_supported(int type);

  bool kvm_has_adjust_clock_stable(void);
  bool kvm_has_exception_payload(void);
  void kvm_synchronize_all_tsc(void);
diff --git a/hw/i386/x86.c b/hw/i386/x86.c
index ffbda48917f..2d4b148cd25 100644
--- a/hw/i386/x86.c
+++ b/hw/i386/x86.c
@@ -1389,6 +1389,11 @@ static void machine_set_sgx_epc(Object *obj, Visitor *v, 
const char *name,
  qapi_free_SgxEPCList(list);
  }
  
+static int x86_kvm_type(MachineState *ms, const char *vm_type)

+{
+return kvm_enabled() ? kvm_get_vm_type(ms, vm_type) : 0;
+}
+
  static void x86_machine_initfn(Object *obj)
  {
  X86MachineState *x86ms = X86_MACHINE(obj);
@@ -1413,6 +1418,7 @@ static void x86_machine_class_init(ObjectClass *oc, void 
*data)
  mc->cpu_index_to_instance_props = x86_cpu_index_to_props;
  mc->get_default_cpu_node_id = x86_get_default_cpu_node_id;
  mc->possible_cpu_arch_ids = x86_possible_cpu_arch_ids;
+mc->kvm_type = x86_kvm_type;
  x86mc->save_tsc_khz = true;
  x86mc->fwcfg_dma_enabled = true;
  nc->nmi_monitor_handler = x86_nmi;
diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
index 0ec69109a2b..e109648f260 100644
--- a/target/i386/kvm/kvm.c
+++ b/target/i386/kvm/kvm.c
@@ -31,6 +31,7 @@
  #include "sysemu/kvm_int.h"
  #include "sysemu/runstate.h"
  #include "kvm_i386.h"
+#include "../confidential-guest.h"
  #include "sev.h"
  #include "xen-emu.h"
  #include "hyperv.h"
@@ -161,6 +162,49 @@ static KVMMSRHandlers 
msr_handlers[KVM_MSR_FILTER_MAX_RANGES];
  static RateLimit bus_lock_ratelimit_ctrl;
  static int kvm_get_one_msr(X86CPU *cpu, int index, uint64_t *value);
  
+static const char *vm_type_name[] = {

+[KVM_X86_DEFAULT_VM] = "default",
+};
+
+bool kvm_is_vm_type_supported(int type)
+{
+uint32_t machine_types;


The name of machine_types confuses me a lot. why not supported_vm_types?


+
+/*
+ * old KVM doesn't support

Re: [PATCH RFC v3 00/49] Add AMD Secure Nested Paging (SEV-SNP) support

2024-03-20 Thread Xiaoyao Li

On 3/21/2024 1:08 AM, Paolo Bonzini wrote:

On Wed, Mar 20, 2024 at 10:59 AM Paolo Bonzini  wrote:

I will now focus on reviewing patches 6-20.  This way we can prepare a
common tree for SEV_INIT2/SNP/TDX, for both vendors to build upon.


Ok, the attachment is the delta that I have. The only major change is
requiring discard (thus effectively blocking VFIO support for
SEV-SNP/TDX, at least for now).

I will push it shortly to the same sevinit2 branch, and will post the
patches sometime soon.

Xiaoyao, you can use that branch too (it's on
https://gitlab.com/bonzini/qemu) as the basis for your TDX work.


Sure, it's really a good news for us.

BTW, there are some minor comments on guest_memfd patches of my v5 
post[*]. Could you please resolve them it your branch?


[*] 
https://lore.kernel.org/qemu-devel/20240229063726.610065-1-xiaoyao...@intel.com/



Paolo





Re: [PATCH v5 08/65] kvm: handle KVM_EXIT_MEMORY_FAULT

2024-03-20 Thread Xiaoyao Li

On 3/19/2024 10:14 AM, Wang, Lei wrote:

On 2/29/2024 14:36, Xiaoyao Li wrote:

From: Chao Peng 

When geeting KVM_EXIT_MEMORY_FAULT exit, it indicates userspace needs to
do the memory conversion on the RAMBlock to turn the memory into desired
attribute, i.e., private/shared.

Currently only KVM_MEMORY_EXIT_FLAG_PRIVATE in flags is valid when
KVM_EXIT_MEMORY_FAULT happens.

Note, KVM_EXIT_MEMORY_FAULT makes sense only when the RAMBlock has
guest_memfd memory backend.

Note, KVM_EXIT_MEMORY_FAULT returns with -EFAULT, so special handling is
added.

When page is converted from shared to private, the original shared
memory can be discarded via ram_block_discard_range(). Note, shared
memory can be discarded only when it's not back'ed by hugetlb because
hugetlb is supposed to be pre-allocated and no need for discarding.

Signed-off-by: Chao Peng 
Co-developed-by: Xiaoyao Li 
Signed-off-by: Xiaoyao Li 

---
Changes in v4:
- open-coded ram_block_discard logic;
- change warn_report() to error_report(); (Daniel)
---
  accel/kvm/kvm-all.c | 94 -
  1 file changed, 84 insertions(+), 10 deletions(-)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index 70d482a2c936..87e4275932a7 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -2903,6 +2903,68 @@ static void kvm_eat_signals(CPUState *cpu)
  } while (sigismember(, SIG_IPI));
  }
  
+static int kvm_convert_memory(hwaddr start, hwaddr size, bool to_private)

+{
+MemoryRegionSection section;
+ram_addr_t offset;
+MemoryRegion *mr;
+RAMBlock *rb;
+void *addr;
+int ret = -1;
+
+if (!QEMU_PTR_IS_ALIGNED(start, qemu_host_page_size) ||
+!QEMU_PTR_IS_ALIGNED(size, qemu_host_page_size)) {
+return -1;
+}
+
+if (!size) {
+return -1;
+}
+
+section = memory_region_find(get_system_memory(), start, size);
+mr = section.mr;
+if (!mr) {
+return -1;
+}
+
+if (memory_region_has_guest_memfd(mr)) {
+if (to_private) {
+ret = kvm_set_memory_attributes_private(start, size);
+} else {
+ret = kvm_set_memory_attributes_shared(start, size);
+}
+
+if (ret) {
+memory_region_unref(section.mr);
+return ret;
+}
+
+addr = memory_region_get_ram_ptr(mr) + section.offset_within_region;
+rb = qemu_ram_block_from_host(addr, false, );
+
+if (to_private) {
+if (rb->page_size != qemu_host_page_size) {
+/*
+* shared memory is back'ed by  hugetlb, which is supposed to be
+* pre-allocated and doesn't need to be discarded
+*/


Nit: comment indentation is broken here.


+return 0;
+} else {
+ret = ram_block_discard_range(rb, offset, size);
+}
+} else {
+ret = ram_block_discard_guest_memfd_range(rb, offset, size);
+}
+} else {
+error_report("Convert non guest_memfd backed memory region "
+"(0x%"HWADDR_PRIx" ,+ 0x%"HWADDR_PRIx") to %s",


Same as above.



Fixed.

thanks!




Re: [PATCH v3 13/49] [FIXUP] "kvm: handle KVM_EXIT_MEMORY_FAULT": drop qemu_host_page_size

2024-03-20 Thread Xiaoyao Li

On 3/20/2024 4:39 PM, Michael Roth wrote:

TODO: squash into "kvm: handle KVM_EXIT_MEMORY_FAULT"

qemu_host_page_size has been superseded by qemu_real_host_page_size()
in newer QEMU, so update the patch accordingly.


I found it today as well when rebase to qemu v9.0.0-rc0.

Fix it locally, will show up on my next post of TDX-QEMU patches. :)


Signed-off-by: Michael Roth 
---
  accel/kvm/kvm-all.c | 6 +++---
  1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index 2fdc07a472..a9c19ab9a1 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -2912,8 +2912,8 @@ static int kvm_convert_memory(hwaddr start, hwaddr size, 
bool to_private)
  void *addr;
  int ret = -1;
  
-if (!QEMU_PTR_IS_ALIGNED(start, qemu_host_page_size) ||

-!QEMU_PTR_IS_ALIGNED(size, qemu_host_page_size)) {
+if (!QEMU_PTR_IS_ALIGNED(start, qemu_real_host_page_size()) ||
+!QEMU_PTR_IS_ALIGNED(size, qemu_real_host_page_size())) {
  return -1;
  }
  
@@ -2943,7 +2943,7 @@ static int kvm_convert_memory(hwaddr start, hwaddr size, bool to_private)

  rb = qemu_ram_block_from_host(addr, false, );
  
  if (to_private) {

-if (rb->page_size != qemu_host_page_size) {
+if (rb->page_size != qemu_real_host_page_size()) {
  /*
  * shared memory is back'ed by  hugetlb, which is supposed to 
be
  * pre-allocated and doesn't need to be discarded





Re: [PATCH v5 06/65] kvm: Introduce support for memory_attributes

2024-03-20 Thread Xiaoyao Li

On 3/19/2024 10:03 AM, Wang, Lei wrote:

On 2/29/2024 14:36, Xiaoyao Li wrote:> Introduce the helper functions to set 
the attributes of a range of

memory to private or shared.

This is necessary to notify KVM the private/shared attribute of each gpa
range. KVM needs the information to decide the GPA needs to be mapped at
hva-based shared memory or guest_memfd based private memory.

Signed-off-by: Xiaoyao Li 
---
Changes in v4:
- move the check of kvm_supported_memory_attributes to the common
   kvm_set_memory_attributes(); (Wang Wei)
- change warn_report() to error_report() in kvm_set_memory_attributes()
   and drop the __func__; (Daniel)
---
  accel/kvm/kvm-all.c  | 44 
  include/sysemu/kvm.h |  3 +++
  2 files changed, 47 insertions(+)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index cd0aa7545a1f..70d482a2c936 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -92,6 +92,7 @@ static bool kvm_has_guest_debug;
  static int kvm_sstep_flags;
  static bool kvm_immediate_exit;
  static bool kvm_guest_memfd_supported;
+static uint64_t kvm_supported_memory_attributes;
  static hwaddr kvm_max_slot_size = ~0;
  
  static const KVMCapabilityInfo kvm_required_capabilites[] = {

@@ -1304,6 +1305,46 @@ void kvm_set_max_memslot_size(hwaddr max_slot_size)
  kvm_max_slot_size = max_slot_size;
  }
  
+static int kvm_set_memory_attributes(hwaddr start, hwaddr size, uint64_t attr)

+{
+struct kvm_memory_attributes attrs;
+int r;
+
+if (kvm_supported_memory_attributes == 0) {
+error_report("No memory attribute supported by KVM\n");
+return -EINVAL;
+}
+
+if ((attr & kvm_supported_memory_attributes) != attr) {
+error_report("memory attribute 0x%lx not supported by KVM,"
+ " supported bits are 0x%lx\n",
+ attr, kvm_supported_memory_attributes);
+return -EINVAL;
+}
+
+attrs.attributes = attr;
+attrs.address = start;
+attrs.size = size;
+attrs.flags = 0;
+
+r = kvm_vm_ioctl(kvm_state, KVM_SET_MEMORY_ATTRIBUTES, );
+if (r) {
+error_report("failed to set memory (0x%lx+%#zx) with attr 0x%lx error 
'%s'",
+ start, size, attr, strerror(errno));
+}
+return r;
+}
+
+int kvm_set_memory_attributes_private(hwaddr start, hwaddr size)
+{
+return kvm_set_memory_attributes(start, size, 
KVM_MEMORY_ATTRIBUTE_PRIVATE);
+}
+
+int kvm_set_memory_attributes_shared(hwaddr start, hwaddr size)
+{
+return kvm_set_memory_attributes(start, size, 0);
+}
+
  /* Called with KVMMemoryListener.slots_lock held */
  static void kvm_set_phys_mem(KVMMemoryListener *kml,
   MemoryRegionSection *section, bool add)
@@ -2439,6 +2480,9 @@ static int kvm_init(MachineState *ms)
  
  kvm_guest_memfd_supported = kvm_check_extension(s, KVM_CAP_GUEST_MEMFD);
  
+ret = kvm_check_extension(s, KVM_CAP_MEMORY_ATTRIBUTES);

+kvm_supported_memory_attributes = ret > 0 ? ret : 0;


kvm_check_extension() only returns non-negative value, so we can just

kvm_supported_memory_attributes = kvm_check_extension(s, 
KVM_CAP_MEMORY_ATTRIBUTES);


Good catch, I will update it in next version.


+
  if (object_property_find(OBJECT(current_machine), "kvm-type")) {
  g_autofree char *kvm_type = 
object_property_get_str(OBJECT(current_machine),
  "kvm-type",
diff --git a/include/sysemu/kvm.h b/include/sysemu/kvm.h
index 6cdf82de8372..8e83adfbbd19 100644
--- a/include/sysemu/kvm.h
+++ b/include/sysemu/kvm.h
@@ -546,4 +546,7 @@ uint32_t kvm_dirty_ring_size(void);
  bool kvm_hwpoisoned_mem(void);
  
  int kvm_create_guest_memfd(uint64_t size, uint64_t flags, Error **errp);

+
+int kvm_set_memory_attributes_private(hwaddr start, hwaddr size);
+int kvm_set_memory_attributes_shared(hwaddr start, hwaddr size);
  #endif





Re: [PATCH v3 11/49] physmem: Introduce ram_block_discard_guest_memfd_range()

2024-03-20 Thread Xiaoyao Li

On 3/20/2024 5:37 PM, David Hildenbrand wrote:

On 20.03.24 09:39, Michael Roth wrote:

From: Xiaoyao Li 

When memory page is converted from private to shared, the original
private memory is back'ed by guest_memfd. Introduce
ram_block_discard_guest_memfd_range() for discarding memory in
guest_memfd.

Originally-from: Isaku Yamahata 
Codeveloped-by: Xiaoyao Li 


"Co-developed-by"


Michael is using the patch from my TDX-QEMU v5 series[1]. I need to fix it.


Signed-off-by: Xiaoyao Li 
Reviewed-by: David Hildenbrand 


Your SOB should go here.


---
Changes in v5:
- Collect Reviewed-by from David;

Changes in in v4:
- Drop ram_block_convert_range() and open code its implementation in the
   next Patch.

Signed-off-by: Michael Roth 


I only received 3 patches from this series, and now I am confused: 
changelog talks about v5 and this is "PATCH v3"


As above, because the guest_memfd patches in my TDX-QEMU v5[1] were 
directly picked for this series, so the change history says v5. They are 
needed by SEV-SNP as well.


I want to raise the question, how do we want to proceed with the guest 
memfd patches (patch 2 to 10 in [1])? Can they be merged separately 
before TDX/SNP patches?


Please make sure to send at least the cover letter along (I might not 
need the other 46 patches :D ).





[1] 
https://lore.kernel.org/qemu-devel/20240229063726.610065-1-xiaoyao...@intel.com/




Re: [PATCH v4 1/2] kvm: add support for guest physical bits

2024-03-19 Thread Xiaoyao Li

On 3/18/2024 11:53 PM, Gerd Hoffmann wrote:

Query kvm for supported guest physical address bits, in cpuid
function 8008, eax[23:16].  Usually this is identical to host
physical address bits.  With NPT or EPT being used this might be
restricted to 48 (max 4-level paging address space size) even if
the host cpu supports more physical address bits.

When set pass this to the guest, using cpuid too.  Guest firmware
can use this to figure how big the usable guest physical address
space is, so PCI bar mapping are actually reachable.

Signed-off-by: Gerd Hoffmann 
---
  target/i386/cpu.h |  1 +
  target/i386/cpu.c |  1 +
  target/i386/kvm/kvm-cpu.c | 31 ++-
  3 files changed, 32 insertions(+), 1 deletion(-)

diff --git a/target/i386/cpu.h b/target/i386/cpu.h
index 952174bb6f52..d427218827f6 100644
--- a/target/i386/cpu.h
+++ b/target/i386/cpu.h
@@ -2026,6 +2026,7 @@ struct ArchCPU {
  
  /* Number of physical address bits supported */

  uint32_t phys_bits;
+uint32_t guest_phys_bits;
  
  /* in order to simplify APIC support, we leave this pointer to the

 user */
diff --git a/target/i386/cpu.c b/target/i386/cpu.c
index 9a210d8d9290..c88c895a5b3e 100644
--- a/target/i386/cpu.c
+++ b/target/i386/cpu.c
@@ -6570,6 +6570,7 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, 
uint32_t count,
  if (env->features[FEAT_8000_0001_EDX] & CPUID_EXT2_LM) {
  /* 64 bit processor */
   *eax |= (cpu_x86_virtual_addr_width(env) << 8);
+ *eax |= (cpu->guest_phys_bits << 16);
  }
  *ebx = env->features[FEAT_8000_0008_EBX];
  if (cs->nr_cores * cs->nr_threads > 1) {
diff --git a/target/i386/kvm/kvm-cpu.c b/target/i386/kvm/kvm-cpu.c
index 9c791b7b0520..5132bb96abd5 100644
--- a/target/i386/kvm/kvm-cpu.c
+++ b/target/i386/kvm/kvm-cpu.c
@@ -18,10 +18,33 @@
  #include "kvm_i386.h"
  #include "hw/core/accel-cpu.h"
  
+static void kvm_set_guest_phys_bits(CPUState *cs)

+{
+X86CPU *cpu = X86_CPU(cs);
+uint32_t eax, guest_phys_bits;
+
+eax = kvm_arch_get_supported_cpuid(cs->kvm_state, 0x8008, 0, R_EAX);
+guest_phys_bits = (eax >> 16) & 0xff;
+if (!guest_phys_bits) {
+return;
+}
+
+if (cpu->guest_phys_bits == 0 ||
+cpu->guest_phys_bits > guest_phys_bits) {
+cpu->guest_phys_bits = guest_phys_bits;
+}
+
+if (cpu->host_phys_bits_limit &&
+cpu->guest_phys_bits > cpu->host_phys_bits_limit) {
+cpu->guest_phys_bits = cpu->host_phys_bits_limit;


host_phys_bits_limit takes effect only when cpu->host_phys_bits is set.

If users pass configuration like "-cpu 
qemu64,phys-bits=52,host-phys-bits-limit=45", the cpu->guest_phys_bits 
will be set to 45. I think this is not what we want, though the usage 
seems insane.


We can guard it as

 if (cpu->host_phys_bits && cpu->host_phys_bits_limit &&
 cpu->guest_phys_bits > cpu->host_phys_bits_limt)
{
}

Simpler, we can guard with cpu->phys_bits like below, because 
cpu->host_phys_bits_limit is used to guard cpu->phys_bits in 
host_cpu_realizefn()


 if (cpu->guest_phys_bits > cpu->phys_bits) {
cpu->guest_phys_bits = cpu->phys_bits;
}


with this resolved,

Reviewed-by: Xiaoyao Li 


+}
+}
+
  static bool kvm_cpu_realizefn(CPUState *cs, Error **errp)
  {
  X86CPU *cpu = X86_CPU(cs);
  CPUX86State *env = >env;
+bool ret;
  
  /*

   * The realize order is important, since x86_cpu_realize() checks if
@@ -50,7 +73,13 @@ static bool kvm_cpu_realizefn(CPUState *cs, Error **errp)
 MSR_IA32_UCODE_REV);
  }
  }
-return host_cpu_realizefn(cs, errp);
+ret = host_cpu_realizefn(cs, errp);
+if (!ret) {
+return ret;
+}
+
+kvm_set_guest_phys_bits(cs);
+return true;
  }
  
  static bool lmce_supported(void)





Re: [PATCH] target/i386: Export RFDS bit to guests

2024-03-19 Thread Xiaoyao Li

On 3/19/2024 11:08 PM, Pawan Gupta wrote:

On Tue, Mar 19, 2024 at 12:22:08PM +0800, Xiaoyao Li wrote:

On 3/13/2024 10:53 PM, Pawan Gupta wrote:

Register File Data Sampling (RFDS) is a CPU side-channel vulnerability
that may expose stale register value. CPUs that set RFDS_NO bit in MSR
IA32_ARCH_CAPABILITIES indicate that they are not vulnerable to RFDS.
Similarly, RFDS_CLEAR indicates that CPU is affected by RFDS, and has
the microcode to help mitigate RFDS.

Make RFDS_CLEAR and RFDS_NO bits available to guests.


What's the status of KVM part?


KVM part is already upstreamed and backported:

https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=v6.8.1=50d33b98b1e23d1cd8743b3cac7a0ae5718b8b00


I see. It was not sent to kvm maillist and not merged through KVM tree.

With KVM part in palce,

Reviewed-by: Xiaoyao Li 



Re: [PATCH] target/i386: Export RFDS bit to guests

2024-03-18 Thread Xiaoyao Li

On 3/13/2024 10:53 PM, Pawan Gupta wrote:

Register File Data Sampling (RFDS) is a CPU side-channel vulnerability
that may expose stale register value. CPUs that set RFDS_NO bit in MSR
IA32_ARCH_CAPABILITIES indicate that they are not vulnerable to RFDS.
Similarly, RFDS_CLEAR indicates that CPU is affected by RFDS, and has
the microcode to help mitigate RFDS.

Make RFDS_CLEAR and RFDS_NO bits available to guests.


What's the status of KVM part?


Signed-off-by: Pawan Gupta 
---
  target/i386/cpu.c | 4 ++--
  1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/target/i386/cpu.c b/target/i386/cpu.c
index 9a210d8d9290..693a5e0fb2ce 100644
--- a/target/i386/cpu.c
+++ b/target/i386/cpu.c
@@ -1158,8 +1158,8 @@ FeatureWordInfo feature_word_info[FEATURE_WORDS] = {
  NULL, "sbdr-ssdp-no", "fbsdp-no", "psdp-no",
  NULL, "fb-clear", NULL, NULL,
  NULL, NULL, NULL, NULL,
-"pbrsb-no", NULL, "gds-no", NULL,
-NULL, NULL, NULL, NULL,
+"pbrsb-no", NULL, "gds-no", "rfds-no",
+"rfds-clear", NULL, NULL, NULL,
  },
  .msr = {
  .index = MSR_IA32_ARCH_CAPABILITIES,

base-commit: a1932d7cd6507d4d9db2044a54731fff3e749bac





Re: [PATCH 2/4] i386/sev: Switch to use confidential_guest_kvm_init()

2024-03-18 Thread Xiaoyao Li

On 3/19/2024 5:51 AM, Paolo Bonzini wrote:

On Thu, Feb 29, 2024 at 7:01 AM Xiaoyao Li  wrote:


Use confidential_guest_kvm_init() instead of calling SEV specific
sev_kvm_init(). As a bouns, it fits to future TDX when TDX implements
its own confidential_guest_support and .kvm_init().

Move the "TypeInfo sev_guest_info" definition and related functions to
the end of the file, to avoid declaring the sev_kvm_init() ahead.

Delete the sve-stub.c since it's not needed anymore.

Signed-off-by: Xiaoyao Li 
---
Changes from rfc v1:
- check ms->cgs not NULL before calling confidential_guest_kvm_init();
- delete the sev-stub.c;


Queued, with just one small simplification that can be done on top:


thank you, Paolo!


diff --git a/target/i386/sev.c b/target/i386/sev.c
index e89d64fa52..b8f79d34d1 100644
--- a/target/i386/sev.c
+++ b/target/i386/sev.c
@@ -851,18 +851,13 @@ sev_vm_state_change(void *opaque, bool running,
RunState state)

  static int sev_kvm_init(ConfidentialGuestSupport *cgs, Error **errp)
  {
-SevGuestState *sev
-= (SevGuestState *)object_dynamic_cast(OBJECT(cgs), TYPE_SEV_GUEST);
+SevGuestState *sev = SEV_GUEST(cgs);
  char *devname;
  int ret, fw_error, cmd;
  uint32_t ebx;
  uint32_t host_cbitpos;
  struct sev_user_data_status status = {};

-if (!sev) {
-return 0;
-}
-
  ret = ram_block_discard_disable(true);
  if (ret) {
  error_report("%s: cannot disable RAM discard", __func__);


It looks good.


Thanks!

Paolo





Re: [PATCH v3 2/3] kvm: add support for guest physical bits

2024-03-17 Thread Xiaoyao Li

On 3/13/2024 9:27 PM, Gerd Hoffmann wrote:

Query kvm for supported guest physical address bits, in cpuid
function 8008, eax[23:16].  Usually this is identical to host
physical address bits.  With NPT or EPT being used this might be
restricted to 48 (max 4-level paging address space size) even if
the host cpu supports more physical address bits.

When set pass this to the guest, using cpuid too.  Guest firmware
can use this to figure how big the usable guest physical address
space is, so PCI bar mapping are actually reachable.

Signed-off-by: Gerd Hoffmann 
---
  target/i386/cpu.h |  1 +
  target/i386/cpu.c |  1 +
  target/i386/kvm/kvm-cpu.c | 32 +++-
  3 files changed, 33 insertions(+), 1 deletion(-)

diff --git a/target/i386/cpu.h b/target/i386/cpu.h
index 952174bb6f52..d427218827f6 100644
--- a/target/i386/cpu.h
+++ b/target/i386/cpu.h
@@ -2026,6 +2026,7 @@ struct ArchCPU {
  
  /* Number of physical address bits supported */

  uint32_t phys_bits;
+uint32_t guest_phys_bits;
  
  /* in order to simplify APIC support, we leave this pointer to the

 user */
diff --git a/target/i386/cpu.c b/target/i386/cpu.c
index 9a210d8d9290..c88c895a5b3e 100644
--- a/target/i386/cpu.c
+++ b/target/i386/cpu.c
@@ -6570,6 +6570,7 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, 
uint32_t count,
  if (env->features[FEAT_8000_0001_EDX] & CPUID_EXT2_LM) {
  /* 64 bit processor */
   *eax |= (cpu_x86_virtual_addr_width(env) << 8);
+ *eax |= (cpu->guest_phys_bits << 16);
  }
  *ebx = env->features[FEAT_8000_0008_EBX];
  if (cs->nr_cores * cs->nr_threads > 1) {
diff --git a/target/i386/kvm/kvm-cpu.c b/target/i386/kvm/kvm-cpu.c
index 9c791b7b0520..a2b7bfaeadf8 100644
--- a/target/i386/kvm/kvm-cpu.c
+++ b/target/i386/kvm/kvm-cpu.c
@@ -18,10 +18,36 @@
  #include "kvm_i386.h"
  #include "hw/core/accel-cpu.h"
  
+static void kvm_set_guest_phys_bits(CPUState *cs)

+{
+X86CPU *cpu = X86_CPU(cs);
+uint32_t eax, guest_phys_bits;
+
+if (!cpu->host_phys_bits) {
+return;
+}


This needs explanation of why. What if users set the phys-bits to 
exactly host's value, via "-cpu xxx,phys-bits=host's value"?




+eax = kvm_arch_get_supported_cpuid(cs->kvm_state, 0x8008, 0, R_EAX);
+guest_phys_bits = (eax >> 16) & 0xff;
+if (!guest_phys_bits) {
+return;
+}
+
+if (cpu->guest_phys_bits == 0 ||
+cpu->guest_phys_bits > guest_phys_bits) {
+cpu->guest_phys_bits = guest_phys_bits;
+}
+
+if (cpu->guest_phys_bits > cpu->host_phys_bits_limit) {
+cpu->guest_phys_bits = cpu->host_phys_bits_limit;
+}
+}
+
  static bool kvm_cpu_realizefn(CPUState *cs, Error **errp)
  {
  X86CPU *cpu = X86_CPU(cs);
  CPUX86State *env = >env;
+bool ret;
  
  /*

   * The realize order is important, since x86_cpu_realize() checks if
@@ -50,7 +76,11 @@ static bool kvm_cpu_realizefn(CPUState *cs, Error **errp)
 MSR_IA32_UCODE_REV);
  }
  }
-return host_cpu_realizefn(cs, errp);
+ret = host_cpu_realizefn(cs, errp);


We need to check ret and return if !ret;


+kvm_set_guest_phys_bits(cs);
+
+return ret;
  }
  
  static bool lmce_supported(void)





Re: [PATCH v5 49/65] i386/tdx: handle TDG.VP.VMCALL

2024-03-15 Thread Xiaoyao Li

On 3/13/2024 11:31 PM, Daniel P. Berrangé wrote:

On Tue, Mar 12, 2024 at 03:44:32PM +0800, Xiaoyao Li wrote:

On 3/11/2024 5:27 PM, Daniel P. Berrangé wrote:

On Thu, Feb 29, 2024 at 01:37:10AM -0500, Xiaoyao Li wrote:

From: Isaku Yamahata 

Add property "quote-generation-socket" to tdx-guest, which is a property
of type SocketAddress to specify Quote Generation Service(QGS).

On request of GetQuote, it connects to the QGS socket, read request
data from shared guest memory, send the request data to the QGS,
and store the response into shared guest memory, at last notify
TD guest by interrupt.

command line example:
qemu-system-x86_64 \
  -object '{"qom-type":"tdx-guest","id":"tdx0","quote-generation-socket":{"type": "vsock", 
"cid":"1","port":"1234"}}' \


Can you illustrate this with 'unix' sockets, not 'vsock'.


Are you suggesting only updating the commit message to an example of unix
socket? Or you want the code to test with some unix socket QGS?

(It seems the QGS I got for testing, only supports vsock socket. Because at
the time when it got developed, it was supposed to communicate with drivers
inside TD guest directly not via VMM (KVM+QEMU). Anyway, I will talk to
internal folks to see if any plan to support unix socket.)


The QGS provided as part of DCAP supports running with both
UNIX sockets and VSOCK, and I would expect QEMU to be made
to work with this, since its is Intel's OSS reference impl.


After synced with internal folks, yes, the QGS I used does support unix 
socket. I tested it and it worked.


-object 
'{"qom-type":"tdx-guest","id":"tdx","quote-generation-socket":{"type":"unix", 
"path":"/var/run/tdx-qgs/qgs.socket"}}'



Exposing QGS to the guest when we only intend for it to be
used by the host QEMU is needlessly expanding the attack
surface.

With regards,
Daniel





Re: [PATCH v5 49/65] i386/tdx: handle TDG.VP.VMCALL

2024-03-12 Thread Xiaoyao Li

On 3/11/2024 5:27 PM, Daniel P. Berrangé wrote:

On Thu, Feb 29, 2024 at 01:37:10AM -0500, Xiaoyao Li wrote:

From: Isaku Yamahata 

Add property "quote-generation-socket" to tdx-guest, which is a property
of type SocketAddress to specify Quote Generation Service(QGS).

On request of GetQuote, it connects to the QGS socket, read request
data from shared guest memory, send the request data to the QGS,
and store the response into shared guest memory, at last notify
TD guest by interrupt.

command line example:
   qemu-system-x86_64 \
 -object '{"qom-type":"tdx-guest","id":"tdx0","quote-generation-socket":{"type": "vsock", 
"cid":"1","port":"1234"}}' \


Can you illustrate this with 'unix' sockets, not 'vsock'.


Are you suggesting only updating the commit message to an example of 
unix socket? Or you want the code to test with some unix socket QGS?


(It seems the QGS I got for testing, only supports vsock socket. Because 
at the time when it got developed, it was supposed to communicate with 
drivers inside TD guest directly not via VMM (KVM+QEMU). Anyway, I will 
talk to internal folks to see if any plan to support unix socket.)



It makes no conceptual sense to be using vsock for two
processes on the host to be using vsock to talk to
each other. vsock is only needed for the guest to talk
to the host.


 -machine confidential-guest-support=tdx0

Note, above example uses vsock type socket because the QGS we used
implements the vsock socket. It can be other types, like UNIX socket,
which depends on the implementation of QGS.

To avoid no response from QGS server, setup a timer for the transaction.
If timeout, make it an error and interrupt guest. Define the threshold of
time to 30s at present, maybe change to other value if not appropriate.

Signed-off-by: Isaku Yamahata 
Codeveloped-by: Chenyi Qiang 
Signed-off-by: Chenyi Qiang 
Codeveloped-by: Xiaoyao Li 
Signed-off-by: Xiaoyao Li 
---
Changes in v5:
- add more decription of quote-generation-socket property;

Changes in v4:
- merge next patch "i386/tdx: setup a timer for the qio channel";

Changes in v3:
- rename property "quote-generation-service" to "quote-generation-socket";
- change the type of "quote-generation-socket" from str to
   SocketAddress;
- squash next patch into this one;
---
  qapi/qom.json |   8 +-
  target/i386/kvm/meson.build   |   2 +-
  target/i386/kvm/tdx-quote-generator.c | 170 
  target/i386/kvm/tdx-quote-generator.h |  95 +++
  target/i386/kvm/tdx.c | 216 ++
  target/i386/kvm/tdx.h |   6 +
  6 files changed, 495 insertions(+), 2 deletions(-)
  create mode 100644 target/i386/kvm/tdx-quote-generator.c
  create mode 100644 target/i386/kvm/tdx-quote-generator.h



With regards,
Daniel





Re: [PATCH v5 52/65] i386/tdx: Wire TDX_REPORT_FATAL_ERROR with GuestPanic facility

2024-03-12 Thread Xiaoyao Li

On 3/11/2024 3:29 PM, Markus Armbruster wrote:

Xiaoyao Li  writes:


On 3/7/2024 9:51 PM, Markus Armbruster wrote:

Xiaoyao Li  writes:


On 2/29/2024 4:51 PM, Markus Armbruster wrote:

Xiaoyao Li  writes:


Integrate TDX's TDX_REPORT_FATAL_ERROR into QEMU GuestPanic facility

Originated-from: Isaku Yamahata 
Signed-off-by: Xiaoyao Li 
---
Changes in v5:
- mention additional error information in gpa when it presents;
- refine the documentation; (Markus)

Changes in v4:
- refine the documentation; (Markus)

Changes in v3:
- Add docmentation of new type and struct; (Daniel)
- refine the error message handling; (Daniel)
---
qapi/run-state.json   | 31 +--
system/runstate.c | 58 +++
target/i386/kvm/tdx.c | 24 +-
3 files changed, 110 insertions(+), 3 deletions(-)

diff --git a/qapi/run-state.json b/qapi/run-state.json
index dd0770b379e5..b71dd1884eb6 100644
--- a/qapi/run-state.json
+++ b/qapi/run-state.json


[...]


@@ -564,6 +567,30 @@
  'psw-addr': 'uint64',
  'reason': 'S390CrashReason'}}
+##
+# @GuestPanicInformationTdx:
+#
+# TDX Guest panic information specific to TDX, as specified in the
+# "Guest-Hypervisor Communication Interface (GHCI) Specification",
+# section TDG.VP.VMCALL.
+#
+# @error-code: TD-specific error code
+#
+# @message: Human-readable error message provided by the guest. Not
+# to be trusted.
+#
+# @gpa: guest-physical address of a page that contains more verbose
+# error information, as zero-terminated string.  Present when the
+# "GPA valid" bit (bit 63) is set in @error-code.


Uh, peeking at GHCI Spec section 3.4 TDG.VP.VMCALL, I
see operand R12 consists of

   bitsnamedescription
   31:0TD-specific error code  TD-specific error code
   Panic – 0x0.
   Values – 0x1 to 0x
   reserved.
   62:32   TD-specific extendedTD-specific extended error code.
   error code  TD software defined.
   63  GPA Valid   Set if the TD specified additional
   information in the GPA parameter
   (R13).
Is @error-code all of R12, or just bits 31:0?
If it's all of R12, description of @error-code as "TD-specific error
code" is misleading.


We pass all of R12 to @error_code.

Here it wants to use "error_code" as generic as the whole R12. Do you have any 
better description of it ?


Sadly, the spec is of no help: it doesn't name the entire thing, only
the three sub-fields TD-specific error code, TD-specific extended error
code, GPA valid.

We could take the hint, and provide the sub-fields instead:

* @error-code contains the TD-specific error code (bits 31:0)

* @extended-error-code contains the TD-specific extended error code
(bits 62:32)

* we don't need @gpa-valid, because it's the same as "@gpa is present"

If we decide to keep the single member, we do need another name for it.
@error-codes (plural) doesn't exactly feel wonderful, but it gives at
least a subtle hint that it's not just *the* error code.


The reason we only defined one single member, is that the
extended-error-code is not used now, and I believe it won't be used in
the near future.


Aha!  Then I recommend

* @error-code contains the TD-specific error code (bits 31:0)

* Omit bits 62:32 from the reply; if we later find an actual use for
   them, we can add a suitable member

* Omit bit 63, because it's the same as "@gpa is present"


If no objection from others, I will use @error-codes (plural) in the
next version.


I recommend to keep the @error-code name, but narrow its value to the
actual error code, i.e. bits 31:0.


It works for me. I will got this direction in the next version.


If it's just bits 31:0, then 'Present when the "GPA valid" bit (bit 63)
is set in @error-code' is wrong.  Could go with 'Only present when the
guest provides this information'.


[...]






Re: [PATCH v9 06/21] i386/cpu: Use APIC ID info to encode cache topo in CPUID[4]

2024-03-11 Thread Xiaoyao Li

On 3/10/2024 9:38 PM, Zhao Liu wrote:

Hi Xiaoyao,


   case 3: /* L3 cache info */
-die_offset = apicid_die_offset(_info);
   if (cpu->enable_l3_cache) {
+addressable_threads_width = apicid_die_offset(_info);


Please get rid of the local variable @addressable_threads_width.

It is truly confusing.


There're several reasons for this:

1. This commit is trying to use APIC ID topology layout to decode 2
cache topology fields in CPUID[4], CPUID.04H:EAX[bits 25:14] and
CPUID.04H:EAX[bits 31:26]. When there's a addressable_cores_width to map
to CPUID.04H:EAX[bits 31:26], it's more clear to also map
CPUID.04H:EAX[bits 25:14] to another variable.


I don't dislike using a variable. I dislike the name of that variable 
since it's misleading



2. All these 2 variables are temporary in this commit, and they will be
replaed by 2 helpers in follow-up cleanup of this series.


you mean patch 20?

I don't see how removing the local variable @addressable_threads_width 
conflicts with patch 20. As a con, it introduces code churn.



3. Similarly, to make it easier to clean up later with the helper and
for more people to review, it's neater to explicitly indicate the
CPUID.04H:EAX[bits 25:14] with a variable here.


If you do want keeping the variable. Please add a comment above it to 
explain the meaning.



4. I call this field as addressable_threads_width since it's "Maximum
number of addressable IDs for logical processors sharing this cache".

Its own name is long, but given the length, only individual words could
be selected as names.

TBH, "addressable" deserves more emphasis than "sharing". The former
emphasizes the fact that the number of threads chosen here is based on
the APIC ID layout and does not necessarily correspond to actual threads.

Therefore, this variable is needed here.

Thanks,
Zhao






Re: [PATCH v9 11/21] i386/cpu: Decouple CPUID[0x1F] subleaf with specific topology level

2024-03-11 Thread Xiaoyao Li

On 2/27/2024 6:32 PM, Zhao Liu wrote:

From: Zhao Liu 

At present, the subleaf 0x02 of CPUID[0x1F] is bound to the "die" level.

In fact, the specific topology level exposed in 0x1F depends on the
platform's support for extension levels (module, tile and die).

To help expose "module" level in 0x1F, decouple CPUID[0x1F] subleaf
with specific topology level.

Tested-by: Yongwei Ma 
Signed-off-by: Zhao Liu 


Reviewed-by: Xiaoyao Li 

Besides, some nits below.


---
Changes since v7:
  * Refactored the encode_topo_cpuid1f() to use traversal to search the
encoded level and avoid using static variables. (Xiaoyao)
- Since the total number of levels in the bitmap is not too large,
  the overhead of traversing is supposed to be acceptable.
  * Renamed the variable num_cpus_next_level to num_threads_next_level.
(Xiaoyao)
  * Renamed the helper num_cpus_by_topo_level() to
num_threads_by_topo_level(). (Xiaoyao)
  * Dropped Michael/Babu's Acked/Tested tags since the code change.
  * Re-added Yongwei's Tested tag For his re-testing.

Changes since v3:
  * New patch to prepare to expose module level in 0x1F.
  * Moved the CPUTopoLevel enumeration definition from "i386: Add cache
topology info in CPUCacheInfo" to this patch. Note, to align with
topology types in SDM, revert the name of CPU_TOPO_LEVEL_UNKNOW to
CPU_TOPO_LEVEL_INVALID.
---
  target/i386/cpu.c | 138 +-
  1 file changed, 113 insertions(+), 25 deletions(-)

diff --git a/target/i386/cpu.c b/target/i386/cpu.c
index 88dffd2b52e3..b0f171c6a465 100644
--- a/target/i386/cpu.c
+++ b/target/i386/cpu.c
@@ -269,6 +269,118 @@ static void encode_cache_cpuid4(CPUCacheInfo *cache,
 (cache->complex_indexing ? CACHE_COMPLEX_IDX : 0);
  }
  
+static uint32_t num_threads_by_topo_level(X86CPUTopoInfo *topo_info,

+  enum CPUTopoLevel topo_level)
+{
+switch (topo_level) {
+case CPU_TOPO_LEVEL_SMT:
+return 1;
+case CPU_TOPO_LEVEL_CORE:
+return topo_info->threads_per_core;
+case CPU_TOPO_LEVEL_DIE:
+return topo_info->threads_per_core * topo_info->cores_per_die;
+case CPU_TOPO_LEVEL_PACKAGE:
+return topo_info->threads_per_core * topo_info->cores_per_die *
+   topo_info->dies_per_pkg;
+default:
+g_assert_not_reached();
+}
+return 0;
+}
+
+static uint32_t apicid_offset_by_topo_level(X86CPUTopoInfo *topo_info,
+enum CPUTopoLevel topo_level)
+{
+switch (topo_level) {
+case CPU_TOPO_LEVEL_SMT:
+return 0;
+case CPU_TOPO_LEVEL_CORE:
+return apicid_core_offset(topo_info);
+case CPU_TOPO_LEVEL_DIE:
+return apicid_die_offset(topo_info);
+case CPU_TOPO_LEVEL_PACKAGE:
+return apicid_pkg_offset(topo_info);
+default:
+g_assert_not_reached();
+}
+return 0;
+}
+
+static uint32_t cpuid1f_topo_type(enum CPUTopoLevel topo_level)
+{
+switch (topo_level) {
+case CPU_TOPO_LEVEL_INVALID:
+return CPUID_1F_ECX_TOPO_LEVEL_INVALID;
+case CPU_TOPO_LEVEL_SMT:
+return CPUID_1F_ECX_TOPO_LEVEL_SMT;
+case CPU_TOPO_LEVEL_CORE:
+return CPUID_1F_ECX_TOPO_LEVEL_CORE;
+case CPU_TOPO_LEVEL_DIE:
+return CPUID_1F_ECX_TOPO_LEVEL_DIE;
+default:
+/* Other types are not supported in QEMU. */
+g_assert_not_reached();
+}
+return 0;
+}
+
+static void encode_topo_cpuid1f(CPUX86State *env, uint32_t count,
+X86CPUTopoInfo *topo_info,
+uint32_t *eax, uint32_t *ebx,
+uint32_t *ecx, uint32_t *edx)
+{
+X86CPU *cpu = env_archcpu(env);
+unsigned long level;
+uint32_t num_threads_next_level, offset_next_level;
+
+assert(count + 1 < CPU_TOPO_LEVEL_MAX);
+
+/*
+ * Find the No.count topology levels in avail_cpu_topo bitmap.
+ * Start from bit 0 (CPU_TOPO_LEVEL_INVALID).


AFAICS, it starts from bit 1 (CPU_TOPO_LEVEL_SMT). Because the initial 
value of level is CPU_TOPO_LEVEL_INVALID, but the first round of the 
loop is to find the valid bit starting from (level + 1).



+ */
+level = CPU_TOPO_LEVEL_INVALID;
+for (int i = 0; i <= count; i++) {
+level = find_next_bit(env->avail_cpu_topo,
+  CPU_TOPO_LEVEL_PACKAGE,
+  level + 1);
+
+/*
+ * CPUID[0x1f] doesn't explicitly encode the package level,
+ * and it just encode the invalid level (all fields are 0)
+ * into the last subleaf of 0x1f.
+ */


QEMU will never set bit CPU_TOPO_LEVEL_PACKAGE in env->avail_cpu_topo.

So I think we should assert() it instead of fixing it silently.


+if (level == CPU_TOPO_LEVEL_PACKAGE) {
+level = CPU_TOPO_LEVEL_INVALID;
+brea

Re: [PATCH v9 09/21] i386/cpu: Introduce bitmap to cache available CPU topology levels

2024-03-11 Thread Xiaoyao Li

On 2/27/2024 6:32 PM, Zhao Liu wrote:

From: Zhao Liu 

Currently, QEMU checks the specify number of topology domains to detect
if there's extended topology levels (e.g., checking nr_dies).

With this bitmap, the extended CPU topology (the levels other than SMT,
core and package) could be easier to detect without touching the
topology details.

This is also in preparation for the follow-up to decouple CPUID[0x1F]
subleaf with specific topology level.

Tested-by: Yongwei Ma 
Signed-off-by: Zhao Liu 



Reviewed-by: Xiaoyao Li 



Re: [PATCH v5 52/65] i386/tdx: Wire TDX_REPORT_FATAL_ERROR with GuestPanic facility

2024-03-10 Thread Xiaoyao Li

On 3/7/2024 9:51 PM, Markus Armbruster wrote:

Xiaoyao Li  writes:


On 2/29/2024 4:51 PM, Markus Armbruster wrote:

Xiaoyao Li  writes:


Integrate TDX's TDX_REPORT_FATAL_ERROR into QEMU GuestPanic facility

Originated-from: Isaku Yamahata 
Signed-off-by: Xiaoyao Li 
---
Changes in v5:
- mention additional error information in gpa when it presents;
- refine the documentation; (Markus)

Changes in v4:
- refine the documentation; (Markus)

Changes in v3:
- Add docmentation of new type and struct; (Daniel)
- refine the error message handling; (Daniel)
---
   qapi/run-state.json   | 31 +--
   system/runstate.c | 58 +++
   target/i386/kvm/tdx.c | 24 +-
   3 files changed, 110 insertions(+), 3 deletions(-)

diff --git a/qapi/run-state.json b/qapi/run-state.json
index dd0770b379e5..b71dd1884eb6 100644
--- a/qapi/run-state.json
+++ b/qapi/run-state.json
@@ -483,10 +483,12 @@
  #
  # @s390: s390 guest panic information type (Since: 2.12)
  #
+# @tdx: tdx guest panic information type (Since: 9.0)
+#
  # Since: 2.9
  ##
  { 'enum': 'GuestPanicInformationType',
-  'data': [ 'hyper-v', 's390' ] }
+  'data': [ 'hyper-v', 's390', 'tdx' ] }
 ##
# @GuestPanicInformation:
@@ -501,7 +503,8 @@
'base': {'type': 'GuestPanicInformationType'},
'discriminator': 'type',
'data': {'hyper-v': 'GuestPanicInformationHyperV',
-  's390': 'GuestPanicInformationS390'}}
+  's390': 'GuestPanicInformationS390',
+  'tdx' : 'GuestPanicInformationTdx'}}
  ##
  # @GuestPanicInformationHyperV:
@@ -564,6 +567,30 @@
 'psw-addr': 'uint64',
 'reason': 'S390CrashReason'}}
+##
+# @GuestPanicInformationTdx:
+#
+# TDX Guest panic information specific to TDX, as specified in the
+# "Guest-Hypervisor Communication Interface (GHCI) Specification",
+# section TDG.VP.VMCALL.
+#
+# @error-code: TD-specific error code
+#
+# @message: Human-readable error message provided by the guest. Not
+# to be trusted.
+#
+# @gpa: guest-physical address of a page that contains more verbose
+# error information, as zero-terminated string.  Present when the
+# "GPA valid" bit (bit 63) is set in @error-code.


Uh, peeking at GHCI Spec section 3.4 TDG.VP.VMCALL, I
see operand R12 consists of

  bitsnamedescription
  31:0TD-specific error code  TD-specific error code
  Panic – 0x0.
  Values – 0x1 to 0x
  reserved.
  62:32   TD-specific extendedTD-specific extended error code.
  error code  TD software defined.
  63  GPA Valid   Set if the TD specified additional
  information in the GPA parameter
  (R13).
Is @error-code all of R12, or just bits 31:0?
If it's all of R12, description of @error-code as "TD-specific error
code" is misleading.


We pass all of R12 to @error_code.

Here it wants to use "error_code" as generic as the whole R12. Do you have any 
better description of it ?


Sadly, the spec is of no help: it doesn't name the entire thing, only
the three sub-fields TD-specific error code, TD-specific extended error
code, GPA valid.

We could take the hint, and provide the sub-fields instead:

* @error-code contains the TD-specific error code (bits 31:0)

* @extended-error-code contains the TD-specific extended error code
   (bits 62:32)

* we don't need @gpa-valid, because it's the same as "@gpa is present"

If we decide to keep the single member, we do need another name for it.
@error-codes (plural) doesn't exactly feel wonderful, but it gives at
least a subtle hint that it's not just *the* error code.


The reason we only defined one single member, is that the 
extended-error-code is not used now, and I believe it won't be used in 
the near future.


If no objection from others, I will use @error-codes (plural) in the 
next version.



If it's just bits 31:0, then 'Present when the "GPA valid" bit (bit 63)
is set in @error-code' is wrong.  Could go with 'Only present when the
guest provides this information'.


+#
+#


Drop one of these two lines, please.


+# Since: 9.0
+##
+{'struct': 'GuestPanicInformationTdx',
+ 'data': {'error-code': 'uint64',
+  'message': 'str',
+  '*gpa': 'uint64'}}
+
   ##
   # @MEMORY_FAILURE:
   #









Re: [PATCH v5 30/65] i386/tdx: Support user configurable mrconfigid/mrowner/mrownerconfig

2024-03-10 Thread Xiaoyao Li

On 3/7/2024 9:56 PM, Markus Armbruster wrote:

Xiaoyao Li  writes:


On 3/7/2024 4:39 PM, Markus Armbruster wrote:

Xiaoyao Li  writes:


On 2/29/2024 9:25 PM, Markus Armbruster wrote:

Xiaoyao Li  writes:


On 2/29/2024 4:37 PM, Markus Armbruster wrote:

Xiaoyao Li  writes:


From: Isaku Yamahata 

Three sha384 hash values, mrconfigid, mrowner and mrownerconfig, of a TD
can be provided for TDX attestation. Detailed meaning of them can be
found: 
https://lore.kernel.org/qemu-devel/31d6dbc1-f453-4cef-ab08-4813f4e0f...@intel.com/

Allow user to specify those values via property mrconfigid, mrowner and
mrownerconfig. They are all in base64 format.

example
-object tdx-guest, \
  
mrconfigid=ASNFZ4mrze8BI0VniavN7wEjRWeJq83vASNFZ4mrze8BI0VniavN7wEjRWeJq83v,\
  mrowner=ASNFZ4mrze8BI0VniavN7wEjRWeJq83vASNFZ4mrze8BI0VniavN7wEjRWeJq83v,\
  
mrownerconfig=ASNFZ4mrze8BI0VniavN7wEjRWeJq83vASNFZ4mrze8BI0VniavN7wEjRWeJq83v

Signed-off-by: Isaku Yamahata 
Co-developed-by: Xiaoyao Li 
Signed-off-by: Xiaoyao Li 

[...]


diff --git a/qapi/qom.json b/qapi/qom.json
index 89ed89b9b46e..cac875349a3a 100644
--- a/qapi/qom.json
+++ b/qapi/qom.json
@@ -905,10 +905,25 @@
# pages.  Some guest OS (e.g., Linux TD guest) may require this to
# be set, otherwise they refuse to boot.
#
+# @mrconfigid: ID for non-owner-defined configuration of the guest TD,
+# e.g., run-time or OS configuration (base64 encoded SHA384 digest).
+# (A default value 0 of SHA384 is used when absent).


Suggest to drop the parenthesis in the last sentence.

@mrconfigid is a string, so the default value can't be 0.  Actually,
it's not just any string, but a base64 encoded SHA384 digest, which
means it must be exactly 96 hex digits.  So it can't be "0", either.  It
could be
"".


I thought value 0 of SHA384 just means it.

That's my fault and my poor english.


"Fault" is too harsh :)  It's not as precise as I want our interface
documentation to be.  We work together to get there.


More on this below.


+#
+# @mrowner: ID for the guest TD’s owner (base64 encoded SHA384 digest).
+# (A default value 0 of SHA384 is used when absent).
+#
+# @mrownerconfig: ID for owner-defined configuration of the guest TD,
+# e.g., specific to the workload rather than the run-time or OS
+# (base64 encoded SHA384 digest). (A default value 0 of SHA384 is
+# used when absent).
+#
# Since: 9.0
##
{ 'struct': 'TdxGuestProperties',
-  'data': { '*sept-ve-disable': 'bool' } }
+  'data': { '*sept-ve-disable': 'bool',
+'*mrconfigid': 'str',
+'*mrowner': 'str',
+'*mrownerconfig': 'str' } }
##
# @ThreadContextProperties:
diff --git a/target/i386/kvm/tdx.c b/target/i386/kvm/tdx.c
index d0ad4f57b5d0..4ce2f1d082ce 100644
--- a/target/i386/kvm/tdx.c
+++ b/target/i386/kvm/tdx.c
@@ -13,6 +13,7 @@
#include "qemu/osdep.h"
#include "qemu/error-report.h"
+#include "qemu/base64.h"
#include "qapi/error.h"
#include "qom/object_interfaces.h"
#include "standard-headers/asm-x86/kvm_para.h"
@@ -516,6 +517,7 @@ int tdx_pre_create_vcpu(CPUState *cpu, Error **errp)
X86CPU *x86cpu = X86_CPU(cpu);
CPUX86State *env = >env;
g_autofree struct kvm_tdx_init_vm *init_vm = NULL;
+size_t data_len;
int r = 0;
object_property_set_bool(OBJECT(cpu), "pmu", false, _abort);
@@ -528,6 +530,38 @@ int tdx_pre_create_vcpu(CPUState *cpu, Error **errp)
init_vm = g_malloc0(sizeof(struct kvm_tdx_init_vm) +
sizeof(struct kvm_cpuid_entry2) * 
KVM_MAX_CPUID_ENTRIES);
+#define SHA384_DIGEST_SIZE  48
+
+if (tdx_guest->mrconfigid) {
+g_autofree uint8_t *data = qbase64_decode(tdx_guest->mrconfigid,
+  strlen(tdx_guest->mrconfigid), _len, errp);
+if (!data || data_len != SHA384_DIGEST_SIZE) {
+error_setg(errp, "TDX: failed to decode mrconfigid");
+return -1;
+}
+memcpy(init_vm->mrconfigid, data, data_len);
+}


When @mrconfigid is absent, the property remains null, and this
conditional is not executed.  init_vm->mrconfigid[], an array of 6
__u64, remains all zero.  How does the kernel treat that?


A all-zero SHA384 value is still a valid value, isn't it?

KVM treats it with no difference.


Can you point me to the spot in the kernel where mrconfigid is used?


https://github.com/intel/tdx/blob/66a10e258636fa8ec9f5ce687607bf2196a92341/arch/x86/kvm/vmx/tdx.c#L2322

KVM just copy what QEMU provides into its own data structure @td_params. The 
format @of td_params is defined by TDX spec, and @td_params needs to be passed 
to TDX module when initialize the context of TD via SEAMCALL(TDH.MNG.INIT): 
https://github

Re: [PATCH v9 08/21] i386/cpu: Consolidate the use of topo_info in cpu_x86_cpuid()

2024-03-09 Thread Xiaoyao Li

On 2/27/2024 6:32 PM, Zhao Liu wrote:

From: Zhao Liu

In cpu_x86_cpuid(), there are many variables in representing the cpu
topology, e.g., topo_info, cs->nr_cores and cs->nr_threads.

Since the names of cs->nr_cores/cs->nr_threads does not accurately


Again as in v7, please changes to "cs->nr_cores and cs->nr_threads",

"cs->nr_cores/cs->nr_threads" looks like a single variable of division


represent its meaning, the use of cs->nr_cores/cs->nr_threads is prone
to confusion and mistakes.

And the structure X86CPUTopoInfo names its members clearly, thus the
variable "topo_info" should be preferred.

In addition, in cpu_x86_cpuid(), to uniformly use the topology variable,
replace env->dies with topo_info.dies_per_pkg as well.





Re: [PATCH v9 07/21] i386/cpu: Use APIC ID info get NumSharingCache for CPUID[0x8000001D].EAX[bits 25:14]

2024-03-09 Thread Xiaoyao Li

On 2/27/2024 6:32 PM, Zhao Liu wrote:

From: Zhao Liu 

The commit 8f4202fb1080 ("i386: Populate AMD Processor Cache Information
for cpuid 0x801D") adds the cache topology for AMD CPU by encoding
the number of sharing threads directly.

 From AMD's APM, NumSharingCache (CPUID[0x801D].EAX[bits 25:14])
means [1]:

The number of logical processors sharing this cache is the value of
this field incremented by 1. To determine which logical processors are
sharing a cache, determine a Share Id for each processor as follows:

ShareId = LocalApicId >> log2(NumSharingCache+1)

Logical processors with the same ShareId then share a cache. If
NumSharingCache+1 is not a power of two, round it up to the next power
of two.

 From the description above, the calculation of this field should be same
as CPUID[4].EAX[bits 25:14] for Intel CPUs. So also use the offsets of
APIC ID to calculate this field.

[1]: APM, vol.3, appendix.E.4.15 Function 8000_001Dh--Cache Topology
  Information

Cc: Babu Moger 
Tested-by: Yongwei Ma 
Signed-off-by: Zhao Liu 


Reviewed-by: Xiaoyao Li 


---
Changes since v7:
  * Moved this patch after CPUID[4]'s similar change ("i386/cpu: Use APIC
ID offset to encode cache topo in CPUID[4]"). (Xiaoyao)
  * Dropped Michael/Babu's Acked/Reviewed/Tested tags since the code
change due to the rebase.
  * Re-added Yongwei's Tested tag For his re-testing (compilation on
Intel platforms).

Changes since v3:
  * Rewrote the subject. (Babu)
  * Deleted the original "comment/help" expression, as this behavior is
confirmed for AMD CPUs. (Babu)
  * Renamed "num_apic_ids" (v3) to "num_sharing_cache" to match spec
definition. (Babu)

Changes since v1:
  * Renamed "l3_threads" to "num_apic_ids" in
encode_cache_cpuid801d(). (Yanan)
  * Added the description of the original commit and add Cc.
---
  target/i386/cpu.c | 8 
  1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/target/i386/cpu.c b/target/i386/cpu.c
index c77bcbc44d59..df56c7a449c8 100644
--- a/target/i386/cpu.c
+++ b/target/i386/cpu.c
@@ -331,7 +331,7 @@ static void encode_cache_cpuid801d(CPUCacheInfo *cache,
 uint32_t *eax, uint32_t *ebx,
 uint32_t *ecx, uint32_t *edx)
  {
-uint32_t l3_threads;
+uint32_t num_sharing_cache;
  assert(cache->size == cache->line_size * cache->associativity *
cache->partitions * cache->sets);
  
@@ -340,11 +340,11 @@ static void encode_cache_cpuid801d(CPUCacheInfo *cache,
  
  /* L3 is shared among multiple cores */

  if (cache->level == 3) {
-l3_threads = topo_info->cores_per_die * topo_info->threads_per_core;
-*eax |= (l3_threads - 1) << 14;
+num_sharing_cache = 1 << apicid_die_offset(topo_info);
  } else {
-*eax |= ((topo_info->threads_per_core - 1) << 14);
+num_sharing_cache = 1 << apicid_core_offset(topo_info);
  }
+*eax |= (num_sharing_cache - 1) << 14;
  
  assert(cache->line_size > 0);

  assert(cache->partitions > 0);





Re: [PATCH v9 06/21] i386/cpu: Use APIC ID info to encode cache topo in CPUID[4]

2024-03-09 Thread Xiaoyao Li
picid_pkg_offset(_info);
+
  *eax &= ~0x3FFC000;
-*eax |= (pow2ceil(vcpus_per_socket) - 1) << 14;
+*eax |= ((1 << addressable_threads_width) - 1) << 14;
  }
  }
  } else if (cpu->vendor_cpuid_only && IS_AMD_CPU(env)) {
  *eax = *ebx = *ecx = *edx = 0;
  } else {
  *eax = 0;
+addressable_cores_width = apicid_pkg_offset(_info) -
+  apicid_core_offset(_info);
+
  switch (count) {
  case 0: /* L1 dcache info */
+addressable_threads_width = apicid_core_offset(_info);
  encode_cache_cpuid4(env->cache_info_cpuid4.l1d_cache,
-cs->nr_threads, cs->nr_cores,
+(1 << addressable_threads_width),
+(1 << addressable_cores_width),
  eax, ebx, ecx, edx);
  break;
  case 1: /* L1 icache info */
+addressable_threads_width = apicid_core_offset(_info);
  encode_cache_cpuid4(env->cache_info_cpuid4.l1i_cache,
-cs->nr_threads, cs->nr_cores,
+(1 << addressable_threads_width),
+(1 << addressable_cores_width),
  eax, ebx, ecx, edx);
  break;
  case 2: /* L2 cache info */
+addressable_threads_width = apicid_core_offset(_info);
  encode_cache_cpuid4(env->cache_info_cpuid4.l2_cache,
-cs->nr_threads, cs->nr_cores,
+(1 << addressable_threads_width),
+(1 << addressable_cores_width),
  eax, ebx, ecx, edx);
  break;
  case 3: /* L3 cache info */
-die_offset = apicid_die_offset(_info);
  if (cpu->enable_l3_cache) {
+addressable_threads_width = apicid_die_offset(_info);


Please get rid of the local variable @addressable_threads_width.

It is truly confusing. In this patch, it is assigned to
 - apicid_pkg_offset(_info);
 - apicid_core_offset(_info);
 - apicid_die_offset(_info);

in different cases.

I only suggested the name of addressable_core_width in v7, which is 
apicid_pkg_offset(_info) - apicid_core_offset(_info); 

And it is straightforward that it means the number of bits in x2APICID 
to encode different addressable cores.


But it is not similar to addressable_threads_width, the semantic changes 
per different cache level. In fact, you want something like 
bit_width_of_addressable_threads_sharing_this_level_of_cache.


So I suggest stop using the variable of "address_therads_width". Instead 
just apicid_*_offset() directly.



With above fixed, feel free to add my

Reviewed-by: Xiaoyao Li 


  encode_cache_cpuid4(env->cache_info_cpuid4.l3_cache,
-(1 << die_offset), cs->nr_cores,
+(1 << addressable_threads_width),
+(1 << addressable_cores_width),
  eax, ebx, ecx, edx);
  break;
  }
@@ -6141,6 +6159,7 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, 
uint32_t count,
  }
  }
  break;
+}
  case 5:
  /* MONITOR/MWAIT Leaf */
  *eax = cpu->mwait.eax; /* Smallest monitor-line size in bytes */





Re: [PATCH v2 2/2] kvm: add support for guest physical bits

2024-03-07 Thread Xiaoyao Li

On 3/5/2024 6:52 PM, Gerd Hoffmann wrote:

Query kvm for supported guest physical address bits, in cpuid
function 8008, eax[23:16].  Usually this is identical to host
physical address bits.  With NPT or EPT being used this might be
restricted to 48 (max 4-level paging address space size) even if
the host cpu supports more physical address bits.

When set pass this to the guest, using cpuid too.  Guest firmware
can use this to figure how big the usable guest physical address
space is, so PCI bar mapping are actually reachable.

Signed-off-by: Gerd Hoffmann 
---
  target/i386/cpu.h |  1 +
  target/i386/cpu.c |  1 +
  target/i386/kvm/kvm.c | 17 +
  3 files changed, 19 insertions(+)

diff --git a/target/i386/cpu.h b/target/i386/cpu.h
index 952174bb6f52..d427218827f6 100644
--- a/target/i386/cpu.h
+++ b/target/i386/cpu.h
@@ -2026,6 +2026,7 @@ struct ArchCPU {
  
  /* Number of physical address bits supported */

  uint32_t phys_bits;
+uint32_t guest_phys_bits;
  
  /* in order to simplify APIC support, we leave this pointer to the

 user */
diff --git a/target/i386/cpu.c b/target/i386/cpu.c
index 2666ef380891..1a6cfc75951e 100644
--- a/target/i386/cpu.c
+++ b/target/i386/cpu.c
@@ -6570,6 +6570,7 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, 
uint32_t count,
  if (env->features[FEAT_8000_0001_EDX] & CPUID_EXT2_LM) {
  /* 64 bit processor */
   *eax |= (cpu_x86_virtual_addr_width(env) << 8);
+ *eax |= (cpu->guest_phys_bits << 16);
  }
  *ebx = env->features[FEAT_8000_0008_EBX];
  if (cs->nr_cores * cs->nr_threads > 1) {
diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
index 7298822cb511..ce22dfcaa661 100644
--- a/target/i386/kvm/kvm.c
+++ b/target/i386/kvm/kvm.c
@@ -238,6 +238,15 @@ static int kvm_get_tsc(CPUState *cs)
  return 0;
  }
  
+/* return cpuid fn 8000_0008 eax[23:16] aka GuestPhysBits */

+static int kvm_get_guest_phys_bits(KVMState *s)
+{
+uint32_t eax;
+
+eax = kvm_arch_get_supported_cpuid(s, 0x8008, 0, R_EAX);
+return (eax >> 16) & 0xff;
+}
+
  static inline void do_kvm_synchronize_tsc(CPUState *cpu, run_on_cpu_data arg)
  {
  kvm_get_tsc(cpu);
@@ -1730,6 +1739,7 @@ int kvm_arch_init_vcpu(CPUState *cs)
  X86CPU *cpu = X86_CPU(cs);
  CPUX86State *env = >env;
  uint32_t limit, i, j, cpuid_i;
+uint32_t guest_phys_bits;
  uint32_t unused;
  struct kvm_cpuid_entry2 *c;
  uint32_t signature[3];
@@ -1765,6 +1775,13 @@ int kvm_arch_init_vcpu(CPUState *cs)
  
  env->apic_bus_freq = KVM_APIC_BUS_FREQUENCY;
  
+guest_phys_bits = kvm_get_guest_phys_bits(cs->kvm_state);

+if (guest_phys_bits &&
+(cpu->guest_phys_bits == 0 ||
+ cpu->guest_phys_bits > guest_phys_bits)) {
+cpu->guest_phys_bits = guest_phys_bits;
+}


suggest to move this added code block to kvm_cpu_realizefn().

Main code in kvm_arch_init_vcpu() is to consume the data in cpu or 
env->features to configure KVM specific configuration via KVM ioctls.


The preparation work of initializing the required data usually not 
happen in kvm_arch_init_vcpu();



  /*
   * kvm_hyperv_expand_features() is called here for the second time in case
   * KVM_CAP_SYS_HYPERV_CPUID is not supported. While we can't possibly 
handle





Re: [PATCH v5 52/65] i386/tdx: Wire TDX_REPORT_FATAL_ERROR with GuestPanic facility

2024-03-07 Thread Xiaoyao Li

On 2/29/2024 4:51 PM, Markus Armbruster wrote:

Xiaoyao Li  writes:


Integrate TDX's TDX_REPORT_FATAL_ERROR into QEMU GuestPanic facility

Originated-from: Isaku Yamahata 
Signed-off-by: Xiaoyao Li 
---
Changes in v5:
- mention additional error information in gpa when it presents;
- refine the documentation; (Markus)

Changes in v4:
- refine the documentation; (Markus)

Changes in v3:
- Add docmentation of new type and struct; (Daniel)
- refine the error message handling; (Daniel)
---
  qapi/run-state.json   | 31 +--
  system/runstate.c | 58 +++
  target/i386/kvm/tdx.c | 24 +-
  3 files changed, 110 insertions(+), 3 deletions(-)

diff --git a/qapi/run-state.json b/qapi/run-state.json
index dd0770b379e5..b71dd1884eb6 100644
--- a/qapi/run-state.json
+++ b/qapi/run-state.json
@@ -483,10 +483,12 @@
  #
  # @s390: s390 guest panic information type (Since: 2.12)
  #
+# @tdx: tdx guest panic information type (Since: 9.0)
+#
  # Since: 2.9
  ##
  { 'enum': 'GuestPanicInformationType',
-  'data': [ 'hyper-v', 's390' ] }
+  'data': [ 'hyper-v', 's390', 'tdx' ] }
  
  ##

  # @GuestPanicInformation:
@@ -501,7 +503,8 @@
   'base': {'type': 'GuestPanicInformationType'},
   'discriminator': 'type',
   'data': {'hyper-v': 'GuestPanicInformationHyperV',
-  's390': 'GuestPanicInformationS390'}}
+  's390': 'GuestPanicInformationS390',
+  'tdx' : 'GuestPanicInformationTdx'}}
  
  ##

  # @GuestPanicInformationHyperV:
@@ -564,6 +567,30 @@
'psw-addr': 'uint64',
'reason': 'S390CrashReason'}}
  
+##

+# @GuestPanicInformationTdx:
+#
+# TDX Guest panic information specific to TDX, as specified in the
+# "Guest-Hypervisor Communication Interface (GHCI) Specification",
+# section TDG.VP.VMCALL.
+#
+# @error-code: TD-specific error code
+#
+# @message: Human-readable error message provided by the guest. Not
+# to be trusted.
+#
+# @gpa: guest-physical address of a page that contains more verbose
+# error information, as zero-terminated string.  Present when the
+# "GPA valid" bit (bit 63) is set in @error-code.


Uh, peeking at GHCI Spec section 3.4 TDG.VP.VMCALL, I
see operand R12 consists of

 bitsnamedescription
 31:0TD-specific error code  TD-specific error code
 Panic – 0x0.
 Values – 0x1 to 0x
 reserved.
 62:32   TD-specific extendedTD-specific extended error code.
 error code  TD software defined.
 63  GPA Valid   Set if the TD specified additional
 information in the GPA parameter
 (R13).

Is @error-code all of R12, or just bits 31:0?

If it's all of R12, description of @error-code as "TD-specific error
code" is misleading.


We pass all of R12 to @error_code.

Here it wants to use "error_code" as generic as the whole R12. Do you 
have any better description of it ?



If it's just bits 31:0, then 'Present when the "GPA valid" bit (bit 63)
is set in @error-code' is wrong.  Could go with 'Only present when the
guest provides this information'.


+#
+#


Drop one of these two lines, please.


+# Since: 9.0
+##
+{'struct': 'GuestPanicInformationTdx',
+ 'data': {'error-code': 'uint64',
+  'message': 'str',
+  '*gpa': 'uint64'}}
+
  ##
  # @MEMORY_FAILURE:
  #







Re: [PATCH v5 49/65] i386/tdx: handle TDG.VP.VMCALL

2024-03-07 Thread Xiaoyao Li

On 2/29/2024 9:28 PM, Markus Armbruster wrote:

Xiaoyao Li  writes:


On 2/29/2024 4:40 PM, Markus Armbruster wrote:

Xiaoyao Li  writes:


From: Isaku Yamahata 

Add property "quote-generation-socket" to tdx-guest, which is a property
of type SocketAddress to specify Quote Generation Service(QGS).

On request of GetQuote, it connects to the QGS socket, read request
data from shared guest memory, send the request data to the QGS,
and store the response into shared guest memory, at last notify
TD guest by interrupt.

command line example:
qemu-system-x86_64 \
  -object '{"qom-type":"tdx-guest","id":"tdx0","quote-generation-socket":{"type": "vsock", 
"cid":"1","port":"1234"}}' \
  -machine confidential-guest-support=tdx0

Note, above example uses vsock type socket because the QGS we used
implements the vsock socket. It can be other types, like UNIX socket,
which depends on the implementation of QGS.

To avoid no response from QGS server, setup a timer for the transaction.
If timeout, make it an error and interrupt guest. Define the threshold of
time to 30s at present, maybe change to other value if not appropriate.

Signed-off-by: Isaku Yamahata 
Codeveloped-by: Chenyi Qiang 
Signed-off-by: Chenyi Qiang 
Codeveloped-by: Xiaoyao Li 
Signed-off-by: Xiaoyao Li 


[...]


diff --git a/qapi/qom.json b/qapi/qom.json
index cac875349a3a..7b26b0a0d3aa 100644
--- a/qapi/qom.json
+++ b/qapi/qom.json
@@ -917,13 +917,19 @@
  # (base64 encoded SHA384 digest). (A default value 0 of SHA384 is
  # used when absent).
  #
+# @quote-generation-socket: socket address for Quote Generation
+# Service (QGS).  QGS is a daemon running on the host.  User in
+# TD guest cannot get TD quoting for attestation if QGS is not
+# provided.  So admin should always provide it.


This makes me wonder why it's optional.  Can you describe a use case for
*not* specifying @quote-generation-socket?


Maybe at last when all the TDX support lands on all the components, attestation 
will become a must for a TD guest to be usable.

However, at least for today, booting and running a TD guest don't require 
attestation. So not provide it, doesn't affect anything excepting cannot get a 
Quote.


Maybe

   # @quote-generation-socket: Socket address for Quote Generation
   # Service (QGS).  QGS is a daemon running on the host.  Without
   # it, the guest will not be able to get a TD quote for
   # attestation.


Thanks! will update to it.


[...]






Re: [PATCH v5 30/65] i386/tdx: Support user configurable mrconfigid/mrowner/mrownerconfig

2024-03-07 Thread Xiaoyao Li

On 3/7/2024 4:39 PM, Markus Armbruster wrote:

Xiaoyao Li  writes:


On 2/29/2024 9:25 PM, Markus Armbruster wrote:

Xiaoyao Li  writes:


On 2/29/2024 4:37 PM, Markus Armbruster wrote:

Xiaoyao Li  writes:


From: Isaku Yamahata 

Three sha384 hash values, mrconfigid, mrowner and mrownerconfig, of a TD
can be provided for TDX attestation. Detailed meaning of them can be
found: 
https://lore.kernel.org/qemu-devel/31d6dbc1-f453-4cef-ab08-4813f4e0f...@intel.com/

Allow user to specify those values via property mrconfigid, mrowner and
mrownerconfig. They are all in base64 format.

example
-object tdx-guest, \
 
mrconfigid=ASNFZ4mrze8BI0VniavN7wEjRWeJq83vASNFZ4mrze8BI0VniavN7wEjRWeJq83v,\
 mrowner=ASNFZ4mrze8BI0VniavN7wEjRWeJq83vASNFZ4mrze8BI0VniavN7wEjRWeJq83v,\
 
mrownerconfig=ASNFZ4mrze8BI0VniavN7wEjRWeJq83vASNFZ4mrze8BI0VniavN7wEjRWeJq83v

Signed-off-by: Isaku Yamahata 
Co-developed-by: Xiaoyao Li 
Signed-off-by: Xiaoyao Li 

[...]


diff --git a/qapi/qom.json b/qapi/qom.json
index 89ed89b9b46e..cac875349a3a 100644
--- a/qapi/qom.json
+++ b/qapi/qom.json
@@ -905,10 +905,25 @@
   # pages.  Some guest OS (e.g., Linux TD guest) may require this to
   # be set, otherwise they refuse to boot.
   #
+# @mrconfigid: ID for non-owner-defined configuration of the guest TD,
+# e.g., run-time or OS configuration (base64 encoded SHA384 digest).
+# (A default value 0 of SHA384 is used when absent).


Suggest to drop the parenthesis in the last sentence.

@mrconfigid is a string, so the default value can't be 0.  Actually,
it's not just any string, but a base64 encoded SHA384 digest, which
means it must be exactly 96 hex digits.  So it can't be "0", either.  It
could be
"".


I thought value 0 of SHA384 just means it.

That's my fault and my poor english.


"Fault" is too harsh :)  It's not as precise as I want our interface
documentation to be.  We work together to get there.


More on this below.


+#
+# @mrowner: ID for the guest TD’s owner (base64 encoded SHA384 digest).
+# (A default value 0 of SHA384 is used when absent).
+#
+# @mrownerconfig: ID for owner-defined configuration of the guest TD,
+# e.g., specific to the workload rather than the run-time or OS
+# (base64 encoded SHA384 digest). (A default value 0 of SHA384 is
+# used when absent).
+#
   # Since: 9.0
   ##
   { 'struct': 'TdxGuestProperties',
-  'data': { '*sept-ve-disable': 'bool' } }
+  'data': { '*sept-ve-disable': 'bool',
+'*mrconfigid': 'str',
+'*mrowner': 'str',
+'*mrownerconfig': 'str' } }
   ##
   # @ThreadContextProperties:
diff --git a/target/i386/kvm/tdx.c b/target/i386/kvm/tdx.c
index d0ad4f57b5d0..4ce2f1d082ce 100644
--- a/target/i386/kvm/tdx.c
+++ b/target/i386/kvm/tdx.c
@@ -13,6 +13,7 @@
   #include "qemu/osdep.h"
   #include "qemu/error-report.h"
+#include "qemu/base64.h"
   #include "qapi/error.h"
   #include "qom/object_interfaces.h"
   #include "standard-headers/asm-x86/kvm_para.h"
@@ -516,6 +517,7 @@ int tdx_pre_create_vcpu(CPUState *cpu, Error **errp)
   X86CPU *x86cpu = X86_CPU(cpu);
   CPUX86State *env = >env;
   g_autofree struct kvm_tdx_init_vm *init_vm = NULL;
+size_t data_len;
   int r = 0;
   object_property_set_bool(OBJECT(cpu), "pmu", false, _abort);
@@ -528,6 +530,38 @@ int tdx_pre_create_vcpu(CPUState *cpu, Error **errp)
   init_vm = g_malloc0(sizeof(struct kvm_tdx_init_vm) +
   sizeof(struct kvm_cpuid_entry2) * 
KVM_MAX_CPUID_ENTRIES);
+#define SHA384_DIGEST_SIZE  48
+
+if (tdx_guest->mrconfigid) {
+g_autofree uint8_t *data = qbase64_decode(tdx_guest->mrconfigid,
+  strlen(tdx_guest->mrconfigid), _len, errp);
+if (!data || data_len != SHA384_DIGEST_SIZE) {
+error_setg(errp, "TDX: failed to decode mrconfigid");
+return -1;
+}
+memcpy(init_vm->mrconfigid, data, data_len);
+}


When @mrconfigid is absent, the property remains null, and this
conditional is not executed.  init_vm->mrconfigid[], an array of 6
__u64, remains all zero.  How does the kernel treat that?


A all-zero SHA384 value is still a valid value, isn't it?

KVM treats it with no difference.


Can you point me to the spot in the kernel where mrconfigid is used?


https://github.com/intel/tdx/blob/66a10e258636fa8ec9f5ce687607bf2196a92341/arch/x86/kvm/vmx/tdx.c#L2322

KVM just copy what QEMU provides into its own data structure @td_params. The 
format @of td_params is defined by TDX spec, and @td_params needs to be passed 
to TDX module when initialize the context of TD via SEAMCALL(TDH.MNG.INIT): 
https://github.com/intel/tdx/blob/66a10e258636fa8ec9f5ce687607bf2196a92341/arch/x86/kvm/vmx/tdx.c#L2450


In fact, a

Re: [PATCH v5 08/65] kvm: handle KVM_EXIT_MEMORY_FAULT

2024-03-06 Thread Xiaoyao Li

On 3/5/2024 5:10 PM, Isaku Yamahata wrote:

On Thu, Feb 29, 2024 at 01:36:29AM -0500,
Xiaoyao Li  wrote:


From: Chao Peng 

When geeting KVM_EXIT_MEMORY_FAULT exit, it indicates userspace needs to
do the memory conversion on the RAMBlock to turn the memory into desired
attribute, i.e., private/shared.

Currently only KVM_MEMORY_EXIT_FLAG_PRIVATE in flags is valid when
KVM_EXIT_MEMORY_FAULT happens.

Note, KVM_EXIT_MEMORY_FAULT makes sense only when the RAMBlock has
guest_memfd memory backend.

Note, KVM_EXIT_MEMORY_FAULT returns with -EFAULT, so special handling is
added.

When page is converted from shared to private, the original shared
memory can be discarded via ram_block_discard_range(). Note, shared
memory can be discarded only when it's not back'ed by hugetlb because
hugetlb is supposed to be pre-allocated and no need for discarding.

Signed-off-by: Chao Peng 
Co-developed-by: Xiaoyao Li 
Signed-off-by: Xiaoyao Li 

---
Changes in v4:
- open-coded ram_block_discard logic;
- change warn_report() to error_report(); (Daniel)
---
  accel/kvm/kvm-all.c | 94 -
  1 file changed, 84 insertions(+), 10 deletions(-)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index 70d482a2c936..87e4275932a7 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -2903,6 +2903,68 @@ static void kvm_eat_signals(CPUState *cpu)
  } while (sigismember(, SIG_IPI));
  }
  
+static int kvm_convert_memory(hwaddr start, hwaddr size, bool to_private)

+{
+MemoryRegionSection section;
+ram_addr_t offset;
+MemoryRegion *mr;
+RAMBlock *rb;
+void *addr;
+int ret = -1;
+
+if (!QEMU_PTR_IS_ALIGNED(start, qemu_host_page_size) ||
+!QEMU_PTR_IS_ALIGNED(size, qemu_host_page_size)) {
+return -1;
+}
+
+if (!size) {
+return -1;
+}
+
+section = memory_region_find(get_system_memory(), start, size);
+mr = section.mr;
+if (!mr) {
+return -1;
+}
+
+if (memory_region_has_guest_memfd(mr)) {
+if (to_private) {
+ret = kvm_set_memory_attributes_private(start, size);
+} else {
+ret = kvm_set_memory_attributes_shared(start, size);
+}
+
+if (ret) {
+memory_region_unref(section.mr);
+return ret;
+}
+
+addr = memory_region_get_ram_ptr(mr) + section.offset_within_region;
+rb = qemu_ram_block_from_host(addr, false, );
+
+if (to_private) {
+if (rb->page_size != qemu_host_page_size) {
+/*
+* shared memory is back'ed by  hugetlb, which is supposed to be
+* pre-allocated and doesn't need to be discarded
+*/
+return 0;


The reference count leaks. Add memory_region_unref() is needed.


thanks for catching it. Will fix it in next version.


Otherwise looks good to me.
Reviewed-by: Isaku Yamahata 





Re: [PATCH 1/1] kvm: add support for guest physical bits

2024-03-04 Thread Xiaoyao Li

On 3/4/2024 10:58 PM, Gerd Hoffmann wrote:

On Mon, Mar 04, 2024 at 09:54:40AM +0800, Xiaoyao Li wrote:

On 3/1/2024 6:17 PM, Gerd Hoffmann wrote:

query kvm for supported guest physical address bits using
KVM_CAP_VM_GPA_BITS.  Expose the value to the guest via cpuid
(leaf 0x8008, eax, bits 16-23).

Signed-off-by: Gerd Hoffmann 
---
   target/i386/cpu.h | 1 +
   target/i386/cpu.c | 1 +
   target/i386/kvm/kvm.c | 8 
   3 files changed, 10 insertions(+)

diff --git a/target/i386/cpu.h b/target/i386/cpu.h
index 952174bb6f52..d427218827f6 100644
--- a/target/i386/cpu.h
+++ b/target/i386/cpu.h
@@ -2026,6 +2026,7 @@ struct ArchCPU {
   /* Number of physical address bits supported */
   uint32_t phys_bits;
+uint32_t guest_phys_bits;
   /* in order to simplify APIC support, we leave this pointer to the
  user */
diff --git a/target/i386/cpu.c b/target/i386/cpu.c
index 2666ef380891..1a6cfc75951e 100644
--- a/target/i386/cpu.c
+++ b/target/i386/cpu.c
@@ -6570,6 +6570,7 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, 
uint32_t count,
   if (env->features[FEAT_8000_0001_EDX] & CPUID_EXT2_LM) {
   /* 64 bit processor */
*eax |= (cpu_x86_virtual_addr_width(env) << 8);
+ *eax |= (cpu->guest_phys_bits << 16);


I think you misunderstand this field.

If you expose this field to guest, it's the information for nested guest.
i.e., the guest itself runs as a hypervisor will know its nested guest can
have guest_phys_bits for physical addr.


I think those limits (l1 + l2 guest phys-bits) are identical, no?


Sorry, I didn't know this patch was based on the off-list proposal made 
by Paolo that changing the definition of CPUID.0x8008:EAX[23:16] to 
advertise the "maximum usable physical address bits".


If you call out this in the change log, it can avoid the misunderstanding.

As I replied to KVM series, I think the info is better to setup by KVM 
and reported by GET_SUPPORTED_CPUID.



The problem this tries to solve is that making the guest phys-bits
smaller than the host phys-bits is problematic (which why we have
allow_smaller_maxphyaddr), but nevertheless there are cases where
the usable guest physical address space is smaller than the host
physical address space.  One case is intel processors with phys-bits
larger than 48 and 4-level EPT.  Another case is amd processors with
phys-bits larger than 48 and the l0 hypervisor using 4-level paging.

The guest needs to know that limit, specifically the guest firmware
so it knows where it can map PCI bars.

take care,
   Gerd






Re: [PATCH 1/1] kvm: add support for guest physical bits

2024-03-03 Thread Xiaoyao Li

On 3/1/2024 6:17 PM, Gerd Hoffmann wrote:

query kvm for supported guest physical address bits using
KVM_CAP_VM_GPA_BITS.  Expose the value to the guest via cpuid
(leaf 0x8008, eax, bits 16-23).

Signed-off-by: Gerd Hoffmann 
---
  target/i386/cpu.h | 1 +
  target/i386/cpu.c | 1 +
  target/i386/kvm/kvm.c | 8 
  3 files changed, 10 insertions(+)

diff --git a/target/i386/cpu.h b/target/i386/cpu.h
index 952174bb6f52..d427218827f6 100644
--- a/target/i386/cpu.h
+++ b/target/i386/cpu.h
@@ -2026,6 +2026,7 @@ struct ArchCPU {
  
  /* Number of physical address bits supported */

  uint32_t phys_bits;
+uint32_t guest_phys_bits;
  
  /* in order to simplify APIC support, we leave this pointer to the

 user */
diff --git a/target/i386/cpu.c b/target/i386/cpu.c
index 2666ef380891..1a6cfc75951e 100644
--- a/target/i386/cpu.c
+++ b/target/i386/cpu.c
@@ -6570,6 +6570,7 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, 
uint32_t count,
  if (env->features[FEAT_8000_0001_EDX] & CPUID_EXT2_LM) {
  /* 64 bit processor */
   *eax |= (cpu_x86_virtual_addr_width(env) << 8);
+ *eax |= (cpu->guest_phys_bits << 16);


I think you misunderstand this field.

If you expose this field to guest, it's the information for nested 
guest. i.e., the guest itself runs as a hypervisor will know its nested 
guest can have guest_phys_bits for physical addr.



  }
  *ebx = env->features[FEAT_8000_0008_EBX];
  if (cs->nr_cores * cs->nr_threads > 1) {
diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
index 42970ab046fa..e06c9d66bb01 100644
--- a/target/i386/kvm/kvm.c
+++ b/target/i386/kvm/kvm.c
@@ -1716,6 +1716,7 @@ int kvm_arch_init_vcpu(CPUState *cs)
  X86CPU *cpu = X86_CPU(cs);
  CPUX86State *env = >env;
  uint32_t limit, i, j, cpuid_i;
+uint32_t guest_phys_bits;
  uint32_t unused;
  struct kvm_cpuid_entry2 *c;
  uint32_t signature[3];
@@ -1751,6 +1752,13 @@ int kvm_arch_init_vcpu(CPUState *cs)
  
  env->apic_bus_freq = KVM_APIC_BUS_FREQUENCY;
  
+guest_phys_bits = kvm_check_extension(cs->kvm_state, KVM_CAP_VM_GPA_BITS);

+if (guest_phys_bits &&
+(cpu->guest_phys_bits == 0 ||
+ cpu->guest_phys_bits > guest_phys_bits)) {
+cpu->guest_phys_bits = guest_phys_bits;
+}
+
  /*
   * kvm_hyperv_expand_features() is called here for the second time in case
   * KVM_CAP_SYS_HYPERV_CPUID is not supported. While we can't possibly 
handle





Re: [PATCH v5 30/65] i386/tdx: Support user configurable mrconfigid/mrowner/mrownerconfig

2024-02-29 Thread Xiaoyao Li

On 2/29/2024 9:25 PM, Markus Armbruster wrote:

Xiaoyao Li  writes:


On 2/29/2024 4:37 PM, Markus Armbruster wrote:

Xiaoyao Li  writes:


From: Isaku Yamahata 

Three sha384 hash values, mrconfigid, mrowner and mrownerconfig, of a TD
can be provided for TDX attestation. Detailed meaning of them can be
found: 
https://lore.kernel.org/qemu-devel/31d6dbc1-f453-4cef-ab08-4813f4e0f...@intel.com/

Allow user to specify those values via property mrconfigid, mrowner and
mrownerconfig. They are all in base64 format.

example
-object tdx-guest, \

mrconfigid=ASNFZ4mrze8BI0VniavN7wEjRWeJq83vASNFZ4mrze8BI0VniavN7wEjRWeJq83v,\
mrowner=ASNFZ4mrze8BI0VniavN7wEjRWeJq83vASNFZ4mrze8BI0VniavN7wEjRWeJq83v,\

mrownerconfig=ASNFZ4mrze8BI0VniavN7wEjRWeJq83vASNFZ4mrze8BI0VniavN7wEjRWeJq83v

Signed-off-by: Isaku Yamahata 
Co-developed-by: Xiaoyao Li 
Signed-off-by: Xiaoyao Li 


[...]


diff --git a/qapi/qom.json b/qapi/qom.json
index 89ed89b9b46e..cac875349a3a 100644
--- a/qapi/qom.json
+++ b/qapi/qom.json
@@ -905,10 +905,25 @@
  # pages.  Some guest OS (e.g., Linux TD guest) may require this to
  # be set, otherwise they refuse to boot.
  #
+# @mrconfigid: ID for non-owner-defined configuration of the guest TD,
+# e.g., run-time or OS configuration (base64 encoded SHA384 digest).
+# (A default value 0 of SHA384 is used when absent).


Suggest to drop the parenthesis in the last sentence.

@mrconfigid is a string, so the default value can't be 0.  Actually,
it's not just any string, but a base64 encoded SHA384 digest, which
means it must be exactly 96 hex digits.  So it can't be "0", either.  It
could be
"".


I thought value 0 of SHA384 just means it.

That's my fault and my poor english.


"Fault" is too harsh :)  It's not as precise as I want our interface
documentation to be.  We work together to get there.


More on this below.


+#
+# @mrowner: ID for the guest TD’s owner (base64 encoded SHA384 digest).
+# (A default value 0 of SHA384 is used when absent).
+#
+# @mrownerconfig: ID for owner-defined configuration of the guest TD,
+# e.g., specific to the workload rather than the run-time or OS
+# (base64 encoded SHA384 digest). (A default value 0 of SHA384 is
+# used when absent).
+#
  # Since: 9.0
  ##
  { 'struct': 'TdxGuestProperties',
-  'data': { '*sept-ve-disable': 'bool' } }
+  'data': { '*sept-ve-disable': 'bool',
+'*mrconfigid': 'str',
+'*mrowner': 'str',
+'*mrownerconfig': 'str' } }
  ##
  # @ThreadContextProperties:
diff --git a/target/i386/kvm/tdx.c b/target/i386/kvm/tdx.c
index d0ad4f57b5d0..4ce2f1d082ce 100644
--- a/target/i386/kvm/tdx.c
+++ b/target/i386/kvm/tdx.c
@@ -13,6 +13,7 @@
  #include "qemu/osdep.h"
  #include "qemu/error-report.h"
+#include "qemu/base64.h"
  #include "qapi/error.h"
  #include "qom/object_interfaces.h"
  #include "standard-headers/asm-x86/kvm_para.h"
@@ -516,6 +517,7 @@ int tdx_pre_create_vcpu(CPUState *cpu, Error **errp)
  X86CPU *x86cpu = X86_CPU(cpu);
  CPUX86State *env = >env;
  g_autofree struct kvm_tdx_init_vm *init_vm = NULL;
+size_t data_len;
  int r = 0;
  object_property_set_bool(OBJECT(cpu), "pmu", false, _abort);
@@ -528,6 +530,38 @@ int tdx_pre_create_vcpu(CPUState *cpu, Error **errp)
  init_vm = g_malloc0(sizeof(struct kvm_tdx_init_vm) +
  sizeof(struct kvm_cpuid_entry2) * 
KVM_MAX_CPUID_ENTRIES);
+#define SHA384_DIGEST_SIZE  48
+
+if (tdx_guest->mrconfigid) {
+g_autofree uint8_t *data = qbase64_decode(tdx_guest->mrconfigid,
+  strlen(tdx_guest->mrconfigid), _len, errp);
+if (!data || data_len != SHA384_DIGEST_SIZE) {
+error_setg(errp, "TDX: failed to decode mrconfigid");
+return -1;
+}
+memcpy(init_vm->mrconfigid, data, data_len);
+}


When @mrconfigid is absent, the property remains null, and this
conditional is not executed.  init_vm->mrconfigid[], an array of 6
__u64, remains all zero.  How does the kernel treat that?


A all-zero SHA384 value is still a valid value, isn't it?

KVM treats it with no difference.


Can you point me to the spot in the kernel where mrconfigid is used?



https://github.com/intel/tdx/blob/66a10e258636fa8ec9f5ce687607bf2196a92341/arch/x86/kvm/vmx/tdx.c#L2322

KVM just copy what QEMU provides into its own data structure @td_params. 
The format @of td_params is defined by TDX spec, and @td_params needs to 
be passed to TDX module when initialize the context of TD via 
SEAMCALL(TDH.MNG.INIT): 
https://github.com/intel/tdx/blob/66a10e258636fa8ec9f5ce687607bf2196a92341/arch/x86/kvm/vmx/tdx.c#L2450



In fact, all the three SHA384 fields, will be hashed together with some 
other fields (in td_params and other content of TD) to compromise the 
initial measurement of TD.


TDX module doesn't care the value of td_params->mrconfigid.




Re: [PATCH v5 49/65] i386/tdx: handle TDG.VP.VMCALL

2024-02-29 Thread Xiaoyao Li

On 2/29/2024 4:40 PM, Markus Armbruster wrote:

Xiaoyao Li  writes:


From: Isaku Yamahata 

Add property "quote-generation-socket" to tdx-guest, which is a property
of type SocketAddress to specify Quote Generation Service(QGS).

On request of GetQuote, it connects to the QGS socket, read request
data from shared guest memory, send the request data to the QGS,
and store the response into shared guest memory, at last notify
TD guest by interrupt.

command line example:
   qemu-system-x86_64 \
 -object '{"qom-type":"tdx-guest","id":"tdx0","quote-generation-socket":{"type": "vsock", 
"cid":"1","port":"1234"}}' \
 -machine confidential-guest-support=tdx0

Note, above example uses vsock type socket because the QGS we used
implements the vsock socket. It can be other types, like UNIX socket,
which depends on the implementation of QGS.

To avoid no response from QGS server, setup a timer for the transaction.
If timeout, make it an error and interrupt guest. Define the threshold of
time to 30s at present, maybe change to other value if not appropriate.

Signed-off-by: Isaku Yamahata 
Codeveloped-by: Chenyi Qiang 
Signed-off-by: Chenyi Qiang 
Codeveloped-by: Xiaoyao Li 
Signed-off-by: Xiaoyao Li 


[...]


diff --git a/qapi/qom.json b/qapi/qom.json
index cac875349a3a..7b26b0a0d3aa 100644
--- a/qapi/qom.json
+++ b/qapi/qom.json
@@ -917,13 +917,19 @@
  # (base64 encoded SHA384 digest). (A default value 0 of SHA384 is
  # used when absent).
  #
+# @quote-generation-socket: socket address for Quote Generation
+# Service (QGS).  QGS is a daemon running on the host.  User in
+# TD guest cannot get TD quoting for attestation if QGS is not
+# provided.  So admin should always provide it.


This makes me wonder why it's optional.  Can you describe a use case for
*not* specifying @quote-generation-socket?


Maybe at last when all the TDX support lands on all the components, 
attestation will become a must for a TD guest to be usable.


However, at least for today, booting and running a TD guest don't 
require attestation. So not provide it, doesn't affect anything 
excepting cannot get a Quote.



+#
  # Since: 9.0
  ##
  { 'struct': 'TdxGuestProperties',
'data': { '*sept-ve-disable': 'bool',
  '*mrconfigid': 'str',
  '*mrowner': 'str',
-'*mrownerconfig': 'str' } }
+'*mrownerconfig': 'str',
+'*quote-generation-socket': 'SocketAddress' } }
  
  ##

  # @ThreadContextProperties:


[...]






Re: [PATCH v5 30/65] i386/tdx: Support user configurable mrconfigid/mrowner/mrownerconfig

2024-02-29 Thread Xiaoyao Li

On 2/29/2024 4:37 PM, Markus Armbruster wrote:

Xiaoyao Li  writes:


From: Isaku Yamahata 

Three sha384 hash values, mrconfigid, mrowner and mrownerconfig, of a TD
can be provided for TDX attestation. Detailed meaning of them can be
found: 
https://lore.kernel.org/qemu-devel/31d6dbc1-f453-4cef-ab08-4813f4e0f...@intel.com/

Allow user to specify those values via property mrconfigid, mrowner and
mrownerconfig. They are all in base64 format.

example
-object tdx-guest, \
   mrconfigid=ASNFZ4mrze8BI0VniavN7wEjRWeJq83vASNFZ4mrze8BI0VniavN7wEjRWeJq83v,\
   mrowner=ASNFZ4mrze8BI0VniavN7wEjRWeJq83vASNFZ4mrze8BI0VniavN7wEjRWeJq83v,\
   
mrownerconfig=ASNFZ4mrze8BI0VniavN7wEjRWeJq83vASNFZ4mrze8BI0VniavN7wEjRWeJq83v

Signed-off-by: Isaku Yamahata 
Co-developed-by: Xiaoyao Li 
Signed-off-by: Xiaoyao Li 

---
Changes in v5:
  - refine the description of QAPI properties and add description of
default value when not specified;

Changes in v4:
  - describe more of there fields in qom.json
  - free the old value before set new value to avoid memory leak in
_setter(); (Daniel)

Changes in v3:
  - use base64 encoding instread of hex-string;
---
  qapi/qom.json | 17 -
  target/i386/kvm/tdx.c | 87 +++
  target/i386/kvm/tdx.h |  3 ++
  3 files changed, 106 insertions(+), 1 deletion(-)

diff --git a/qapi/qom.json b/qapi/qom.json
index 89ed89b9b46e..cac875349a3a 100644
--- a/qapi/qom.json
+++ b/qapi/qom.json
@@ -905,10 +905,25 @@
  # pages.  Some guest OS (e.g., Linux TD guest) may require this to
  # be set, otherwise they refuse to boot.
  #
+# @mrconfigid: ID for non-owner-defined configuration of the guest TD,
+# e.g., run-time or OS configuration (base64 encoded SHA384 digest).
+# (A default value 0 of SHA384 is used when absent).


Suggest to drop the parenthesis in the last sentence.

@mrconfigid is a string, so the default value can't be 0.  Actually,
it's not just any string, but a base64 encoded SHA384 digest, which
means it must be exactly 96 hex digits.  So it can't be "0", either.  It
could be
"".


I thought value 0 of SHA384 just means it.

That's my fault and my poor english.


More on this below.


+#
+# @mrowner: ID for the guest TD’s owner (base64 encoded SHA384 digest).
+# (A default value 0 of SHA384 is used when absent).
+#
+# @mrownerconfig: ID for owner-defined configuration of the guest TD,
+# e.g., specific to the workload rather than the run-time or OS
+# (base64 encoded SHA384 digest). (A default value 0 of SHA384 is
+# used when absent).
+#
  # Since: 9.0
  ##
  { 'struct': 'TdxGuestProperties',
-  'data': { '*sept-ve-disable': 'bool' } }
+  'data': { '*sept-ve-disable': 'bool',
+'*mrconfigid': 'str',
+'*mrowner': 'str',
+'*mrownerconfig': 'str' } }
  
  ##

  # @ThreadContextProperties:
diff --git a/target/i386/kvm/tdx.c b/target/i386/kvm/tdx.c
index d0ad4f57b5d0..4ce2f1d082ce 100644
--- a/target/i386/kvm/tdx.c
+++ b/target/i386/kvm/tdx.c
@@ -13,6 +13,7 @@
  
  #include "qemu/osdep.h"

  #include "qemu/error-report.h"
+#include "qemu/base64.h"
  #include "qapi/error.h"
  #include "qom/object_interfaces.h"
  #include "standard-headers/asm-x86/kvm_para.h"
@@ -516,6 +517,7 @@ int tdx_pre_create_vcpu(CPUState *cpu, Error **errp)
  X86CPU *x86cpu = X86_CPU(cpu);
  CPUX86State *env = >env;
  g_autofree struct kvm_tdx_init_vm *init_vm = NULL;
+size_t data_len;
  int r = 0;
  
  object_property_set_bool(OBJECT(cpu), "pmu", false, _abort);

@@ -528,6 +530,38 @@ int tdx_pre_create_vcpu(CPUState *cpu, Error **errp)
  init_vm = g_malloc0(sizeof(struct kvm_tdx_init_vm) +
  sizeof(struct kvm_cpuid_entry2) * 
KVM_MAX_CPUID_ENTRIES);
  
+#define SHA384_DIGEST_SIZE  48

+
+if (tdx_guest->mrconfigid) {
+g_autofree uint8_t *data = qbase64_decode(tdx_guest->mrconfigid,
+  strlen(tdx_guest->mrconfigid), _len, errp);
+if (!data || data_len != SHA384_DIGEST_SIZE) {
+error_setg(errp, "TDX: failed to decode mrconfigid");
+return -1;
+}
+memcpy(init_vm->mrconfigid, data, data_len);
+}


When @mrconfigid is absent, the property remains null, and this
conditional is not executed.  init_vm->mrconfigid[], an array of 6
__u64, remains all zero.  How does the kernel treat that?


A all-zero SHA384 value is still a valid value, isn't it?

KVM treats it with no difference.


+
+if (tdx_guest->mrowner) {
+g_autofree uint8_t *data = qbase64_decode(tdx_guest->mrowner,
+  strlen(tdx_guest->mrowner), _len, errp);
+if (!data || data_len != SHA384_DIGEST_SIZE) {
+error_s

[PATCH v5 39/65] i386/tdx: Skip BIOS shadowing setup

2024-02-28 Thread Xiaoyao Li
TDX doesn't support map different GPAs to same private memory. Thus,
aliasing top 128KB of BIOS as isa-bios is not supported.

On the other hand, TDX guest cannot go to real mode, it can work fine
without isa-bios.

Signed-off-by: Xiaoyao Li 
Acked-by: Gerd Hoffmann 
---
Changes in v1:
 - update commit message and comment to clarify
---
 hw/i386/x86.c | 25 ++---
 1 file changed, 14 insertions(+), 11 deletions(-)

diff --git a/hw/i386/x86.c b/hw/i386/x86.c
index 5a0cadc88c4f..61c45dfc14dd 100644
--- a/hw/i386/x86.c
+++ b/hw/i386/x86.c
@@ -1190,17 +1190,20 @@ void x86_bios_rom_init(MachineState *ms, const char 
*default_firmware,
 }
 g_free(filename);
 
-/* map the last 128KB of the BIOS in ISA space */
-isa_bios_size = MIN(bios_size, 128 * KiB);
-isa_bios = g_malloc(sizeof(*isa_bios));
-memory_region_init_alias(isa_bios, NULL, "isa-bios", bios,
- bios_size - isa_bios_size, isa_bios_size);
-memory_region_add_subregion_overlap(rom_memory,
-0x10 - isa_bios_size,
-isa_bios,
-1);
-if (!isapc_ram_fw) {
-memory_region_set_readonly(isa_bios, true);
+/* For TDX, alias different GPAs to same private memory is not supported */
+if (!is_tdx_vm()) {
+/* map the last 128KB of the BIOS in ISA space */
+isa_bios_size = MIN(bios_size, 128 * KiB);
+isa_bios = g_malloc(sizeof(*isa_bios));
+memory_region_init_alias(isa_bios, NULL, "isa-bios", bios,
+bios_size - isa_bios_size, isa_bios_size);
+memory_region_add_subregion_overlap(rom_memory,
+0x10 - isa_bios_size,
+isa_bios,
+1);
+if (!isapc_ram_fw) {
+memory_region_set_readonly(isa_bios, true);
+}
 }
 
 /* map all the bios at the top of memory */
-- 
2.34.1




[PATCH v5 40/65] i386/tdx: Don't initialize pc.rom for TDX VMs

2024-02-28 Thread Xiaoyao Li
For TDX, the address below 1MB are entirely general RAM. No need to
initialize pc.rom memory region for TDs.

Signed-off-by: Xiaoyao Li 
---
This is more as a workaround of the issue that for q35 machine type, the
real memslot update (which requires memslot deletion )for pc.rom happens
after tdx_init_memory_region. It leads to the private memory ADD'ed
before get lost. I haven't work out a good solution to resolve the
order issue. So just skip the pc.rom setup to avoid memslot deletion.
---
 hw/i386/pc.c | 21 -
 1 file changed, 12 insertions(+), 9 deletions(-)

diff --git a/hw/i386/pc.c b/hw/i386/pc.c
index f5ff970acfa0..3f8dd218eb08 100644
--- a/hw/i386/pc.c
+++ b/hw/i386/pc.c
@@ -43,6 +43,7 @@
 #include "sysemu/xen.h"
 #include "sysemu/reset.h"
 #include "kvm/kvm_i386.h"
+#include "kvm/tdx.h"
 #include "hw/xen/xen.h"
 #include "qapi/qmp/qlist.h"
 #include "qemu/error-report.h"
@@ -1028,16 +1029,18 @@ void pc_memory_init(PCMachineState *pcms,
 /* Initialize PC system firmware */
 pc_system_firmware_init(pcms, rom_memory);
 
-option_rom_mr = g_malloc(sizeof(*option_rom_mr));
-memory_region_init_ram(option_rom_mr, NULL, "pc.rom", PC_ROM_SIZE,
-   _fatal);
-if (pcmc->pci_enabled) {
-memory_region_set_readonly(option_rom_mr, true);
+if (!is_tdx_vm()) {
+option_rom_mr = g_malloc(sizeof(*option_rom_mr));
+memory_region_init_ram(option_rom_mr, NULL, "pc.rom", PC_ROM_SIZE,
+_fatal);
+if (pcmc->pci_enabled) {
+memory_region_set_readonly(option_rom_mr, true);
+}
+memory_region_add_subregion_overlap(rom_memory,
+PC_ROM_MIN_VGA,
+option_rom_mr,
+1);
 }
-memory_region_add_subregion_overlap(rom_memory,
-PC_ROM_MIN_VGA,
-option_rom_mr,
-1);
 
 fw_cfg = fw_cfg_arch_create(machine,
 x86ms->boot_cpus, x86ms->apic_id_limit);
-- 
2.34.1




[PATCH v5 52/65] i386/tdx: Wire TDX_REPORT_FATAL_ERROR with GuestPanic facility

2024-02-28 Thread Xiaoyao Li
Integrate TDX's TDX_REPORT_FATAL_ERROR into QEMU GuestPanic facility

Originated-from: Isaku Yamahata 
Signed-off-by: Xiaoyao Li 
---
Changes in v5:
- mention additional error information in gpa when it presents;
- refine the documentation; (Markus)

Changes in v4:
- refine the documentation; (Markus)

Changes in v3:
- Add docmentation of new type and struct; (Daniel)
- refine the error message handling; (Daniel)
---
 qapi/run-state.json   | 31 +--
 system/runstate.c | 58 +++
 target/i386/kvm/tdx.c | 24 +-
 3 files changed, 110 insertions(+), 3 deletions(-)

diff --git a/qapi/run-state.json b/qapi/run-state.json
index dd0770b379e5..b71dd1884eb6 100644
--- a/qapi/run-state.json
+++ b/qapi/run-state.json
@@ -483,10 +483,12 @@
 #
 # @s390: s390 guest panic information type (Since: 2.12)
 #
+# @tdx: tdx guest panic information type (Since: 9.0)
+#
 # Since: 2.9
 ##
 { 'enum': 'GuestPanicInformationType',
-  'data': [ 'hyper-v', 's390' ] }
+  'data': [ 'hyper-v', 's390', 'tdx' ] }
 
 ##
 # @GuestPanicInformation:
@@ -501,7 +503,8 @@
  'base': {'type': 'GuestPanicInformationType'},
  'discriminator': 'type',
  'data': {'hyper-v': 'GuestPanicInformationHyperV',
-  's390': 'GuestPanicInformationS390'}}
+  's390': 'GuestPanicInformationS390',
+  'tdx' : 'GuestPanicInformationTdx'}}
 
 ##
 # @GuestPanicInformationHyperV:
@@ -564,6 +567,30 @@
   'psw-addr': 'uint64',
   'reason': 'S390CrashReason'}}
 
+##
+# @GuestPanicInformationTdx:
+#
+# TDX Guest panic information specific to TDX, as specified in the
+# "Guest-Hypervisor Communication Interface (GHCI) Specification",
+# section TDG.VP.VMCALL.
+#
+# @error-code: TD-specific error code
+#
+# @message: Human-readable error message provided by the guest. Not
+# to be trusted.
+#
+# @gpa: guest-physical address of a page that contains more verbose
+# error information, as zero-terminated string.  Present when the
+# "GPA valid" bit (bit 63) is set in @error-code.
+#
+#
+# Since: 9.0
+##
+{'struct': 'GuestPanicInformationTdx',
+ 'data': {'error-code': 'uint64',
+  'message': 'str',
+  '*gpa': 'uint64'}}
+
 ##
 # @MEMORY_FAILURE:
 #
diff --git a/system/runstate.c b/system/runstate.c
index d6ab860ecaa7..3b628c38739d 100644
--- a/system/runstate.c
+++ b/system/runstate.c
@@ -519,6 +519,52 @@ static void qemu_system_wakeup(void)
 }
 }
 
+static char* tdx_parse_panic_message(char *message)
+{
+bool printable = false;
+char *buf = NULL;
+int len = 0, i;
+
+/*
+ * Although message is defined as a json string, we shouldn't
+ * unconditionally treat it as is because the guest generated it and
+ * it's not necessarily trustable.
+ */
+if (message) {
+/* The caller guarantees the NUL-terminated string. */
+len = strlen(message);
+
+printable = len > 0;
+for (i = 0; i < len; i++) {
+if (!(0x20 <= message[i] && message[i] <= 0x7e)) {
+printable = false;
+break;
+}
+}
+}
+
+if (!printable && len) {
+/* 3 = length of "%02x " */
+buf = g_malloc(len * 3);
+for (i = 0; i < len; i++) {
+if (message[i] == '\0') {
+break;
+} else {
+sprintf(buf + 3 * i, "%02x ", message[i]);
+}
+}
+if (i > 0)
+/* replace the last ' '(space) to NUL */
+buf[i * 3 - 1] = '\0';
+else
+buf[0] = '\0';
+
+return buf;
+}
+
+return message;
+}
+
 void qemu_system_guest_panicked(GuestPanicInformation *info)
 {
 qemu_log_mask(LOG_GUEST_ERROR, "Guest crashed");
@@ -560,7 +606,19 @@ void qemu_system_guest_panicked(GuestPanicInformation 
*info)
   S390CrashReason_str(info->u.s390.reason),
   info->u.s390.psw_mask,
   info->u.s390.psw_addr);
+} else if (info->type == GUEST_PANIC_INFORMATION_TYPE_TDX) {
+qemu_log_mask(LOG_GUEST_ERROR,
+  " TDX guest reports fatal error:"
+  " error code: 0x%#" PRIx64 " error message:\"%s\"\n",
+  info->u.tdx.error_code,
+  tdx_parse_panic_message(info->u.tdx.message));
+if (info->u.tdx.error_code & (1ull << 63)) {
+qemu_log_mask(LOG_GUEST_ERROR, "Additional error information "
+  "can be found at gpa page: 0x%#" PRIx64 "\n",
+  info->u.tdx.gpa);
+}
 }
+
 qapi_free_GuestPanicInformation(info);
 }
 }
diff --git a/target/i386/kvm/tdx.c b/target/i386/kvm

[PATCH v5 32/65] i386/tdx: Set kvm_readonly_mem_enabled to false for TDX VM

2024-02-28 Thread Xiaoyao Li
TDX only supports readonly for shared memory but not for private memory.

In the view of QEMU, it has no idea whether a memslot is used as shared
memory of private. Thus just mark kvm_readonly_mem_enabled to false to
TDX VM for simplicity.

Signed-off-by: Xiaoyao Li 
Acked-by: Gerd Hoffmann 
---
 target/i386/kvm/tdx.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/target/i386/kvm/tdx.c b/target/i386/kvm/tdx.c
index 42dbb5ce5c15..13f069171db7 100644
--- a/target/i386/kvm/tdx.c
+++ b/target/i386/kvm/tdx.c
@@ -480,6 +480,15 @@ static int tdx_kvm_init(ConfidentialGuestSupport *cgs, 
Error **errp)
 
 update_tdx_cpuid_lookup_by_tdx_caps();
 
+/*
+ * Set kvm_readonly_mem_allowed to false, because TDX only supports 
readonly
+ * memory for shared memory but not for private memory. Besides, whether a
+ * memslot is private or shared is not determined by QEMU.
+ *
+ * Thus, just mark readonly memory not supported for simplicity.
+ */
+kvm_readonly_mem_allowed = false;
+
 tdx_guest = tdx;
 return 0;
 }
-- 
2.34.1




[PATCH v5 65/65] docs: Add TDX documentation

2024-02-28 Thread Xiaoyao Li
Add docs/system/i386/tdx.rst for TDX support, and add tdx in
confidential-guest-support.rst

Signed-off-by: Xiaoyao Li 
---
Changes in v5:
 - Add TD attestation section and update the QEMU parameter;

Changes since v1:
 - Add prerequisite of private gmem;
 - update example command to launch TD;

Changes since RFC v4:
 - add the restriction that kernel-irqchip must be split
---
 docs/system/confidential-guest-support.rst |   1 +
 docs/system/i386/tdx.rst   | 143 +
 docs/system/target-i386.rst|   1 +
 3 files changed, 145 insertions(+)
 create mode 100644 docs/system/i386/tdx.rst

diff --git a/docs/system/confidential-guest-support.rst 
b/docs/system/confidential-guest-support.rst
index 0c490dbda2b7..66129fbab64c 100644
--- a/docs/system/confidential-guest-support.rst
+++ b/docs/system/confidential-guest-support.rst
@@ -38,6 +38,7 @@ Supported mechanisms
 Currently supported confidential guest mechanisms are:
 
 * AMD Secure Encrypted Virtualization (SEV) (see 
:doc:`i386/amd-memory-encryption`)
+* Intel Trust Domain Extension (TDX) (see :doc:`i386/tdx`)
 * POWER Protected Execution Facility (PEF) (see 
:ref:`power-papr-protected-execution-facility-pef`)
 * s390x Protected Virtualization (PV) (see :doc:`s390x/protvirt`)
 
diff --git a/docs/system/i386/tdx.rst b/docs/system/i386/tdx.rst
new file mode 100644
index ..8491cdcfa163
--- /dev/null
+++ b/docs/system/i386/tdx.rst
@@ -0,0 +1,143 @@
+Intel Trusted Domain eXtension (TDX)
+
+
+Intel Trusted Domain eXtensions (TDX) refers to an Intel technology that 
extends
+Virtual Machine Extensions (VMX) and Multi-Key Total Memory Encryption (MKTME)
+with a new kind of virtual machine guest called a Trust Domain (TD). A TD runs
+in a CPU mode that is designed to protect the confidentiality of its memory
+contents and its CPU state from any other software, including the hosting
+Virtual Machine Monitor (VMM), unless explicitly shared by the TD itself.
+
+Prerequisites
+-
+
+To run TD, the physical machine needs to have TDX module loaded and initialized
+while KVM hypervisor has TDX support and has TDX enabled. If those requirements
+are met, the ``KVM_CAP_VM_TYPES`` will report the support of 
``KVM_X86_TDX_VM``.
+
+Trust Domain Virtual Firmware (TDVF)
+
+
+Trust Domain Virtual Firmware (TDVF) is required to provide TD services to boot
+TD Guest OS. TDVF needs to be copied to guest private memory and measured 
before
+the TD boots.
+
+KVM vcpu ioctl ``KVM_MEMORY_MAPPING`` can be used to populates the TDVF content
+into its private memory.
+
+Since TDX doesn't support readonly memslot, TDVF cannot be mapped as pflash
+device and it actually works as RAM. "-bios" option is chosen to load TDVF.
+
+OVMF is the opensource firmware that implements the TDVF support. Thus the
+command line to specify and load TDVF is ``-bios OVMF.fd``
+
+KVM private memory
+
+
+TD's memory (RAM) needs to be able to be transformed between private and 
shared.
+Its BIOS (OVMF/TDVF) needs to be mapped as private as well. Thus QEMU needs to
+allocate private guest memfd for them via KVM's IOCTL (KVM_CREATE_GUEST_MEMFD),
+which requires KVM is newer enough that reports KVM_CAP_GUEST_MEMFD.
+
+Feature Control
+---
+
+Unlike non-TDX VM, the CPU features (enumerated by CPU or MSR) of a TD is not
+under full control of VMM. VMM can only configure part of features of a TD on
+``KVM_TDX_INIT_VM`` command of VM scope ``MEMORY_ENCRYPT_OP`` ioctl.
+
+The configurable features have three types:
+
+- Attributes:
+  - PKS (bit 30) controls whether Supervisor Protection Keys is exposed to TD,
+  which determines related CPUID bit and CR4 bit;
+  - PERFMON (bit 63) controls whether PMU is exposed to TD.
+
+- XSAVE related features (XFAM):
+  XFAM is a 64b mask, which has the same format as XCR0 or IA32_XSS MSR. It
+  determines the set of extended features available for use by the guest TD.
+
+- CPUID features:
+  Only some bits of some CPUID leaves are directly configurable by VMM.
+
+What features can be configured is reported via TDX capabilities.
+
+TDX capabilities
+
+
+The VM scope ``MEMORY_ENCRYPT_OP`` ioctl provides command 
``KVM_TDX_CAPABILITIES``
+to get the TDX capabilities from KVM. It returns a data structure of
+``struct kvm_tdx_capabilites``, which tells the supported configuration of
+attributes, XFAM and CPUIDs.
+
+TD attestation
+--
+
+In TD guest, the attestation process is used to verify the TDX guest
+trustworthiness to other entities before provisioning secrets to the guest.
+
+TD attestation is initiated first by calling TDG.MR.REPORT inside TD to get the
+REPORT. Then the REPORT data needs to be converted into a remotely verifiable
+Quote by SGX Quoting Enclave (QE).
+
+A host daemon, Quote Generation Service (QGS), provides the functionality of
+SGX GE. I

[PATCH v5 28/65] i386/tdx: Disable pmu for TD guest

2024-02-28 Thread Xiaoyao Li
Current KVM doesn't support PMU for TD guest. It returns error if TD is
created with PMU bit being set in attributes.

Disable PMU for TD guest on QEMU side.

Signed-off-by: Xiaoyao Li 
---
 target/i386/kvm/tdx.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/target/i386/kvm/tdx.c b/target/i386/kvm/tdx.c
index 262e86fd2c67..1c12cda002b8 100644
--- a/target/i386/kvm/tdx.c
+++ b/target/i386/kvm/tdx.c
@@ -496,6 +496,8 @@ int tdx_pre_create_vcpu(CPUState *cpu, Error **errp)
 g_autofree struct kvm_tdx_init_vm *init_vm = NULL;
 int r = 0;
 
+object_property_set_bool(OBJECT(cpu), "pmu", false, _abort);
+
 QEMU_LOCK_GUARD(_guest->lock);
 if (tdx_guest->initialized) {
 return r;
-- 
2.34.1




[PATCH v5 34/65] kvm/tdx: Ignore memory conversion to shared of unassigned region

2024-02-28 Thread Xiaoyao Li
From: Isaku Yamahata 

TDX requires vMMIO region to be shared.  For KVM, MMIO region is the region
which kvm memslot isn't assigned to (except in-kernel emulation).
qemu has the memory region for vMMIO at each device level.

While OVMF issues MapGPA(to-shared) conservatively on 32bit PCI MMIO
region, qemu doesn't find corresponding vMMIO region because it's before
PCI device allocation and memory_region_find() finds the device region, not
PCI bus region.  It's safe to ignore MapGPA(to-shared) because when guest
accesses those region they use GPA with shared bit set for vMMIO.  Ignore
memory conversion request of non-assigned region to shared and return
success.  Otherwise OVMF is confused and panics there.

Signed-off-by: Isaku Yamahata 
Signed-off-by: Xiaoyao Li 
---
 accel/kvm/kvm-all.c | 12 
 1 file changed, 12 insertions(+)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index d533e2611ad8..9dc17a1b5f43 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -2953,6 +2953,18 @@ static int kvm_convert_memory(hwaddr start, hwaddr size, 
bool to_private)
 section = memory_region_find(get_system_memory(), start, size);
 mr = section.mr;
 if (!mr) {
+/*
+ * Ignore converting non-assigned region to shared.
+ *
+ * TDX requires vMMIO region to be shared to inject #VE to guest.
+ * OVMF issues conservatively MapGPA(shared) on 32bit PCI MMIO region,
+ * and vIO-APIC 0xFEC0 4K page.
+ * OVMF assigns 32bit PCI MMIO region to
+ * [top of low memory: typically 2GB=0xC00,  0xFC0)
+ */
+if (!to_private) {
+return 0;
+}
 return -1;
 }
 
-- 
2.34.1




[PATCH v5 29/65] i386/tdx: Validate TD attributes

2024-02-28 Thread Xiaoyao Li
Validate TD attributes with tdx_caps that fixed-0 bits must be zero and
fixed-1 bits must be set.

Besides, sanity check the attribute bits that have not been supported by
QEMU yet. e.g., debug bit, it will be allowed in the future when debug
TD support lands in QEMU.

Signed-off-by: Xiaoyao Li 
Acked-by: Gerd Hoffmann 

---
Changes in v3:
- using error_setg() for error report; (Daniel)
---
 target/i386/kvm/tdx.c | 29 +++--
 1 file changed, 27 insertions(+), 2 deletions(-)

diff --git a/target/i386/kvm/tdx.c b/target/i386/kvm/tdx.c
index 1c12cda002b8..d0ad4f57b5d0 100644
--- a/target/i386/kvm/tdx.c
+++ b/target/i386/kvm/tdx.c
@@ -32,6 +32,7 @@
  (1U << KVM_FEATURE_PV_SCHED_YIELD) | \
  (1U << KVM_FEATURE_MSI_EXT_DEST_ID))
 
+#define TDX_TD_ATTRIBUTES_DEBUG BIT_ULL(0)
 #define TDX_TD_ATTRIBUTES_SEPT_VE_DISABLE   BIT_ULL(28)
 #define TDX_TD_ATTRIBUTES_PKS   BIT_ULL(30)
 #define TDX_TD_ATTRIBUTES_PERFMON   BIT_ULL(63)
@@ -479,13 +480,34 @@ static int tdx_kvm_init(ConfidentialGuestSupport *cgs, 
Error **errp)
 return 0;
 }
 
-static void setup_td_guest_attributes(X86CPU *x86cpu)
+static int tdx_validate_attributes(TdxGuest *tdx, Error **errp)
+{
+if (((tdx->attributes & tdx_caps->attrs_fixed0) | tdx_caps->attrs_fixed1) 
!=
+tdx->attributes) {
+error_setg(errp, "Invalid attributes 0x%lx for TDX VM "
+   "(fixed0 0x%llx, fixed1 0x%llx)",
+   tdx->attributes, tdx_caps->attrs_fixed0,
+   tdx_caps->attrs_fixed1);
+return -1;
+}
+
+if (tdx->attributes & TDX_TD_ATTRIBUTES_DEBUG) {
+error_setg(errp, "Current QEMU doesn't support attributes.debug[bit 0] 
for TDX VM");
+return -1;
+}
+
+return 0;
+}
+
+static int setup_td_guest_attributes(X86CPU *x86cpu, Error **errp)
 {
 CPUX86State *env = >env;
 
 tdx_guest->attributes |= (env->features[FEAT_7_0_ECX] & CPUID_7_0_ECX_PKS) 
?
  TDX_TD_ATTRIBUTES_PKS : 0;
 tdx_guest->attributes |= x86cpu->enable_pmu ? TDX_TD_ATTRIBUTES_PERFMON : 
0;
+
+return tdx_validate_attributes(tdx_guest, errp);
 }
 
 int tdx_pre_create_vcpu(CPUState *cpu, Error **errp)
@@ -512,7 +534,10 @@ int tdx_pre_create_vcpu(CPUState *cpu, Error **errp)
 return r;
 }
 
-setup_td_guest_attributes(x86cpu);
+r = setup_td_guest_attributes(x86cpu, errp);
+if (r) {
+return r;
+}
 
 init_vm->cpuid.nent = kvm_x86_arch_cpuid(env, init_vm->cpuid.entries, 0);
 
-- 
2.34.1




[PATCH v5 35/65] memory: Introduce memory_region_init_ram_guest_memfd()

2024-02-28 Thread Xiaoyao Li
Introduce memory_region_init_ram_guest_memfd() to allocate private
guset memfd on the MemoryRegion initialization. It's for the use case of
TDVF, which must be private on TDX case.

Signed-off-by: Xiaoyao Li 
---
Changes in v5:
- drop memory_region_set_default_private() because this function is
  dropped in this v5 series;
---
 include/exec/memory.h |  6 ++
 system/memory.c   | 25 +
 2 files changed, 31 insertions(+)

diff --git a/include/exec/memory.h b/include/exec/memory.h
index 679a8476852e..1e351f6fc875 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -1603,6 +1603,12 @@ bool memory_region_init_ram(MemoryRegion *mr,
 uint64_t size,
 Error **errp);
 
+bool memory_region_init_ram_guest_memfd(MemoryRegion *mr,
+Object *owner,
+const char *name,
+uint64_t size,
+Error **errp);
+
 /**
  * memory_region_init_rom: Initialize a ROM memory region.
  *
diff --git a/system/memory.c b/system/memory.c
index c756950c0c0f..85a22408e9a4 100644
--- a/system/memory.c
+++ b/system/memory.c
@@ -3606,6 +3606,31 @@ bool memory_region_init_ram(MemoryRegion *mr,
 return true;
 }
 
+bool memory_region_init_ram_guest_memfd(MemoryRegion *mr,
+Object *owner,
+const char *name,
+uint64_t size,
+Error **errp)
+{
+DeviceState *owner_dev;
+
+if (!memory_region_init_ram_flags_nomigrate(mr, owner, name, size,
+RAM_GUEST_MEMFD, errp)) {
+return false;
+}
+
+/* This will assert if owner is neither NULL nor a DeviceState.
+ * We only want the owner here for the purposes of defining a
+ * unique name for migration. TODO: Ideally we should implement
+ * a naming scheme for Objects which are not DeviceStates, in
+ * which case we can relax this restriction.
+ */
+owner_dev = DEVICE(owner);
+vmstate_register_ram(mr, owner_dev);
+
+return true;
+}
+
 bool memory_region_init_rom(MemoryRegion *mr,
 Object *owner,
 const char *name,
-- 
2.34.1




[PATCH v5 06/65] kvm: Introduce support for memory_attributes

2024-02-28 Thread Xiaoyao Li
Introduce the helper functions to set the attributes of a range of
memory to private or shared.

This is necessary to notify KVM the private/shared attribute of each gpa
range. KVM needs the information to decide the GPA needs to be mapped at
hva-based shared memory or guest_memfd based private memory.

Signed-off-by: Xiaoyao Li 
---
Changes in v4:
- move the check of kvm_supported_memory_attributes to the common
  kvm_set_memory_attributes(); (Wang Wei)
- change warn_report() to error_report() in kvm_set_memory_attributes()
  and drop the __func__; (Daniel)
---
 accel/kvm/kvm-all.c  | 44 
 include/sysemu/kvm.h |  3 +++
 2 files changed, 47 insertions(+)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index cd0aa7545a1f..70d482a2c936 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -92,6 +92,7 @@ static bool kvm_has_guest_debug;
 static int kvm_sstep_flags;
 static bool kvm_immediate_exit;
 static bool kvm_guest_memfd_supported;
+static uint64_t kvm_supported_memory_attributes;
 static hwaddr kvm_max_slot_size = ~0;
 
 static const KVMCapabilityInfo kvm_required_capabilites[] = {
@@ -1304,6 +1305,46 @@ void kvm_set_max_memslot_size(hwaddr max_slot_size)
 kvm_max_slot_size = max_slot_size;
 }
 
+static int kvm_set_memory_attributes(hwaddr start, hwaddr size, uint64_t attr)
+{
+struct kvm_memory_attributes attrs;
+int r;
+
+if (kvm_supported_memory_attributes == 0) {
+error_report("No memory attribute supported by KVM\n");
+return -EINVAL;
+}
+
+if ((attr & kvm_supported_memory_attributes) != attr) {
+error_report("memory attribute 0x%lx not supported by KVM,"
+ " supported bits are 0x%lx\n",
+ attr, kvm_supported_memory_attributes);
+return -EINVAL;
+}
+
+attrs.attributes = attr;
+attrs.address = start;
+attrs.size = size;
+attrs.flags = 0;
+
+r = kvm_vm_ioctl(kvm_state, KVM_SET_MEMORY_ATTRIBUTES, );
+if (r) {
+error_report("failed to set memory (0x%lx+%#zx) with attr 0x%lx error 
'%s'",
+ start, size, attr, strerror(errno));
+}
+return r;
+}
+
+int kvm_set_memory_attributes_private(hwaddr start, hwaddr size)
+{
+return kvm_set_memory_attributes(start, size, 
KVM_MEMORY_ATTRIBUTE_PRIVATE);
+}
+
+int kvm_set_memory_attributes_shared(hwaddr start, hwaddr size)
+{
+return kvm_set_memory_attributes(start, size, 0);
+}
+
 /* Called with KVMMemoryListener.slots_lock held */
 static void kvm_set_phys_mem(KVMMemoryListener *kml,
  MemoryRegionSection *section, bool add)
@@ -2439,6 +2480,9 @@ static int kvm_init(MachineState *ms)
 
 kvm_guest_memfd_supported = kvm_check_extension(s, KVM_CAP_GUEST_MEMFD);
 
+ret = kvm_check_extension(s, KVM_CAP_MEMORY_ATTRIBUTES);
+kvm_supported_memory_attributes = ret > 0 ? ret : 0;
+
 if (object_property_find(OBJECT(current_machine), "kvm-type")) {
 g_autofree char *kvm_type = 
object_property_get_str(OBJECT(current_machine),
 "kvm-type",
diff --git a/include/sysemu/kvm.h b/include/sysemu/kvm.h
index 6cdf82de8372..8e83adfbbd19 100644
--- a/include/sysemu/kvm.h
+++ b/include/sysemu/kvm.h
@@ -546,4 +546,7 @@ uint32_t kvm_dirty_ring_size(void);
 bool kvm_hwpoisoned_mem(void);
 
 int kvm_create_guest_memfd(uint64_t size, uint64_t flags, Error **errp);
+
+int kvm_set_memory_attributes_private(hwaddr start, hwaddr size);
+int kvm_set_memory_attributes_shared(hwaddr start, hwaddr size);
 #endif
-- 
2.34.1




[PATCH v5 30/65] i386/tdx: Support user configurable mrconfigid/mrowner/mrownerconfig

2024-02-28 Thread Xiaoyao Li
From: Isaku Yamahata 

Three sha384 hash values, mrconfigid, mrowner and mrownerconfig, of a TD
can be provided for TDX attestation. Detailed meaning of them can be
found: 
https://lore.kernel.org/qemu-devel/31d6dbc1-f453-4cef-ab08-4813f4e0f...@intel.com/

Allow user to specify those values via property mrconfigid, mrowner and
mrownerconfig. They are all in base64 format.

example
-object tdx-guest, \
  mrconfigid=ASNFZ4mrze8BI0VniavN7wEjRWeJq83vASNFZ4mrze8BI0VniavN7wEjRWeJq83v,\
  mrowner=ASNFZ4mrze8BI0VniavN7wEjRWeJq83vASNFZ4mrze8BI0VniavN7wEjRWeJq83v,\
  mrownerconfig=ASNFZ4mrze8BI0VniavN7wEjRWeJq83vASNFZ4mrze8BI0VniavN7wEjRWeJq83v

Signed-off-by: Isaku Yamahata 
Co-developed-by: Xiaoyao Li 
Signed-off-by: Xiaoyao Li 

---
Changes in v5:
 - refine the description of QAPI properties and add description of
   default value when not specified;

Changes in v4:
 - describe more of there fields in qom.json
 - free the old value before set new value to avoid memory leak in
   _setter(); (Daniel)

Changes in v3:
 - use base64 encoding instread of hex-string;
---
 qapi/qom.json | 17 -
 target/i386/kvm/tdx.c | 87 +++
 target/i386/kvm/tdx.h |  3 ++
 3 files changed, 106 insertions(+), 1 deletion(-)

diff --git a/qapi/qom.json b/qapi/qom.json
index 89ed89b9b46e..cac875349a3a 100644
--- a/qapi/qom.json
+++ b/qapi/qom.json
@@ -905,10 +905,25 @@
 # pages.  Some guest OS (e.g., Linux TD guest) may require this to
 # be set, otherwise they refuse to boot.
 #
+# @mrconfigid: ID for non-owner-defined configuration of the guest TD,
+# e.g., run-time or OS configuration (base64 encoded SHA384 digest).
+# (A default value 0 of SHA384 is used when absent).
+#
+# @mrowner: ID for the guest TD’s owner (base64 encoded SHA384 digest).
+# (A default value 0 of SHA384 is used when absent).
+#
+# @mrownerconfig: ID for owner-defined configuration of the guest TD,
+# e.g., specific to the workload rather than the run-time or OS
+# (base64 encoded SHA384 digest). (A default value 0 of SHA384 is
+# used when absent).
+#
 # Since: 9.0
 ##
 { 'struct': 'TdxGuestProperties',
-  'data': { '*sept-ve-disable': 'bool' } }
+  'data': { '*sept-ve-disable': 'bool',
+'*mrconfigid': 'str',
+'*mrowner': 'str',
+'*mrownerconfig': 'str' } }
 
 ##
 # @ThreadContextProperties:
diff --git a/target/i386/kvm/tdx.c b/target/i386/kvm/tdx.c
index d0ad4f57b5d0..4ce2f1d082ce 100644
--- a/target/i386/kvm/tdx.c
+++ b/target/i386/kvm/tdx.c
@@ -13,6 +13,7 @@
 
 #include "qemu/osdep.h"
 #include "qemu/error-report.h"
+#include "qemu/base64.h"
 #include "qapi/error.h"
 #include "qom/object_interfaces.h"
 #include "standard-headers/asm-x86/kvm_para.h"
@@ -516,6 +517,7 @@ int tdx_pre_create_vcpu(CPUState *cpu, Error **errp)
 X86CPU *x86cpu = X86_CPU(cpu);
 CPUX86State *env = >env;
 g_autofree struct kvm_tdx_init_vm *init_vm = NULL;
+size_t data_len;
 int r = 0;
 
 object_property_set_bool(OBJECT(cpu), "pmu", false, _abort);
@@ -528,6 +530,38 @@ int tdx_pre_create_vcpu(CPUState *cpu, Error **errp)
 init_vm = g_malloc0(sizeof(struct kvm_tdx_init_vm) +
 sizeof(struct kvm_cpuid_entry2) * 
KVM_MAX_CPUID_ENTRIES);
 
+#define SHA384_DIGEST_SIZE  48
+
+if (tdx_guest->mrconfigid) {
+g_autofree uint8_t *data = qbase64_decode(tdx_guest->mrconfigid,
+  strlen(tdx_guest->mrconfigid), _len, errp);
+if (!data || data_len != SHA384_DIGEST_SIZE) {
+error_setg(errp, "TDX: failed to decode mrconfigid");
+return -1;
+}
+memcpy(init_vm->mrconfigid, data, data_len);
+}
+
+if (tdx_guest->mrowner) {
+g_autofree uint8_t *data = qbase64_decode(tdx_guest->mrowner,
+  strlen(tdx_guest->mrowner), _len, errp);
+if (!data || data_len != SHA384_DIGEST_SIZE) {
+error_setg(errp, "TDX: failed to decode mrowner");
+return -1;
+}
+memcpy(init_vm->mrowner, data, data_len);
+}
+
+if (tdx_guest->mrownerconfig) {
+g_autofree uint8_t *data = qbase64_decode(tdx_guest->mrownerconfig,
+  strlen(tdx_guest->mrownerconfig), _len, 
errp);
+if (!data || data_len != SHA384_DIGEST_SIZE) {
+error_setg(errp, "TDX: failed to decode mrownerconfig");
+return -1;
+}
+memcpy(init_vm->mrownerconfig, data, data_len);
+}
+
 r = kvm_vm_enable_cap(kvm_state, KVM_CAP_MAX_VCPUS, 0, ms->smp.cpus);
 if (r < 0) {
 error_setg(errp, "Unable to set MAX VCPUS to %d", ms->smp.cpus);
@@ -574,6 +608,51 @@ static void tdx_guest_set_sept_ve_disable(Object *obj, 
bool value, Error **errp)
 }
 }
 
+static

[PATCH v5 51/65] i386/tdx: Handle TDG.VP.VMCALL

2024-02-28 Thread Xiaoyao Li
TD guest can use TDG.VP.VMCALL to request termination
with error message encoded in GPRs.

Parse and print the error message, and terminate the TD guest in the
handler.

Signed-off-by: Xiaoyao Li 
---
 target/i386/kvm/tdx.c | 39 +++
 target/i386/kvm/tdx.h |  1 +
 2 files changed, 40 insertions(+)

diff --git a/target/i386/kvm/tdx.c b/target/i386/kvm/tdx.c
index 88d245577594..6cd72a26b970 100644
--- a/target/i386/kvm/tdx.c
+++ b/target/i386/kvm/tdx.c
@@ -1092,6 +1092,43 @@ static int tdx_handle_get_quote(X86CPU *cpu, struct 
kvm_tdx_vmcall *vmcall)
 return 0;
 }
 
+static int tdx_handle_report_fatal_error(X86CPU *cpu,
+ struct kvm_tdx_vmcall *vmcall)
+{
+uint64_t error_code = vmcall->in_r12;
+char *message = NULL;
+
+if (error_code & 0x) {
+error_report("TDX: REPORT_FATAL_ERROR: invalid error code: "
+ "0x%lx\n", error_code);
+return -1;
+}
+
+/* it has optional message */
+if (vmcall->in_r14) {
+uint64_t * tmp;
+
+#define GUEST_PANIC_INFO_TDX_MESSAGE_MAX64
+message = g_malloc0(GUEST_PANIC_INFO_TDX_MESSAGE_MAX + 1);
+
+tmp = (uint64_t *)message;
+/* The order is defined in TDX GHCI spec */
+*(tmp++) = cpu_to_le64(vmcall->in_r14);
+*(tmp++) = cpu_to_le64(vmcall->in_r15);
+*(tmp++) = cpu_to_le64(vmcall->in_rbx);
+*(tmp++) = cpu_to_le64(vmcall->in_rdi);
+*(tmp++) = cpu_to_le64(vmcall->in_rsi);
+*(tmp++) = cpu_to_le64(vmcall->in_r8);
+*(tmp++) = cpu_to_le64(vmcall->in_r9);
+*(tmp++) = cpu_to_le64(vmcall->in_rdx);
+message[GUEST_PANIC_INFO_TDX_MESSAGE_MAX] = '\0';
+assert((char *)tmp == message + GUEST_PANIC_INFO_TDX_MESSAGE_MAX);
+}
+
+error_report("TD guest reports fatal error. %s\n", message ? : "");
+return -1;
+}
+
 static int tdx_handle_setup_event_notify_interrupt(X86CPU *cpu,
struct kvm_tdx_vmcall 
*vmcall)
 {
@@ -1126,6 +1163,8 @@ static int tdx_handle_vmcall(X86CPU *cpu, struct 
kvm_tdx_vmcall *vmcall)
 return tdx_handle_map_gpa(cpu, vmcall);
 case TDG_VP_VMCALL_GET_QUOTE:
 return tdx_handle_get_quote(cpu, vmcall);
+case TDG_VP_VMCALL_REPORT_FATAL_ERROR:
+return  tdx_handle_report_fatal_error(cpu, vmcall);
 case TDG_VP_VMCALL_SETUP_EVENT_NOTIFY_INTERRUPT:
 return tdx_handle_setup_event_notify_interrupt(cpu, vmcall);
 default:
diff --git a/target/i386/kvm/tdx.h b/target/i386/kvm/tdx.h
index b6b8742e79f7..86b1f80d6f49 100644
--- a/target/i386/kvm/tdx.h
+++ b/target/i386/kvm/tdx.h
@@ -20,6 +20,7 @@ typedef struct TdxGuestClass {
 
 #define TDG_VP_VMCALL_MAP_GPA   0x10001ULL
 #define TDG_VP_VMCALL_GET_QUOTE 0x10002ULL
+#define TDG_VP_VMCALL_REPORT_FATAL_ERROR0x10003ULL
 #define TDG_VP_VMCALL_SETUP_EVENT_NOTIFY_INTERRUPT  0x10004ULL
 
 #define TDG_VP_VMCALL_SUCCESS   0xULL
-- 
2.34.1




[PATCH v5 59/65] hw/i386: add eoi_intercept_unsupported member to X86MachineState

2024-02-28 Thread Xiaoyao Li
Add a new bool member, eoi_intercept_unsupported, to X86MachineState
with default value false. Set true for TDX VM.

Inability to intercept eoi causes impossibility to emulate level
triggered interrupt to be re-injected when level is still kept active.
which affects interrupt controller emulation.

Signed-off-by: Xiaoyao Li 
Acked-by: Gerd Hoffmann 
---
 hw/i386/x86.c | 1 +
 include/hw/i386/x86.h | 1 +
 target/i386/kvm/tdx.c | 2 ++
 3 files changed, 4 insertions(+)

diff --git a/hw/i386/x86.c b/hw/i386/x86.c
index 61c45dfc14dd..6ff2475535bc 100644
--- a/hw/i386/x86.c
+++ b/hw/i386/x86.c
@@ -1425,6 +1425,7 @@ static void x86_machine_initfn(Object *obj)
 x86ms->oem_table_id = g_strndup(ACPI_BUILD_APPNAME8, 8);
 x86ms->bus_lock_ratelimit = 0;
 x86ms->above_4g_mem_start = 4 * GiB;
+x86ms->eoi_intercept_unsupported = false;
 }
 
 static void x86_machine_class_init(ObjectClass *oc, void *data)
diff --git a/include/hw/i386/x86.h b/include/hw/i386/x86.h
index d28e79cc484a..033f0a34891b 100644
--- a/include/hw/i386/x86.h
+++ b/include/hw/i386/x86.h
@@ -60,6 +60,7 @@ struct X86MachineState {
 uint64_t above_4g_mem_start;
 
 /* CPU and apic information: */
+bool eoi_intercept_unsupported;
 unsigned pci_irq_mask;
 unsigned apic_id_limit;
 uint16_t boot_cpus;
diff --git a/target/i386/kvm/tdx.c b/target/i386/kvm/tdx.c
index 0225a9b79b36..b1fb326bd395 100644
--- a/target/i386/kvm/tdx.c
+++ b/target/i386/kvm/tdx.c
@@ -727,6 +727,8 @@ static int tdx_kvm_init(ConfidentialGuestSupport *cgs, 
Error **errp)
 return -EINVAL;
 }
 
+x86ms->eoi_intercept_unsupported = true;
+
 if (!tdx_caps) {
 r = get_tdx_capabilities(errp);
 if (r) {
-- 
2.34.1




[PATCH v5 18/65] i386/tdx: Make Intel-PT unsupported for TD guest

2024-02-28 Thread Xiaoyao Li
Due to the fact that Intel-PT virtualization support has been broken in
QEMU since Sapphire Rapids generation[1], below warning is triggered when
luanching TD guest:

  warning: host doesn't support requested feature: CPUID.07H:EBX.intel-pt [bit 
25]

Before Intel-pt is fixed in QEMU, just make Intel-PT unsupported for TD
guest, to avoid the confusing warning.

[1] 
https://lore.kernel.org/qemu-devel/20230531084311.3807277-1-xiaoyao...@intel.com/

Signed-off-by: Xiaoyao Li 
---
Changes in v4:
 - newly added patch;
---
 target/i386/kvm/tdx.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/target/i386/kvm/tdx.c b/target/i386/kvm/tdx.c
index 85d96140b450..239170142e4f 100644
--- a/target/i386/kvm/tdx.c
+++ b/target/i386/kvm/tdx.c
@@ -292,6 +292,11 @@ void tdx_get_supported_cpuid(uint32_t function, uint32_t 
index, int reg,
 if (function == 1 && reg == R_ECX && !enable_cpu_pm) {
 *ret &= ~CPUID_EXT_MONITOR;
 }
+
+/* QEMU Intel-pt support is broken, don't advertise Intel-PT */
+if (function == 7 && reg == R_EBX) {
+*ret &= ~CPUID_7_0_EBX_INTEL_PT;
+}
 }
 
 enum tdx_ioctl_level{
-- 
2.34.1




[PATCH v5 17/65] i386/tdx: Adjust the supported CPUID based on TDX restrictions

2024-02-28 Thread Xiaoyao Li
According to Chapter "CPUID Virtualization" in TDX module spec, CPUID
bits of TD can be classified into 6 types:


1 | As configured | configurable by VMM, independent of native value;

2 | As configured | configurable by VMM if the bit is supported natively
(if native)   | Otherwise it equals as native(0).

3 | Fixed | fixed to 0/1

4 | Native| reflect the native value

5 | Calculated| calculated by TDX module.

6 | Inducing #VE  | get #VE exception


Note:
1. All the configurable XFAM related features and TD attributes related
   features fall into type #2. And fixed0/1 bits of XFAM and TD
   attributes fall into type #3.

2. For CPUID leaves not listed in "CPUID virtualization Overview" table
   in TDX module spec, TDX module injects #VE to TDs when those are
   queried. For this case, TDs can request CPUID emulation from VMM via
   TDVMCALL and the values are fully controlled by VMM.

Due to TDX module has its own virtualization policy on CPUID bits, it leads
to what reported via KVM_GET_SUPPORTED_CPUID diverges from the supported
CPUID bits for TDs. In order to keep a consistent CPUID configuration
between VMM and TDs. Adjust supported CPUID for TDs based on TDX
restrictions.

Currently only focus on the CPUID leaves recognized by QEMU's
feature_word_info[] that are indexed by a FeatureWord.

Introduce a TDX CPUID lookup table, which maintains 1 entry for each
FeatureWord. Each entry has below fields:

 - tdx_fixed0/1: The bits that are fixed as 0/1;

 - depends_on_vmm_cap: The bits that are configurable from the view of
   TDX module. But they requires emulation of VMM
   when configured as enabled. For those, they are
   not supported if VMM doesn't report them as
   supported. So they need be fixed up by checking
   if VMM supports them.

 - inducing_ve: TD gets #VE when querying this CPUID leaf. The result is
totally configurable by VMM.

 - supported_on_ve: It's valid only when @inducing_ve is true. It represents
the maximum feature set supported that be emulated
for TDs.

By applying TDX CPUID lookup table and TDX capabilities reported from
TDX module, the supported CPUID for TDs can be obtained from following
steps:

- get the base of VMM supported feature set;

- if the leaf is not a FeatureWord just return VMM's value without
  modification;

- if the leaf is an inducing_ve type, applying supported_on_ve mask and
  return;

- include all native bits, it covers type #2, #4, and parts of type #1.
  (it also includes some unsupported bits. The following step will
   correct it.)

- apply fixed0/1 to it (it covers #3, and rectifies the previous step);

- add configurable bits (it covers the other part of type #1);

- fix the ones in vmm_fixup;

(Calculated type is ignored since it's determined at runtime).

Co-developed-by: Chenyi Qiang 
Signed-off-by: Chenyi Qiang 
Signed-off-by: Xiaoyao Li 
---
 target/i386/cpu.h |  16 +++
 target/i386/kvm/kvm.c |   4 +
 target/i386/kvm/tdx.c | 263 ++
 target/i386/kvm/tdx.h |   3 +
 4 files changed, 286 insertions(+)

diff --git a/target/i386/cpu.h b/target/i386/cpu.h
index 952174bb6f52..7bd604f802a1 100644
--- a/target/i386/cpu.h
+++ b/target/i386/cpu.h
@@ -787,6 +787,8 @@ uint64_t x86_cpu_get_supported_feature_word(FeatureWord w,
 
 /* Support RDFSBASE/RDGSBASE/WRFSBASE/WRGSBASE */
 #define CPUID_7_0_EBX_FSGSBASE  (1U << 0)
+/* Support for TSC adjustment MSR 0x3B */
+#define CPUID_7_0_EBX_TSC_ADJUST(1U << 1)
 /* Support SGX */
 #define CPUID_7_0_EBX_SGX   (1U << 2)
 /* 1st Group of Advanced Bit Manipulation Extensions */
@@ -805,8 +807,12 @@ uint64_t x86_cpu_get_supported_feature_word(FeatureWord w,
 #define CPUID_7_0_EBX_INVPCID   (1U << 10)
 /* Restricted Transactional Memory */
 #define CPUID_7_0_EBX_RTM   (1U << 11)
+/* Cache QoS Monitoring */
+#define CPUID_7_0_EBX_PQM   (1U << 12)
 /* Memory Protection Extension */
 #define CPUID_7_0_EBX_MPX   (1U << 14)
+/* Resource Director Technology Allocation */
+#define CPUID_7_0_EBX_RDT_A (1U << 15)
 /* AVX-512 Foundation */
 #define CPUID_7_0_EBX_AVX512F   (1U << 16)
 /* AVX-512 Doubleword & Quadword Instruction */
@@ -862,10 

[PATCH v5 22/65] i386/kvm: Move architectural CPUID leaf generation to separate helper

2024-02-28 Thread Xiaoyao Li
From: Sean Christopherson 

Move the architectural (for lack of a better term) CPUID leaf generation
to a separate helper so that the generation code can be reused by TDX,
which needs to generate a canonical VM-scoped configuration.

Signed-off-by: Sean Christopherson 
Signed-off-by: Xiaoyao Li 
---
 target/i386/kvm/kvm.c  | 459 +++--
 target/i386/kvm/kvm_i386.h |   3 +
 2 files changed, 240 insertions(+), 222 deletions(-)

diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
index 389b631c03dd..315998c8f7e5 100644
--- a/target/i386/kvm/kvm.c
+++ b/target/i386/kvm/kvm.c
@@ -1731,6 +1731,241 @@ static void kvm_init_nested_state(CPUX86State *env)
 }
 }
 
+uint32_t kvm_x86_arch_cpuid(CPUX86State *env, struct kvm_cpuid_entry2 *entries,
+uint32_t cpuid_i)
+{
+uint32_t limit, i, j;
+uint32_t unused;
+struct kvm_cpuid_entry2 *c;
+
+if (cpuid_i >  KVM_MAX_CPUID_ENTRIES) {
+error_report("exceeded cpuid index (%d) for entries[]", cpuid_i);
+abort();
+}
+
+cpu_x86_cpuid(env, 0, 0, , , , );
+
+for (i = 0; i <= limit; i++) {
+if (cpuid_i == KVM_MAX_CPUID_ENTRIES) {
+fprintf(stderr, "unsupported level value: 0x%x\n", limit);
+abort();
+}
+c = [cpuid_i++];
+
+switch (i) {
+case 2: {
+/* Keep reading function 2 till all the input is received */
+int times;
+
+c->function = i;
+c->flags = KVM_CPUID_FLAG_STATEFUL_FUNC |
+   KVM_CPUID_FLAG_STATE_READ_NEXT;
+cpu_x86_cpuid(env, i, 0, >eax, >ebx, >ecx, >edx);
+times = c->eax & 0xff;
+
+for (j = 1; j < times; ++j) {
+if (cpuid_i == KVM_MAX_CPUID_ENTRIES) {
+fprintf(stderr, "cpuid_data is full, no space for "
+"cpuid(eax:2):eax & 0xf = 0x%x\n", times);
+abort();
+}
+c = [cpuid_i++];
+c->function = i;
+c->flags = KVM_CPUID_FLAG_STATEFUL_FUNC;
+cpu_x86_cpuid(env, i, 0, >eax, >ebx, >ecx, >edx);
+}
+break;
+}
+case 0x1f:
+if (env->nr_dies < 2) {
+cpuid_i--;
+break;
+}
+/* fallthrough */
+case 4:
+case 0xb:
+case 0xd:
+for (j = 0; ; j++) {
+if (i == 0xd && j == 64) {
+break;
+}
+
+c->function = i;
+c->flags = KVM_CPUID_FLAG_SIGNIFCANT_INDEX;
+c->index = j;
+cpu_x86_cpuid(env, i, j, >eax, >ebx, >ecx, >edx);
+
+if (i == 4 && c->eax == 0) {
+break;
+}
+if (i == 0xb && !(c->ecx & 0xff00)) {
+break;
+}
+if (i == 0x1f && !(c->ecx & 0xff00)) {
+break;
+}
+if (i == 0xd && c->eax == 0) {
+continue;
+}
+if (cpuid_i == KVM_MAX_CPUID_ENTRIES) {
+fprintf(stderr, "cpuid_data is full, no space for "
+"cpuid(eax:0x%x,ecx:0x%x)\n", i, j);
+abort();
+}
+c = [cpuid_i++];
+}
+break;
+case 0x12:
+for (j = 0; ; j++) {
+c->function = i;
+c->flags = KVM_CPUID_FLAG_SIGNIFCANT_INDEX;
+c->index = j;
+cpu_x86_cpuid(env, i, j, >eax, >ebx, >ecx, >edx);
+
+if (j > 1 && (c->eax & 0xf) != 1) {
+break;
+}
+
+if (cpuid_i == KVM_MAX_CPUID_ENTRIES) {
+fprintf(stderr, "cpuid_data is full, no space for "
+"cpuid(eax:0x12,ecx:0x%x)\n", j);
+abort();
+}
+c = [cpuid_i++];
+}
+break;
+case 0x7:
+case 0x14:
+case 0x1d:
+case 0x1e: {
+uint32_t times;
+
+c->function = i;
+c->index = 0;
+c->flags = KVM_CPUID_FLAG_SIGNIFCANT_INDEX;
+cpu_x86_cpuid(env, i, 0, >eax, >ebx, >ecx, >edx);
+times = c->eax;
+
+for (j = 1; j <= times; ++j) {
+if (cpuid_i == KVM_MAX_CPUID_ENTRIES) {
+fprintf(stderr, "cpuid_data is full, no space for "
+"cpuid

[PATCH v5 53/65] pci-host/q35: Move PAM initialization above SMRAM initialization

2024-02-28 Thread Xiaoyao Li
From: Isaku Yamahata 

In mch_realize(), process PAM initialization before SMRAM initialization so
that later patch can skill all the SMRAM related with a single check.

Signed-off-by: Isaku Yamahata 
Signed-off-by: Xiaoyao Li 
---
 hw/pci-host/q35.c | 19 ++-
 1 file changed, 10 insertions(+), 9 deletions(-)

diff --git a/hw/pci-host/q35.c b/hw/pci-host/q35.c
index 0d7d4e3f0860..98d4a7c253a6 100644
--- a/hw/pci-host/q35.c
+++ b/hw/pci-host/q35.c
@@ -568,6 +568,16 @@ static void mch_realize(PCIDevice *d, Error **errp)
 /* setup pci memory mapping */
 pc_pci_as_mapping_init(mch->system_memory, mch->pci_address_space);
 
+/* PAM */
+init_pam(>pam_regions[0], OBJECT(mch), mch->ram_memory,
+ mch->system_memory, mch->pci_address_space,
+ PAM_BIOS_BASE, PAM_BIOS_SIZE);
+for (i = 0; i < ARRAY_SIZE(mch->pam_regions) - 1; ++i) {
+init_pam(>pam_regions[i + 1], OBJECT(mch), mch->ram_memory,
+ mch->system_memory, mch->pci_address_space,
+ PAM_EXPAN_BASE + i * PAM_EXPAN_SIZE, PAM_EXPAN_SIZE);
+}
+
 /* if *disabled* show SMRAM to all CPUs */
 memory_region_init_alias(>smram_region, OBJECT(mch), "smram-region",
  mch->pci_address_space, 
MCH_HOST_BRIDGE_SMRAM_C_BASE,
@@ -634,15 +644,6 @@ static void mch_realize(PCIDevice *d, Error **errp)
 
 object_property_add_const_link(qdev_get_machine(), "smram",
OBJECT(>smram));
-
-init_pam(>pam_regions[0], OBJECT(mch), mch->ram_memory,
- mch->system_memory, mch->pci_address_space,
- PAM_BIOS_BASE, PAM_BIOS_SIZE);
-for (i = 0; i < ARRAY_SIZE(mch->pam_regions) - 1; ++i) {
-init_pam(>pam_regions[i + 1], OBJECT(mch), mch->ram_memory,
- mch->system_memory, mch->pci_address_space,
- PAM_EXPAN_BASE + i * PAM_EXPAN_SIZE, PAM_EXPAN_SIZE);
-}
 }
 
 uint64_t mch_mcfg_base(void)
-- 
2.34.1




[PATCH v5 50/65] i386/tdx: handle TDG.VP.VMCALL hypercall

2024-02-28 Thread Xiaoyao Li
From: Isaku Yamahata 

MapGPA is a hypercall to convert GPA from/to private GPA to/from shared GPA.
As the conversion function is already implemented as kvm_convert_memory,
wire it to TDX hypercall exit.

Signed-off-by: Isaku Yamahata 
Signed-off-by: Xiaoyao Li 
---
 accel/kvm/kvm-all.c   |  2 +-
 include/sysemu/kvm.h  |  2 ++
 target/i386/kvm/tdx.c | 54 +++
 target/i386/kvm/tdx.h |  1 +
 4 files changed, 58 insertions(+), 1 deletion(-)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index 9dc17a1b5f43..2c83b6d270f7 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -2930,7 +2930,7 @@ static void kvm_eat_signals(CPUState *cpu)
 } while (sigismember(, SIG_IPI));
 }
 
-static int kvm_convert_memory(hwaddr start, hwaddr size, bool to_private)
+int kvm_convert_memory(hwaddr start, hwaddr size, bool to_private)
 {
 MemoryRegionSection section;
 ram_addr_t offset;
diff --git a/include/sysemu/kvm.h b/include/sysemu/kvm.h
index 82b547848130..78fcf1c3a617 100644
--- a/include/sysemu/kvm.h
+++ b/include/sysemu/kvm.h
@@ -550,4 +550,6 @@ int kvm_create_guest_memfd(uint64_t size, uint64_t flags, 
Error **errp);
 
 int kvm_set_memory_attributes_private(hwaddr start, hwaddr size);
 int kvm_set_memory_attributes_shared(hwaddr start, hwaddr size);
+
+int kvm_convert_memory(hwaddr start, hwaddr size, bool to_private);
 #endif
diff --git a/target/i386/kvm/tdx.c b/target/i386/kvm/tdx.c
index 7dfda507cc8c..88d245577594 100644
--- a/target/i386/kvm/tdx.c
+++ b/target/i386/kvm/tdx.c
@@ -898,6 +898,58 @@ static hwaddr tdx_shared_bit(X86CPU *cpu)
 return (cpu->phys_bits > 48) ? BIT_ULL(51) : BIT_ULL(47);
 }
 
+/* 64MB at most in one call. What value is appropriate? */
+#define TDX_MAP_GPA_MAX_LEN (64 * 1024 * 1024)
+
+static int tdx_handle_map_gpa(X86CPU *cpu, struct kvm_tdx_vmcall *vmcall)
+{
+hwaddr shared_bit = tdx_shared_bit(cpu);
+hwaddr gpa = vmcall->in_r12 & ~shared_bit;
+bool private = !(vmcall->in_r12 & shared_bit);
+hwaddr size = vmcall->in_r13;
+bool retry = false;
+int ret = 0;
+
+vmcall->status_code = TDG_VP_VMCALL_INVALID_OPERAND;
+
+if (!QEMU_IS_ALIGNED(gpa, 4096) || !QEMU_IS_ALIGNED(size, 4096)) {
+vmcall->status_code = TDG_VP_VMCALL_ALIGN_ERROR;
+return 0;
+}
+
+/* Overflow case. */
+if (gpa + size < gpa) {
+return 0;
+}
+if (gpa >= (1ULL << cpu->phys_bits) ||
+gpa + size >= (1ULL << cpu->phys_bits)) {
+return 0;
+}
+
+if (size > TDX_MAP_GPA_MAX_LEN) {
+retry = true;
+size = TDX_MAP_GPA_MAX_LEN;
+}
+
+if (size > 0) {
+ret = kvm_convert_memory(gpa, size, private);
+}
+
+if (!ret) {
+if (retry) {
+vmcall->status_code = TDG_VP_VMCALL_RETRY;
+vmcall->out_r11 = gpa + size;
+if (!private) {
+vmcall->out_r11 |= shared_bit;
+}
+} else {
+vmcall->status_code = TDG_VP_VMCALL_SUCCESS;
+}
+}
+
+return 0;
+}
+
 static void tdx_get_quote_completion(struct tdx_generate_quote_task *task)
 {
 int ret;
@@ -1070,6 +1122,8 @@ static int tdx_handle_vmcall(X86CPU *cpu, struct 
kvm_tdx_vmcall *vmcall)
 }
 
 switch (vmcall->subfunction) {
+case TDG_VP_VMCALL_MAP_GPA:
+return tdx_handle_map_gpa(cpu, vmcall);
 case TDG_VP_VMCALL_GET_QUOTE:
 return tdx_handle_get_quote(cpu, vmcall);
 case TDG_VP_VMCALL_SETUP_EVENT_NOTIFY_INTERRUPT:
diff --git a/target/i386/kvm/tdx.h b/target/i386/kvm/tdx.h
index b7ce793d6037..b6b8742e79f7 100644
--- a/target/i386/kvm/tdx.h
+++ b/target/i386/kvm/tdx.h
@@ -18,6 +18,7 @@ typedef struct TdxGuestClass {
 ConfidentialGuestSupportClass parent_class;
 } TdxGuestClass;
 
+#define TDG_VP_VMCALL_MAP_GPA   0x10001ULL
 #define TDG_VP_VMCALL_GET_QUOTE 0x10002ULL
 #define TDG_VP_VMCALL_SETUP_EVENT_NOTIFY_INTERRUPT  0x10004ULL
 
-- 
2.34.1




[PATCH v5 55/65] i386/tdx: Disable SMM for TDX VMs

2024-02-28 Thread Xiaoyao Li
TDX doesn't support SMM and VMM cannot emulate SMM for TDX VMs because
VMM cannot manipulate TDX VM's memory.

Disable SMM for TDX VMs and error out if user requests to enable SMM.

Signed-off-by: Xiaoyao Li 
Acked-by: Gerd Hoffmann 
---
 target/i386/kvm/tdx.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/target/i386/kvm/tdx.c b/target/i386/kvm/tdx.c
index 811a3b81af99..c3fadbc5c58e 100644
--- a/target/i386/kvm/tdx.c
+++ b/target/i386/kvm/tdx.c
@@ -707,11 +707,19 @@ static Notifier tdx_machine_done_notify = {
 static int tdx_kvm_init(ConfidentialGuestSupport *cgs, Error **errp)
 {
 MachineState *ms = MACHINE(qdev_get_machine());
+X86MachineState *x86ms = X86_MACHINE(ms);
 TdxGuest *tdx = TDX_GUEST(cgs);
 int r = 0;
 
 ms->require_guest_memfd = true;
 
+if (x86ms->smm == ON_OFF_AUTO_AUTO) {
+x86ms->smm = ON_OFF_AUTO_OFF;
+} else if (x86ms->smm == ON_OFF_AUTO_ON) {
+error_setg(errp, "TDX VM doesn't support SMM");
+return -EINVAL;
+}
+
 if (!tdx_caps) {
 r = get_tdx_capabilities(errp);
 if (r) {
-- 
2.34.1




[PATCH v5 47/65] i386/tdx: Finalize TDX VM

2024-02-28 Thread Xiaoyao Li
Invoke KVM_TDX_FINALIZE_VM to finalize the TD's measurement and make
the TD vCPUs runnable once machine initialization is complete.

Signed-off-by: Xiaoyao Li 
Acked-by: Gerd Hoffmann 
---
 target/i386/kvm/tdx.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/target/i386/kvm/tdx.c b/target/i386/kvm/tdx.c
index 4625f806920e..d445d4b70f77 100644
--- a/target/i386/kvm/tdx.c
+++ b/target/i386/kvm/tdx.c
@@ -685,6 +685,13 @@ static void tdx_finalize_vm(Notifier *notifier, void 
*unused)
  */
 ram_block = tdx_guest->tdvf_mr->ram_block;
 ram_block_discard_range(ram_block, 0, ram_block->max_length);
+
+r = tdx_vm_ioctl(KVM_TDX_FINALIZE_VM, 0, NULL);
+if (r < 0) {
+error_report("KVM_TDX_FINALIZE_VM failed %s", strerror(-r));
+exit(0);
+}
+tdx_guest->parent_obj.ready = true;
 }
 
 static Notifier tdx_machine_done_notify = {
-- 
2.34.1




[PATCH v5 37/65] i386/tdvf: Introduce function to parse TDVF metadata

2024-02-28 Thread Xiaoyao Li
From: Isaku Yamahata 

TDX VM needs to boot with its specialized firmware, Trusted Domain
Virtual Firmware (TDVF). QEMU needs to parse TDVF and map it in TD
guest memory prior to running the TDX VM.

A TDVF Metadata in TDVF image describes the structure of firmware.
QEMU refers to it to setup memory for TDVF. Introduce function
tdvf_parse_metadata() to parse the metadata from TDVF image and store
the info of each TDVF section.

TDX metadata is located by a TDX metadata offset block, which is a
GUID-ed structure. The data portion of the GUID structure contains
only an 4-byte field that is the offset of TDX metadata to the end
of firmware file.

Select X86_FW_OVMF when TDX is enable to leverage existing functions
to parse and search OVMF's GUID-ed structures.

Signed-off-by: Isaku Yamahata 
Co-developed-by: Xiaoyao Li 
Signed-off-by: Xiaoyao Li 
Acked-by: Gerd Hoffmann 

---
Changes in v1:
 - rename tdvf_parse_section_entry() to
   tdvf_parse_and_check_section_entry()
Changes in RFC v4:
 - rename TDX_METADATA_GUID to TDX_METADATA_OFFSET_GUID
---
 hw/i386/Kconfig|   1 +
 hw/i386/meson.build|   1 +
 hw/i386/tdvf.c | 199 +
 include/hw/i386/tdvf.h |  51 +++
 4 files changed, 252 insertions(+)
 create mode 100644 hw/i386/tdvf.c
 create mode 100644 include/hw/i386/tdvf.h

diff --git a/hw/i386/Kconfig b/hw/i386/Kconfig
index c0ccf50ac3ef..4e6c8905f077 100644
--- a/hw/i386/Kconfig
+++ b/hw/i386/Kconfig
@@ -12,6 +12,7 @@ config SGX
 
 config TDX
 bool
+select X86_FW_OVMF
 depends on KVM
 
 config PC
diff --git a/hw/i386/meson.build b/hw/i386/meson.build
index b9c1ca39cb05..f09441c1ea54 100644
--- a/hw/i386/meson.build
+++ b/hw/i386/meson.build
@@ -28,6 +28,7 @@ i386_ss.add(when: 'CONFIG_PC', if_true: files(
   'port92.c'))
 i386_ss.add(when: 'CONFIG_X86_FW_OVMF', if_true: files('pc_sysfw_ovmf.c'),
 if_false: 
files('pc_sysfw_ovmf-stubs.c'))
+i386_ss.add(when: 'CONFIG_TDX', if_true: files('tdvf.c'))
 
 subdir('kvm')
 subdir('xen')
diff --git a/hw/i386/tdvf.c b/hw/i386/tdvf.c
new file mode 100644
index ..ff51f40088f0
--- /dev/null
+++ b/hw/i386/tdvf.c
@@ -0,0 +1,199 @@
+/*
+ * SPDX-License-Identifier: GPL-2.0-or-later
+
+ * Copyright (c) 2020 Intel Corporation
+ * Author: Isaku Yamahata 
+ *
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+
+ * You should have received a copy of the GNU General Public License along
+ * with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include "qemu/osdep.h"
+#include "qemu/error-report.h"
+
+#include "hw/i386/pc.h"
+#include "hw/i386/tdvf.h"
+#include "sysemu/kvm.h"
+
+#define TDX_METADATA_OFFSET_GUID"e47a6535-984a-4798-865e-4685a7bf8ec2"
+#define TDX_METADATA_VERSION1
+#define TDVF_SIGNATURE  0x46564454 /* TDVF as little endian */
+
+typedef struct {
+uint32_t DataOffset;
+uint32_t RawDataSize;
+uint64_t MemoryAddress;
+uint64_t MemoryDataSize;
+uint32_t Type;
+uint32_t Attributes;
+} TdvfSectionEntry;
+
+typedef struct {
+uint32_t Signature;
+uint32_t Length;
+uint32_t Version;
+uint32_t NumberOfSectionEntries;
+TdvfSectionEntry SectionEntries[];
+} TdvfMetadata;
+
+struct tdx_metadata_offset {
+uint32_t offset;
+};
+
+static TdvfMetadata *tdvf_get_metadata(void *flash_ptr, int size)
+{
+TdvfMetadata *metadata;
+uint32_t offset = 0;
+uint8_t *data;
+
+if ((uint32_t) size != size) {
+return NULL;
+}
+
+if (pc_system_ovmf_table_find(TDX_METADATA_OFFSET_GUID, , NULL)) {
+offset = size - le32_to_cpu(((struct tdx_metadata_offset 
*)data)->offset);
+
+if (offset + sizeof(*metadata) > size) {
+return NULL;
+}
+} else {
+error_report("Cannot find TDX_METADATA_OFFSET_GUID");
+return NULL;
+}
+
+metadata = flash_ptr + offset;
+
+/* Finally, verify the signature to determine if this is a TDVF image. */
+metadata->Signature = le32_to_cpu(metadata->Signature);
+if (metadata->Signature != TDVF_SIGNATURE) {
+error_report("Invalid TDVF signature in metadata!");
+return NULL;
+}
+
+/* Sanity check that the TDVF doesn't overlap its own metadata. */
+metadata->Length = le32_to_cpu(metadata->Length);
+if (offset + metadata->Length > size) {
+re

[PATCH v5 19/65] i386/tdx: Update tdx_cpuid_lookup[].tdx_fixed0/1 by tdx_caps.cpuid_config[]

2024-02-28 Thread Xiaoyao Li
tdx_cpuid_lookup[].tdx_fixed0/1 is QEMU maintained data which reflects
TDX restrictions regrading what bits are fixed by TDX module.

It's retrieved from TDX spec and static. However, TDX may evolve and
change some fixed fields to configurable in the future. Update
tdx_cpuid.lookup[].tdx_fixed0/1 fields by removing the bits that
reported from TDX module as configurable. This can adapt with the
updated TDX (module) automatically.

Signed-off-by: Xiaoyao Li 
---
 target/i386/kvm/tdx.c | 34 ++
 1 file changed, 34 insertions(+)

diff --git a/target/i386/kvm/tdx.c b/target/i386/kvm/tdx.c
index 239170142e4f..424c0f3c0fbb 100644
--- a/target/i386/kvm/tdx.c
+++ b/target/i386/kvm/tdx.c
@@ -377,6 +377,38 @@ static int get_tdx_capabilities(Error **errp)
 return 0;
 }
 
+static void update_tdx_cpuid_lookup_by_tdx_caps(void)
+{
+KvmTdxCpuidLookup *entry;
+FeatureWordInfo *fi;
+uint32_t config;
+FeatureWord w;
+
+for (w = 0; w < FEATURE_WORDS; w++) {
+fi = _word_info[w];
+entry = _cpuid_lookup[w];
+
+if (fi->type != CPUID_FEATURE_WORD) {
+continue;
+}
+
+config = tdx_cap_cpuid_config(fi->cpuid.eax,
+  fi->cpuid.needs_ecx ? fi->cpuid.ecx : 
~0u,
+  fi->cpuid.reg);
+
+if (!config) {
+continue;
+}
+
+/*
+ * Remove the configurable bits from tdx_fixed0/1 in case QEMU
+ * maintained fixed0/1 values is outdated to TDX module.
+ */
+entry->tdx_fixed0 &= ~config;
+entry->tdx_fixed1 &= ~config;
+}
+}
+
 static int tdx_kvm_init(ConfidentialGuestSupport *cgs, Error **errp)
 {
 MachineState *ms = MACHINE(qdev_get_machine());
@@ -392,6 +424,8 @@ static int tdx_kvm_init(ConfidentialGuestSupport *cgs, 
Error **errp)
 }
 }
 
+update_tdx_cpuid_lookup_by_tdx_caps();
+
 tdx_guest = tdx;
 return 0;
 }
-- 
2.34.1




[PATCH v5 63/65] i386/tdx: Skip kvm_put_apicbase() for TDs

2024-02-28 Thread Xiaoyao Li
KVM doesn't allow wirting to MSR_IA32_APICBASE for TDs.

Signed-off-by: Xiaoyao Li 
Acked-by: Gerd Hoffmann 
---
 target/i386/kvm/kvm.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
index d23f94b77257..31aed1c9aae0 100644
--- a/target/i386/kvm/kvm.c
+++ b/target/i386/kvm/kvm.c
@@ -3052,6 +3052,11 @@ void kvm_put_apicbase(X86CPU *cpu, uint64_t value)
 {
 int ret;
 
+/* TODO: Allow accessing guest state for debug TDs. */
+if (is_tdx_vm()) {
+return;
+}
+
 ret = kvm_put_one_msr(cpu, MSR_IA32_APICBASE, value);
 assert(ret == 1);
 }
-- 
2.34.1




[PATCH v5 24/65] i386/tdx: Initialize TDX before creating TD vcpus

2024-02-28 Thread Xiaoyao Li
Invoke KVM_TDX_INIT in kvm_arch_pre_create_vcpu() that KVM_TDX_INIT
configures global TD configurations, e.g. the canonical CPUID config,
and must be executed prior to creating vCPUs.

Use kvm_x86_arch_cpuid() to setup the CPUID settings for TDX VM.

Note, this doesn't address the fact that QEMU may change the CPUID
configuration when creating vCPUs, i.e. punts on refactoring QEMU to
provide a stable CPUID config prior to kvm_arch_init().

Signed-off-by: Xiaoyao Li 
Acked-by: Gerd Hoffmann 
Acked-by: Markus Armbruster 
---
Changes in v4:
- mark init_vm with g_autofree() and use QEMU_LOCK_GUARD() to eliminate
  the goto labels; (Daniel)
Changes in v3:
- Pass @errp in tdx_pre_create_vcpu() and pass error info to it. (Daniel)
---
 accel/kvm/kvm-all.c |  9 +++-
 target/i386/kvm/kvm.c   |  9 
 target/i386/kvm/meson.build |  2 +-
 target/i386/kvm/tdx-stub.c  |  8 
 target/i386/kvm/tdx.c   | 41 +
 target/i386/kvm/tdx.h   |  4 
 6 files changed, 71 insertions(+), 2 deletions(-)
 create mode 100644 target/i386/kvm/tdx-stub.c

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index a8a99d48e4ce..c9df41efa484 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -431,8 +431,15 @@ int kvm_init_vcpu(CPUState *cpu, Error **errp)
 
 trace_kvm_init_vcpu(cpu->cpu_index, kvm_arch_vcpu_id(cpu));
 
+/*
+ * tdx_pre_create_vcpu() may call cpu_x86_cpuid(). It in turn may call
+ * kvm_vm_ioctl(). Set cpu->kvm_state in advance to avoid NULL pointer
+ * dereference.
+ */
+cpu->kvm_state = s;
 ret = kvm_arch_pre_create_vcpu(cpu, errp);
 if (ret < 0) {
+cpu->kvm_state = NULL;
 goto err;
 }
 
@@ -440,11 +447,11 @@ int kvm_init_vcpu(CPUState *cpu, Error **errp)
 if (ret < 0) {
 error_setg_errno(errp, -ret, "kvm_init_vcpu: kvm_get_vcpu failed 
(%lu)",
  kvm_arch_vcpu_id(cpu));
+cpu->kvm_state = NULL;
 goto err;
 }
 
 cpu->kvm_fd = ret;
-cpu->kvm_state = s;
 cpu->vcpu_dirty = true;
 cpu->dirty_pages = 0;
 cpu->throttle_us_per_full = 0;
diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
index 315998c8f7e5..1664ac49005e 100644
--- a/target/i386/kvm/kvm.c
+++ b/target/i386/kvm/kvm.c
@@ -2271,6 +2271,15 @@ int kvm_arch_init_vcpu(CPUState *cs)
 return r;
 }
 
+int kvm_arch_pre_create_vcpu(CPUState *cpu, Error **errp)
+{
+if (is_tdx_vm()) {
+return tdx_pre_create_vcpu(cpu, errp);
+}
+
+return 0;
+}
+
 int kvm_arch_destroy_vcpu(CPUState *cs)
 {
 X86CPU *cpu = X86_CPU(cs);
diff --git a/target/i386/kvm/meson.build b/target/i386/kvm/meson.build
index 26a1ab038513..460c5f8f85f3 100644
--- a/target/i386/kvm/meson.build
+++ b/target/i386/kvm/meson.build
@@ -7,7 +7,7 @@ i386_kvm_ss.add(files(
 
 i386_kvm_ss.add(when: 'CONFIG_XEN_EMU', if_true: files('xen-emu.c'))
 
-i386_kvm_ss.add(when: 'CONFIG_TDX', if_true: files('tdx.c'))
+i386_kvm_ss.add(when: 'CONFIG_TDX', if_true: files('tdx.c'), if_false: 
files('tdx-stub.c'))
 
 i386_system_ss.add(when: 'CONFIG_HYPERV', if_true: files('hyperv.c'), 
if_false: files('hyperv-stub.c'))
 
diff --git a/target/i386/kvm/tdx-stub.c b/target/i386/kvm/tdx-stub.c
new file mode 100644
index ..b614b46d3f4a
--- /dev/null
+++ b/target/i386/kvm/tdx-stub.c
@@ -0,0 +1,8 @@
+#include "qemu/osdep.h"
+
+#include "tdx.h"
+
+int tdx_pre_create_vcpu(CPUState *cpu, Error **errp)
+{
+return -EINVAL;
+}
diff --git a/target/i386/kvm/tdx.c b/target/i386/kvm/tdx.c
index 144acd8c9912..d548ec340285 100644
--- a/target/i386/kvm/tdx.c
+++ b/target/i386/kvm/tdx.c
@@ -475,6 +475,45 @@ static int tdx_kvm_init(ConfidentialGuestSupport *cgs, 
Error **errp)
 return 0;
 }
 
+int tdx_pre_create_vcpu(CPUState *cpu, Error **errp)
+{
+MachineState *ms = MACHINE(qdev_get_machine());
+X86CPU *x86cpu = X86_CPU(cpu);
+CPUX86State *env = >env;
+g_autofree struct kvm_tdx_init_vm *init_vm = NULL;
+int r = 0;
+
+QEMU_LOCK_GUARD(_guest->lock);
+if (tdx_guest->initialized) {
+return r;
+}
+
+init_vm = g_malloc0(sizeof(struct kvm_tdx_init_vm) +
+sizeof(struct kvm_cpuid_entry2) * 
KVM_MAX_CPUID_ENTRIES);
+
+r = kvm_vm_enable_cap(kvm_state, KVM_CAP_MAX_VCPUS, 0, ms->smp.cpus);
+if (r < 0) {
+error_setg(errp, "Unable to set MAX VCPUS to %d", ms->smp.cpus);
+return r;
+}
+
+init_vm->cpuid.nent = kvm_x86_arch_cpuid(env, init_vm->cpuid.entries, 0);
+
+init_vm->attributes = tdx_guest->attributes;
+
+do {
+r = tdx_vm_ioctl(KVM_TDX_INIT_VM, 0, init_vm);
+} while (r == -EAGAIN);
+if (r < 0) {
+error_setg_errno(errp, -r, "KVM_TDX_INIT_VM failed");
+return r;
+}
+
+tdx_guest->initialized = true;
+
+return 0;
+}
+
 /* tdx guest 

[PATCH v5 46/65] i386/tdx: Call KVM_TDX_INIT_VCPU to initialize TDX vcpu

2024-02-28 Thread Xiaoyao Li
TDX vcpu needs to be initialized by SEAMCALL(TDH.VP.INIT) and KVM
provides vcpu level IOCTL KVM_TDX_INIT_VCPU for it.

KVM_TDX_INIT_VCPU needs the address of the HOB as input. Invoke it for
each vcpu after HOB list is created.

Signed-off-by: Xiaoyao Li 
Acked-by: Gerd Hoffmann 
---
 target/i386/kvm/tdx.c | 18 ++
 1 file changed, 18 insertions(+)

diff --git a/target/i386/kvm/tdx.c b/target/i386/kvm/tdx.c
index dcabe359eda5..4625f806920e 100644
--- a/target/i386/kvm/tdx.c
+++ b/target/i386/kvm/tdx.c
@@ -591,6 +591,22 @@ static void tdx_init_ram_entries(void)
 tdx_guest->nr_ram_entries = j;
 }
 
+static void tdx_post_init_vcpus(void)
+{
+TdxFirmwareEntry *hob;
+CPUState *cpu;
+int r;
+
+hob = tdx_get_hob_entry(tdx_guest);
+CPU_FOREACH(cpu) {
+r = tdx_vcpu_ioctl(cpu, KVM_TDX_INIT_VCPU, 0, (void *)hob->address);
+if (r < 0) {
+error_report("KVM_TDX_INIT_VCPU failed %s", strerror(-r));
+exit(1);
+}
+}
+}
+
 static void tdx_finalize_vm(Notifier *notifier, void *unused)
 {
 TdxFirmware *tdvf = _guest->tdvf;
@@ -623,6 +639,8 @@ static void tdx_finalize_vm(Notifier *notifier, void 
*unused)
 
 tdvf_hob_create(tdx_guest, tdx_get_hob_entry(tdx_guest));
 
+tdx_post_init_vcpus();
+
 for_each_tdx_fw_entry(tdvf, entry) {
 struct kvm_memory_mapping mapping = {
 .base_gfn = entry->address >> 12,
-- 
2.34.1




[PATCH v5 33/65] kvm/tdx: Don't complain when converting vMMIO region to shared

2024-02-28 Thread Xiaoyao Li
From: Isaku Yamahata 

Because vMMIO region needs to be shared region, guest TD may explicitly
convert such region from private to shared.  Don't complain such
conversion.

Signed-off-by: Isaku Yamahata 
Signed-off-by: Xiaoyao Li 
---
 accel/kvm/kvm-all.c | 19 ---
 1 file changed, 16 insertions(+), 3 deletions(-)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index c9df41efa484..d533e2611ad8 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -2985,9 +2985,22 @@ static int kvm_convert_memory(hwaddr start, hwaddr size, 
bool to_private)
 ret = ram_block_discard_guest_memfd_range(rb, offset, size);
 }
 } else {
-error_report("Convert non guest_memfd backed memory region "
-"(0x%"HWADDR_PRIx" ,+ 0x%"HWADDR_PRIx") to %s",
-start, size, to_private ? "private" : "shared");
+/*
+ * Because vMMIO region must be shared, guest TD may convert vMMIO
+ * region to shared explicitly.  Don't complain such case.  See
+ * memory_region_type() for checking if the region is MMIO region.
+ */
+if (!to_private &&
+!memory_region_is_ram(mr) &&
+!memory_region_is_ram_device(mr) &&
+!memory_region_is_rom(mr) &&
+!memory_region_is_romd(mr)) {
+   ret = 0;
+} else {
+error_report("Convert non guest_memfd backed memory region "
+"(0x%"HWADDR_PRIx" ,+ 0x%"HWADDR_PRIx") to %s",
+start, size, to_private ? "private" : "shared");
+}
 }
 
 memory_region_unref(section.mr);
-- 
2.34.1




[PATCH v5 49/65] i386/tdx: handle TDG.VP.VMCALL

2024-02-28 Thread Xiaoyao Li
From: Isaku Yamahata 

Add property "quote-generation-socket" to tdx-guest, which is a property
of type SocketAddress to specify Quote Generation Service(QGS).

On request of GetQuote, it connects to the QGS socket, read request
data from shared guest memory, send the request data to the QGS,
and store the response into shared guest memory, at last notify
TD guest by interrupt.

command line example:
  qemu-system-x86_64 \
-object 
'{"qom-type":"tdx-guest","id":"tdx0","quote-generation-socket":{"type": 
"vsock", "cid":"1","port":"1234"}}' \
-machine confidential-guest-support=tdx0

Note, above example uses vsock type socket because the QGS we used
implements the vsock socket. It can be other types, like UNIX socket,
which depends on the implementation of QGS.

To avoid no response from QGS server, setup a timer for the transaction.
If timeout, make it an error and interrupt guest. Define the threshold of
time to 30s at present, maybe change to other value if not appropriate.

Signed-off-by: Isaku Yamahata 
Codeveloped-by: Chenyi Qiang 
Signed-off-by: Chenyi Qiang 
Codeveloped-by: Xiaoyao Li 
Signed-off-by: Xiaoyao Li 
---
Changes in v5:
- add more decription of quote-generation-socket property;

Changes in v4:
- merge next patch "i386/tdx: setup a timer for the qio channel";

Changes in v3:
- rename property "quote-generation-service" to "quote-generation-socket";
- change the type of "quote-generation-socket" from str to
  SocketAddress;
- squash next patch into this one;
---
 qapi/qom.json |   8 +-
 target/i386/kvm/meson.build   |   2 +-
 target/i386/kvm/tdx-quote-generator.c | 170 
 target/i386/kvm/tdx-quote-generator.h |  95 +++
 target/i386/kvm/tdx.c | 216 ++
 target/i386/kvm/tdx.h |   6 +
 6 files changed, 495 insertions(+), 2 deletions(-)
 create mode 100644 target/i386/kvm/tdx-quote-generator.c
 create mode 100644 target/i386/kvm/tdx-quote-generator.h

diff --git a/qapi/qom.json b/qapi/qom.json
index cac875349a3a..7b26b0a0d3aa 100644
--- a/qapi/qom.json
+++ b/qapi/qom.json
@@ -917,13 +917,19 @@
 # (base64 encoded SHA384 digest). (A default value 0 of SHA384 is
 # used when absent).
 #
+# @quote-generation-socket: socket address for Quote Generation
+# Service (QGS).  QGS is a daemon running on the host.  User in
+# TD guest cannot get TD quoting for attestation if QGS is not
+# provided.  So admin should always provide it.
+#
 # Since: 9.0
 ##
 { 'struct': 'TdxGuestProperties',
   'data': { '*sept-ve-disable': 'bool',
 '*mrconfigid': 'str',
 '*mrowner': 'str',
-'*mrownerconfig': 'str' } }
+'*mrownerconfig': 'str',
+'*quote-generation-socket': 'SocketAddress' } }
 
 ##
 # @ThreadContextProperties:
diff --git a/target/i386/kvm/meson.build b/target/i386/kvm/meson.build
index 460c5f8f85f3..d38b8e8e5fca 100644
--- a/target/i386/kvm/meson.build
+++ b/target/i386/kvm/meson.build
@@ -7,7 +7,7 @@ i386_kvm_ss.add(files(
 
 i386_kvm_ss.add(when: 'CONFIG_XEN_EMU', if_true: files('xen-emu.c'))
 
-i386_kvm_ss.add(when: 'CONFIG_TDX', if_true: files('tdx.c'), if_false: 
files('tdx-stub.c'))
+i386_kvm_ss.add(when: 'CONFIG_TDX', if_true: files('tdx.c', 
'tdx-quote-generator.c'), if_false: files('tdx-stub.c'))
 
 i386_system_ss.add(when: 'CONFIG_HYPERV', if_true: files('hyperv.c'), 
if_false: files('hyperv-stub.c'))
 
diff --git a/target/i386/kvm/tdx-quote-generator.c 
b/target/i386/kvm/tdx-quote-generator.c
new file mode 100644
index ..057ad09e7e95
--- /dev/null
+++ b/target/i386/kvm/tdx-quote-generator.c
@@ -0,0 +1,170 @@
+/*
+ * QEMU TDX support
+ *
+ * Copyright Intel
+ *
+ * Author:
+ *  Xiaoyao Li 
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory
+ *
+ */
+
+#include "qemu/osdep.h"
+#include "qemu/error-report.h"
+#include "qapi/error.h"
+#include "qapi/qapi-visit-sockets.h"
+
+#include "tdx-quote-generator.h"
+
+typedef struct TdxQuoteGeneratorClass {
+DeviceClass parent_class;
+} TdxQuoteGeneratorClass;
+
+OBJECT_DEFINE_TYPE(TdxQuoteGenerator, tdx_quote_generator, 
TDX_QUOTE_GENERATOR, OBJECT)
+
+static void tdx_quote_generator_finalize(Object *obj)
+{
+}
+
+static void tdx_quote_generator_class_init(ObjectClass *oc, void *data)
+{
+}
+
+static void tdx_quote_generator_init(Object *obj)
+{
+}
+
+static void tdx_generate_quote_cleanup(struct tdx_generate_quote_task *task)
+{
+timer_del(>timer);
+
+g_source_remove(task->watch);
+qio_channel_close(QIO_CHANNEL(task->sioc), NULL);
+object_unref(OBJECT(task->sioc));
+
+/* Maintain the number of in-flight requests. */

[PATCH v5 62/65] i386/tdx: Only configure MSR_IA32_UCODE_REV in kvm_init_msrs() for TDs

2024-02-28 Thread Xiaoyao Li
For TDs, only MSR_IA32_UCODE_REV in kvm_init_msrs() can be configured
by VMM, while the features enumerated/controlled by other MSRs except
MSR_IA32_UCODE_REV in kvm_init_msrs() are not under control of VMM.

Only configure MSR_IA32_UCODE_REV for TDs.

Signed-off-by: Xiaoyao Li 
Acked-by: Gerd Hoffmann 
---
 target/i386/kvm/kvm.c | 44 ++-
 1 file changed, 23 insertions(+), 21 deletions(-)

diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
index 83276e23b19f..d23f94b77257 100644
--- a/target/i386/kvm/kvm.c
+++ b/target/i386/kvm/kvm.c
@@ -3270,32 +3270,34 @@ static void kvm_init_msrs(X86CPU *cpu)
 CPUX86State *env = >env;
 
 kvm_msr_buf_reset(cpu);
-if (has_msr_arch_capabs) {
-kvm_msr_entry_add(cpu, MSR_IA32_ARCH_CAPABILITIES,
-  env->features[FEAT_ARCH_CAPABILITIES]);
-}
-
-if (has_msr_core_capabs) {
-kvm_msr_entry_add(cpu, MSR_IA32_CORE_CAPABILITY,
-  env->features[FEAT_CORE_CAPABILITY]);
-}
-
-if (has_msr_perf_capabs && cpu->enable_pmu) {
-kvm_msr_entry_add_perf(cpu, env->features);
+
+if (!is_tdx_vm()) {
+if (has_msr_arch_capabs) {
+kvm_msr_entry_add(cpu, MSR_IA32_ARCH_CAPABILITIES,
+env->features[FEAT_ARCH_CAPABILITIES]);
+}
+
+if (has_msr_core_capabs) {
+kvm_msr_entry_add(cpu, MSR_IA32_CORE_CAPABILITY,
+env->features[FEAT_CORE_CAPABILITY]);
+}
+
+if (has_msr_perf_capabs && cpu->enable_pmu) {
+kvm_msr_entry_add_perf(cpu, env->features);
+}
+
+/*
+ * Older kernels do not include VMX MSRs in KVM_GET_MSR_INDEX_LIST, but
+ * all kernels with MSR features should have them.
+ */
+if (kvm_feature_msrs && cpu_has_vmx(env)) {
+kvm_msr_entry_add_vmx(cpu, env->features);
+}
 }
 
 if (has_msr_ucode_rev) {
 kvm_msr_entry_add(cpu, MSR_IA32_UCODE_REV, cpu->ucode_rev);
 }
-
-/*
- * Older kernels do not include VMX MSRs in KVM_GET_MSR_INDEX_LIST, but
- * all kernels with MSR features should have them.
- */
-if (kvm_feature_msrs && cpu_has_vmx(env)) {
-kvm_msr_entry_add_vmx(cpu, env->features);
-}
-
 assert(kvm_buf_set_msrs(cpu) == 0);
 }
 
-- 
2.34.1




[PATCH v5 64/65] i386/tdx: Don't get/put guest state for TDX VMs

2024-02-28 Thread Xiaoyao Li
From: Sean Christopherson 

Don't get/put state of TDX VMs since accessing/mutating guest state of
production TDs is not supported.

Note, it will be allowed for a debug TD. Corresponding support will be
introduced when debug TD support is implemented in the future.

Signed-off-by: Sean Christopherson 
Signed-off-by: Xiaoyao Li 
Acked-by: Gerd Hoffmann 
---
 target/i386/kvm/kvm.c | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
index 31aed1c9aae0..39113718ea14 100644
--- a/target/i386/kvm/kvm.c
+++ b/target/i386/kvm/kvm.c
@@ -4620,6 +4620,11 @@ int kvm_arch_put_registers(CPUState *cpu, int level)
 
 assert(cpu_is_stopped(cpu) || qemu_cpu_is_self(cpu));
 
+/* TODO: Allow accessing guest state for debug TDs. */
+if (is_tdx_vm()) {
+return 0;
+}
+
 /*
  * Put MSR_IA32_FEATURE_CONTROL first, this ensures the VM gets out of VMX
  * root operation upon vCPU reset. kvm_put_msr_feature_control() should 
also
@@ -4720,6 +4725,12 @@ int kvm_arch_get_registers(CPUState *cs)
 if (ret < 0) {
 goto out;
 }
+
+/* TODO: Allow accessing guest state for debug TDs. */
+if (is_tdx_vm()) {
+return 0;
+}
+
 ret = kvm_getput_regs(cpu, 0);
 if (ret < 0) {
 goto out;
-- 
2.34.1




[PATCH v5 44/65] i386/tdx: Setup the TD HOB list

2024-02-28 Thread Xiaoyao Li
The TD HOB list is used to pass the information from VMM to TDVF. The TD
HOB must include PHIT HOB and Resource Descriptor HOB. More details can
be found in TDVF specification and PI specification.

Build the TD HOB in TDX's machine_init_done callback.

Co-developed-by: Isaku Yamahata 
Signed-off-by: Isaku Yamahata 
Co-developed-by: Sean Christopherson 
Signed-off-by: Sean Christopherson 
Signed-off-by: Xiaoyao Li 
Acked-by: Gerd Hoffmann 

---
Changes in v1:
  - drop the code of adding mmio resources since OVMF prepares all the
MMIO hob itself.
---
 hw/i386/meson.build   |   2 +-
 hw/i386/tdvf-hob.c| 147 ++
 hw/i386/tdvf-hob.h|  24 +++
 target/i386/kvm/tdx.c |  16 +
 4 files changed, 188 insertions(+), 1 deletion(-)
 create mode 100644 hw/i386/tdvf-hob.c
 create mode 100644 hw/i386/tdvf-hob.h

diff --git a/hw/i386/meson.build b/hw/i386/meson.build
index f09441c1ea54..2345383940ab 100644
--- a/hw/i386/meson.build
+++ b/hw/i386/meson.build
@@ -28,7 +28,7 @@ i386_ss.add(when: 'CONFIG_PC', if_true: files(
   'port92.c'))
 i386_ss.add(when: 'CONFIG_X86_FW_OVMF', if_true: files('pc_sysfw_ovmf.c'),
 if_false: 
files('pc_sysfw_ovmf-stubs.c'))
-i386_ss.add(when: 'CONFIG_TDX', if_true: files('tdvf.c'))
+i386_ss.add(when: 'CONFIG_TDX', if_true: files('tdvf.c', 'tdvf-hob.c'))
 
 subdir('kvm')
 subdir('xen')
diff --git a/hw/i386/tdvf-hob.c b/hw/i386/tdvf-hob.c
new file mode 100644
index ..0da6ff2df576
--- /dev/null
+++ b/hw/i386/tdvf-hob.c
@@ -0,0 +1,147 @@
+/*
+ * SPDX-License-Identifier: GPL-2.0-or-later
+
+ * Copyright (c) 2020 Intel Corporation
+ * Author: Isaku Yamahata 
+ *
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+
+ * You should have received a copy of the GNU General Public License along
+ * with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include "qemu/osdep.h"
+#include "qemu/log.h"
+#include "qemu/error-report.h"
+#include "e820_memory_layout.h"
+#include "hw/i386/pc.h"
+#include "hw/i386/x86.h"
+#include "hw/pci/pcie_host.h"
+#include "sysemu/kvm.h"
+#include "standard-headers/uefi/uefi.h"
+#include "tdvf-hob.h"
+
+typedef struct TdvfHob {
+hwaddr hob_addr;
+void *ptr;
+int size;
+
+/* working area */
+void *current;
+void *end;
+} TdvfHob;
+
+static uint64_t tdvf_current_guest_addr(const TdvfHob *hob)
+{
+return hob->hob_addr + (hob->current - hob->ptr);
+}
+
+static void tdvf_align(TdvfHob *hob, size_t align)
+{
+hob->current = QEMU_ALIGN_PTR_UP(hob->current, align);
+}
+
+static void *tdvf_get_area(TdvfHob *hob, uint64_t size)
+{
+void *ret;
+
+if (hob->current + size > hob->end) {
+error_report("TD_HOB overrun, size = 0x%" PRIx64, size);
+exit(1);
+}
+
+ret = hob->current;
+hob->current += size;
+tdvf_align(hob, 8);
+return ret;
+}
+
+static void tdvf_hob_add_memory_resources(TdxGuest *tdx, TdvfHob *hob)
+{
+EFI_HOB_RESOURCE_DESCRIPTOR *region;
+EFI_RESOURCE_ATTRIBUTE_TYPE attr;
+EFI_RESOURCE_TYPE resource_type;
+
+TdxRamEntry *e;
+int i;
+
+for (i = 0; i < tdx->nr_ram_entries; i++) {
+e = >ram_entries[i];
+
+if (e->type == TDX_RAM_UNACCEPTED) {
+resource_type = EFI_RESOURCE_MEMORY_UNACCEPTED;
+attr = EFI_RESOURCE_ATTRIBUTE_TDVF_UNACCEPTED;
+} else if (e->type == TDX_RAM_ADDED){
+resource_type = EFI_RESOURCE_SYSTEM_MEMORY;
+attr = EFI_RESOURCE_ATTRIBUTE_TDVF_PRIVATE;
+} else {
+error_report("unknown TDX_RAM_ENTRY type %d", e->type);
+exit(1);
+}
+
+region = tdvf_get_area(hob, sizeof(*region));
+*region = (EFI_HOB_RESOURCE_DESCRIPTOR) {
+.Header = {
+.HobType = EFI_HOB_TYPE_RESOURCE_DESCRIPTOR,
+.HobLength = cpu_to_le16(sizeof(*region)),
+.Reserved = cpu_to_le32(0),
+},
+.Owner = EFI_HOB_OWNER_ZERO,
+.ResourceType = cpu_to_le32(resource_type),
+.ResourceAttribute = cpu_to_le32(attr),
+.PhysicalStart = cpu_to_le64(e->address),
+.ResourceLength = cpu_to_le64(e->length),
+};
+}
+}
+
+void tdvf_hob_create(TdxGuest *tdx, TdxFirmwareEntry *td_hob

[PATCH v5 61/65] i386/tdx: Don't synchronize guest tsc for TDs

2024-02-28 Thread Xiaoyao Li
From: Isaku Yamahata 

TSC of TDs is not accessible and KVM doesn't allow access of
MSR_IA32_TSC for TDs. To avoid the assert() in kvm_get_tsc, make
kvm_synchronize_all_tsc() noop for TDs,

Signed-off-by: Isaku Yamahata 
Reviewed-by: Connor Kuehl 
Signed-off-by: Xiaoyao Li 
Acked-by: Gerd Hoffmann 
---
 target/i386/kvm/kvm.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
index b1b0384b0c5c..83276e23b19f 100644
--- a/target/i386/kvm/kvm.c
+++ b/target/i386/kvm/kvm.c
@@ -277,7 +277,7 @@ void kvm_synchronize_all_tsc(void)
 {
 CPUState *cpu;
 
-if (kvm_enabled()) {
+if (kvm_enabled() && !is_tdx_vm()) {
 CPU_FOREACH(cpu) {
 run_on_cpu(cpu, do_kvm_synchronize_tsc, RUN_ON_CPU_NULL);
 }
-- 
2.34.1




[PATCH v5 41/65] i386/tdx: Track mem_ptr for each firmware entry of TDVF

2024-02-28 Thread Xiaoyao Li
For each TDVF sections, QEMU needs to copy the content to guest
private memory via KVM API (KVM_TDX_INIT_MEM_REGION).

Introduce a field @mem_ptr for TdxFirmwareEntry to track the memory
pointer of each TDVF sections. So that QEMU can add/copy them to guest
private memory later.

TDVF sections can be classified into two groups:
 - Firmware itself, e.g., TDVF BFV and CFV, that located separately from
   guest RAM. Its memory pointer is the bios pointer.

 - Sections located at guest RAM, e.g., TEMP_MEM and TD_HOB.
   mmap a new memory range for them.

Register a machine_init_done callback to do the stuff.

Signed-off-by: Xiaoyao Li 
Acked-by: Gerd Hoffmann 
---
 hw/i386/tdvf.c |  1 +
 include/hw/i386/tdvf.h |  7 +++
 target/i386/kvm/tdx.c  | 31 +++
 3 files changed, 39 insertions(+)

diff --git a/hw/i386/tdvf.c b/hw/i386/tdvf.c
index ff51f40088f0..0a6445705160 100644
--- a/hw/i386/tdvf.c
+++ b/hw/i386/tdvf.c
@@ -189,6 +189,7 @@ int tdvf_parse_metadata(TdxFirmware *fw, void *flash_ptr, 
int size)
 }
 g_free(sections);
 
+fw->mem_ptr = flash_ptr;
 return 0;
 
 err:
diff --git a/include/hw/i386/tdvf.h b/include/hw/i386/tdvf.h
index 593341eb2e93..d880af245a73 100644
--- a/include/hw/i386/tdvf.h
+++ b/include/hw/i386/tdvf.h
@@ -39,13 +39,20 @@ typedef struct TdxFirmwareEntry {
 uint64_t size;
 uint32_t type;
 uint32_t attributes;
+
+void *mem_ptr;
 } TdxFirmwareEntry;
 
 typedef struct TdxFirmware {
+void *mem_ptr;
+
 uint32_t nr_entries;
 TdxFirmwareEntry *entries;
 } TdxFirmware;
 
+#define for_each_tdx_fw_entry(fw, e)\
+for (e = (fw)->entries; e != (fw)->entries + (fw)->nr_entries; e++)
+
 int tdvf_parse_metadata(TdxFirmware *fw, void *flash_ptr, int size);
 
 #endif /* HW_I386_TDVF_H */
diff --git a/target/i386/kvm/tdx.c b/target/i386/kvm/tdx.c
index 2bb6e9e9c392..1b4bca9cc3cd 100644
--- a/target/i386/kvm/tdx.c
+++ b/target/i386/kvm/tdx.c
@@ -14,6 +14,7 @@
 #include "qemu/osdep.h"
 #include "qemu/error-report.h"
 #include "qemu/base64.h"
+#include "qemu/mmap-alloc.h"
 #include "qapi/error.h"
 #include "qom/object_interfaces.h"
 #include "standard-headers/asm-x86/kvm_para.h"
@@ -22,6 +23,7 @@
 #include "exec/ramblock.h"
 
 #include "hw/i386/x86.h"
+#include "hw/i386/tdvf.h"
 #include "kvm_i386.h"
 #include "tdx.h"
 #include "../cpu-internal.h"
@@ -470,6 +472,33 @@ void tdx_set_tdvf_region(MemoryRegion *tdvf_mr)
 tdx_guest->tdvf_mr = tdvf_mr;
 }
 
+static void tdx_finalize_vm(Notifier *notifier, void *unused)
+{
+TdxFirmware *tdvf = _guest->tdvf;
+TdxFirmwareEntry *entry;
+
+for_each_tdx_fw_entry(tdvf, entry) {
+switch (entry->type) {
+case TDVF_SECTION_TYPE_BFV:
+case TDVF_SECTION_TYPE_CFV:
+entry->mem_ptr = tdvf->mem_ptr + entry->data_offset;
+break;
+case TDVF_SECTION_TYPE_TD_HOB:
+case TDVF_SECTION_TYPE_TEMP_MEM:
+entry->mem_ptr = qemu_ram_mmap(-1, entry->size,
+   qemu_real_host_page_size(), 0, 0);
+break;
+default:
+error_report("Unsupported TDVF section %d", entry->type);
+exit(1);
+}
+}
+}
+
+static Notifier tdx_machine_done_notify = {
+.notify = tdx_finalize_vm,
+};
+
 static int tdx_kvm_init(ConfidentialGuestSupport *cgs, Error **errp)
 {
 MachineState *ms = MACHINE(qdev_get_machine());
@@ -496,6 +525,8 @@ static int tdx_kvm_init(ConfidentialGuestSupport *cgs, 
Error **errp)
  */
 kvm_readonly_mem_allowed = false;
 
+qemu_add_machine_init_done_notifier(_machine_done_notify);
+
 tdx_guest = tdx;
 return 0;
 }
-- 
2.34.1




[PATCH v5 60/65] hw/i386: add option to forcibly report edge trigger in acpi tables

2024-02-28 Thread Xiaoyao Li
From: Isaku Yamahata 

When level trigger isn't supported on x86 platform,
forcibly report edge trigger in acpi tables.

Signed-off-by: Isaku Yamahata 
Signed-off-by: Xiaoyao Li 
Acked-by: Gerd Hoffmann 
---
 hw/i386/acpi-build.c  | 99 ---
 hw/i386/acpi-common.c | 45 +++-
 2 files changed, 101 insertions(+), 43 deletions(-)

diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c
index 15242b9096b5..8b493a2e9605 100644
--- a/hw/i386/acpi-build.c
+++ b/hw/i386/acpi-build.c
@@ -964,7 +964,8 @@ static void build_dbg_aml(Aml *table)
 aml_append(table, scope);
 }
 
-static Aml *build_link_dev(const char *name, uint8_t uid, Aml *reg)
+static Aml *build_link_dev(const char *name, uint8_t uid, Aml *reg,
+   bool level_trigger_unsupported)
 {
 Aml *dev;
 Aml *crs;
@@ -976,7 +977,10 @@ static Aml *build_link_dev(const char *name, uint8_t uid, 
Aml *reg)
 aml_append(dev, aml_name_decl("_UID", aml_int(uid)));
 
 crs = aml_resource_template();
-aml_append(crs, aml_interrupt(AML_CONSUMER, AML_LEVEL, AML_ACTIVE_HIGH,
+aml_append(crs, aml_interrupt(AML_CONSUMER,
+  level_trigger_unsupported ?
+  AML_EDGE : AML_LEVEL,
+  AML_ACTIVE_HIGH,
   AML_SHARED, irqs, ARRAY_SIZE(irqs)));
 aml_append(dev, aml_name_decl("_PRS", crs));
 
@@ -1000,7 +1004,8 @@ static Aml *build_link_dev(const char *name, uint8_t uid, 
Aml *reg)
 return dev;
  }
 
-static Aml *build_gsi_link_dev(const char *name, uint8_t uid, uint8_t gsi)
+static Aml *build_gsi_link_dev(const char *name, uint8_t uid,
+   uint8_t gsi, bool level_trigger_unsupported)
 {
 Aml *dev;
 Aml *crs;
@@ -1013,7 +1018,10 @@ static Aml *build_gsi_link_dev(const char *name, uint8_t 
uid, uint8_t gsi)
 
 crs = aml_resource_template();
 irqs = gsi;
-aml_append(crs, aml_interrupt(AML_CONSUMER, AML_LEVEL, AML_ACTIVE_HIGH,
+aml_append(crs, aml_interrupt(AML_CONSUMER,
+  level_trigger_unsupported ?
+  AML_EDGE : AML_LEVEL,
+  AML_ACTIVE_HIGH,
   AML_SHARED, , 1));
 aml_append(dev, aml_name_decl("_PRS", crs));
 
@@ -1032,7 +1040,7 @@ static Aml *build_gsi_link_dev(const char *name, uint8_t 
uid, uint8_t gsi)
 }
 
 /* _CRS method - get current settings */
-static Aml *build_iqcr_method(bool is_piix4)
+static Aml *build_iqcr_method(bool is_piix4, bool level_trigger_unsupported)
 {
 Aml *if_ctx;
 uint32_t irqs;
@@ -1040,7 +1048,9 @@ static Aml *build_iqcr_method(bool is_piix4)
 Aml *crs = aml_resource_template();
 
 irqs = 0;
-aml_append(crs, aml_interrupt(AML_CONSUMER, AML_LEVEL,
+aml_append(crs, aml_interrupt(AML_CONSUMER,
+  level_trigger_unsupported ?
+  AML_EDGE : AML_LEVEL,
   AML_ACTIVE_HIGH, AML_SHARED, , 1));
 aml_append(method, aml_name_decl("PRR0", crs));
 
@@ -1074,7 +1084,7 @@ static Aml *build_irq_status_method(void)
 return method;
 }
 
-static void build_piix4_pci0_int(Aml *table)
+static void build_piix4_pci0_int(Aml *table, bool level_trigger_unsupported)
 {
 Aml *dev;
 Aml *crs;
@@ -1087,12 +1097,16 @@ static void build_piix4_pci0_int(Aml *table)
 aml_append(sb_scope, pci0_scope);
 
 aml_append(sb_scope, build_irq_status_method());
-aml_append(sb_scope, build_iqcr_method(true));
+aml_append(sb_scope, build_iqcr_method(true, level_trigger_unsupported));
 
-aml_append(sb_scope, build_link_dev("LNKA", 0, aml_name("PRQ0")));
-aml_append(sb_scope, build_link_dev("LNKB", 1, aml_name("PRQ1")));
-aml_append(sb_scope, build_link_dev("LNKC", 2, aml_name("PRQ2")));
-aml_append(sb_scope, build_link_dev("LNKD", 3, aml_name("PRQ3")));
+aml_append(sb_scope, build_link_dev("LNKA", 0, aml_name("PRQ0"),
+level_trigger_unsupported));
+aml_append(sb_scope, build_link_dev("LNKB", 1, aml_name("PRQ1"),
+level_trigger_unsupported));
+aml_append(sb_scope, build_link_dev("LNKC", 2, aml_name("PRQ2"),
+level_trigger_unsupported));
+aml_append(sb_scope, build_link_dev("LNKD", 3, aml_name("PRQ3"),
+level_trigger_unsupported));
 
 dev = aml_device("LNKS");
 {
@@ -1101,7 +1115,9 @@ static void build_piix4_pci0_int(Aml *table)
 
 crs = aml_resource_template();
 irqs = 9;
-aml_append(crs, aml_inte

[PATCH v5 56/65] i386/tdx: Disable PIC for TDX VMs

2024-02-28 Thread Xiaoyao Li
Legacy PIC (8259) cannot be supported for TDX VMs since TDX module
doesn't allow directly interrupt injection.  Using posted interrupts
for the PIC is not a viable option as the guest BIOS/kernel will not
do EOI for PIC IRQs, i.e. will leave the vIRR bit set.

Hence disable PIC for TDX VMs and error out if user wants PIC.

Signed-off-by: Xiaoyao Li 
Acked-by: Gerd Hoffmann 
---
 target/i386/kvm/tdx.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/target/i386/kvm/tdx.c b/target/i386/kvm/tdx.c
index c3fadbc5c58e..0225a9b79b36 100644
--- a/target/i386/kvm/tdx.c
+++ b/target/i386/kvm/tdx.c
@@ -720,6 +720,13 @@ static int tdx_kvm_init(ConfidentialGuestSupport *cgs, 
Error **errp)
 return -EINVAL;
 }
 
+if (x86ms->pic == ON_OFF_AUTO_AUTO) {
+x86ms->pic = ON_OFF_AUTO_OFF;
+} else if (x86ms->pic == ON_OFF_AUTO_ON) {
+error_setg(errp, "TDX VM doesn't support PIC");
+return -EINVAL;
+}
+
 if (!tdx_caps) {
 r = get_tdx_capabilities(errp);
 if (r) {
-- 
2.34.1




[PATCH v5 57/65] i386/tdx: Don't allow system reset for TDX VMs

2024-02-28 Thread Xiaoyao Li
TDX CPU state is protected and thus vcpu state cann't be reset by VMM.

Signed-off-by: Xiaoyao Li 
Acked-by: Gerd Hoffmann 
---
 target/i386/kvm/kvm.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
index 2748086231d5..b1b0384b0c5c 100644
--- a/target/i386/kvm/kvm.c
+++ b/target/i386/kvm/kvm.c
@@ -5673,7 +5673,7 @@ bool kvm_has_waitpkg(void)
 
 bool kvm_arch_cpu_check_are_resettable(void)
 {
-return !sev_es_enabled();
+return !sev_es_enabled() && !is_tdx_vm();
 }
 
 #define ARCH_REQ_XCOMP_GUEST_PERM   0x1025
-- 
2.34.1




[PATCH v5 43/65] headers: Add definitions from UEFI spec for volumes, resources, etc...

2024-02-28 Thread Xiaoyao Li
Add UEFI definitions for literals, enums, structs, GUIDs, etc... that
will be used by TDX to build the UEFI Hand-Off Block (HOB) that is passed
to the Trusted Domain Virtual Firmware (TDVF).

All values come from the UEFI specification [1], PI spec [2] and TDVF
design guide[3].

[1] UEFI Specification v2.1.0 
https://uefi.org/sites/default/files/resources/UEFI_Spec_2_10_Aug29.pdf
[2] UEFI PI spec v1.8 
https://uefi.org/sites/default/files/resources/UEFI_PI_Spec_1_8_March3.pdf
[3] 
https://software.intel.com/content/dam/develop/external/us/en/documents/tdx-virtual-firmware-design-guide-rev-1.pdf

Signed-off-by: Xiaoyao Li 
Acked-by: Gerd Hoffmann 
---
 include/standard-headers/uefi/uefi.h | 198 +++
 1 file changed, 198 insertions(+)
 create mode 100644 include/standard-headers/uefi/uefi.h

diff --git a/include/standard-headers/uefi/uefi.h 
b/include/standard-headers/uefi/uefi.h
new file mode 100644
index ..b15aba796156
--- /dev/null
+++ b/include/standard-headers/uefi/uefi.h
@@ -0,0 +1,198 @@
+/*
+ * Copyright (C) 2020 Intel Corporation
+ *
+ * Author: Isaku Yamahata 
+ *
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+
+ * You should have received a copy of the GNU General Public License along
+ * with this program; if not, see <http://www.gnu.org/licenses/>.
+ *
+ */
+
+#ifndef HW_I386_UEFI_H
+#define HW_I386_UEFI_H
+
+/***/
+/*
+ * basic EFI definitions
+ * supplemented with UEFI Specification Version 2.8 (Errata A)
+ * released February 2020
+ */
+/* UEFI integer is little endian */
+
+typedef struct {
+uint32_t Data1;
+uint16_t Data2;
+uint16_t Data3;
+uint8_t Data4[8];
+} EFI_GUID;
+
+typedef enum {
+EfiReservedMemoryType,
+EfiLoaderCode,
+EfiLoaderData,
+EfiBootServicesCode,
+EfiBootServicesData,
+EfiRuntimeServicesCode,
+EfiRuntimeServicesData,
+EfiConventionalMemory,
+EfiUnusableMemory,
+EfiACPIReclaimMemory,
+EfiACPIMemoryNVS,
+EfiMemoryMappedIO,
+EfiMemoryMappedIOPortSpace,
+EfiPalCode,
+EfiPersistentMemory,
+EfiUnacceptedMemoryType,
+EfiMaxMemoryType
+} EFI_MEMORY_TYPE;
+
+#define EFI_HOB_HANDOFF_TABLE_VERSION 0x0009
+
+#define EFI_HOB_TYPE_HANDOFF  0x0001
+#define EFI_HOB_TYPE_MEMORY_ALLOCATION0x0002
+#define EFI_HOB_TYPE_RESOURCE_DESCRIPTOR  0x0003
+#define EFI_HOB_TYPE_GUID_EXTENSION   0x0004
+#define EFI_HOB_TYPE_FV   0x0005
+#define EFI_HOB_TYPE_CPU  0x0006
+#define EFI_HOB_TYPE_MEMORY_POOL  0x0007
+#define EFI_HOB_TYPE_FV2  0x0009
+#define EFI_HOB_TYPE_LOAD_PEIM_UNUSED 0x000A
+#define EFI_HOB_TYPE_UEFI_CAPSULE 0x000B
+#define EFI_HOB_TYPE_FV3  0x000C
+#define EFI_HOB_TYPE_UNUSED   0xFFFE
+#define EFI_HOB_TYPE_END_OF_HOB_LIST  0x
+
+typedef struct {
+uint16_t HobType;
+uint16_t HobLength;
+uint32_t Reserved;
+} EFI_HOB_GENERIC_HEADER;
+
+typedef uint64_t EFI_PHYSICAL_ADDRESS;
+typedef uint32_t EFI_BOOT_MODE;
+
+typedef struct {
+EFI_HOB_GENERIC_HEADER Header;
+uint32_t Version;
+EFI_BOOT_MODE BootMode;
+EFI_PHYSICAL_ADDRESS EfiMemoryTop;
+EFI_PHYSICAL_ADDRESS EfiMemoryBottom;
+EFI_PHYSICAL_ADDRESS EfiFreeMemoryTop;
+EFI_PHYSICAL_ADDRESS EfiFreeMemoryBottom;
+EFI_PHYSICAL_ADDRESS EfiEndOfHobList;
+} EFI_HOB_HANDOFF_INFO_TABLE;
+
+#define EFI_RESOURCE_SYSTEM_MEMORY  0x
+#define EFI_RESOURCE_MEMORY_MAPPED_IO   0x0001
+#define EFI_RESOURCE_IO 0x0002
+#define EFI_RESOURCE_FIRMWARE_DEVICE0x0003
+#define EFI_RESOURCE_MEMORY_MAPPED_IO_PORT  0x0004
+#define EFI_RESOURCE_MEMORY_RESERVED0x0005
+#define EFI_RESOURCE_IO_RESERVED0x0006
+#define EFI_RESOURCE_MEMORY_UNACCEPTED  0x0007
+#define EFI_RESOURCE_MAX_MEMORY_TYPE0x0008
+
+#define EFI_RESOURCE_ATTRIBUTE_PRESENT  0x0001
+#define EFI_RESOURCE_ATTRIBUTE_INITIALIZED  0x0002
+#define EFI_RESOURCE_ATTRIBUTE_TESTED   0x0004
+#define EFI_RESOURCE_ATTRIBUTE_SINGLE_BIT_ECC   0x0008
+#define EFI_RESOURCE_ATTRIBUTE_MULTIPLE_BIT_ECC 0x0010
+#define EFI_RESOURCE_ATTRIBUTE_ECC_RESERVED_1   0x0020
+#define EFI_RESOURCE_ATTRIBUTE_ECC_RESERVED_2   0x0040
+#define EFI_RESOURCE_ATTRIBUTE_READ_PROTECTED   0x00

[PATCH v5 45/65] i386/tdx: Populate TDVF private memory via KVM_MEMORY_MAPPING

2024-02-28 Thread Xiaoyao Li
From: Isaku Yamahata 

TDVF firmware (CODE and VARS) needs to be copied to TD's private
memory, as well as TD HOB and TEMP memory.

If the TDVF section has TDVF_SECTION_ATTRIBUTES_MR_EXTEND set in the
flag, calling KVM_TDX_EXTEND_MEMORY to extend the measurement.

After populating the TDVF memory, the original image located in shared
ramblock can be discarded.

Signed-off-by: Isaku Yamahata 
Signed-off-by: Xiaoyao Li 
Acked-by: Gerd Hoffmann 

---
Changes in v1:
  - rename variable @metadata to @flags
---
 target/i386/kvm/tdx.c | 47 +++
 1 file changed, 47 insertions(+)

diff --git a/target/i386/kvm/tdx.c b/target/i386/kvm/tdx.c
index fb9c60172fde..dcabe359eda5 100644
--- a/target/i386/kvm/tdx.c
+++ b/target/i386/kvm/tdx.c
@@ -595,6 +595,8 @@ static void tdx_finalize_vm(Notifier *notifier, void 
*unused)
 {
 TdxFirmware *tdvf = _guest->tdvf;
 TdxFirmwareEntry *entry;
+RAMBlock *ram_block;
+int r;
 
 tdx_init_ram_entries();
 
@@ -620,6 +622,51 @@ static void tdx_finalize_vm(Notifier *notifier, void 
*unused)
   sizeof(TdxRamEntry), _ram_entry_compare);
 
 tdvf_hob_create(tdx_guest, tdx_get_hob_entry(tdx_guest));
+
+for_each_tdx_fw_entry(tdvf, entry) {
+struct kvm_memory_mapping mapping = {
+.base_gfn = entry->address >> 12,
+.nr_pages = entry->size >> 12,
+.source = (__u64)entry->mem_ptr,
+};
+
+do {
+r = kvm_vcpu_ioctl(first_cpu, KVM_MEMORY_MAPPING, );
+} while (r == -EAGAIN);
+
+if (r < 0) {
+ error_report("KVM_MEMORY_MAPPING failed %s", strerror(-r));
+ exit(1);
+}
+
+if (entry->attributes & TDVF_SECTION_ATTRIBUTES_MR_EXTEND) {
+mapping = (struct kvm_memory_mapping) {
+.base_gfn = entry->address >> 12,
+.nr_pages = entry->size >> 12,
+};
+
+do {
+r = tdx_vm_ioctl(KVM_TDX_EXTEND_MEMORY, 0, );
+} while (r == -EAGAIN);
+if (r < 0) {
+error_report("KVM_TDX_EXTEND_MEMORY failed %s", strerror(-r));
+exit(1);
+}
+}
+
+if (entry->type == TDVF_SECTION_TYPE_TD_HOB ||
+entry->type == TDVF_SECTION_TYPE_TEMP_MEM) {
+qemu_ram_munmap(-1, entry->mem_ptr, entry->size);
+entry->mem_ptr = NULL;
+}
+}
+
+/*
+ * TDVF image has been copied into private region above via
+ * KVM_MEMORY_MAPPING. It becomes useless.
+ */
+ram_block = tdx_guest->tdvf_mr->ram_block;
+ram_block_discard_range(ram_block, 0, ram_block->max_length);
 }
 
 static Notifier tdx_machine_done_notify = {
-- 
2.34.1




[PATCH v5 31/65] i386/tdx: Implement user specified tsc frequency

2024-02-28 Thread Xiaoyao Li
Reuse "-cpu,tsc-frequency=" to get user wanted tsc frequency and call VM
scope VM_SET_TSC_KHZ to set the tsc frequency of TD before KVM_TDX_INIT_VM.

Besides, sanity check the tsc frequency to be in the legal range and
legal granularity (required by TDX module).

Signed-off-by: Xiaoyao Li 
Acked-by: Gerd Hoffmann 
---
Changes in v3:
- use @errp to report error info; (Daniel)

Changes in v1:
- Use VM scope VM_SET_TSC_KHZ to set the TSC frequency of TD since KVM
  side drop the @tsc_khz field in struct kvm_tdx_init_vm
---
 target/i386/kvm/kvm.c |  9 +
 target/i386/kvm/tdx.c | 25 +
 2 files changed, 34 insertions(+)

diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
index 1664ac49005e..4f998b2d6d37 100644
--- a/target/i386/kvm/kvm.c
+++ b/target/i386/kvm/kvm.c
@@ -793,6 +793,15 @@ static int kvm_arch_set_tsc_khz(CPUState *cs)
 int r, cur_freq;
 bool set_ioctl = false;
 
+/*
+ * TSC of TD vcpu is immutable, it cannot be set/changed via vcpu scope
+ * VM_SET_TSC_KHZ, but only be initialized via VM scope VM_SET_TSC_KHZ
+ * before ioctl KVM_TDX_INIT_VM in tdx_pre_create_vcpu()
+ */
+if (is_tdx_vm()) {
+return 0;
+}
+
 if (!env->tsc_khz) {
 return 0;
 }
diff --git a/target/i386/kvm/tdx.c b/target/i386/kvm/tdx.c
index 4ce2f1d082ce..42dbb5ce5c15 100644
--- a/target/i386/kvm/tdx.c
+++ b/target/i386/kvm/tdx.c
@@ -33,6 +33,9 @@
  (1U << KVM_FEATURE_PV_SCHED_YIELD) | \
  (1U << KVM_FEATURE_MSI_EXT_DEST_ID))
 
+#define TDX_MIN_TSC_FREQUENCY_KHZ   (100 * 1000)
+#define TDX_MAX_TSC_FREQUENCY_KHZ   (10 * 1000 * 1000)
+
 #define TDX_TD_ATTRIBUTES_DEBUG BIT_ULL(0)
 #define TDX_TD_ATTRIBUTES_SEPT_VE_DISABLE   BIT_ULL(28)
 #define TDX_TD_ATTRIBUTES_PKS   BIT_ULL(30)
@@ -568,6 +571,28 @@ int tdx_pre_create_vcpu(CPUState *cpu, Error **errp)
 return r;
 }
 
+if (env->tsc_khz && (env->tsc_khz < TDX_MIN_TSC_FREQUENCY_KHZ ||
+ env->tsc_khz > TDX_MAX_TSC_FREQUENCY_KHZ)) {
+error_setg(errp, "Invalid TSC %ld KHz, must specify cpu_frequency 
between [%d, %d] kHz",
+   env->tsc_khz, TDX_MIN_TSC_FREQUENCY_KHZ,
+   TDX_MAX_TSC_FREQUENCY_KHZ);
+   return -EINVAL;
+}
+
+if (env->tsc_khz % (25 * 1000)) {
+error_setg(errp, "Invalid TSC %ld KHz, it must be multiple of 25MHz",
+   env->tsc_khz);
+return -EINVAL;
+}
+
+/* it's safe even env->tsc_khz is 0. KVM uses host's tsc_khz in this case 
*/
+r = kvm_vm_ioctl(kvm_state, KVM_SET_TSC_KHZ, env->tsc_khz);
+if (r < 0) {
+error_setg_errno(errp, -r, "Unable to set TSC frequency to %" PRId64 " 
kHz",
+ env->tsc_khz);
+return r;
+}
+
 r = setup_td_guest_attributes(x86cpu, errp);
 if (r) {
 return r;
-- 
2.34.1




[PATCH v5 48/65] i386/tdx: handle TDG.VP.VMCALL

2024-02-28 Thread Xiaoyao Li
From: Isaku Yamahata 

For SetupEventNotifyInterrupt, record interrupt vector and the apic id
of the vcpu that received this TDVMCALL.

Later it can inject interrupt with given vector to the specific vcpu
that received SetupEventNotifyInterrupt.

Signed-off-by: Isaku Yamahata 
Signed-off-by: Xiaoyao Li 
---
 target/i386/kvm/kvm.c  |  8 ++
 target/i386/kvm/tdx-stub.c |  5 
 target/i386/kvm/tdx.c  | 53 ++
 target/i386/kvm/tdx.h  | 14 ++
 4 files changed, 80 insertions(+)

diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
index 4f998b2d6d37..2748086231d5 100644
--- a/target/i386/kvm/kvm.c
+++ b/target/i386/kvm/kvm.c
@@ -5413,6 +5413,14 @@ int kvm_arch_handle_exit(CPUState *cs, struct kvm_run 
*run)
 ret = kvm_xen_handle_exit(cpu, >xen);
 break;
 #endif
+case KVM_EXIT_TDX:
+if (!is_tdx_vm()) {
+error_report("KVM: get KVM_EXIT_TDX for a non-TDX VM.");
+ret = -1;
+break;
+}
+ret = tdx_handle_exit(cpu, >tdx);
+break;
 default:
 fprintf(stderr, "KVM: unknown exit reason %d\n", run->exit_reason);
 ret = -1;
diff --git a/target/i386/kvm/tdx-stub.c b/target/i386/kvm/tdx-stub.c
index a064d583d393..57cd25793842 100644
--- a/target/i386/kvm/tdx-stub.c
+++ b/target/i386/kvm/tdx-stub.c
@@ -11,3 +11,8 @@ int tdx_parse_tdvf(void *flash_ptr, int size)
 {
 return -EINVAL;
 }
+
+int tdx_handle_exit(X86CPU *cpu, struct kvm_tdx_exit *tdx_exit)
+{
+return -EINVAL;
+}
diff --git a/target/i386/kvm/tdx.c b/target/i386/kvm/tdx.c
index d445d4b70f77..49f94d9d46f4 100644
--- a/target/i386/kvm/tdx.c
+++ b/target/i386/kvm/tdx.c
@@ -866,6 +866,56 @@ int tdx_parse_tdvf(void *flash_ptr, int size)
 return tdvf_parse_metadata(_guest->tdvf, flash_ptr, size);
 }
 
+static int tdx_handle_setup_event_notify_interrupt(X86CPU *cpu,
+   struct kvm_tdx_vmcall 
*vmcall)
+{
+int vector = vmcall->in_r12;
+
+if (32 <= vector && vector <= 255) {
+qemu_mutex_lock(_guest->lock);
+tdx_guest->event_notify_vector = vector;
+tdx_guest->event_notify_apicid = cpu->apic_id;
+qemu_mutex_unlock(_guest->lock);
+vmcall->status_code = TDG_VP_VMCALL_SUCCESS;
+} else {
+vmcall->status_code = TDG_VP_VMCALL_INVALID_OPERAND;
+}
+
+return 0;
+}
+
+static int tdx_handle_vmcall(X86CPU *cpu, struct kvm_tdx_vmcall *vmcall)
+{
+vmcall->status_code = TDG_VP_VMCALL_INVALID_OPERAND;
+
+/* For now handle only TDG.VP.VMCALL leaf defined in TDX GHCI */
+if (vmcall->type != 0) {
+error_report("Unknown TDG.VP.VMCALL type 0x%llx subfunction 0x%llx",
+ vmcall->type, vmcall->subfunction);
+return -1;
+}
+
+switch (vmcall->subfunction) {
+case TDG_VP_VMCALL_SETUP_EVENT_NOTIFY_INTERRUPT:
+return tdx_handle_setup_event_notify_interrupt(cpu, vmcall);
+default:
+error_report("Unknown TDG.VP.VMCALL type 0x%llx subfunction 0x%llx",
+ vmcall->type, vmcall->subfunction);
+return -1;
+}
+}
+
+int tdx_handle_exit(X86CPU *cpu, struct kvm_tdx_exit *tdx_exit)
+{
+switch (tdx_exit->type) {
+case KVM_EXIT_TDX_VMCALL:
+return tdx_handle_vmcall(cpu, _exit->u.vmcall);
+default:
+error_report("unknown tdx exit type 0x%x", tdx_exit->type);
+return -1;
+}
+}
+
 static bool tdx_guest_get_sept_ve_disable(Object *obj, Error **errp)
 {
 TdxGuest *tdx = TDX_GUEST(obj);
@@ -956,6 +1006,9 @@ static void tdx_guest_init(Object *obj)
 object_property_add_str(obj, "mrownerconfig",
 tdx_guest_get_mrownerconfig,
 tdx_guest_set_mrownerconfig);
+
+tdx->event_notify_vector = -1;
+tdx->event_notify_apicid = -1;
 }
 
 static void tdx_guest_finalize(Object *obj)
diff --git a/target/i386/kvm/tdx.h b/target/i386/kvm/tdx.h
index 3fb4069268f6..b3d4462fe718 100644
--- a/target/i386/kvm/tdx.h
+++ b/target/i386/kvm/tdx.h
@@ -7,6 +7,7 @@
 
 #include "exec/confidential-guest-support.h"
 #include "hw/i386/tdvf.h"
+#include "sysemu/kvm.h"
 
 #define TYPE_TDX_GUEST "tdx-guest"
 #define TDX_GUEST(obj)  OBJECT_CHECK(TdxGuest, (obj), TYPE_TDX_GUEST)
@@ -15,6 +16,14 @@ typedef struct TdxGuestClass {
 ConfidentialGuestSupportClass parent_class;
 } TdxGuestClass;
 
+#define TDG_VP_VMCALL_SETUP_EVENT_NOTIFY_INTERRUPT  0x10004ULL
+
+#define TDG_VP_VMCALL_SUCCESS   0xULL
+#define TDG_VP_VMCALL_RETRY 0x0001ULL
+#define TDG_VP_VMCALL_INVALID_OPERAND   0x8000ULL
+#define TDG_VP_VMCALL_GPA_INUSE 0x8001ULL
+#define TDG_VP_VMCALL_ALIGN_ERROR   0x8

[PATCH v5 42/65] i386/tdx: Track RAM entries for TDX VM

2024-02-28 Thread Xiaoyao Li
The RAM of TDX VM can be classified into two types:

 - TDX_RAM_UNACCEPTED: default type of TDX memory, which needs to be
   accepted by TDX guest before it can be used and will be all-zeros
   after being accepted.

 - TDX_RAM_ADDED: the RAM that is ADD'ed to TD guest before running, and
   can be used directly. E.g., TD HOB and TEMP MEM that needed by TDVF.

Maintain TdxRamEntries[] which grabs the initial RAM info from e820 table
and mark each RAM range as default type TDX_RAM_UNACCEPTED.

Then turn the range of TD HOB and TEMP MEM to TDX_RAM_ADDED since these
ranges will be ADD'ed before TD runs and no need to be accepted runtime.

The TdxRamEntries[] are later used to setup the memory TD resource HOB
that passes memory info from QEMU to TDVF.

Signed-off-by: Xiaoyao Li 
Acked-by: Gerd Hoffmann 
---
Changes in v3:
- use enum TdxRamType in struct TdxRamEntry; (Isaku)
- Fix the indention; (Daniel)

Changes in v1:
  - simplify the algorithm of tdx_accept_ram_range() (Suggested-by: Gerd 
Hoffman)
(1) Change the existing entry to cover the accepted ram range.
(2) If there is room before the accepted ram range add a
TDX_RAM_UNACCEPTED entry for that.
(3) If there is room after the accepted ram range add a
TDX_RAM_UNACCEPTED entry for that.
---
 target/i386/kvm/tdx.c | 111 ++
 target/i386/kvm/tdx.h |  14 ++
 2 files changed, 125 insertions(+)

diff --git a/target/i386/kvm/tdx.c b/target/i386/kvm/tdx.c
index 1b4bca9cc3cd..98b2cfd40651 100644
--- a/target/i386/kvm/tdx.c
+++ b/target/i386/kvm/tdx.c
@@ -22,6 +22,7 @@
 #include "sysemu/sysemu.h"
 #include "exec/ramblock.h"
 
+#include "hw/i386/e820_memory_layout.h"
 #include "hw/i386/x86.h"
 #include "hw/i386/tdvf.h"
 #include "kvm_i386.h"
@@ -472,11 +473,117 @@ void tdx_set_tdvf_region(MemoryRegion *tdvf_mr)
 tdx_guest->tdvf_mr = tdvf_mr;
 }
 
+static void tdx_add_ram_entry(uint64_t address, uint64_t length,
+  enum TdxRamType type)
+{
+uint32_t nr_entries = tdx_guest->nr_ram_entries;
+tdx_guest->ram_entries = g_renew(TdxRamEntry, tdx_guest->ram_entries,
+ nr_entries + 1);
+
+tdx_guest->ram_entries[nr_entries].address = address;
+tdx_guest->ram_entries[nr_entries].length = length;
+tdx_guest->ram_entries[nr_entries].type = type;
+tdx_guest->nr_ram_entries++;
+}
+
+static int tdx_accept_ram_range(uint64_t address, uint64_t length)
+{
+uint64_t head_start, tail_start, head_length, tail_length;
+uint64_t tmp_address, tmp_length;
+TdxRamEntry *e;
+int i;
+
+for (i = 0; i < tdx_guest->nr_ram_entries; i++) {
+e = _guest->ram_entries[i];
+
+if (address + length <= e->address ||
+e->address + e->length <= address) {
+continue;
+}
+
+/*
+ * The to-be-accepted ram range must be fully contained by one
+ * RAM entry.
+ */
+if (e->address > address ||
+e->address + e->length < address + length) {
+return -EINVAL;
+}
+
+if (e->type == TDX_RAM_ADDED) {
+return -EINVAL;
+}
+
+break;
+}
+
+if (i == tdx_guest->nr_ram_entries) {
+return -1;
+}
+
+tmp_address = e->address;
+tmp_length = e->length;
+
+e->address = address;
+e->length = length;
+e->type = TDX_RAM_ADDED;
+
+head_length = address - tmp_address;
+if (head_length > 0) {
+head_start = tmp_address;
+tdx_add_ram_entry(head_start, head_length, TDX_RAM_UNACCEPTED);
+}
+
+tail_start = address + length;
+if (tail_start < tmp_address + tmp_length) {
+tail_length = tmp_address + tmp_length - tail_start;
+tdx_add_ram_entry(tail_start, tail_length, TDX_RAM_UNACCEPTED);
+}
+
+return 0;
+}
+
+static int tdx_ram_entry_compare(const void *lhs_, const void* rhs_)
+{
+const TdxRamEntry *lhs = lhs_;
+const TdxRamEntry *rhs = rhs_;
+
+if (lhs->address == rhs->address) {
+return 0;
+}
+if (le64_to_cpu(lhs->address) > le64_to_cpu(rhs->address)) {
+return 1;
+}
+return -1;
+}
+
+static void tdx_init_ram_entries(void)
+{
+unsigned i, j, nr_e820_entries;
+
+nr_e820_entries = e820_get_num_entries();
+tdx_guest->ram_entries = g_new(TdxRamEntry, nr_e820_entries);
+
+for (i = 0, j = 0; i < nr_e820_entries; i++) {
+uint64_t addr, len;
+
+if (e820_get_entry(i, E820_RAM, , )) {
+tdx_guest->ram_entries[j].address = addr;
+tdx_guest->ram_entries[j].length = len;
+tdx_guest->ram_entries[j].type = TDX_RAM_UNACCEPTED;
+j++;
+}
+}
+tdx_guest->nr_ram_entries = j;
+}
+
 static void tdx_finalize_vm(Notifier *not

[PATCH v5 58/65] i386/tdx: LMCE is not supported for TDX

2024-02-28 Thread Xiaoyao Li
LMCE is not supported TDX since KVM doesn't provide emulation for
MSR_IA32_FEAT_CTL.

Signed-off-by: Xiaoyao Li 
---
 target/i386/kvm/kvm-cpu.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/target/i386/kvm/kvm-cpu.c b/target/i386/kvm/kvm-cpu.c
index 9c791b7b0520..8c618869533c 100644
--- a/target/i386/kvm/kvm-cpu.c
+++ b/target/i386/kvm/kvm-cpu.c
@@ -15,6 +15,7 @@
 #include "sysemu/sysemu.h"
 #include "hw/boards.h"
 
+#include "tdx.h"
 #include "kvm_i386.h"
 #include "hw/core/accel-cpu.h"
 
@@ -60,6 +61,10 @@ static bool lmce_supported(void)
 if (kvm_ioctl(kvm_state, KVM_X86_GET_MCE_CAP_SUPPORTED, _cap) < 0) {
 return false;
 }
+
+if (is_tdx_vm())
+return false;
+
 return !!(mce_cap & MCG_LMCE_P);
 }
 
-- 
2.34.1




[PATCH v5 54/65] q35: Introduce smm_ranges property for q35-pci-host

2024-02-28 Thread Xiaoyao Li
From: Isaku Yamahata 

Add a q35 property to check whether or not SMM ranges, e.g. SMRAM, TSEG,
etc... exist for the target platform.  TDX doesn't support SMM and doesn't
play nice with QEMU modifying related guest memory ranges.

Signed-off-by: Isaku Yamahata 
Co-developed-by: Sean Christopherson 
Signed-off-by: Sean Christopherson 
Signed-off-by: Xiaoyao Li 
---
 hw/i386/pc_q35.c  |  2 ++
 hw/pci-host/q35.c | 42 +++
 include/hw/i386/pc.h  |  1 +
 include/hw/pci-host/q35.h |  1 +
 4 files changed, 33 insertions(+), 13 deletions(-)

diff --git a/hw/i386/pc_q35.c b/hw/i386/pc_q35.c
index 45a4102e75f5..dc3f68676ea0 100644
--- a/hw/i386/pc_q35.c
+++ b/hw/i386/pc_q35.c
@@ -212,6 +212,8 @@ static void pc_q35_init(MachineState *machine)
 x86ms->above_4g_mem_size, NULL);
 object_property_set_bool(phb, PCI_HOST_BYPASS_IOMMU,
  pcms->default_bus_bypass_iommu, NULL);
+object_property_set_bool(phb, PCI_HOST_PROP_SMM_RANGES,
+ x86_machine_is_smm_enabled(x86ms), NULL);
 sysbus_realize_and_unref(SYS_BUS_DEVICE(phb), _fatal);
 
 /* pci */
diff --git a/hw/pci-host/q35.c b/hw/pci-host/q35.c
index 98d4a7c253a6..0b6cbaed7ed5 100644
--- a/hw/pci-host/q35.c
+++ b/hw/pci-host/q35.c
@@ -179,6 +179,8 @@ static Property q35_host_props[] = {
  mch.below_4g_mem_size, 0),
 DEFINE_PROP_SIZE(PCI_HOST_ABOVE_4G_MEM_SIZE, Q35PCIHost,
  mch.above_4g_mem_size, 0),
+DEFINE_PROP_BOOL(PCI_HOST_PROP_SMM_RANGES, Q35PCIHost,
+ mch.has_smm_ranges, true),
 DEFINE_PROP_BOOL("x-pci-hole64-fix", Q35PCIHost, pci_hole64_fix, true),
 DEFINE_PROP_END_OF_LIST(),
 };
@@ -214,6 +216,7 @@ static void q35_host_initfn(Object *obj)
 /* mch's object_initialize resets the default value, set it again */
 qdev_prop_set_uint64(DEVICE(s), PCI_HOST_PROP_PCI_HOLE64_SIZE,
  Q35_PCI_HOST_HOLE64_SIZE_DEFAULT);
+
 object_property_add(obj, PCI_HOST_PROP_PCI_HOLE_START, "uint32",
 q35_host_get_pci_hole_start,
 NULL, NULL, NULL);
@@ -476,6 +479,10 @@ static void mch_write_config(PCIDevice *d,
 mch_update_pciexbar(mch);
 }
 
+if (!mch->has_smm_ranges) {
+return;
+}
+
 if (ranges_overlap(address, len, MCH_HOST_BRIDGE_SMRAM,
MCH_HOST_BRIDGE_SMRAM_SIZE)) {
 mch_update_smram(mch);
@@ -494,10 +501,13 @@ static void mch_write_config(PCIDevice *d,
 static void mch_update(MCHPCIState *mch)
 {
 mch_update_pciexbar(mch);
+
 mch_update_pam(mch);
-mch_update_smram(mch);
-mch_update_ext_tseg_mbytes(mch);
-mch_update_smbase_smram(mch);
+if (mch->has_smm_ranges) {
+mch_update_smram(mch);
+mch_update_ext_tseg_mbytes(mch);
+mch_update_smbase_smram(mch);
+}
 
 /*
  * pci hole goes from end-of-low-ram to io-apic.
@@ -538,19 +548,21 @@ static void mch_reset(DeviceState *qdev)
 pci_set_quad(d->config + MCH_HOST_BRIDGE_PCIEXBAR,
  MCH_HOST_BRIDGE_PCIEXBAR_DEFAULT);
 
-d->config[MCH_HOST_BRIDGE_SMRAM] = MCH_HOST_BRIDGE_SMRAM_DEFAULT;
-d->config[MCH_HOST_BRIDGE_ESMRAMC] = MCH_HOST_BRIDGE_ESMRAMC_DEFAULT;
-d->wmask[MCH_HOST_BRIDGE_SMRAM] = MCH_HOST_BRIDGE_SMRAM_WMASK;
-d->wmask[MCH_HOST_BRIDGE_ESMRAMC] = MCH_HOST_BRIDGE_ESMRAMC_WMASK;
+if (mch->has_smm_ranges) {
+d->config[MCH_HOST_BRIDGE_SMRAM] = MCH_HOST_BRIDGE_SMRAM_DEFAULT;
+d->config[MCH_HOST_BRIDGE_ESMRAMC] = MCH_HOST_BRIDGE_ESMRAMC_DEFAULT;
+d->wmask[MCH_HOST_BRIDGE_SMRAM] = MCH_HOST_BRIDGE_SMRAM_WMASK;
+d->wmask[MCH_HOST_BRIDGE_ESMRAMC] = MCH_HOST_BRIDGE_ESMRAMC_WMASK;
 
-if (mch->ext_tseg_mbytes > 0) {
-pci_set_word(d->config + MCH_HOST_BRIDGE_EXT_TSEG_MBYTES,
- MCH_HOST_BRIDGE_EXT_TSEG_MBYTES_QUERY);
+if (mch->ext_tseg_mbytes > 0) {
+pci_set_word(d->config + MCH_HOST_BRIDGE_EXT_TSEG_MBYTES,
+MCH_HOST_BRIDGE_EXT_TSEG_MBYTES_QUERY);
+}
+
+d->config[MCH_HOST_BRIDGE_F_SMBASE] = 0;
+d->wmask[MCH_HOST_BRIDGE_F_SMBASE] = 0xff;
 }
 
-d->config[MCH_HOST_BRIDGE_F_SMBASE] = 0;
-d->wmask[MCH_HOST_BRIDGE_F_SMBASE] = 0xff;
-
 mch_update(mch);
 }
 
@@ -578,6 +590,10 @@ static void mch_realize(PCIDevice *d, Error **errp)
  PAM_EXPAN_BASE + i * PAM_EXPAN_SIZE, PAM_EXPAN_SIZE);
 }
 
+if (!mch->has_smm_ranges) {
+return;
+}
+
 /* if *disabled* show SMRAM to all CPUs */
 memory_region_init_alias(>smram_region, OBJECT(mch), "smram-region",
  mch->pci_address_space, 
MCH_HOST_BRIDGE_SMRAM_C_BASE,
diff --git a/include/hw/i386/pc.h b/include/hw/i386/pc.h
index 5

[PATCH v5 36/65] i386/tdx: load TDVF for TD guest

2024-02-28 Thread Xiaoyao Li
From: Chao Peng 

TDVF(OVMF) needs to run at private memory for TD guest. TDX cannot
support pflash device since it doesn't support read-only private memory.
Thus load TDVF(OVMF) with -bios option for TDs.

Use memory_region_init_ram_guest_memfd() to allocate the MemoryRegion
for TDVF because it needs to be located at private memory.

Also store the MemoryRegion pointer of TDVF since the shared ramblock of
it can be discared after it gets copied to private ramblock.

Signed-off-by: Chao Peng 
Co-developed-by: Xiaoyao Li 
Signed-off-by: Xiaoyao Li 
---
 hw/i386/x86.c | 13 +++--
 target/i386/kvm/tdx.c |  7 +++
 target/i386/kvm/tdx.h |  3 +++
 3 files changed, 21 insertions(+), 2 deletions(-)

diff --git a/hw/i386/x86.c b/hw/i386/x86.c
index fa7095310f37..5a0cadc88c4f 100644
--- a/hw/i386/x86.c
+++ b/hw/i386/x86.c
@@ -47,6 +47,7 @@
 #include "hw/intc/i8259.h"
 #include "hw/rtc/mc146818rtc.h"
 #include "target/i386/sev.h"
+#include "kvm/tdx.h"
 
 #include "hw/acpi/cpu_hotplug.h"
 #include "hw/irq.h"
@@ -1157,9 +1158,17 @@ void x86_bios_rom_init(MachineState *ms, const char 
*default_firmware,
 (bios_size % 65536) != 0) {
 goto bios_error;
 }
+
 bios = g_malloc(sizeof(*bios));
-memory_region_init_ram(bios, NULL, "pc.bios", bios_size, _fatal);
-if (sev_enabled()) {
+if (is_tdx_vm()) {
+memory_region_init_ram_guest_memfd(bios, NULL, "pc.bios", bios_size,
+   _fatal);
+tdx_set_tdvf_region(bios);
+} else {
+memory_region_init_ram(bios, NULL, "pc.bios", bios_size, _fatal);
+}
+
+if (sev_enabled() || is_tdx_vm()) {
 /*
  * The concept of a "reset" simply doesn't exist for
  * confidential computing guests, we have to destroy and
diff --git a/target/i386/kvm/tdx.c b/target/i386/kvm/tdx.c
index 13f069171db7..7c8e14e3cc58 100644
--- a/target/i386/kvm/tdx.c
+++ b/target/i386/kvm/tdx.c
@@ -19,6 +19,7 @@
 #include "standard-headers/asm-x86/kvm_para.h"
 #include "sysemu/kvm.h"
 #include "sysemu/sysemu.h"
+#include "exec/ramblock.h"
 
 #include "hw/i386/x86.h"
 #include "kvm_i386.h"
@@ -463,6 +464,12 @@ static void update_tdx_cpuid_lookup_by_tdx_caps(void)
 (tdx_caps->xfam_fixed1 & CPUID_XSTATE_XSS_MASK) >> 32;
 }
 
+void tdx_set_tdvf_region(MemoryRegion *tdvf_mr)
+{
+assert(!tdx_guest->tdvf_mr);
+tdx_guest->tdvf_mr = tdvf_mr;
+}
+
 static int tdx_kvm_init(ConfidentialGuestSupport *cgs, Error **errp)
 {
 MachineState *ms = MACHINE(qdev_get_machine());
diff --git a/target/i386/kvm/tdx.h b/target/i386/kvm/tdx.h
index 2697e6bdfb1d..c021223001a5 100644
--- a/target/i386/kvm/tdx.h
+++ b/target/i386/kvm/tdx.h
@@ -24,6 +24,8 @@ typedef struct TdxGuest {
 char *mrconfigid;   /* base64 encoded sha348 digest */
 char *mrowner;  /* base64 encoded sha348 digest */
 char *mrownerconfig;/* base64 encoded sha348 digest */
+
+MemoryRegion *tdvf_mr;
 } TdxGuest;
 
 #ifdef CONFIG_TDX
@@ -35,5 +37,6 @@ bool is_tdx_vm(void);
 void tdx_get_supported_cpuid(uint32_t function, uint32_t index, int reg,
  uint32_t *ret);
 int tdx_pre_create_vcpu(CPUState *cpu, Error **errp);
+void tdx_set_tdvf_region(MemoryRegion *tdvf_mr);
 
 #endif /* QEMU_I386_TDX_H */
-- 
2.34.1




[PATCH v5 27/65] i386/tdx: Wire CPU features up with attributes of TD guest

2024-02-28 Thread Xiaoyao Li
For QEMU VMs, PKS is configured via CPUID_7_0_ECX_PKS and PMU is
configured by x86cpu->enable_pmu. Reuse the existing configuration
interface for TDX VMs.

Signed-off-by: Xiaoyao Li 
Acked-by: Gerd Hoffmann 
---
 target/i386/kvm/tdx.c | 13 +
 1 file changed, 13 insertions(+)

diff --git a/target/i386/kvm/tdx.c b/target/i386/kvm/tdx.c
index b6295a644566..262e86fd2c67 100644
--- a/target/i386/kvm/tdx.c
+++ b/target/i386/kvm/tdx.c
@@ -33,6 +33,8 @@
  (1U << KVM_FEATURE_MSI_EXT_DEST_ID))
 
 #define TDX_TD_ATTRIBUTES_SEPT_VE_DISABLE   BIT_ULL(28)
+#define TDX_TD_ATTRIBUTES_PKS   BIT_ULL(30)
+#define TDX_TD_ATTRIBUTES_PERFMON   BIT_ULL(63)
 
 #define TDX_ATTRIBUTES_MAX_BITS  64
 
@@ -477,6 +479,15 @@ static int tdx_kvm_init(ConfidentialGuestSupport *cgs, 
Error **errp)
 return 0;
 }
 
+static void setup_td_guest_attributes(X86CPU *x86cpu)
+{
+CPUX86State *env = >env;
+
+tdx_guest->attributes |= (env->features[FEAT_7_0_ECX] & CPUID_7_0_ECX_PKS) 
?
+ TDX_TD_ATTRIBUTES_PKS : 0;
+tdx_guest->attributes |= x86cpu->enable_pmu ? TDX_TD_ATTRIBUTES_PERFMON : 
0;
+}
+
 int tdx_pre_create_vcpu(CPUState *cpu, Error **errp)
 {
 MachineState *ms = MACHINE(qdev_get_machine());
@@ -499,6 +510,8 @@ int tdx_pre_create_vcpu(CPUState *cpu, Error **errp)
 return r;
 }
 
+setup_td_guest_attributes(x86cpu);
+
 init_vm->cpuid.nent = kvm_x86_arch_cpuid(env, init_vm->cpuid.entries, 0);
 
 init_vm->attributes = tdx_guest->attributes;
-- 
2.34.1




  1   2   3   4   5   6   7   8   9   >