Re: [PATCH 2/2] tests: add testing of parameter=1 for SMP topology
On 5/13/2024 8:33 PM, Daniel P. Berrangé wrote: Validate that it is possible to pass 'parameter=1' for any SMP topology parameter, since unsupported parameters are implicitly considered to always have a value of 1. Signed-off-by: Daniel P. Berrangé --- tests/unit/test-smp-parse.c | 8 1 file changed, 8 insertions(+) diff --git a/tests/unit/test-smp-parse.c b/tests/unit/test-smp-parse.c index 56165e6644..56ce5128f1 100644 --- a/tests/unit/test-smp-parse.c +++ b/tests/unit/test-smp-parse.c @@ -330,6 +330,14 @@ static const struct SMPTestData data_generic_valid[] = { .config = SMP_CONFIG_GENERIC(T, 8, T, 2, T, 4, T, 2, T, 16), .expect_prefer_sockets = CPU_TOPOLOGY_GENERIC(8, 2, 4, 2, 16), .expect_prefer_cores = CPU_TOPOLOGY_GENERIC(8, 2, 4, 2, 16), +}, { +/* + * Unsupported parameters are always allowed to be set to '1' + * config: -smp 8,books=1,drawers=1,sockets=2,modules=1,dies=1,cores=4,threads=2,maxcpus=8 cores=2 not 4. + * expect: cpus=8,sockets=2,cores=2,threads=2,maxcpus=8 */ +.config = SMP_CONFIG_WITH_FULL_TOPO(8, 1, 1, 2, 1, 1, 2, 2, 8), +.expect_prefer_sockets = CPU_TOPOLOGY_GENERIC(8, 2, 2, 2, 8), +.expect_prefer_cores = CPU_TOPOLOGY_GENERIC(8, 2, 2, 2, 8), }, };
Re: [PATCH 6/6] target/i386/confidential-guest: Fix comment of x86_confidential_guest_kvm_type()
On 4/26/2024 6:07 PM, Zhao Liu wrote: Update the comment to match the X86ConfidentialGuestClass implementation. Suggested-by: Xiaoyao Li I think it should be "Reported-by" Signed-off-by: Zhao Liu --- target/i386/confidential-guest.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/target/i386/confidential-guest.h b/target/i386/confidential-guest.h index 532e172a60b6..06d54a120227 100644 --- a/target/i386/confidential-guest.h +++ b/target/i386/confidential-guest.h @@ -44,7 +44,7 @@ struct X86ConfidentialGuestClass { /** * x86_confidential_guest_kvm_type: * - * Calls #X86ConfidentialGuestClass.unplug callback of @plug_handler. + * Calls #X86ConfidentialGuestClass.kvm_type() callback. */ static inline int x86_confidential_guest_kvm_type(X86ConfidentialGuest *cg) {
Re: [PATCH for-9.1 0/7] target/i386/kvm: Cleanup the kvmclock feature name
On 4/25/2024 6:29 PM, Zhao Liu wrote: On Thu, Apr 25, 2024 at 04:40:10PM +0800, Xiaoyao Li wrote: Date: Thu, 25 Apr 2024 16:40:10 +0800 From: Xiaoyao Li Subject: Re: [PATCH for-9.1 0/7] target/i386/kvm: Cleanup the kvmclock feature name On 4/25/2024 3:17 PM, Zhao Liu wrote: Hi Xiaoyao, On Wed, Apr 24, 2024 at 11:57:11PM +0800, Xiaoyao Li wrote: Date: Wed, 24 Apr 2024 23:57:11 +0800 From: Xiaoyao Li Subject: Re: [PATCH for-9.1 0/7] target/i386/kvm: Cleanup the kvmclock feature name On 3/29/2024 6:19 PM, Zhao Liu wrote: From: Zhao Liu Hi list, This series is based on Paolo's guest_phys_bits patchset [1]. Currently, the old and new kvmclocks have the same feature name "kvmclock" in FeatureWordInfo[FEAT_KVM]. When I tried to dig into the history of this unusual naming and fix it, I realized that Tim was already trying to rename it, so I picked up his renaming patch [2] (with a new commit message and other minor changes). 13 years age, the same name was introduced in [3], and its main purpose is to make it easy for users to enable/disable 2 kvmclocks. Then, in 2012, Don tried to rename the new kvmclock, but the follow-up did not address Igor and Eduardo's comments about compatibility. Tim [2], not long ago, and I just now, were both puzzled by the naming one after the other. The commit message of [3] said the reason clearly: When we tweak flags involving this value - specially when we use "-", we have to act on both. So you are trying to change it to "when people want to disable kvmclock, they need to use '-kvmclock,-kvmclock2' instead of '-kvmclock'" IMHO, I prefer existing code and I don't see much value of differentiating them. If the current code puzzles you, then we can add comment to explain. It's enough to just enable kvmclock2 for Guest; kvmclock (old) is redundant in the presence of kvmclock2. So operating both feature bits at the same time is not a reasonable choice, we should only keep kvmclock2 for Guest. It's possible because the oldest linux (v4.5) which QEMU i386 supports has kvmclock2. who said the oldest Linux QEMU supports is from 4.5? what about 2.x kernel? For Host (docs/system/target-i386.rst). Besides, not only the Linux guest, whatever guest OS is, it will be broken if the guest is using kvmclock and QEMU starts to drop support of kvmclock. I'm not aware of any minimum version requirements for Guest supported by KVM, but there are no commitment. the common commitment is at least keeping backwards compatibility. So, again, hard NAK to drop the support of kvmclock. It breaks existing guests that use kvmclock. You cannot force people to upgrade their existing VMs to use kvmclock2 instead of kvmclock. I agree, legacy kvmclock can be left out, if the old kernel does not support kvmclock2 and strongly requires kvmclock, it can be enabled using 9.1 and earlier machines or legacy-kvmclock, as long as Host still supports it. What's the gap in handling it this way? especially considering that kvmclock2 was introduced in v2.6.35, and earlier kernels are no longer maintained. The availability of the PV feature requires compatibility for both Host and Guest. Anyway, the above discussion here is about future plans, and this series does not prevent any Guest from ignoring kvmclock2 in favor of kvmclock. What this series is doing, i.e. separating the current two kvmclock and ensuring CPUID compatibility via legacy-kvmclock, could balance the compatibility requirements of the ancient (unmaintained kernel) with the need for future feature changes. You introduce a user-visible change that people need to use "-kvmclock,-kvmclock2" to totally disable kvmclock. The only difference between kvmclock and kvmclock2 is the MSR index. And from users' perspective, they don't care this difference. The existing usage is simple. When I want kvmclock, use "+kvmclock". When I don't want it, use "-kvmclock". You are complicating things and I don't see a strong reason that we have to do it. Thanks, Zhao
Re: [PATCH for-9.1 0/7] target/i386/kvm: Cleanup the kvmclock feature name
On 4/25/2024 3:17 PM, Zhao Liu wrote: Hi Xiaoyao, On Wed, Apr 24, 2024 at 11:57:11PM +0800, Xiaoyao Li wrote: Date: Wed, 24 Apr 2024 23:57:11 +0800 From: Xiaoyao Li Subject: Re: [PATCH for-9.1 0/7] target/i386/kvm: Cleanup the kvmclock feature name On 3/29/2024 6:19 PM, Zhao Liu wrote: From: Zhao Liu Hi list, This series is based on Paolo's guest_phys_bits patchset [1]. Currently, the old and new kvmclocks have the same feature name "kvmclock" in FeatureWordInfo[FEAT_KVM]. When I tried to dig into the history of this unusual naming and fix it, I realized that Tim was already trying to rename it, so I picked up his renaming patch [2] (with a new commit message and other minor changes). 13 years age, the same name was introduced in [3], and its main purpose is to make it easy for users to enable/disable 2 kvmclocks. Then, in 2012, Don tried to rename the new kvmclock, but the follow-up did not address Igor and Eduardo's comments about compatibility. Tim [2], not long ago, and I just now, were both puzzled by the naming one after the other. The commit message of [3] said the reason clearly: When we tweak flags involving this value - specially when we use "-", we have to act on both. So you are trying to change it to "when people want to disable kvmclock, they need to use '-kvmclock,-kvmclock2' instead of '-kvmclock'" IMHO, I prefer existing code and I don't see much value of differentiating them. If the current code puzzles you, then we can add comment to explain. It's enough to just enable kvmclock2 for Guest; kvmclock (old) is redundant in the presence of kvmclock2. So operating both feature bits at the same time is not a reasonable choice, we should only keep kvmclock2 for Guest. It's possible because the oldest linux (v4.5) which QEMU i386 supports has kvmclock2. who said the oldest Linux QEMU supports is from 4.5? what about 2.x kernel? Besides, not only the Linux guest, whatever guest OS is, it will be broken if the guest is using kvmclock and QEMU starts to drop support of kvmclock. So, again, hard NAK to drop the support of kvmclock. It breaks existing guests that use kvmclock. You cannot force people to upgrade their existing VMs to use kvmclock2 instead of kvmclock. Pls see: https://elixir.bootlin.com/linux/v4.5/source/arch/x86/include/uapi/asm/kvm_para.h#L22 With this in mind, I'm trying to implement a silky smooth transition to "only kcmclock2 support". This series is now separating kvmclock and kvmclock2, and then I plan to go to the KVM side and deprecate kvmclock2, and then finally remove kvmclock (old) in QEMU. Separating the two features makes the process clearer. Continuing to use the same name equally would have silently caused the CPUID to change on the Guest side, which would have hurt compatibility as well. So, this series is to push for renaming the new kvmclock feature to "kvmclock2" and adding compatibility support for older machines (PC 9.0 and older). Finally, let's put an end to decades of doubt about this name. Next Step = This series just separates the two kvmclocks from the naming, and in subsequent patches I plan to stop setting kvmclock(old kcmclock) by default as long as KVM supports kvmclock2 (new kvmclock). No. It will break existing guests that use KVM_FEATURE_CLOCKSOURCE. Please refer to my elaboration on v4.5 above, where kvmclock2 is available, it should be used in priority. Also, try to deprecate the old kvmclock in KVM side. [1]: https://lore.kernel.org/qemu-devel/20240325141422.1380087-1-pbonz...@redhat.com/ [2]: https://lore.kernel.org/qemu-devel/20230908124534.25027-4-twied...@redhat.com/ [3]: https://lore.kernel.org/qemu-devel/1300401727-5235-3-git-send-email-glom...@redhat.com/ [4]: https://lore.kernel.org/qemu-devel/1348171412-23669-3-git-send-email-...@cloudswitch.com/ Thanks, Zhao
Re: [PATCH for-9.1 0/7] target/i386/kvm: Cleanup the kvmclock feature name
On 3/29/2024 6:19 PM, Zhao Liu wrote: From: Zhao Liu Hi list, This series is based on Paolo's guest_phys_bits patchset [1]. Currently, the old and new kvmclocks have the same feature name "kvmclock" in FeatureWordInfo[FEAT_KVM]. When I tried to dig into the history of this unusual naming and fix it, I realized that Tim was already trying to rename it, so I picked up his renaming patch [2] (with a new commit message and other minor changes). 13 years age, the same name was introduced in [3], and its main purpose is to make it easy for users to enable/disable 2 kvmclocks. Then, in 2012, Don tried to rename the new kvmclock, but the follow-up did not address Igor and Eduardo's comments about compatibility. Tim [2], not long ago, and I just now, were both puzzled by the naming one after the other. The commit message of [3] said the reason clearly: When we tweak flags involving this value - specially when we use "-", we have to act on both. So you are trying to change it to "when people want to disable kvmclock, they need to use '-kvmclock,-kvmclock2' instead of '-kvmclock'" IMHO, I prefer existing code and I don't see much value of differentiating them. If the current code puzzles you, then we can add comment to explain. So, this series is to push for renaming the new kvmclock feature to "kvmclock2" and adding compatibility support for older machines (PC 9.0 and older). Finally, let's put an end to decades of doubt about this name. Next Step = This series just separates the two kvmclocks from the naming, and in subsequent patches I plan to stop setting kvmclock(old kcmclock) by default as long as KVM supports kvmclock2 (new kvmclock). No. It will break existing guests that use KVM_FEATURE_CLOCKSOURCE. Also, try to deprecate the old kvmclock in KVM side. [1]: https://lore.kernel.org/qemu-devel/20240325141422.1380087-1-pbonz...@redhat.com/ [2]: https://lore.kernel.org/qemu-devel/20230908124534.25027-4-twied...@redhat.com/ [3]: https://lore.kernel.org/qemu-devel/1300401727-5235-3-git-send-email-glom...@redhat.com/ [4]: https://lore.kernel.org/qemu-devel/1348171412-23669-3-git-send-email-...@cloudswitch.com/ Thanks and Best Regards, Zhao --- Tim Wiederhake (1): target/i386: Fix duplicated kvmclock name in FEAT_KVM Zhao Liu (6): target/i386/kvm: Add feature bit definitions for KVM CPUID target/i386/kvm: Remove local MSR_KVM_WALL_CLOCK and MSR_KVM_SYSTEM_TIME definitions target/i386/kvm: Only Save/load kvmclock MSRs when kvmclock enabled target/i386/kvm: Save/load MSRs of new kvmclock (KVM_FEATURE_CLOCKSOURCE2) target/i386/kvm: Add legacy_kvmclock cpu property target/i386/kvm: Update comment in kvm_cpu_realizefn() hw/i386/kvm/clock.c | 5 ++-- hw/i386/pc.c | 1 + target/i386/cpu.c | 3 +- target/i386/cpu.h | 32 + target/i386/kvm/kvm-cpu.c | 25 - target/i386/kvm/kvm.c | 59 +-- 6 files changed, 99 insertions(+), 26 deletions(-)
Re: [PATCH for-9.1 2/7] target/i386/kvm: Remove local MSR_KVM_WALL_CLOCK and MSR_KVM_SYSTEM_TIME definitions
On 3/29/2024 6:19 PM, Zhao Liu wrote: From: Zhao Liu These 2 MSRs have been already defined in the kvm_para header (standard-headers/asm-x86/kvm_para.h). Remove QEMU local definitions to avoid duplication. Signed-off-by: Zhao Liu Reviewed-by: Xiaoyao Li --- target/i386/kvm/kvm.c | 3 --- 1 file changed, 3 deletions(-) diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c index 2f3c8bc3a4ed..be88339fb8bd 100644 --- a/target/i386/kvm/kvm.c +++ b/target/i386/kvm/kvm.c @@ -78,9 +78,6 @@ #define KVM_APIC_BUS_CYCLE_NS 1 #define KVM_APIC_BUS_FREQUENCY (10ULL / KVM_APIC_BUS_CYCLE_NS) -#define MSR_KVM_WALL_CLOCK 0x11 -#define MSR_KVM_SYSTEM_TIME 0x12 - /* A 4096-byte buffer can hold the 8-byte kvm_msrs header, plus * 255 kvm_msr_entry structs */ #define MSR_BUF_SIZE 4096
Re: [PATCH for-9.1 1/7] target/i386/kvm: Add feature bit definitions for KVM CPUID
On 3/29/2024 6:19 PM, Zhao Liu wrote: From: Zhao Liu Add feature definiations for KVM_CPUID_FEATURES in CPUID ( CPUID[4000_0001].EAX and CPUID[4000_0001].EDX), to get rid of lots of offset calculations. Signed-off-by: Zhao Liu --- hw/i386/kvm/clock.c | 5 ++--- target/i386/cpu.h | 23 +++ target/i386/kvm/kvm.c | 28 ++-- 3 files changed, 39 insertions(+), 17 deletions(-) diff --git a/hw/i386/kvm/clock.c b/hw/i386/kvm/clock.c index 40aa9a32c32c..7c9752d5036f 100644 --- a/hw/i386/kvm/clock.c +++ b/hw/i386/kvm/clock.c @@ -27,7 +27,6 @@ #include "qapi/error.h" #include -#include "standard-headers/asm-x86/kvm_para.h" #include "qom/object.h" #define TYPE_KVM_CLOCK "kvmclock" @@ -334,8 +333,8 @@ void kvmclock_create(bool create_always) assert(kvm_enabled()); if (create_always || -cpu->env.features[FEAT_KVM] & ((1ULL << KVM_FEATURE_CLOCKSOURCE) | - (1ULL << KVM_FEATURE_CLOCKSOURCE2))) { +cpu->env.features[FEAT_KVM] & (CPUID_FEAT_KVM_CLOCK | + CPUID_FEAT_KVM_CLOCK2)) { sysbus_create_simple(TYPE_KVM_CLOCK, -1, NULL); } } diff --git a/target/i386/cpu.h b/target/i386/cpu.h index 83e473584517..b1b8d11cb0fe 100644 --- a/target/i386/cpu.h +++ b/target/i386/cpu.h @@ -27,6 +27,7 @@ #include "qapi/qapi-types-common.h" #include "qemu/cpu-float.h" #include "qemu/timer.h" +#include "standard-headers/asm-x86/kvm_para.h" #define XEN_NR_VIRQS 24 @@ -951,6 +952,28 @@ uint64_t x86_cpu_get_supported_feature_word(FeatureWord w, /* Packets which contain IP payload have LIP values */ #define CPUID_14_0_ECX_LIP (1U << 31) +/* (Old) KVM paravirtualized clocksource */ +#define CPUID_FEAT_KVM_CLOCK(1U << KVM_FEATURE_CLOCKSOURCE) we can drop the _FEAT_, just name it as CPUID_KVM_CLOCK +/* (New) KVM specific paravirtualized clocksource */ +#define CPUID_FEAT_KVM_CLOCK2 (1U << KVM_FEATURE_CLOCKSOURCE2) +/* KVM asynchronous page fault */ +#define CPUID_FEAT_KVM_ASYNCPF (1U << KVM_FEATURE_ASYNC_PF) +/* KVM stolen (when guest vCPU is not running) time accounting */ +#define CPUID_FEAT_KVM_STEAL_TIME (1U << KVM_FEATURE_STEAL_TIME) +/* KVM paravirtualized end-of-interrupt signaling */ +#define CPUID_FEAT_KVM_PV_EOI (1U << KVM_FEATURE_PV_EOI) +/* KVM paravirtualized spinlocks support */ +#define CPUID_FEAT_KVM_PV_UNHALT(1U << KVM_FEATURE_PV_UNHALT) +/* KVM host-side polling on HLT control from the guest */ +#define CPUID_FEAT_KVM_POLL_CONTROL (1U << KVM_FEATURE_POLL_CONTROL) +/* KVM interrupt based asynchronous page fault*/ +#define CPUID_FEAT_KVM_ASYNCPF_INT (1U << KVM_FEATURE_ASYNC_PF_INT) +/* KVM 'Extended Destination ID' support for external interrupts */ +#define CPUID_FEAT_KVM_MSI_EXT_DEST_ID (1U << KVM_FEATURE_MSI_EXT_DEST_ID) + +/* Hint to KVM that vCPUs expect never preempted for an unlimited time */ +#define CPUID_FEAT_KVM_HINTS_REALTIME(1U << KVM_HINTS_REALTIME) + /* CLZERO instruction */ #define CPUID_8000_0008_EBX_CLZERO (1U << 0) /* Always save/restore FP error pointers */ diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c index e68cbe929302..2f3c8bc3a4ed 100644 --- a/target/i386/kvm/kvm.c +++ b/target/i386/kvm/kvm.c @@ -481,13 +481,13 @@ uint32_t kvm_arch_get_supported_cpuid(KVMState *s, uint32_t function, * be enabled without the in-kernel irqchip */ if (!kvm_irqchip_in_kernel()) { -ret &= ~(1U << KVM_FEATURE_PV_UNHALT); +ret &= ~CPUID_FEAT_KVM_PV_UNHALT; } if (kvm_irqchip_is_split()) { -ret |= 1U << KVM_FEATURE_MSI_EXT_DEST_ID; +ret |= CPUID_FEAT_KVM_MSI_EXT_DEST_ID; } } else if (function == KVM_CPUID_FEATURES && reg == R_EDX) { -ret |= 1U << KVM_HINTS_REALTIME; +ret |= CPUID_FEAT_KVM_HINTS_REALTIME; } return ret; @@ -3324,20 +3324,20 @@ static int kvm_put_msrs(X86CPU *cpu, int level) kvm_msr_entry_add(cpu, MSR_IA32_TSC, env->tsc); kvm_msr_entry_add(cpu, MSR_KVM_SYSTEM_TIME, env->system_time_msr); kvm_msr_entry_add(cpu, MSR_KVM_WALL_CLOCK, env->wall_clock_msr); -if (env->features[FEAT_KVM] & (1 << KVM_FEATURE_ASYNC_PF_INT)) { +if (env->features[FEAT_KVM] & CPUID_FEAT_KVM_ASYNCPF_INT) { kvm_msr_entry_add(cpu, MSR_KVM_ASYNC_PF_INT, env->async_pf_int_msr); } -if (env->features[FEAT_KVM] & (1 << KVM_FEATURE_ASYNC_PF)) { +if (env->features[FEAT_KVM] & CPUID_FEAT_KVM_ASYNCPF) { kvm_msr_entry_add(cpu, MSR_KVM_ASYNC_PF_EN, env->async_pf_en_msr); } -if (env->features[FEAT_KVM] & (1 << KVM_FEATURE_PV_EOI)) { +if (env->features[FEAT_KVM] & CPUID_FEAT_KVM_PV_EOI) { kvm_msr_entry_add(cpu, MSR_KVM_PV_EOI_EN, env->pv_eoi_en_msr);
Re: [PULL 43/63] target/i386: Implement mc->kvm_type() to get VM type
On 4/23/2024 11:09 PM, Paolo Bonzini wrote: + +/** + * x86_confidential_guest_kvm_type: + * + * Calls #X86ConfidentialGuestClass.unplug callback of @plug_handler. the comment needs to be updated: Calls #X86ConfidentialGuestClass.kvm_type() callback + */ +static inline int x86_confidential_guest_kvm_type(X86ConfidentialGuest *cg) +{ +X86ConfidentialGuestClass *klass = X86_CONFIDENTIAL_GUEST_GET_CLASS(cg); + +if (klass->kvm_type) { +return klass->kvm_type(cg); +} else { +return 0; +} +}
Re: [PULL 25/63] i386/kvm: Move architectural CPUID leaf generation to separate helper
On 4/23/2024 11:09 PM, Paolo Bonzini wrote: From: Sean Christopherson Move the architectural (for lack of a better term) CPUID leaf generation to a separate helper so that the generation code can be reused by TDX, which needs to generate a canonical VM-scoped configuration. For now this is just a cleanup, so keep the function static. Signed-off-by: Sean Christopherson Signed-off-by: Xiaoyao Li Message-ID: <20240229063726.610065-23-xiaoyao...@intel.com> Reviewed-by: Xiaoyao Li Signed-off-by: Paolo Bonzini --- target/i386/kvm/kvm.c | 449 +- 1 file changed, 227 insertions(+), 222 deletions(-) diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c index e68cbe92930..fcf9603d3e6 100644 --- a/target/i386/kvm/kvm.c +++ b/target/i386/kvm/kvm.c @@ -1706,6 +1706,231 @@ static void kvm_init_nested_state(CPUX86State *env) } } +static uint32_t kvm_x86_build_cpuid(CPUX86State *env, +struct kvm_cpuid_entry2 *entries, +uint32_t cpuid_i) +{ +uint32_t limit, i, j; +uint32_t unused; +struct kvm_cpuid_entry2 *c; + +cpu_x86_cpuid(env, 0, 0, , , , ); + +for (i = 0; i <= limit; i++) { +j = 0; +if (cpuid_i == KVM_MAX_CPUID_ENTRIES) { +goto full; +} +c = [cpuid_i++]; +switch (i) { +case 2: { +/* Keep reading function 2 till all the input is received */ +int times; + +c->function = i; +c->flags = KVM_CPUID_FLAG_STATEFUL_FUNC | + KVM_CPUID_FLAG_STATE_READ_NEXT; +cpu_x86_cpuid(env, i, 0, >eax, >ebx, >ecx, >edx); +times = c->eax & 0xff; + +for (j = 1; j < times; ++j) { +if (cpuid_i == KVM_MAX_CPUID_ENTRIES) { +goto full; +} +c = [cpuid_i++]; +c->function = i; +c->flags = KVM_CPUID_FLAG_STATEFUL_FUNC; +cpu_x86_cpuid(env, i, 0, >eax, >ebx, >ecx, >edx); +} +break; +} +case 0x1f: +if (env->nr_dies < 2) { +cpuid_i--; +break; +} +/* fallthrough */ +case 4: +case 0xb: +case 0xd: +for (j = 0; ; j++) { +if (i == 0xd && j == 64) { +break; +} + +c->function = i; +c->flags = KVM_CPUID_FLAG_SIGNIFCANT_INDEX; +c->index = j; +cpu_x86_cpuid(env, i, j, >eax, >ebx, >ecx, >edx); + +if (i == 4 && c->eax == 0) { +break; +} +if (i == 0xb && !(c->ecx & 0xff00)) { +break; +} +if (i == 0x1f && !(c->ecx & 0xff00)) { +break; +} +if (i == 0xd && c->eax == 0) { +continue; +} +if (cpuid_i == KVM_MAX_CPUID_ENTRIES) { +goto full; +} +c = [cpuid_i++]; +} +break; +case 0x12: +for (j = 0; ; j++) { +c->function = i; +c->flags = KVM_CPUID_FLAG_SIGNIFCANT_INDEX; +c->index = j; +cpu_x86_cpuid(env, i, j, >eax, >ebx, >ecx, >edx); + +if (j > 1 && (c->eax & 0xf) != 1) { +break; +} + +if (cpuid_i == KVM_MAX_CPUID_ENTRIES) { +goto full; +} +c = [cpuid_i++]; +} +break; +case 0x7: +case 0x14: +case 0x1d: +case 0x1e: { +uint32_t times; + +c->function = i; +c->index = 0; +c->flags = KVM_CPUID_FLAG_SIGNIFCANT_INDEX; +cpu_x86_cpuid(env, i, 0, >eax, >ebx, >ecx, >edx); +times = c->eax; + +for (j = 1; j <= times; ++j) { +if (cpuid_i == KVM_MAX_CPUID_ENTRIES) { +goto full; +} +c = [cpuid_i++]; +c->function = i; +c->index = j; +c->flags = KVM_CPUID_FLAG_SIGNIFCANT_INDEX; +cpu_x86_cpuid(env, i, j, >eax, >ebx, >ecx, >edx); +} +break; +} +default: +c->function = i; +c->flags = 0; +cpu_x86_cpuid(env, i, 0, >eax, >ebx, >ecx, >edx); +if (!c->eax && !c->ebx && !c->ecx &a
Re: [PATCH v5 28/65] i386/tdx: Disable pmu for TD guest
On 4/16/2024 4:32 PM, Chenyi Qiang wrote: On 2/29/2024 2:36 PM, Xiaoyao Li wrote: Current KVM doesn't support PMU for TD guest. It returns error if TD is created with PMU bit being set in attributes. Disable PMU for TD guest on QEMU side. Signed-off-by: Xiaoyao Li --- target/i386/kvm/tdx.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/target/i386/kvm/tdx.c b/target/i386/kvm/tdx.c index 262e86fd2c67..1c12cda002b8 100644 --- a/target/i386/kvm/tdx.c +++ b/target/i386/kvm/tdx.c @@ -496,6 +496,8 @@ int tdx_pre_create_vcpu(CPUState *cpu, Error **errp) g_autofree struct kvm_tdx_init_vm *init_vm = NULL; int r = 0; +object_property_set_bool(OBJECT(cpu), "pmu", false, _abort); Is it necessary to output some prompt if the user wants to enable pmu by "-cpu host,pmu=on"? As in patch 27, it mentions PMU is configured by x86cpu->enable_pmu, but PMU is actually not support in this series and will be disabled silently. We do this in QEMU is mainly for KVM, because KVM will fail to init TD if ATTRIBUTE.PERFMON is set. It's expected that KVM reports PERFMON in attributes.fixed0 when KVM cannot provide support of it. Then QEMU will print error message automatically when validate the attributes. For QEMU part, next version is going to set the default value of "pmu" to false in kvm_cpu_max_instance_init(), so that "-cpu host" will not enable pmu for TDX VMs by default. I suppose both the KVM and QEMU change will show up in the next version. + QEMU_LOCK_GUARD(_guest->lock); if (tdx_guest->initialized) { return r;
Re: [PATCH v2] hw/i386/acpi: Set PCAT_COMPAT bit only when pic is not disabled
On 4/3/2024 11:12 PM, Igor Mammedov wrote: On Wed, 3 Apr 2024 10:59:53 -0400 Xiaoyao Li wrote: A value 1 of PCAT_COMPAT (bit 0) of MADT.Flags indicates that the system also has a PC-AT-compatible dual-8259 setup, i.e., the PIC. When PIC is not enabled (pic=off) for x86 machine, the PCAT_COMPAT bit needs to be cleared. Otherwise, the guest thinks there is a present PIC. Can you add to commit message reproducer (aka qemu CLI and relevant logs/symptoms observed on guest side)? When booting a VM with "-machine xx,pic=off", there is supposed to be no PIC for the guest. When guest probes PIC, it should find nothing and log of below should be printed: [0.155970] Using NULL legacy PIC However, the fact is that no such log printed in guest kernel, with the VM created with "pic=off". This is because guest think there is a present due to pcat_compat is reporte as 1 in MADT. See Linux code of probe_8259A() in arch/x86/kernel/i8259.c Signed-off-by: Xiaoyao Li --- changes in v2: - Clarify more in commit message; --- hw/i386/acpi-common.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/hw/i386/acpi-common.c b/hw/i386/acpi-common.c index 20f19269da40..0cc2919bb851 100644 --- a/hw/i386/acpi-common.c +++ b/hw/i386/acpi-common.c @@ -107,7 +107,9 @@ void acpi_build_madt(GArray *table_data, BIOSLinker *linker, acpi_table_begin(, table_data); /* Local APIC Address */ build_append_int_noprefix(table_data, APIC_DEFAULT_ADDRESS, 4); -build_append_int_noprefix(table_data, 1 /* PCAT_COMPAT */, 4); /* Flags */ +/* Flags. bit 0: PCAT_COMPAT */ +build_append_int_noprefix(table_data, + x86ms->pic != ON_OFF_AUTO_OFF ? 1 : 0 , 4); for (i = 0; i < apic_ids->len; i++) { pc_madt_cpu_entry(i, apic_ids, table_data, false);
[PATCH v2] hw/i386/acpi: Set PCAT_COMPAT bit only when pic is not disabled
A value 1 of PCAT_COMPAT (bit 0) of MADT.Flags indicates that the system also has a PC-AT-compatible dual-8259 setup, i.e., the PIC. When PIC is not enabled (pic=off) for x86 machine, the PCAT_COMPAT bit needs to be cleared. Otherwise, the guest thinks there is a present PIC. Signed-off-by: Xiaoyao Li --- changes in v2: - Clarify more in commit message; --- hw/i386/acpi-common.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/hw/i386/acpi-common.c b/hw/i386/acpi-common.c index 20f19269da40..0cc2919bb851 100644 --- a/hw/i386/acpi-common.c +++ b/hw/i386/acpi-common.c @@ -107,7 +107,9 @@ void acpi_build_madt(GArray *table_data, BIOSLinker *linker, acpi_table_begin(, table_data); /* Local APIC Address */ build_append_int_noprefix(table_data, APIC_DEFAULT_ADDRESS, 4); -build_append_int_noprefix(table_data, 1 /* PCAT_COMPAT */, 4); /* Flags */ +/* Flags. bit 0: PCAT_COMPAT */ +build_append_int_noprefix(table_data, + x86ms->pic != ON_OFF_AUTO_OFF ? 1 : 0 , 4); for (i = 0; i < apic_ids->len; i++) { pc_madt_cpu_entry(i, apic_ids, table_data, false); -- 2.34.1
Re: [PATCH] hw/i386/acpi: Set PCAT_COMPAT bit only when pic is not disabled
On 4/2/2024 10:31 PM, Michael S. Tsirkin wrote: On Tue, Apr 02, 2024 at 09:18:44PM +0800, Xiaoyao Li wrote: On 4/2/2024 6:02 PM, Michael S. Tsirkin wrote: On Tue, Apr 02, 2024 at 04:25:16AM -0400, Xiaoyao Li wrote: Set MADT.FLAGS[bit 0].PCAT_COMPAT based on x86ms->pic. Signed-off-by: Xiaoyao Li Please include more info in the commit log: what is the behaviour you observe, why it is wrong, how does the patch fix it, what is guest behaviour before and after. Sorry, I thought it was straightforward. A value 1 of PCAT_COMPAT (bit 0) of MADT.Flags indicates that the system also has a PC-AT-compatible dual-8259 setup, i.e., the PIC. When PIC is not enabled for x86 machine, the PCAT_COMPAT bit needs to be cleared. Otherwise, the guest thinks there is a present PIC even it is booted with pic=off on QEMU. (I haven't seen real issue from Linux guest. The user of PIC inside guest seems only the pit calibration. Whether pit calibration is triggered depends on other things. But logically, current code is wrong, we need to fix it anyway. @Isaku, please share more info if you have) + Kirill, It seems to have issue with legacy irqs with PCAT_COMPAT set 1 while no PIC on QEMU side. Kirill, could you elaborate it? That's sufficient, thanks! Pls put this in commit log and resubmit. The commit log and the subject should not repeat what the diff already states. --- hw/i386/acpi-common.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/hw/i386/acpi-common.c b/hw/i386/acpi-common.c index 20f19269da40..0cc2919bb851 100644 --- a/hw/i386/acpi-common.c +++ b/hw/i386/acpi-common.c @@ -107,7 +107,9 @@ void acpi_build_madt(GArray *table_data, BIOSLinker *linker, acpi_table_begin(, table_data); /* Local APIC Address */ build_append_int_noprefix(table_data, APIC_DEFAULT_ADDRESS, 4); -build_append_int_noprefix(table_data, 1 /* PCAT_COMPAT */, 4); /* Flags */ +/* Flags. bit 0: PCAT_COMPAT */ +build_append_int_noprefix(table_data, + x86ms->pic != ON_OFF_AUTO_OFF ? 1 : 0 , 4); for (i = 0; i < apic_ids->len; i++) { pc_madt_cpu_entry(i, apic_ids, table_data, false); -- 2.34.1
Re: [PATCH] hw/i386/acpi: Set PCAT_COMPAT bit only when pic is not disabled
On 4/2/2024 6:02 PM, Michael S. Tsirkin wrote: On Tue, Apr 02, 2024 at 04:25:16AM -0400, Xiaoyao Li wrote: Set MADT.FLAGS[bit 0].PCAT_COMPAT based on x86ms->pic. Signed-off-by: Xiaoyao Li Please include more info in the commit log: what is the behaviour you observe, why it is wrong, how does the patch fix it, what is guest behaviour before and after. Sorry, I thought it was straightforward. A value 1 of PCAT_COMPAT (bit 0) of MADT.Flags indicates that the system also has a PC-AT-compatible dual-8259 setup, i.e., the PIC. When PIC is not enabled for x86 machine, the PCAT_COMPAT bit needs to be cleared. Otherwise, the guest thinks there is a present PIC even it is booted with pic=off on QEMU. (I haven't seen real issue from Linux guest. The user of PIC inside guest seems only the pit calibration. Whether pit calibration is triggered depends on other things. But logically, current code is wrong, we need to fix it anyway. @Isaku, please share more info if you have) The commit log and the subject should not repeat what the diff already states. --- hw/i386/acpi-common.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/hw/i386/acpi-common.c b/hw/i386/acpi-common.c index 20f19269da40..0cc2919bb851 100644 --- a/hw/i386/acpi-common.c +++ b/hw/i386/acpi-common.c @@ -107,7 +107,9 @@ void acpi_build_madt(GArray *table_data, BIOSLinker *linker, acpi_table_begin(, table_data); /* Local APIC Address */ build_append_int_noprefix(table_data, APIC_DEFAULT_ADDRESS, 4); -build_append_int_noprefix(table_data, 1 /* PCAT_COMPAT */, 4); /* Flags */ +/* Flags. bit 0: PCAT_COMPAT */ +build_append_int_noprefix(table_data, + x86ms->pic != ON_OFF_AUTO_OFF ? 1 : 0 , 4); for (i = 0; i < apic_ids->len; i++) { pc_madt_cpu_entry(i, apic_ids, table_data, false); -- 2.34.1
[PATCH] hw/i386/acpi: Set PCAT_COMPAT bit only when pic is not disabled
Set MADT.FLAGS[bit 0].PCAT_COMPAT based on x86ms->pic. Signed-off-by: Xiaoyao Li --- hw/i386/acpi-common.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/hw/i386/acpi-common.c b/hw/i386/acpi-common.c index 20f19269da40..0cc2919bb851 100644 --- a/hw/i386/acpi-common.c +++ b/hw/i386/acpi-common.c @@ -107,7 +107,9 @@ void acpi_build_madt(GArray *table_data, BIOSLinker *linker, acpi_table_begin(, table_data); /* Local APIC Address */ build_append_int_noprefix(table_data, APIC_DEFAULT_ADDRESS, 4); -build_append_int_noprefix(table_data, 1 /* PCAT_COMPAT */, 4); /* Flags */ +/* Flags. bit 0: PCAT_COMPAT */ +build_append_int_noprefix(table_data, + x86ms->pic != ON_OFF_AUTO_OFF ? 1 : 0 , 4); for (i = 0; i < apic_ids->len; i++) { pc_madt_cpu_entry(i, apic_ids, table_data, false); -- 2.34.1
Re: [PATCH 26/26] i386/kvm: Move architectural CPUID leaf generation to separate helper
On 3/23/2024 2:11 AM, Paolo Bonzini wrote: From: Sean Christopherson Move the architectural (for lack of a better term) CPUID leaf generation to a separate helper so that the generation code can be reused by TDX, which needs to generate a canonical VM-scoped configuration. For now this is just a cleanup, so keep the function static. Signed-off-by: Sean Christopherson Signed-off-by: Xiaoyao Li Message-ID: <20240229063726.610065-23-xiaoyao...@intel.com> [Unify error reporting, rename function. - Paolo] Signed-off-by: Paolo Bonzini --- target/i386/kvm/kvm.c | 446 +- 1 file changed, 224 insertions(+), 222 deletions(-) diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c index 2577e345502..eab6261e1f5 100644 --- a/target/i386/kvm/kvm.c +++ b/target/i386/kvm/kvm.c @@ -1752,6 +1752,228 @@ static void kvm_init_nested_state(CPUX86State *env) } } +static uint32_t kvm_x86_build_cpuid(CPUX86State *env, +struct kvm_cpuid_entry2 *entries, +uint32_t cpuid_i) +{ +uint32_t limit, i, j; +uint32_t unused; +struct kvm_cpuid_entry2 *c; + +cpu_x86_cpuid(env, 0, 0, , , , ); + +for (i = 0; i <= limit; i++) { +j = 0; +if (cpuid_i == KVM_MAX_CPUID_ENTRIES) { +goto full; +} +c = [cpuid_i++]; +switch (i) { +case 2: { +/* Keep reading function 2 till all the input is received */ +int times; + +c->function = i; +c->flags = KVM_CPUID_FLAG_STATEFUL_FUNC | + KVM_CPUID_FLAG_STATE_READ_NEXT; +cpu_x86_cpuid(env, i, 0, >eax, >ebx, >ecx, >edx); +times = c->eax & 0xff; + +for (j = 1; j < times; ++j) { +if (cpuid_i == KVM_MAX_CPUID_ENTRIES) { +goto full; +} +c = [cpuid_i++]; +c->function = i; +c->flags = KVM_CPUID_FLAG_STATEFUL_FUNC; +cpu_x86_cpuid(env, i, 0, >eax, >ebx, >ecx, >edx); +} +break; +} +case 0x1f: +if (env->nr_dies < 2) { +cpuid_i--; +break; +} +/* fallthrough */ +case 4: +case 0xb: +case 0xd: +for (j = 0; ; j++) { +if (i == 0xd && j == 64) { +break; +} + +c->function = i; +c->flags = KVM_CPUID_FLAG_SIGNIFCANT_INDEX; +c->index = j; +cpu_x86_cpuid(env, i, j, >eax, >ebx, >ecx, >edx); + +if (i == 4 && c->eax == 0) { +break; +} +if (i == 0xb && !(c->ecx & 0xff00)) { +break; +} +if (i == 0x1f && !(c->ecx & 0xff00)) { +break; +} +if (i == 0xd && c->eax == 0) { +continue; +} +if (cpuid_i == KVM_MAX_CPUID_ENTRIES) { +goto full; +} +c = [cpuid_i++]; +} +break; +case 0x12: +for (j = 0; ; j++) { +c->function = i; +c->flags = KVM_CPUID_FLAG_SIGNIFCANT_INDEX; +c->index = j; +cpu_x86_cpuid(env, i, j, >eax, >ebx, >ecx, >edx); + +if (j > 1 && (c->eax & 0xf) != 1) { +break; +} + +if (cpuid_i == KVM_MAX_CPUID_ENTRIES) { +goto full; +} +c = [cpuid_i++]; +} +break; +case 0x7: +case 0x14: +case 0x1d: +case 0x1e: { +uint32_t times; + +c->function = i; +c->index = 0; +c->flags = KVM_CPUID_FLAG_SIGNIFCANT_INDEX; +cpu_x86_cpuid(env, i, 0, >eax, >ebx, >ecx, >edx); +times = c->eax; + +for (j = 1; j <= times; ++j) { +if (cpuid_i == KVM_MAX_CPUID_ENTRIES) { +goto full; +} +c = [cpuid_i++]; +c->function = i; +c->index = j; +c->flags = KVM_CPUID_FLAG_SIGNIFCANT_INDEX; +cpu_x86_cpuid(env, i, j, >eax, >ebx, >ecx, >edx); +} +break; +} +default: +c->function = i; +c->flags = 0; +cpu_x86_cpuid(env, i, 0, >eax, >ebx, >ecx, >edx); +if (!c->eax &&
Re: [PATCH v3 48/49] hw/i386/sev: Use guest_memfd for legacy ROMs
On 3/21/2024 2:12 AM, Isaku Yamahata wrote: On Wed, Mar 20, 2024 at 03:39:44AM -0500, Michael Roth wrote: TODO: make this SNP-specific if TDX disables legacy ROMs in general TDX disables pc.rom, not disable isa-bios. IIRC, TDX doesn't need pc pflash. Not TDX doesn't need pc pflash, but TDX cannot support pflash. Can SNP support the behavior of pflash? That what's got changed will be synced back to OVMF file? Xiaoyao can chime in. Thanks, Current SNP guest kernels will attempt to access these regions with with C-bit set, so guest_memfd is needed to handle that. Otherwise, kvm_convert_memory() will fail when the guest kernel tries to access it and QEMU attempts to call KVM_SET_MEMORY_ATTRIBUTES to set these ranges to private. Whether guests should actually try to access ROM regions in this way (or need to deal with legacy ROM regions at all), is a separate issue to be addressed on kernel side, but current SNP guest kernels will exhibit this behavior and so this handling is needed to allow QEMU to continue running existing SNP guest kernels. Signed-off-by: Michael Roth --- hw/i386/pc.c | 13 + hw/i386/pc_sysfw.c | 13 ++--- 2 files changed, 19 insertions(+), 7 deletions(-) diff --git a/hw/i386/pc.c b/hw/i386/pc.c index feb7a93083..5feaeb43ee 100644 --- a/hw/i386/pc.c +++ b/hw/i386/pc.c @@ -1011,10 +1011,15 @@ void pc_memory_init(PCMachineState *pcms, pc_system_firmware_init(pcms, rom_memory); option_rom_mr = g_malloc(sizeof(*option_rom_mr)); -memory_region_init_ram(option_rom_mr, NULL, "pc.rom", PC_ROM_SIZE, - _fatal); -if (pcmc->pci_enabled) { -memory_region_set_readonly(option_rom_mr, true); +if (machine_require_guest_memfd(machine)) { +memory_region_init_ram_guest_memfd(option_rom_mr, NULL, "pc.rom", + PC_ROM_SIZE, _fatal); +} else { +memory_region_init_ram(option_rom_mr, NULL, "pc.rom", PC_ROM_SIZE, + _fatal); +if (pcmc->pci_enabled) { +memory_region_set_readonly(option_rom_mr, true); +} } memory_region_add_subregion_overlap(rom_memory, PC_ROM_MIN_VGA, diff --git a/hw/i386/pc_sysfw.c b/hw/i386/pc_sysfw.c index 9dbb3f7337..850f86edd4 100644 --- a/hw/i386/pc_sysfw.c +++ b/hw/i386/pc_sysfw.c @@ -54,8 +54,13 @@ static void pc_isa_bios_init(MemoryRegion *rom_memory, /* map the last 128KB of the BIOS in ISA space */ isa_bios_size = MIN(flash_size, 128 * KiB); isa_bios = g_malloc(sizeof(*isa_bios)); -memory_region_init_ram(isa_bios, NULL, "isa-bios", isa_bios_size, - _fatal); +if (machine_require_guest_memfd(current_machine)) { +memory_region_init_ram_guest_memfd(isa_bios, NULL, "isa-bios", + isa_bios_size, _fatal); +} else { +memory_region_init_ram(isa_bios, NULL, "isa-bios", isa_bios_size, + _fatal); +} memory_region_add_subregion_overlap(rom_memory, 0x10 - isa_bios_size, isa_bios, @@ -68,7 +73,9 @@ static void pc_isa_bios_init(MemoryRegion *rom_memory, ((uint8_t*)flash_ptr) + (flash_size - isa_bios_size), isa_bios_size); -memory_region_set_readonly(isa_bios, true); +if (!machine_require_guest_memfd(current_machine)) { +memory_region_set_readonly(isa_bios, true); +} } static PFlashCFI01 *pc_pflash_create(PCMachineState *pcms, -- 2.25.1
Re: [PATCH for-9.1 v5 2/3] target/i386: add guest-phys-bits cpu property
On 3/25/2024 10:14 PM, Paolo Bonzini wrote: From: Gerd Hoffmann Allows to set guest-phys-bits (cpuid leaf 8008, eax[23:16]) via -cpu $model,guest-phys-bits=$nr. Signed-off-by: Gerd Hoffmann Message-ID: <20240318155336.156197-3-kra...@redhat.com> Signed-off-by: Paolo Bonzini Reviewed-by: Xiaoyao Li --- v4->v5: - move here all non-KVM parts - add compat property and support for special value "-1" (accelerator defines value) target/i386/cpu.h | 1 + hw/i386/pc.c | 4 +++- target/i386/cpu.c | 22 ++ 3 files changed, 26 insertions(+), 1 deletion(-) diff --git a/target/i386/cpu.h b/target/i386/cpu.h index 6b057380791..83e47358451 100644 --- a/target/i386/cpu.h +++ b/target/i386/cpu.h @@ -2026,6 +2026,7 @@ struct ArchCPU { /* Number of physical address bits supported */ uint32_t phys_bits; +uint32_t guest_phys_bits; /* in order to simplify APIC support, we leave this pointer to the user */ diff --git a/hw/i386/pc.c b/hw/i386/pc.c index 461fcaa1b48..9c4b3969cc8 100644 --- a/hw/i386/pc.c +++ b/hw/i386/pc.c @@ -78,7 +78,9 @@ { "qemu64-" TYPE_X86_CPU, "model-id", "QEMU Virtual CPU version " v, },\ { "athlon-" TYPE_X86_CPU, "model-id", "QEMU Virtual CPU version " v, }, -GlobalProperty pc_compat_9_0[] = {}; +GlobalProperty pc_compat_9_0[] = { +{ TYPE_X86_CPU, "guest-phys-bits", "0" }, +}; const size_t pc_compat_9_0_len = G_N_ELEMENTS(pc_compat_9_0); GlobalProperty pc_compat_8_2[] = {}; diff --git a/target/i386/cpu.c b/target/i386/cpu.c index 33760a2ee16..eef3d08473e 100644 --- a/target/i386/cpu.c +++ b/target/i386/cpu.c @@ -6570,6 +6570,7 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, uint32_t count, if (env->features[FEAT_8000_0001_EDX] & CPUID_EXT2_LM) { /* 64 bit processor */ *eax |= (cpu_x86_virtual_addr_width(env) << 8); + *eax |= (cpu->guest_phys_bits << 16); } *ebx = env->features[FEAT_8000_0008_EBX]; if (cs->nr_cores * cs->nr_threads > 1) { @@ -7329,6 +7330,14 @@ static void x86_cpu_realizefn(DeviceState *dev, Error **errp) goto out; } +if (cpu->guest_phys_bits == -1) { +/* + * If it was not set by the user, or by the accelerator via + * cpu_exec_realizefn, clear. + */ +cpu->guest_phys_bits = 0; +} + if (cpu->ucode_rev == 0) { /* * The default is the same as KVM's. Note that this check @@ -7379,6 +7388,14 @@ static void x86_cpu_realizefn(DeviceState *dev, Error **errp) if (cpu->phys_bits == 0) { cpu->phys_bits = TCG_PHYS_ADDR_BITS; } +if (cpu->guest_phys_bits && +(cpu->guest_phys_bits > cpu->phys_bits || +cpu->guest_phys_bits < 32)) { +error_setg(errp, "guest-phys-bits should be between 32 and %u " + " (but is %u)", + cpu->phys_bits, cpu->guest_phys_bits); +return; +} } else { /* For 32 bit systems don't use the user set value, but keep * phys_bits consistent with what we tell the guest. @@ -7387,6 +7404,10 @@ static void x86_cpu_realizefn(DeviceState *dev, Error **errp) error_setg(errp, "phys-bits is not user-configurable in 32 bit"); return; } +if (cpu->guest_phys_bits != 0) { +error_setg(errp, "guest-phys-bits is not user-configurable in 32 bit"); +return; +} if (env->features[FEAT_1_EDX] & (CPUID_PSE36 | CPUID_PAE)) { cpu->phys_bits = 36; @@ -7887,6 +7908,7 @@ static Property x86_cpu_properties[] = { DEFINE_PROP_BOOL("x-force-features", X86CPU, force_features, false), DEFINE_PROP_BOOL("kvm", X86CPU, expose_kvm, true), DEFINE_PROP_UINT32("phys-bits", X86CPU, phys_bits, 0), +DEFINE_PROP_UINT32("guest-phys-bits", X86CPU, guest_phys_bits, -1), DEFINE_PROP_BOOL("host-phys-bits", X86CPU, host_phys_bits, false), DEFINE_PROP_UINT8("host-phys-bits-limit", X86CPU, host_phys_bits_limit, 0), DEFINE_PROP_BOOL("fill-mtrr-mask", X86CPU, fill_mtrr_mask, true),
Re: [PATCH 12/26] KVM: track whether guest state is encrypted
On 3/23/2024 2:11 AM, Paolo Bonzini wrote: So far, KVM has allowed KVM_GET/SET_* ioctls to execute even if the guest state is encrypted, in which case they do nothing. For the new API using VM types, instead, the ioctls will fail which is a safer and more robust approach. The new API will be the only one available for SEV-SNP and TDX, but it is also usable for SEV and SEV-ES. In preparation for that, require architecture-specific KVM code to communicate the point at which guest state is protected (which must be after kvm_cpu_synchronize_post_init(), though that might change in the future in order to suppor migration). From that point, skip reading registers so that cpu->vcpu_dirty is never true: if it ever becomes true, kvm_arch_put_registers() will fail miserably. Signed-off-by: Paolo Bonzini --- include/sysemu/kvm.h | 2 ++ include/sysemu/kvm_int.h | 1 + accel/kvm/kvm-all.c | 14 -- target/i386/sev.c| 1 + 4 files changed, 16 insertions(+), 2 deletions(-) diff --git a/include/sysemu/kvm.h b/include/sysemu/kvm.h index fad9a7e8ff3..302e8f6f1e5 100644 --- a/include/sysemu/kvm.h +++ b/include/sysemu/kvm.h @@ -539,6 +539,8 @@ bool kvm_dirty_ring_enabled(void); uint32_t kvm_dirty_ring_size(void); +void kvm_mark_guest_state_protected(void); + /** * kvm_hwpoisoned_mem - indicate if there is any hwpoisoned page * reported for the VM. diff --git a/include/sysemu/kvm_int.h b/include/sysemu/kvm_int.h index 882e37e12c5..3496be7997a 100644 --- a/include/sysemu/kvm_int.h +++ b/include/sysemu/kvm_int.h @@ -87,6 +87,7 @@ struct KVMState bool kernel_irqchip_required; OnOffAuto kernel_irqchip_split; bool sync_mmu; +bool guest_state_protected; uint64_t manual_dirty_log_protect; /* The man page (and posix) say ioctl numbers are signed int, but * they're not. Linux, glibc and *BSD all treat ioctl numbers as diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c index a8cecd040eb..05fa3533c66 100644 --- a/accel/kvm/kvm-all.c +++ b/accel/kvm/kvm-all.c @@ -2698,7 +2698,7 @@ bool kvm_cpu_check_are_resettable(void) static void do_kvm_cpu_synchronize_state(CPUState *cpu, run_on_cpu_data arg) { -if (!cpu->vcpu_dirty) { +if (!cpu->vcpu_dirty && !kvm_state->guest_state_protected) { int ret = kvm_arch_get_registers(cpu); if (ret) { error_report("Failed to get registers: %s", strerror(-ret)); @@ -2712,7 +2712,7 @@ static void do_kvm_cpu_synchronize_state(CPUState *cpu, run_on_cpu_data arg) void kvm_cpu_synchronize_state(CPUState *cpu) { -if (!cpu->vcpu_dirty) { +if (!cpu->vcpu_dirty && !kvm_state->guest_state_protected) { run_on_cpu(cpu, do_kvm_cpu_synchronize_state, RUN_ON_CPU_NULL); } } @@ -2747,6 +2747,11 @@ static void do_kvm_cpu_synchronize_post_init(CPUState *cpu, run_on_cpu_data arg) void kvm_cpu_synchronize_post_init(CPUState *cpu) { +/* + * This runs before the machine_init_done notifiers, and is the last + * opportunity to synchronize the state of confidential guests. + */ > +assert(!kvm_state->guest_state_protected); So, this requires confidential guests to call kvm_mark_guest_state_protected() in its machine_init_done notifier callback? But for TDX, the guest_state is protected at the beginning, not some time later when machine_init_done. run_on_cpu(cpu, do_kvm_cpu_synchronize_post_init, RUN_ON_CPU_NULL); } @@ -4094,3 +4099,8 @@ void query_stats_schemas_cb(StatsSchemaList **result, Error **errp) query_stats_schema_vcpu(first_cpu, _args); } } + +void kvm_mark_guest_state_protected(void) +{ +kvm_state->guest_state_protected = true; +} diff --git a/target/i386/sev.c b/target/i386/sev.c index b8f79d34d19..c49a8fd55eb 100644 --- a/target/i386/sev.c +++ b/target/i386/sev.c @@ -755,6 +755,7 @@ sev_launch_get_measure(Notifier *notifier, void *unused) if (ret) { exit(1); } +kvm_mark_guest_state_protected(); } /* query the measurement blob length */
Re: [PATCH 21/26] kvm/memory: Make memory type private by default if it has guest memfd backend
On 3/23/2024 2:11 AM, Paolo Bonzini wrote: From: Xiaoyao Li KVM side leaves the memory to shared by default, while may incur the /s/while/which/ fix typo from myself. overhead of paging conversion on the first visit of each page. Because the expectation is that page is likely to private for the VMs that require private memory (has guest memfd). Explicitly set the memory to private when memory region has valid guest memfd backend. Signed-off-by: Xiaoyao Li Signed-off-by: Michael Roth Message-ID: <20240320083945.991426-16-michael.r...@amd.com> Signed-off-by: Paolo Bonzini --- accel/kvm/kvm-all.c | 10 ++ 1 file changed, 10 insertions(+) diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c index 7fbaf31cbaf..56b17cbd8aa 100644 --- a/accel/kvm/kvm-all.c +++ b/accel/kvm/kvm-all.c @@ -1430,6 +1430,16 @@ static void kvm_set_phys_mem(KVMMemoryListener *kml, strerror(-err)); abort(); } + +if (memory_region_has_guest_memfd(mr)) { +err = kvm_set_memory_attributes_private(start_addr, slot_size); +if (err) { +error_report("%s: failed to set memory attribute private: %s\n", + __func__, strerror(-err)); +exit(1); +} +} + start_addr += slot_size; ram_start_offset += slot_size; ram += slot_size;
Re: [PATCH 25/26] kvm: handle KVM_EXIT_MEMORY_FAULT
On 3/23/2024 2:11 AM, Paolo Bonzini wrote: From: Chao Peng When geeting KVM_EXIT_MEMORY_FAULT exit, it indicates userspace needs to typo: /s/geeting/getting do the memory conversion on the RAMBlock to turn the memory into desired attribute, i.e., private/shared. Currently only KVM_MEMORY_EXIT_FLAG_PRIVATE in flags is valid when KVM_EXIT_MEMORY_FAULT happens. Note, KVM_EXIT_MEMORY_FAULT makes sense only when the RAMBlock has guest_memfd memory backend. Note, KVM_EXIT_MEMORY_FAULT returns with -EFAULT, so special handling is added. When page is converted from shared to private, the original shared memory can be discarded via ram_block_discard_range(). Note, shared memory can be discarded only when it's not back'ed by hugetlb because hugetlb is supposed to be pre-allocated and no need for discarding. Signed-off-by: Chao Peng Co-developed-by: Xiaoyao Li Signed-off-by: Xiaoyao Li Message-ID: <20240320083945.991426-13-michael.r...@amd.com> Signed-off-by: Paolo Bonzini --- include/sysemu/kvm.h | 2 + accel/kvm/kvm-all.c| 99 +- accel/kvm/trace-events | 2 + 3 files changed, 93 insertions(+), 10 deletions(-) diff --git a/include/sysemu/kvm.h b/include/sysemu/kvm.h index 2cb31925091..698f1640fe2 100644 --- a/include/sysemu/kvm.h +++ b/include/sysemu/kvm.h @@ -541,4 +541,6 @@ int kvm_create_guest_memfd(uint64_t size, uint64_t flags, Error **errp); int kvm_set_memory_attributes_private(hwaddr start, hwaddr size); int kvm_set_memory_attributes_shared(hwaddr start, hwaddr size); + +int kvm_convert_memory(hwaddr start, hwaddr size, bool to_private); #endif diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c index 56b17cbd8aa..afd7f992e39 100644 --- a/accel/kvm/kvm-all.c +++ b/accel/kvm/kvm-all.c @@ -2893,6 +2893,70 @@ static void kvm_eat_signals(CPUState *cpu) } while (sigismember(, SIG_IPI)); } +int kvm_convert_memory(hwaddr start, hwaddr size, bool to_private) +{ +MemoryRegionSection section; +ram_addr_t offset; +MemoryRegion *mr; +RAMBlock *rb; +void *addr; +int ret = -1; + +trace_kvm_convert_memory(start, size, to_private ? "shared_to_private" : "private_to_shared"); + +if (!QEMU_PTR_IS_ALIGNED(start, qemu_real_host_page_size()) || +!QEMU_PTR_IS_ALIGNED(size, qemu_real_host_page_size())) { +return -1; +} + +if (!size) { +return -1; +} + +section = memory_region_find(get_system_memory(), start, size); +mr = section.mr; +if (!mr) { +return -1; +} + +if (!memory_region_has_guest_memfd(mr)) { +error_report("Converting non guest_memfd backed memory region " + "(0x%"HWADDR_PRIx" ,+ 0x%"HWADDR_PRIx") to %s", + start, size, to_private ? "private" : "shared"); +ret = -1; No need for it. ret is initialized as -1 at the function start. +goto out_unref; +} + +if (to_private) { +ret = kvm_set_memory_attributes_private(start, size); +} else { +ret = kvm_set_memory_attributes_shared(start, size); +} +if (ret) { +goto out_unref; +} + +addr = memory_region_get_ram_ptr(mr) + section.offset_within_region; +rb = qemu_ram_block_from_host(addr, false, ); + +if (to_private) { +if (rb->page_size == qemu_real_host_page_size()) { +/* +* shared memory is back'ed by hugetlb, which is supposed to be Please fix the bad comment indentation for me, as well as the extra space before 'hugetlb' +* pre-allocated and doesn't need to be discarded +*/ +goto out_unref; +} +ret = ram_block_discard_range(rb, offset, size); +} else { +ret = ram_block_discard_guest_memfd_range(rb, offset, size); +} + +out_unref: +memory_region_unref(section.mr); +return ret; +} + int kvm_cpu_exec(CPUState *cpu) { struct kvm_run *run = cpu->kvm_run; @@ -2960,18 +3024,20 @@ int kvm_cpu_exec(CPUState *cpu) ret = EXCP_INTERRUPT; break; } -fprintf(stderr, "error: kvm run failed %s\n", -strerror(-run_ret)); +if (!(run_ret == -EFAULT && run->exit_reason == KVM_EXIT_MEMORY_FAULT)) { +fprintf(stderr, "error: kvm run failed %s\n", +strerror(-run_ret)); #ifdef TARGET_PPC -if (run_ret == -EBUSY) { -fprintf(stderr, -"This is probably because your SMT is enabled.\n" -"VCPU can only run on primary threads with all " -"secondary threads offline.\n"); -} +if (run_ret == -EBUSY) { +fprintf(stderr, +
Re: [PATCH 3/7] KVM: track whether guest state is encrypted
On 3/19/2024 9:59 PM, Paolo Bonzini wrote: So far, KVM has allowed KVM_GET/SET_* ioctls to execute even if the guest state is encrypted, in which case they do nothing. For the new API using VM types, instead, the ioctls will fail which is a safer and more robust approach. The new API will be the only one available for SEV-SNP and TDX, but it is also usable for SEV and SEV-ES. In preparation for that, require architecture-specific KVM code to communicate the point at which guest state is protected (which must be after kvm_cpu_synchronize_post_init(), though that might change in the future in order to suppor migration). From that point, skip reading registers so that cpu->vcpu_dirty is never true: if it ever becomes true, kvm_arch_put_registers() will fail miserably. Signed-off-by: Paolo Bonzini Reviewed-by: Xiaoyao Li
Re: [PATCH 4/7] KVM: remove kvm_arch_cpu_check_are_resettable
On 3/19/2024 9:59 PM, Paolo Bonzini wrote: Board reset requires writing a fresh CPU state. As far as KVM is concerned, the only thing that blocks reset is that CPU state is encrypted; therefore, kvm_cpus_are_resettable() can simply check if that is the case. Signed-off-by: Paolo Bonzini Reviewed-by: Xiaoyao Li
Re: [PATCH 5/7] target/i386: introduce x86-confidential-guest
On 3/19/2024 9:59 PM, Paolo Bonzini wrote: Introduce a common superclass for x86 confidential guest implementations. It will extend ConfidentialGuestSupportClass with a method that provides the VM type to be passed to KVM_CREATE_VM. Signed-off-by: Paolo Bonzini Reviewed-by: Xiaoyao Li
Re: [PATCH 6/7] target/i386: Implement mc->kvm_type() to get VM type
On 3/19/2024 9:59 PM, Paolo Bonzini wrote: From: Xiaoyao Li KVM is introducing a new API to create confidential guests, which will be used by TDX and SEV-SNP but is also available for SEV and SEV-ES. The API uses the VM type argument to KVM_CREATE_VM to identify which confidential computing technology to use. Since there are no other expected uses of VM types, delegate mc->kvm_type() for x86 boards to the confidential-guest-support object pointed to by ms->cgs. For example, if a sev-guest object is specified to confidential-guest-support, like, qemu -machine ...,confidential-guest-support=sev0 \ -object sev-guest,id=sev0,... it will check if a VM type KVM_X86_SEV_VM or KVM_X86_SEV_ES_VM is supported, and if so use them together with the KVM_SEV_INIT2 function of the KVM_MEMORY_ENCRYPT_OP ioctl. If not, it will fall back to KVM_SEV_INIT and KVM_SEV_ES_INIT. This is a preparatory work towards TDX and SEV-SNP support, but it will also enable support for VMSA features such as DebugSwap, which are only available via KVM_SEV_INIT2. Co-developed-by: Xiaoyao Li Signed-off-by: Xiaoyao Li Signed-off-by: Paolo Bonzini Reviewed-by: Xiaoyao Li some nits below. --- target/i386/confidential-guest.h | 19 ++ target/i386/kvm/kvm_i386.h | 2 ++ hw/i386/x86.c| 6 + target/i386/kvm/kvm.c| 44 4 files changed, 71 insertions(+) diff --git a/target/i386/confidential-guest.h b/target/i386/confidential-guest.h index ca12d5a8fba..532e172a60b 100644 --- a/target/i386/confidential-guest.h +++ b/target/i386/confidential-guest.h @@ -36,5 +36,24 @@ struct X86ConfidentialGuest { struct X86ConfidentialGuestClass { /* */ ConfidentialGuestSupportClass parent; + +/* */ +int (*kvm_type)(X86ConfidentialGuest *cg); }; + +/** + * x86_confidential_guest_kvm_type: + * + * Calls #X86ConfidentialGuestClass.unplug callback of @plug_handler. ah, forgot to change the callback name after copy+paste. + */ +static inline int x86_confidential_guest_kvm_type(X86ConfidentialGuest *cg) +{ +X86ConfidentialGuestClass *klass = X86_CONFIDENTIAL_GUEST_GET_CLASS(cg); + +if (klass->kvm_type) { +return klass->kvm_type(cg); +} else { +return 0; +} +} #endif diff --git a/target/i386/kvm/kvm_i386.h b/target/i386/kvm/kvm_i386.h index 30fedcffea3..02168122787 100644 --- a/target/i386/kvm/kvm_i386.h +++ b/target/i386/kvm/kvm_i386.h @@ -37,6 +37,7 @@ bool kvm_hv_vpindex_settable(void); bool kvm_enable_sgx_provisioning(KVMState *s); bool kvm_hyperv_expand_features(X86CPU *cpu, Error **errp); +int kvm_get_vm_type(MachineState *ms, const char *vm_type); void kvm_arch_reset_vcpu(X86CPU *cs); void kvm_arch_after_reset_vcpu(X86CPU *cpu); void kvm_arch_do_init_vcpu(X86CPU *cs); @@ -49,6 +50,7 @@ void kvm_request_xsave_components(X86CPU *cpu, uint64_t mask); #ifdef CONFIG_KVM +bool kvm_is_vm_type_supported(int type); bool kvm_has_adjust_clock_stable(void); bool kvm_has_exception_payload(void); void kvm_synchronize_all_tsc(void); diff --git a/hw/i386/x86.c b/hw/i386/x86.c index ffbda48917f..2d4b148cd25 100644 --- a/hw/i386/x86.c +++ b/hw/i386/x86.c @@ -1389,6 +1389,11 @@ static void machine_set_sgx_epc(Object *obj, Visitor *v, const char *name, qapi_free_SgxEPCList(list); } +static int x86_kvm_type(MachineState *ms, const char *vm_type) +{ +return kvm_enabled() ? kvm_get_vm_type(ms, vm_type) : 0; +} + static void x86_machine_initfn(Object *obj) { X86MachineState *x86ms = X86_MACHINE(obj); @@ -1413,6 +1418,7 @@ static void x86_machine_class_init(ObjectClass *oc, void *data) mc->cpu_index_to_instance_props = x86_cpu_index_to_props; mc->get_default_cpu_node_id = x86_get_default_cpu_node_id; mc->possible_cpu_arch_ids = x86_possible_cpu_arch_ids; +mc->kvm_type = x86_kvm_type; x86mc->save_tsc_khz = true; x86mc->fwcfg_dma_enabled = true; nc->nmi_monitor_handler = x86_nmi; diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c index 0ec69109a2b..e109648f260 100644 --- a/target/i386/kvm/kvm.c +++ b/target/i386/kvm/kvm.c @@ -31,6 +31,7 @@ #include "sysemu/kvm_int.h" #include "sysemu/runstate.h" #include "kvm_i386.h" +#include "../confidential-guest.h" #include "sev.h" #include "xen-emu.h" #include "hyperv.h" @@ -161,6 +162,49 @@ static KVMMSRHandlers msr_handlers[KVM_MSR_FILTER_MAX_RANGES]; static RateLimit bus_lock_ratelimit_ctrl; static int kvm_get_one_msr(X86CPU *cpu, int index, uint64_t *value); +static const char *vm_type_name[] = { +[KVM_X86_DEFAULT_VM] = "default", +}; + +bool kvm_is_vm_type_supported(int type) +{ +uint32_t machine_types; The name of machine_types confuses me a lot. why not supported_vm_types? + +/* + * old KVM doesn't support
Re: [PATCH RFC v3 00/49] Add AMD Secure Nested Paging (SEV-SNP) support
On 3/21/2024 1:08 AM, Paolo Bonzini wrote: On Wed, Mar 20, 2024 at 10:59 AM Paolo Bonzini wrote: I will now focus on reviewing patches 6-20. This way we can prepare a common tree for SEV_INIT2/SNP/TDX, for both vendors to build upon. Ok, the attachment is the delta that I have. The only major change is requiring discard (thus effectively blocking VFIO support for SEV-SNP/TDX, at least for now). I will push it shortly to the same sevinit2 branch, and will post the patches sometime soon. Xiaoyao, you can use that branch too (it's on https://gitlab.com/bonzini/qemu) as the basis for your TDX work. Sure, it's really a good news for us. BTW, there are some minor comments on guest_memfd patches of my v5 post[*]. Could you please resolve them it your branch? [*] https://lore.kernel.org/qemu-devel/20240229063726.610065-1-xiaoyao...@intel.com/ Paolo
Re: [PATCH v5 08/65] kvm: handle KVM_EXIT_MEMORY_FAULT
On 3/19/2024 10:14 AM, Wang, Lei wrote: On 2/29/2024 14:36, Xiaoyao Li wrote: From: Chao Peng When geeting KVM_EXIT_MEMORY_FAULT exit, it indicates userspace needs to do the memory conversion on the RAMBlock to turn the memory into desired attribute, i.e., private/shared. Currently only KVM_MEMORY_EXIT_FLAG_PRIVATE in flags is valid when KVM_EXIT_MEMORY_FAULT happens. Note, KVM_EXIT_MEMORY_FAULT makes sense only when the RAMBlock has guest_memfd memory backend. Note, KVM_EXIT_MEMORY_FAULT returns with -EFAULT, so special handling is added. When page is converted from shared to private, the original shared memory can be discarded via ram_block_discard_range(). Note, shared memory can be discarded only when it's not back'ed by hugetlb because hugetlb is supposed to be pre-allocated and no need for discarding. Signed-off-by: Chao Peng Co-developed-by: Xiaoyao Li Signed-off-by: Xiaoyao Li --- Changes in v4: - open-coded ram_block_discard logic; - change warn_report() to error_report(); (Daniel) --- accel/kvm/kvm-all.c | 94 - 1 file changed, 84 insertions(+), 10 deletions(-) diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c index 70d482a2c936..87e4275932a7 100644 --- a/accel/kvm/kvm-all.c +++ b/accel/kvm/kvm-all.c @@ -2903,6 +2903,68 @@ static void kvm_eat_signals(CPUState *cpu) } while (sigismember(, SIG_IPI)); } +static int kvm_convert_memory(hwaddr start, hwaddr size, bool to_private) +{ +MemoryRegionSection section; +ram_addr_t offset; +MemoryRegion *mr; +RAMBlock *rb; +void *addr; +int ret = -1; + +if (!QEMU_PTR_IS_ALIGNED(start, qemu_host_page_size) || +!QEMU_PTR_IS_ALIGNED(size, qemu_host_page_size)) { +return -1; +} + +if (!size) { +return -1; +} + +section = memory_region_find(get_system_memory(), start, size); +mr = section.mr; +if (!mr) { +return -1; +} + +if (memory_region_has_guest_memfd(mr)) { +if (to_private) { +ret = kvm_set_memory_attributes_private(start, size); +} else { +ret = kvm_set_memory_attributes_shared(start, size); +} + +if (ret) { +memory_region_unref(section.mr); +return ret; +} + +addr = memory_region_get_ram_ptr(mr) + section.offset_within_region; +rb = qemu_ram_block_from_host(addr, false, ); + +if (to_private) { +if (rb->page_size != qemu_host_page_size) { +/* +* shared memory is back'ed by hugetlb, which is supposed to be +* pre-allocated and doesn't need to be discarded +*/ Nit: comment indentation is broken here. +return 0; +} else { +ret = ram_block_discard_range(rb, offset, size); +} +} else { +ret = ram_block_discard_guest_memfd_range(rb, offset, size); +} +} else { +error_report("Convert non guest_memfd backed memory region " +"(0x%"HWADDR_PRIx" ,+ 0x%"HWADDR_PRIx") to %s", Same as above. Fixed. thanks!
Re: [PATCH v3 13/49] [FIXUP] "kvm: handle KVM_EXIT_MEMORY_FAULT": drop qemu_host_page_size
On 3/20/2024 4:39 PM, Michael Roth wrote: TODO: squash into "kvm: handle KVM_EXIT_MEMORY_FAULT" qemu_host_page_size has been superseded by qemu_real_host_page_size() in newer QEMU, so update the patch accordingly. I found it today as well when rebase to qemu v9.0.0-rc0. Fix it locally, will show up on my next post of TDX-QEMU patches. :) Signed-off-by: Michael Roth --- accel/kvm/kvm-all.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c index 2fdc07a472..a9c19ab9a1 100644 --- a/accel/kvm/kvm-all.c +++ b/accel/kvm/kvm-all.c @@ -2912,8 +2912,8 @@ static int kvm_convert_memory(hwaddr start, hwaddr size, bool to_private) void *addr; int ret = -1; -if (!QEMU_PTR_IS_ALIGNED(start, qemu_host_page_size) || -!QEMU_PTR_IS_ALIGNED(size, qemu_host_page_size)) { +if (!QEMU_PTR_IS_ALIGNED(start, qemu_real_host_page_size()) || +!QEMU_PTR_IS_ALIGNED(size, qemu_real_host_page_size())) { return -1; } @@ -2943,7 +2943,7 @@ static int kvm_convert_memory(hwaddr start, hwaddr size, bool to_private) rb = qemu_ram_block_from_host(addr, false, ); if (to_private) { -if (rb->page_size != qemu_host_page_size) { +if (rb->page_size != qemu_real_host_page_size()) { /* * shared memory is back'ed by hugetlb, which is supposed to be * pre-allocated and doesn't need to be discarded
Re: [PATCH v5 06/65] kvm: Introduce support for memory_attributes
On 3/19/2024 10:03 AM, Wang, Lei wrote: On 2/29/2024 14:36, Xiaoyao Li wrote:> Introduce the helper functions to set the attributes of a range of memory to private or shared. This is necessary to notify KVM the private/shared attribute of each gpa range. KVM needs the information to decide the GPA needs to be mapped at hva-based shared memory or guest_memfd based private memory. Signed-off-by: Xiaoyao Li --- Changes in v4: - move the check of kvm_supported_memory_attributes to the common kvm_set_memory_attributes(); (Wang Wei) - change warn_report() to error_report() in kvm_set_memory_attributes() and drop the __func__; (Daniel) --- accel/kvm/kvm-all.c | 44 include/sysemu/kvm.h | 3 +++ 2 files changed, 47 insertions(+) diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c index cd0aa7545a1f..70d482a2c936 100644 --- a/accel/kvm/kvm-all.c +++ b/accel/kvm/kvm-all.c @@ -92,6 +92,7 @@ static bool kvm_has_guest_debug; static int kvm_sstep_flags; static bool kvm_immediate_exit; static bool kvm_guest_memfd_supported; +static uint64_t kvm_supported_memory_attributes; static hwaddr kvm_max_slot_size = ~0; static const KVMCapabilityInfo kvm_required_capabilites[] = { @@ -1304,6 +1305,46 @@ void kvm_set_max_memslot_size(hwaddr max_slot_size) kvm_max_slot_size = max_slot_size; } +static int kvm_set_memory_attributes(hwaddr start, hwaddr size, uint64_t attr) +{ +struct kvm_memory_attributes attrs; +int r; + +if (kvm_supported_memory_attributes == 0) { +error_report("No memory attribute supported by KVM\n"); +return -EINVAL; +} + +if ((attr & kvm_supported_memory_attributes) != attr) { +error_report("memory attribute 0x%lx not supported by KVM," + " supported bits are 0x%lx\n", + attr, kvm_supported_memory_attributes); +return -EINVAL; +} + +attrs.attributes = attr; +attrs.address = start; +attrs.size = size; +attrs.flags = 0; + +r = kvm_vm_ioctl(kvm_state, KVM_SET_MEMORY_ATTRIBUTES, ); +if (r) { +error_report("failed to set memory (0x%lx+%#zx) with attr 0x%lx error '%s'", + start, size, attr, strerror(errno)); +} +return r; +} + +int kvm_set_memory_attributes_private(hwaddr start, hwaddr size) +{ +return kvm_set_memory_attributes(start, size, KVM_MEMORY_ATTRIBUTE_PRIVATE); +} + +int kvm_set_memory_attributes_shared(hwaddr start, hwaddr size) +{ +return kvm_set_memory_attributes(start, size, 0); +} + /* Called with KVMMemoryListener.slots_lock held */ static void kvm_set_phys_mem(KVMMemoryListener *kml, MemoryRegionSection *section, bool add) @@ -2439,6 +2480,9 @@ static int kvm_init(MachineState *ms) kvm_guest_memfd_supported = kvm_check_extension(s, KVM_CAP_GUEST_MEMFD); +ret = kvm_check_extension(s, KVM_CAP_MEMORY_ATTRIBUTES); +kvm_supported_memory_attributes = ret > 0 ? ret : 0; kvm_check_extension() only returns non-negative value, so we can just kvm_supported_memory_attributes = kvm_check_extension(s, KVM_CAP_MEMORY_ATTRIBUTES); Good catch, I will update it in next version. + if (object_property_find(OBJECT(current_machine), "kvm-type")) { g_autofree char *kvm_type = object_property_get_str(OBJECT(current_machine), "kvm-type", diff --git a/include/sysemu/kvm.h b/include/sysemu/kvm.h index 6cdf82de8372..8e83adfbbd19 100644 --- a/include/sysemu/kvm.h +++ b/include/sysemu/kvm.h @@ -546,4 +546,7 @@ uint32_t kvm_dirty_ring_size(void); bool kvm_hwpoisoned_mem(void); int kvm_create_guest_memfd(uint64_t size, uint64_t flags, Error **errp); + +int kvm_set_memory_attributes_private(hwaddr start, hwaddr size); +int kvm_set_memory_attributes_shared(hwaddr start, hwaddr size); #endif
Re: [PATCH v3 11/49] physmem: Introduce ram_block_discard_guest_memfd_range()
On 3/20/2024 5:37 PM, David Hildenbrand wrote: On 20.03.24 09:39, Michael Roth wrote: From: Xiaoyao Li When memory page is converted from private to shared, the original private memory is back'ed by guest_memfd. Introduce ram_block_discard_guest_memfd_range() for discarding memory in guest_memfd. Originally-from: Isaku Yamahata Codeveloped-by: Xiaoyao Li "Co-developed-by" Michael is using the patch from my TDX-QEMU v5 series[1]. I need to fix it. Signed-off-by: Xiaoyao Li Reviewed-by: David Hildenbrand Your SOB should go here. --- Changes in v5: - Collect Reviewed-by from David; Changes in in v4: - Drop ram_block_convert_range() and open code its implementation in the   next Patch. Signed-off-by: Michael Roth I only received 3 patches from this series, and now I am confused: changelog talks about v5 and this is "PATCH v3" As above, because the guest_memfd patches in my TDX-QEMU v5[1] were directly picked for this series, so the change history says v5. They are needed by SEV-SNP as well. I want to raise the question, how do we want to proceed with the guest memfd patches (patch 2 to 10 in [1])? Can they be merged separately before TDX/SNP patches? Please make sure to send at least the cover letter along (I might not need the other 46 patches :D ). [1] https://lore.kernel.org/qemu-devel/20240229063726.610065-1-xiaoyao...@intel.com/
Re: [PATCH v4 1/2] kvm: add support for guest physical bits
On 3/18/2024 11:53 PM, Gerd Hoffmann wrote: Query kvm for supported guest physical address bits, in cpuid function 8008, eax[23:16]. Usually this is identical to host physical address bits. With NPT or EPT being used this might be restricted to 48 (max 4-level paging address space size) even if the host cpu supports more physical address bits. When set pass this to the guest, using cpuid too. Guest firmware can use this to figure how big the usable guest physical address space is, so PCI bar mapping are actually reachable. Signed-off-by: Gerd Hoffmann --- target/i386/cpu.h | 1 + target/i386/cpu.c | 1 + target/i386/kvm/kvm-cpu.c | 31 ++- 3 files changed, 32 insertions(+), 1 deletion(-) diff --git a/target/i386/cpu.h b/target/i386/cpu.h index 952174bb6f52..d427218827f6 100644 --- a/target/i386/cpu.h +++ b/target/i386/cpu.h @@ -2026,6 +2026,7 @@ struct ArchCPU { /* Number of physical address bits supported */ uint32_t phys_bits; +uint32_t guest_phys_bits; /* in order to simplify APIC support, we leave this pointer to the user */ diff --git a/target/i386/cpu.c b/target/i386/cpu.c index 9a210d8d9290..c88c895a5b3e 100644 --- a/target/i386/cpu.c +++ b/target/i386/cpu.c @@ -6570,6 +6570,7 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, uint32_t count, if (env->features[FEAT_8000_0001_EDX] & CPUID_EXT2_LM) { /* 64 bit processor */ *eax |= (cpu_x86_virtual_addr_width(env) << 8); + *eax |= (cpu->guest_phys_bits << 16); } *ebx = env->features[FEAT_8000_0008_EBX]; if (cs->nr_cores * cs->nr_threads > 1) { diff --git a/target/i386/kvm/kvm-cpu.c b/target/i386/kvm/kvm-cpu.c index 9c791b7b0520..5132bb96abd5 100644 --- a/target/i386/kvm/kvm-cpu.c +++ b/target/i386/kvm/kvm-cpu.c @@ -18,10 +18,33 @@ #include "kvm_i386.h" #include "hw/core/accel-cpu.h" +static void kvm_set_guest_phys_bits(CPUState *cs) +{ +X86CPU *cpu = X86_CPU(cs); +uint32_t eax, guest_phys_bits; + +eax = kvm_arch_get_supported_cpuid(cs->kvm_state, 0x8008, 0, R_EAX); +guest_phys_bits = (eax >> 16) & 0xff; +if (!guest_phys_bits) { +return; +} + +if (cpu->guest_phys_bits == 0 || +cpu->guest_phys_bits > guest_phys_bits) { +cpu->guest_phys_bits = guest_phys_bits; +} + +if (cpu->host_phys_bits_limit && +cpu->guest_phys_bits > cpu->host_phys_bits_limit) { +cpu->guest_phys_bits = cpu->host_phys_bits_limit; host_phys_bits_limit takes effect only when cpu->host_phys_bits is set. If users pass configuration like "-cpu qemu64,phys-bits=52,host-phys-bits-limit=45", the cpu->guest_phys_bits will be set to 45. I think this is not what we want, though the usage seems insane. We can guard it as if (cpu->host_phys_bits && cpu->host_phys_bits_limit && cpu->guest_phys_bits > cpu->host_phys_bits_limt) { } Simpler, we can guard with cpu->phys_bits like below, because cpu->host_phys_bits_limit is used to guard cpu->phys_bits in host_cpu_realizefn() if (cpu->guest_phys_bits > cpu->phys_bits) { cpu->guest_phys_bits = cpu->phys_bits; } with this resolved, Reviewed-by: Xiaoyao Li +} +} + static bool kvm_cpu_realizefn(CPUState *cs, Error **errp) { X86CPU *cpu = X86_CPU(cs); CPUX86State *env = >env; +bool ret; /* * The realize order is important, since x86_cpu_realize() checks if @@ -50,7 +73,13 @@ static bool kvm_cpu_realizefn(CPUState *cs, Error **errp) MSR_IA32_UCODE_REV); } } -return host_cpu_realizefn(cs, errp); +ret = host_cpu_realizefn(cs, errp); +if (!ret) { +return ret; +} + +kvm_set_guest_phys_bits(cs); +return true; } static bool lmce_supported(void)
Re: [PATCH] target/i386: Export RFDS bit to guests
On 3/19/2024 11:08 PM, Pawan Gupta wrote: On Tue, Mar 19, 2024 at 12:22:08PM +0800, Xiaoyao Li wrote: On 3/13/2024 10:53 PM, Pawan Gupta wrote: Register File Data Sampling (RFDS) is a CPU side-channel vulnerability that may expose stale register value. CPUs that set RFDS_NO bit in MSR IA32_ARCH_CAPABILITIES indicate that they are not vulnerable to RFDS. Similarly, RFDS_CLEAR indicates that CPU is affected by RFDS, and has the microcode to help mitigate RFDS. Make RFDS_CLEAR and RFDS_NO bits available to guests. What's the status of KVM part? KVM part is already upstreamed and backported: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=v6.8.1=50d33b98b1e23d1cd8743b3cac7a0ae5718b8b00 I see. It was not sent to kvm maillist and not merged through KVM tree. With KVM part in palce, Reviewed-by: Xiaoyao Li
Re: [PATCH] target/i386: Export RFDS bit to guests
On 3/13/2024 10:53 PM, Pawan Gupta wrote: Register File Data Sampling (RFDS) is a CPU side-channel vulnerability that may expose stale register value. CPUs that set RFDS_NO bit in MSR IA32_ARCH_CAPABILITIES indicate that they are not vulnerable to RFDS. Similarly, RFDS_CLEAR indicates that CPU is affected by RFDS, and has the microcode to help mitigate RFDS. Make RFDS_CLEAR and RFDS_NO bits available to guests. What's the status of KVM part? Signed-off-by: Pawan Gupta --- target/i386/cpu.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/target/i386/cpu.c b/target/i386/cpu.c index 9a210d8d9290..693a5e0fb2ce 100644 --- a/target/i386/cpu.c +++ b/target/i386/cpu.c @@ -1158,8 +1158,8 @@ FeatureWordInfo feature_word_info[FEATURE_WORDS] = { NULL, "sbdr-ssdp-no", "fbsdp-no", "psdp-no", NULL, "fb-clear", NULL, NULL, NULL, NULL, NULL, NULL, -"pbrsb-no", NULL, "gds-no", NULL, -NULL, NULL, NULL, NULL, +"pbrsb-no", NULL, "gds-no", "rfds-no", +"rfds-clear", NULL, NULL, NULL, }, .msr = { .index = MSR_IA32_ARCH_CAPABILITIES, base-commit: a1932d7cd6507d4d9db2044a54731fff3e749bac
Re: [PATCH 2/4] i386/sev: Switch to use confidential_guest_kvm_init()
On 3/19/2024 5:51 AM, Paolo Bonzini wrote: On Thu, Feb 29, 2024 at 7:01 AM Xiaoyao Li wrote: Use confidential_guest_kvm_init() instead of calling SEV specific sev_kvm_init(). As a bouns, it fits to future TDX when TDX implements its own confidential_guest_support and .kvm_init(). Move the "TypeInfo sev_guest_info" definition and related functions to the end of the file, to avoid declaring the sev_kvm_init() ahead. Delete the sve-stub.c since it's not needed anymore. Signed-off-by: Xiaoyao Li --- Changes from rfc v1: - check ms->cgs not NULL before calling confidential_guest_kvm_init(); - delete the sev-stub.c; Queued, with just one small simplification that can be done on top: thank you, Paolo! diff --git a/target/i386/sev.c b/target/i386/sev.c index e89d64fa52..b8f79d34d1 100644 --- a/target/i386/sev.c +++ b/target/i386/sev.c @@ -851,18 +851,13 @@ sev_vm_state_change(void *opaque, bool running, RunState state) static int sev_kvm_init(ConfidentialGuestSupport *cgs, Error **errp) { -SevGuestState *sev -= (SevGuestState *)object_dynamic_cast(OBJECT(cgs), TYPE_SEV_GUEST); +SevGuestState *sev = SEV_GUEST(cgs); char *devname; int ret, fw_error, cmd; uint32_t ebx; uint32_t host_cbitpos; struct sev_user_data_status status = {}; -if (!sev) { -return 0; -} - ret = ram_block_discard_disable(true); if (ret) { error_report("%s: cannot disable RAM discard", __func__); It looks good. Thanks! Paolo
Re: [PATCH v3 2/3] kvm: add support for guest physical bits
On 3/13/2024 9:27 PM, Gerd Hoffmann wrote: Query kvm for supported guest physical address bits, in cpuid function 8008, eax[23:16]. Usually this is identical to host physical address bits. With NPT or EPT being used this might be restricted to 48 (max 4-level paging address space size) even if the host cpu supports more physical address bits. When set pass this to the guest, using cpuid too. Guest firmware can use this to figure how big the usable guest physical address space is, so PCI bar mapping are actually reachable. Signed-off-by: Gerd Hoffmann --- target/i386/cpu.h | 1 + target/i386/cpu.c | 1 + target/i386/kvm/kvm-cpu.c | 32 +++- 3 files changed, 33 insertions(+), 1 deletion(-) diff --git a/target/i386/cpu.h b/target/i386/cpu.h index 952174bb6f52..d427218827f6 100644 --- a/target/i386/cpu.h +++ b/target/i386/cpu.h @@ -2026,6 +2026,7 @@ struct ArchCPU { /* Number of physical address bits supported */ uint32_t phys_bits; +uint32_t guest_phys_bits; /* in order to simplify APIC support, we leave this pointer to the user */ diff --git a/target/i386/cpu.c b/target/i386/cpu.c index 9a210d8d9290..c88c895a5b3e 100644 --- a/target/i386/cpu.c +++ b/target/i386/cpu.c @@ -6570,6 +6570,7 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, uint32_t count, if (env->features[FEAT_8000_0001_EDX] & CPUID_EXT2_LM) { /* 64 bit processor */ *eax |= (cpu_x86_virtual_addr_width(env) << 8); + *eax |= (cpu->guest_phys_bits << 16); } *ebx = env->features[FEAT_8000_0008_EBX]; if (cs->nr_cores * cs->nr_threads > 1) { diff --git a/target/i386/kvm/kvm-cpu.c b/target/i386/kvm/kvm-cpu.c index 9c791b7b0520..a2b7bfaeadf8 100644 --- a/target/i386/kvm/kvm-cpu.c +++ b/target/i386/kvm/kvm-cpu.c @@ -18,10 +18,36 @@ #include "kvm_i386.h" #include "hw/core/accel-cpu.h" +static void kvm_set_guest_phys_bits(CPUState *cs) +{ +X86CPU *cpu = X86_CPU(cs); +uint32_t eax, guest_phys_bits; + +if (!cpu->host_phys_bits) { +return; +} This needs explanation of why. What if users set the phys-bits to exactly host's value, via "-cpu xxx,phys-bits=host's value"? +eax = kvm_arch_get_supported_cpuid(cs->kvm_state, 0x8008, 0, R_EAX); +guest_phys_bits = (eax >> 16) & 0xff; +if (!guest_phys_bits) { +return; +} + +if (cpu->guest_phys_bits == 0 || +cpu->guest_phys_bits > guest_phys_bits) { +cpu->guest_phys_bits = guest_phys_bits; +} + +if (cpu->guest_phys_bits > cpu->host_phys_bits_limit) { +cpu->guest_phys_bits = cpu->host_phys_bits_limit; +} +} + static bool kvm_cpu_realizefn(CPUState *cs, Error **errp) { X86CPU *cpu = X86_CPU(cs); CPUX86State *env = >env; +bool ret; /* * The realize order is important, since x86_cpu_realize() checks if @@ -50,7 +76,11 @@ static bool kvm_cpu_realizefn(CPUState *cs, Error **errp) MSR_IA32_UCODE_REV); } } -return host_cpu_realizefn(cs, errp); +ret = host_cpu_realizefn(cs, errp); We need to check ret and return if !ret; +kvm_set_guest_phys_bits(cs); + +return ret; } static bool lmce_supported(void)
Re: [PATCH v5 49/65] i386/tdx: handle TDG.VP.VMCALL
On 3/13/2024 11:31 PM, Daniel P. Berrangé wrote: On Tue, Mar 12, 2024 at 03:44:32PM +0800, Xiaoyao Li wrote: On 3/11/2024 5:27 PM, Daniel P. Berrangé wrote: On Thu, Feb 29, 2024 at 01:37:10AM -0500, Xiaoyao Li wrote: From: Isaku Yamahata Add property "quote-generation-socket" to tdx-guest, which is a property of type SocketAddress to specify Quote Generation Service(QGS). On request of GetQuote, it connects to the QGS socket, read request data from shared guest memory, send the request data to the QGS, and store the response into shared guest memory, at last notify TD guest by interrupt. command line example: qemu-system-x86_64 \ -object '{"qom-type":"tdx-guest","id":"tdx0","quote-generation-socket":{"type": "vsock", "cid":"1","port":"1234"}}' \ Can you illustrate this with 'unix' sockets, not 'vsock'. Are you suggesting only updating the commit message to an example of unix socket? Or you want the code to test with some unix socket QGS? (It seems the QGS I got for testing, only supports vsock socket. Because at the time when it got developed, it was supposed to communicate with drivers inside TD guest directly not via VMM (KVM+QEMU). Anyway, I will talk to internal folks to see if any plan to support unix socket.) The QGS provided as part of DCAP supports running with both UNIX sockets and VSOCK, and I would expect QEMU to be made to work with this, since its is Intel's OSS reference impl. After synced with internal folks, yes, the QGS I used does support unix socket. I tested it and it worked. -object '{"qom-type":"tdx-guest","id":"tdx","quote-generation-socket":{"type":"unix", "path":"/var/run/tdx-qgs/qgs.socket"}}' Exposing QGS to the guest when we only intend for it to be used by the host QEMU is needlessly expanding the attack surface. With regards, Daniel
Re: [PATCH v5 49/65] i386/tdx: handle TDG.VP.VMCALL
On 3/11/2024 5:27 PM, Daniel P. Berrangé wrote: On Thu, Feb 29, 2024 at 01:37:10AM -0500, Xiaoyao Li wrote: From: Isaku Yamahata Add property "quote-generation-socket" to tdx-guest, which is a property of type SocketAddress to specify Quote Generation Service(QGS). On request of GetQuote, it connects to the QGS socket, read request data from shared guest memory, send the request data to the QGS, and store the response into shared guest memory, at last notify TD guest by interrupt. command line example: qemu-system-x86_64 \ -object '{"qom-type":"tdx-guest","id":"tdx0","quote-generation-socket":{"type": "vsock", "cid":"1","port":"1234"}}' \ Can you illustrate this with 'unix' sockets, not 'vsock'. Are you suggesting only updating the commit message to an example of unix socket? Or you want the code to test with some unix socket QGS? (It seems the QGS I got for testing, only supports vsock socket. Because at the time when it got developed, it was supposed to communicate with drivers inside TD guest directly not via VMM (KVM+QEMU). Anyway, I will talk to internal folks to see if any plan to support unix socket.) It makes no conceptual sense to be using vsock for two processes on the host to be using vsock to talk to each other. vsock is only needed for the guest to talk to the host. -machine confidential-guest-support=tdx0 Note, above example uses vsock type socket because the QGS we used implements the vsock socket. It can be other types, like UNIX socket, which depends on the implementation of QGS. To avoid no response from QGS server, setup a timer for the transaction. If timeout, make it an error and interrupt guest. Define the threshold of time to 30s at present, maybe change to other value if not appropriate. Signed-off-by: Isaku Yamahata Codeveloped-by: Chenyi Qiang Signed-off-by: Chenyi Qiang Codeveloped-by: Xiaoyao Li Signed-off-by: Xiaoyao Li --- Changes in v5: - add more decription of quote-generation-socket property; Changes in v4: - merge next patch "i386/tdx: setup a timer for the qio channel"; Changes in v3: - rename property "quote-generation-service" to "quote-generation-socket"; - change the type of "quote-generation-socket" from str to SocketAddress; - squash next patch into this one; --- qapi/qom.json | 8 +- target/i386/kvm/meson.build | 2 +- target/i386/kvm/tdx-quote-generator.c | 170 target/i386/kvm/tdx-quote-generator.h | 95 +++ target/i386/kvm/tdx.c | 216 ++ target/i386/kvm/tdx.h | 6 + 6 files changed, 495 insertions(+), 2 deletions(-) create mode 100644 target/i386/kvm/tdx-quote-generator.c create mode 100644 target/i386/kvm/tdx-quote-generator.h With regards, Daniel
Re: [PATCH v5 52/65] i386/tdx: Wire TDX_REPORT_FATAL_ERROR with GuestPanic facility
On 3/11/2024 3:29 PM, Markus Armbruster wrote: Xiaoyao Li writes: On 3/7/2024 9:51 PM, Markus Armbruster wrote: Xiaoyao Li writes: On 2/29/2024 4:51 PM, Markus Armbruster wrote: Xiaoyao Li writes: Integrate TDX's TDX_REPORT_FATAL_ERROR into QEMU GuestPanic facility Originated-from: Isaku Yamahata Signed-off-by: Xiaoyao Li --- Changes in v5: - mention additional error information in gpa when it presents; - refine the documentation; (Markus) Changes in v4: - refine the documentation; (Markus) Changes in v3: - Add docmentation of new type and struct; (Daniel) - refine the error message handling; (Daniel) --- qapi/run-state.json | 31 +-- system/runstate.c | 58 +++ target/i386/kvm/tdx.c | 24 +- 3 files changed, 110 insertions(+), 3 deletions(-) diff --git a/qapi/run-state.json b/qapi/run-state.json index dd0770b379e5..b71dd1884eb6 100644 --- a/qapi/run-state.json +++ b/qapi/run-state.json [...] @@ -564,6 +567,30 @@ 'psw-addr': 'uint64', 'reason': 'S390CrashReason'}} +## +# @GuestPanicInformationTdx: +# +# TDX Guest panic information specific to TDX, as specified in the +# "Guest-Hypervisor Communication Interface (GHCI) Specification", +# section TDG.VP.VMCALL. +# +# @error-code: TD-specific error code +# +# @message: Human-readable error message provided by the guest. Not +# to be trusted. +# +# @gpa: guest-physical address of a page that contains more verbose +# error information, as zero-terminated string. Present when the +# "GPA valid" bit (bit 63) is set in @error-code. Uh, peeking at GHCI Spec section 3.4 TDG.VP.VMCALL, I see operand R12 consists of bitsnamedescription 31:0TD-specific error code TD-specific error code Panic – 0x0. Values – 0x1 to 0x reserved. 62:32 TD-specific extendedTD-specific extended error code. error code TD software defined. 63 GPA Valid Set if the TD specified additional information in the GPA parameter (R13). Is @error-code all of R12, or just bits 31:0? If it's all of R12, description of @error-code as "TD-specific error code" is misleading. We pass all of R12 to @error_code. Here it wants to use "error_code" as generic as the whole R12. Do you have any better description of it ? Sadly, the spec is of no help: it doesn't name the entire thing, only the three sub-fields TD-specific error code, TD-specific extended error code, GPA valid. We could take the hint, and provide the sub-fields instead: * @error-code contains the TD-specific error code (bits 31:0) * @extended-error-code contains the TD-specific extended error code (bits 62:32) * we don't need @gpa-valid, because it's the same as "@gpa is present" If we decide to keep the single member, we do need another name for it. @error-codes (plural) doesn't exactly feel wonderful, but it gives at least a subtle hint that it's not just *the* error code. The reason we only defined one single member, is that the extended-error-code is not used now, and I believe it won't be used in the near future. Aha! Then I recommend * @error-code contains the TD-specific error code (bits 31:0) * Omit bits 62:32 from the reply; if we later find an actual use for them, we can add a suitable member * Omit bit 63, because it's the same as "@gpa is present" If no objection from others, I will use @error-codes (plural) in the next version. I recommend to keep the @error-code name, but narrow its value to the actual error code, i.e. bits 31:0. It works for me. I will got this direction in the next version. If it's just bits 31:0, then 'Present when the "GPA valid" bit (bit 63) is set in @error-code' is wrong. Could go with 'Only present when the guest provides this information'. [...]
Re: [PATCH v9 06/21] i386/cpu: Use APIC ID info to encode cache topo in CPUID[4]
On 3/10/2024 9:38 PM, Zhao Liu wrote: Hi Xiaoyao, case 3: /* L3 cache info */ -die_offset = apicid_die_offset(_info); if (cpu->enable_l3_cache) { +addressable_threads_width = apicid_die_offset(_info); Please get rid of the local variable @addressable_threads_width. It is truly confusing. There're several reasons for this: 1. This commit is trying to use APIC ID topology layout to decode 2 cache topology fields in CPUID[4], CPUID.04H:EAX[bits 25:14] and CPUID.04H:EAX[bits 31:26]. When there's a addressable_cores_width to map to CPUID.04H:EAX[bits 31:26], it's more clear to also map CPUID.04H:EAX[bits 25:14] to another variable. I don't dislike using a variable. I dislike the name of that variable since it's misleading 2. All these 2 variables are temporary in this commit, and they will be replaed by 2 helpers in follow-up cleanup of this series. you mean patch 20? I don't see how removing the local variable @addressable_threads_width conflicts with patch 20. As a con, it introduces code churn. 3. Similarly, to make it easier to clean up later with the helper and for more people to review, it's neater to explicitly indicate the CPUID.04H:EAX[bits 25:14] with a variable here. If you do want keeping the variable. Please add a comment above it to explain the meaning. 4. I call this field as addressable_threads_width since it's "Maximum number of addressable IDs for logical processors sharing this cache". Its own name is long, but given the length, only individual words could be selected as names. TBH, "addressable" deserves more emphasis than "sharing". The former emphasizes the fact that the number of threads chosen here is based on the APIC ID layout and does not necessarily correspond to actual threads. Therefore, this variable is needed here. Thanks, Zhao
Re: [PATCH v9 11/21] i386/cpu: Decouple CPUID[0x1F] subleaf with specific topology level
On 2/27/2024 6:32 PM, Zhao Liu wrote: From: Zhao Liu At present, the subleaf 0x02 of CPUID[0x1F] is bound to the "die" level. In fact, the specific topology level exposed in 0x1F depends on the platform's support for extension levels (module, tile and die). To help expose "module" level in 0x1F, decouple CPUID[0x1F] subleaf with specific topology level. Tested-by: Yongwei Ma Signed-off-by: Zhao Liu Reviewed-by: Xiaoyao Li Besides, some nits below. --- Changes since v7: * Refactored the encode_topo_cpuid1f() to use traversal to search the encoded level and avoid using static variables. (Xiaoyao) - Since the total number of levels in the bitmap is not too large, the overhead of traversing is supposed to be acceptable. * Renamed the variable num_cpus_next_level to num_threads_next_level. (Xiaoyao) * Renamed the helper num_cpus_by_topo_level() to num_threads_by_topo_level(). (Xiaoyao) * Dropped Michael/Babu's Acked/Tested tags since the code change. * Re-added Yongwei's Tested tag For his re-testing. Changes since v3: * New patch to prepare to expose module level in 0x1F. * Moved the CPUTopoLevel enumeration definition from "i386: Add cache topology info in CPUCacheInfo" to this patch. Note, to align with topology types in SDM, revert the name of CPU_TOPO_LEVEL_UNKNOW to CPU_TOPO_LEVEL_INVALID. --- target/i386/cpu.c | 138 +- 1 file changed, 113 insertions(+), 25 deletions(-) diff --git a/target/i386/cpu.c b/target/i386/cpu.c index 88dffd2b52e3..b0f171c6a465 100644 --- a/target/i386/cpu.c +++ b/target/i386/cpu.c @@ -269,6 +269,118 @@ static void encode_cache_cpuid4(CPUCacheInfo *cache, (cache->complex_indexing ? CACHE_COMPLEX_IDX : 0); } +static uint32_t num_threads_by_topo_level(X86CPUTopoInfo *topo_info, + enum CPUTopoLevel topo_level) +{ +switch (topo_level) { +case CPU_TOPO_LEVEL_SMT: +return 1; +case CPU_TOPO_LEVEL_CORE: +return topo_info->threads_per_core; +case CPU_TOPO_LEVEL_DIE: +return topo_info->threads_per_core * topo_info->cores_per_die; +case CPU_TOPO_LEVEL_PACKAGE: +return topo_info->threads_per_core * topo_info->cores_per_die * + topo_info->dies_per_pkg; +default: +g_assert_not_reached(); +} +return 0; +} + +static uint32_t apicid_offset_by_topo_level(X86CPUTopoInfo *topo_info, +enum CPUTopoLevel topo_level) +{ +switch (topo_level) { +case CPU_TOPO_LEVEL_SMT: +return 0; +case CPU_TOPO_LEVEL_CORE: +return apicid_core_offset(topo_info); +case CPU_TOPO_LEVEL_DIE: +return apicid_die_offset(topo_info); +case CPU_TOPO_LEVEL_PACKAGE: +return apicid_pkg_offset(topo_info); +default: +g_assert_not_reached(); +} +return 0; +} + +static uint32_t cpuid1f_topo_type(enum CPUTopoLevel topo_level) +{ +switch (topo_level) { +case CPU_TOPO_LEVEL_INVALID: +return CPUID_1F_ECX_TOPO_LEVEL_INVALID; +case CPU_TOPO_LEVEL_SMT: +return CPUID_1F_ECX_TOPO_LEVEL_SMT; +case CPU_TOPO_LEVEL_CORE: +return CPUID_1F_ECX_TOPO_LEVEL_CORE; +case CPU_TOPO_LEVEL_DIE: +return CPUID_1F_ECX_TOPO_LEVEL_DIE; +default: +/* Other types are not supported in QEMU. */ +g_assert_not_reached(); +} +return 0; +} + +static void encode_topo_cpuid1f(CPUX86State *env, uint32_t count, +X86CPUTopoInfo *topo_info, +uint32_t *eax, uint32_t *ebx, +uint32_t *ecx, uint32_t *edx) +{ +X86CPU *cpu = env_archcpu(env); +unsigned long level; +uint32_t num_threads_next_level, offset_next_level; + +assert(count + 1 < CPU_TOPO_LEVEL_MAX); + +/* + * Find the No.count topology levels in avail_cpu_topo bitmap. + * Start from bit 0 (CPU_TOPO_LEVEL_INVALID). AFAICS, it starts from bit 1 (CPU_TOPO_LEVEL_SMT). Because the initial value of level is CPU_TOPO_LEVEL_INVALID, but the first round of the loop is to find the valid bit starting from (level + 1). + */ +level = CPU_TOPO_LEVEL_INVALID; +for (int i = 0; i <= count; i++) { +level = find_next_bit(env->avail_cpu_topo, + CPU_TOPO_LEVEL_PACKAGE, + level + 1); + +/* + * CPUID[0x1f] doesn't explicitly encode the package level, + * and it just encode the invalid level (all fields are 0) + * into the last subleaf of 0x1f. + */ QEMU will never set bit CPU_TOPO_LEVEL_PACKAGE in env->avail_cpu_topo. So I think we should assert() it instead of fixing it silently. +if (level == CPU_TOPO_LEVEL_PACKAGE) { +level = CPU_TOPO_LEVEL_INVALID; +brea
Re: [PATCH v9 09/21] i386/cpu: Introduce bitmap to cache available CPU topology levels
On 2/27/2024 6:32 PM, Zhao Liu wrote: From: Zhao Liu Currently, QEMU checks the specify number of topology domains to detect if there's extended topology levels (e.g., checking nr_dies). With this bitmap, the extended CPU topology (the levels other than SMT, core and package) could be easier to detect without touching the topology details. This is also in preparation for the follow-up to decouple CPUID[0x1F] subleaf with specific topology level. Tested-by: Yongwei Ma Signed-off-by: Zhao Liu Reviewed-by: Xiaoyao Li
Re: [PATCH v5 52/65] i386/tdx: Wire TDX_REPORT_FATAL_ERROR with GuestPanic facility
On 3/7/2024 9:51 PM, Markus Armbruster wrote: Xiaoyao Li writes: On 2/29/2024 4:51 PM, Markus Armbruster wrote: Xiaoyao Li writes: Integrate TDX's TDX_REPORT_FATAL_ERROR into QEMU GuestPanic facility Originated-from: Isaku Yamahata Signed-off-by: Xiaoyao Li --- Changes in v5: - mention additional error information in gpa when it presents; - refine the documentation; (Markus) Changes in v4: - refine the documentation; (Markus) Changes in v3: - Add docmentation of new type and struct; (Daniel) - refine the error message handling; (Daniel) --- qapi/run-state.json | 31 +-- system/runstate.c | 58 +++ target/i386/kvm/tdx.c | 24 +- 3 files changed, 110 insertions(+), 3 deletions(-) diff --git a/qapi/run-state.json b/qapi/run-state.json index dd0770b379e5..b71dd1884eb6 100644 --- a/qapi/run-state.json +++ b/qapi/run-state.json @@ -483,10 +483,12 @@ # # @s390: s390 guest panic information type (Since: 2.12) # +# @tdx: tdx guest panic information type (Since: 9.0) +# # Since: 2.9 ## { 'enum': 'GuestPanicInformationType', - 'data': [ 'hyper-v', 's390' ] } + 'data': [ 'hyper-v', 's390', 'tdx' ] } ## # @GuestPanicInformation: @@ -501,7 +503,8 @@ 'base': {'type': 'GuestPanicInformationType'}, 'discriminator': 'type', 'data': {'hyper-v': 'GuestPanicInformationHyperV', - 's390': 'GuestPanicInformationS390'}} + 's390': 'GuestPanicInformationS390', + 'tdx' : 'GuestPanicInformationTdx'}} ## # @GuestPanicInformationHyperV: @@ -564,6 +567,30 @@ 'psw-addr': 'uint64', 'reason': 'S390CrashReason'}} +## +# @GuestPanicInformationTdx: +# +# TDX Guest panic information specific to TDX, as specified in the +# "Guest-Hypervisor Communication Interface (GHCI) Specification", +# section TDG.VP.VMCALL. +# +# @error-code: TD-specific error code +# +# @message: Human-readable error message provided by the guest. Not +# to be trusted. +# +# @gpa: guest-physical address of a page that contains more verbose +# error information, as zero-terminated string. Present when the +# "GPA valid" bit (bit 63) is set in @error-code. Uh, peeking at GHCI Spec section 3.4 TDG.VP.VMCALL, I see operand R12 consists of bitsnamedescription 31:0TD-specific error code TD-specific error code Panic – 0x0. Values – 0x1 to 0x reserved. 62:32 TD-specific extendedTD-specific extended error code. error code TD software defined. 63 GPA Valid Set if the TD specified additional information in the GPA parameter (R13). Is @error-code all of R12, or just bits 31:0? If it's all of R12, description of @error-code as "TD-specific error code" is misleading. We pass all of R12 to @error_code. Here it wants to use "error_code" as generic as the whole R12. Do you have any better description of it ? Sadly, the spec is of no help: it doesn't name the entire thing, only the three sub-fields TD-specific error code, TD-specific extended error code, GPA valid. We could take the hint, and provide the sub-fields instead: * @error-code contains the TD-specific error code (bits 31:0) * @extended-error-code contains the TD-specific extended error code (bits 62:32) * we don't need @gpa-valid, because it's the same as "@gpa is present" If we decide to keep the single member, we do need another name for it. @error-codes (plural) doesn't exactly feel wonderful, but it gives at least a subtle hint that it's not just *the* error code. The reason we only defined one single member, is that the extended-error-code is not used now, and I believe it won't be used in the near future. If no objection from others, I will use @error-codes (plural) in the next version. If it's just bits 31:0, then 'Present when the "GPA valid" bit (bit 63) is set in @error-code' is wrong. Could go with 'Only present when the guest provides this information'. +# +# Drop one of these two lines, please. +# Since: 9.0 +## +{'struct': 'GuestPanicInformationTdx', + 'data': {'error-code': 'uint64', + 'message': 'str', + '*gpa': 'uint64'}} + ## # @MEMORY_FAILURE: #
Re: [PATCH v5 30/65] i386/tdx: Support user configurable mrconfigid/mrowner/mrownerconfig
On 3/7/2024 9:56 PM, Markus Armbruster wrote: Xiaoyao Li writes: On 3/7/2024 4:39 PM, Markus Armbruster wrote: Xiaoyao Li writes: On 2/29/2024 9:25 PM, Markus Armbruster wrote: Xiaoyao Li writes: On 2/29/2024 4:37 PM, Markus Armbruster wrote: Xiaoyao Li writes: From: Isaku Yamahata Three sha384 hash values, mrconfigid, mrowner and mrownerconfig, of a TD can be provided for TDX attestation. Detailed meaning of them can be found: https://lore.kernel.org/qemu-devel/31d6dbc1-f453-4cef-ab08-4813f4e0f...@intel.com/ Allow user to specify those values via property mrconfigid, mrowner and mrownerconfig. They are all in base64 format. example -object tdx-guest, \ mrconfigid=ASNFZ4mrze8BI0VniavN7wEjRWeJq83vASNFZ4mrze8BI0VniavN7wEjRWeJq83v,\ mrowner=ASNFZ4mrze8BI0VniavN7wEjRWeJq83vASNFZ4mrze8BI0VniavN7wEjRWeJq83v,\ mrownerconfig=ASNFZ4mrze8BI0VniavN7wEjRWeJq83vASNFZ4mrze8BI0VniavN7wEjRWeJq83v Signed-off-by: Isaku Yamahata Co-developed-by: Xiaoyao Li Signed-off-by: Xiaoyao Li [...] diff --git a/qapi/qom.json b/qapi/qom.json index 89ed89b9b46e..cac875349a3a 100644 --- a/qapi/qom.json +++ b/qapi/qom.json @@ -905,10 +905,25 @@ # pages. Some guest OS (e.g., Linux TD guest) may require this to # be set, otherwise they refuse to boot. # +# @mrconfigid: ID for non-owner-defined configuration of the guest TD, +# e.g., run-time or OS configuration (base64 encoded SHA384 digest). +# (A default value 0 of SHA384 is used when absent). Suggest to drop the parenthesis in the last sentence. @mrconfigid is a string, so the default value can't be 0. Actually, it's not just any string, but a base64 encoded SHA384 digest, which means it must be exactly 96 hex digits. So it can't be "0", either. It could be "". I thought value 0 of SHA384 just means it. That's my fault and my poor english. "Fault" is too harsh :) It's not as precise as I want our interface documentation to be. We work together to get there. More on this below. +# +# @mrowner: ID for the guest TD’s owner (base64 encoded SHA384 digest). +# (A default value 0 of SHA384 is used when absent). +# +# @mrownerconfig: ID for owner-defined configuration of the guest TD, +# e.g., specific to the workload rather than the run-time or OS +# (base64 encoded SHA384 digest). (A default value 0 of SHA384 is +# used when absent). +# # Since: 9.0 ## { 'struct': 'TdxGuestProperties', - 'data': { '*sept-ve-disable': 'bool' } } + 'data': { '*sept-ve-disable': 'bool', +'*mrconfigid': 'str', +'*mrowner': 'str', +'*mrownerconfig': 'str' } } ## # @ThreadContextProperties: diff --git a/target/i386/kvm/tdx.c b/target/i386/kvm/tdx.c index d0ad4f57b5d0..4ce2f1d082ce 100644 --- a/target/i386/kvm/tdx.c +++ b/target/i386/kvm/tdx.c @@ -13,6 +13,7 @@ #include "qemu/osdep.h" #include "qemu/error-report.h" +#include "qemu/base64.h" #include "qapi/error.h" #include "qom/object_interfaces.h" #include "standard-headers/asm-x86/kvm_para.h" @@ -516,6 +517,7 @@ int tdx_pre_create_vcpu(CPUState *cpu, Error **errp) X86CPU *x86cpu = X86_CPU(cpu); CPUX86State *env = >env; g_autofree struct kvm_tdx_init_vm *init_vm = NULL; +size_t data_len; int r = 0; object_property_set_bool(OBJECT(cpu), "pmu", false, _abort); @@ -528,6 +530,38 @@ int tdx_pre_create_vcpu(CPUState *cpu, Error **errp) init_vm = g_malloc0(sizeof(struct kvm_tdx_init_vm) + sizeof(struct kvm_cpuid_entry2) * KVM_MAX_CPUID_ENTRIES); +#define SHA384_DIGEST_SIZE 48 + +if (tdx_guest->mrconfigid) { +g_autofree uint8_t *data = qbase64_decode(tdx_guest->mrconfigid, + strlen(tdx_guest->mrconfigid), _len, errp); +if (!data || data_len != SHA384_DIGEST_SIZE) { +error_setg(errp, "TDX: failed to decode mrconfigid"); +return -1; +} +memcpy(init_vm->mrconfigid, data, data_len); +} When @mrconfigid is absent, the property remains null, and this conditional is not executed. init_vm->mrconfigid[], an array of 6 __u64, remains all zero. How does the kernel treat that? A all-zero SHA384 value is still a valid value, isn't it? KVM treats it with no difference. Can you point me to the spot in the kernel where mrconfigid is used? https://github.com/intel/tdx/blob/66a10e258636fa8ec9f5ce687607bf2196a92341/arch/x86/kvm/vmx/tdx.c#L2322 KVM just copy what QEMU provides into its own data structure @td_params. The format @of td_params is defined by TDX spec, and @td_params needs to be passed to TDX module when initialize the context of TD via SEAMCALL(TDH.MNG.INIT): https://github
Re: [PATCH v9 08/21] i386/cpu: Consolidate the use of topo_info in cpu_x86_cpuid()
On 2/27/2024 6:32 PM, Zhao Liu wrote: From: Zhao Liu In cpu_x86_cpuid(), there are many variables in representing the cpu topology, e.g., topo_info, cs->nr_cores and cs->nr_threads. Since the names of cs->nr_cores/cs->nr_threads does not accurately Again as in v7, please changes to "cs->nr_cores and cs->nr_threads", "cs->nr_cores/cs->nr_threads" looks like a single variable of division represent its meaning, the use of cs->nr_cores/cs->nr_threads is prone to confusion and mistakes. And the structure X86CPUTopoInfo names its members clearly, thus the variable "topo_info" should be preferred. In addition, in cpu_x86_cpuid(), to uniformly use the topology variable, replace env->dies with topo_info.dies_per_pkg as well.
Re: [PATCH v9 07/21] i386/cpu: Use APIC ID info get NumSharingCache for CPUID[0x8000001D].EAX[bits 25:14]
On 2/27/2024 6:32 PM, Zhao Liu wrote: From: Zhao Liu The commit 8f4202fb1080 ("i386: Populate AMD Processor Cache Information for cpuid 0x801D") adds the cache topology for AMD CPU by encoding the number of sharing threads directly. From AMD's APM, NumSharingCache (CPUID[0x801D].EAX[bits 25:14]) means [1]: The number of logical processors sharing this cache is the value of this field incremented by 1. To determine which logical processors are sharing a cache, determine a Share Id for each processor as follows: ShareId = LocalApicId >> log2(NumSharingCache+1) Logical processors with the same ShareId then share a cache. If NumSharingCache+1 is not a power of two, round it up to the next power of two. From the description above, the calculation of this field should be same as CPUID[4].EAX[bits 25:14] for Intel CPUs. So also use the offsets of APIC ID to calculate this field. [1]: APM, vol.3, appendix.E.4.15 Function 8000_001Dh--Cache Topology Information Cc: Babu Moger Tested-by: Yongwei Ma Signed-off-by: Zhao Liu Reviewed-by: Xiaoyao Li --- Changes since v7: * Moved this patch after CPUID[4]'s similar change ("i386/cpu: Use APIC ID offset to encode cache topo in CPUID[4]"). (Xiaoyao) * Dropped Michael/Babu's Acked/Reviewed/Tested tags since the code change due to the rebase. * Re-added Yongwei's Tested tag For his re-testing (compilation on Intel platforms). Changes since v3: * Rewrote the subject. (Babu) * Deleted the original "comment/help" expression, as this behavior is confirmed for AMD CPUs. (Babu) * Renamed "num_apic_ids" (v3) to "num_sharing_cache" to match spec definition. (Babu) Changes since v1: * Renamed "l3_threads" to "num_apic_ids" in encode_cache_cpuid801d(). (Yanan) * Added the description of the original commit and add Cc. --- target/i386/cpu.c | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/target/i386/cpu.c b/target/i386/cpu.c index c77bcbc44d59..df56c7a449c8 100644 --- a/target/i386/cpu.c +++ b/target/i386/cpu.c @@ -331,7 +331,7 @@ static void encode_cache_cpuid801d(CPUCacheInfo *cache, uint32_t *eax, uint32_t *ebx, uint32_t *ecx, uint32_t *edx) { -uint32_t l3_threads; +uint32_t num_sharing_cache; assert(cache->size == cache->line_size * cache->associativity * cache->partitions * cache->sets); @@ -340,11 +340,11 @@ static void encode_cache_cpuid801d(CPUCacheInfo *cache, /* L3 is shared among multiple cores */ if (cache->level == 3) { -l3_threads = topo_info->cores_per_die * topo_info->threads_per_core; -*eax |= (l3_threads - 1) << 14; +num_sharing_cache = 1 << apicid_die_offset(topo_info); } else { -*eax |= ((topo_info->threads_per_core - 1) << 14); +num_sharing_cache = 1 << apicid_core_offset(topo_info); } +*eax |= (num_sharing_cache - 1) << 14; assert(cache->line_size > 0); assert(cache->partitions > 0);
Re: [PATCH v9 06/21] i386/cpu: Use APIC ID info to encode cache topo in CPUID[4]
picid_pkg_offset(_info); + *eax &= ~0x3FFC000; -*eax |= (pow2ceil(vcpus_per_socket) - 1) << 14; +*eax |= ((1 << addressable_threads_width) - 1) << 14; } } } else if (cpu->vendor_cpuid_only && IS_AMD_CPU(env)) { *eax = *ebx = *ecx = *edx = 0; } else { *eax = 0; +addressable_cores_width = apicid_pkg_offset(_info) - + apicid_core_offset(_info); + switch (count) { case 0: /* L1 dcache info */ +addressable_threads_width = apicid_core_offset(_info); encode_cache_cpuid4(env->cache_info_cpuid4.l1d_cache, -cs->nr_threads, cs->nr_cores, +(1 << addressable_threads_width), +(1 << addressable_cores_width), eax, ebx, ecx, edx); break; case 1: /* L1 icache info */ +addressable_threads_width = apicid_core_offset(_info); encode_cache_cpuid4(env->cache_info_cpuid4.l1i_cache, -cs->nr_threads, cs->nr_cores, +(1 << addressable_threads_width), +(1 << addressable_cores_width), eax, ebx, ecx, edx); break; case 2: /* L2 cache info */ +addressable_threads_width = apicid_core_offset(_info); encode_cache_cpuid4(env->cache_info_cpuid4.l2_cache, -cs->nr_threads, cs->nr_cores, +(1 << addressable_threads_width), +(1 << addressable_cores_width), eax, ebx, ecx, edx); break; case 3: /* L3 cache info */ -die_offset = apicid_die_offset(_info); if (cpu->enable_l3_cache) { +addressable_threads_width = apicid_die_offset(_info); Please get rid of the local variable @addressable_threads_width. It is truly confusing. In this patch, it is assigned to - apicid_pkg_offset(_info); - apicid_core_offset(_info); - apicid_die_offset(_info); in different cases. I only suggested the name of addressable_core_width in v7, which is apicid_pkg_offset(_info) - apicid_core_offset(_info); And it is straightforward that it means the number of bits in x2APICID to encode different addressable cores. But it is not similar to addressable_threads_width, the semantic changes per different cache level. In fact, you want something like bit_width_of_addressable_threads_sharing_this_level_of_cache. So I suggest stop using the variable of "address_therads_width". Instead just apicid_*_offset() directly. With above fixed, feel free to add my Reviewed-by: Xiaoyao Li encode_cache_cpuid4(env->cache_info_cpuid4.l3_cache, -(1 << die_offset), cs->nr_cores, +(1 << addressable_threads_width), +(1 << addressable_cores_width), eax, ebx, ecx, edx); break; } @@ -6141,6 +6159,7 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, uint32_t count, } } break; +} case 5: /* MONITOR/MWAIT Leaf */ *eax = cpu->mwait.eax; /* Smallest monitor-line size in bytes */
Re: [PATCH v2 2/2] kvm: add support for guest physical bits
On 3/5/2024 6:52 PM, Gerd Hoffmann wrote: Query kvm for supported guest physical address bits, in cpuid function 8008, eax[23:16]. Usually this is identical to host physical address bits. With NPT or EPT being used this might be restricted to 48 (max 4-level paging address space size) even if the host cpu supports more physical address bits. When set pass this to the guest, using cpuid too. Guest firmware can use this to figure how big the usable guest physical address space is, so PCI bar mapping are actually reachable. Signed-off-by: Gerd Hoffmann --- target/i386/cpu.h | 1 + target/i386/cpu.c | 1 + target/i386/kvm/kvm.c | 17 + 3 files changed, 19 insertions(+) diff --git a/target/i386/cpu.h b/target/i386/cpu.h index 952174bb6f52..d427218827f6 100644 --- a/target/i386/cpu.h +++ b/target/i386/cpu.h @@ -2026,6 +2026,7 @@ struct ArchCPU { /* Number of physical address bits supported */ uint32_t phys_bits; +uint32_t guest_phys_bits; /* in order to simplify APIC support, we leave this pointer to the user */ diff --git a/target/i386/cpu.c b/target/i386/cpu.c index 2666ef380891..1a6cfc75951e 100644 --- a/target/i386/cpu.c +++ b/target/i386/cpu.c @@ -6570,6 +6570,7 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, uint32_t count, if (env->features[FEAT_8000_0001_EDX] & CPUID_EXT2_LM) { /* 64 bit processor */ *eax |= (cpu_x86_virtual_addr_width(env) << 8); + *eax |= (cpu->guest_phys_bits << 16); } *ebx = env->features[FEAT_8000_0008_EBX]; if (cs->nr_cores * cs->nr_threads > 1) { diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c index 7298822cb511..ce22dfcaa661 100644 --- a/target/i386/kvm/kvm.c +++ b/target/i386/kvm/kvm.c @@ -238,6 +238,15 @@ static int kvm_get_tsc(CPUState *cs) return 0; } +/* return cpuid fn 8000_0008 eax[23:16] aka GuestPhysBits */ +static int kvm_get_guest_phys_bits(KVMState *s) +{ +uint32_t eax; + +eax = kvm_arch_get_supported_cpuid(s, 0x8008, 0, R_EAX); +return (eax >> 16) & 0xff; +} + static inline void do_kvm_synchronize_tsc(CPUState *cpu, run_on_cpu_data arg) { kvm_get_tsc(cpu); @@ -1730,6 +1739,7 @@ int kvm_arch_init_vcpu(CPUState *cs) X86CPU *cpu = X86_CPU(cs); CPUX86State *env = >env; uint32_t limit, i, j, cpuid_i; +uint32_t guest_phys_bits; uint32_t unused; struct kvm_cpuid_entry2 *c; uint32_t signature[3]; @@ -1765,6 +1775,13 @@ int kvm_arch_init_vcpu(CPUState *cs) env->apic_bus_freq = KVM_APIC_BUS_FREQUENCY; +guest_phys_bits = kvm_get_guest_phys_bits(cs->kvm_state); +if (guest_phys_bits && +(cpu->guest_phys_bits == 0 || + cpu->guest_phys_bits > guest_phys_bits)) { +cpu->guest_phys_bits = guest_phys_bits; +} suggest to move this added code block to kvm_cpu_realizefn(). Main code in kvm_arch_init_vcpu() is to consume the data in cpu or env->features to configure KVM specific configuration via KVM ioctls. The preparation work of initializing the required data usually not happen in kvm_arch_init_vcpu(); /* * kvm_hyperv_expand_features() is called here for the second time in case * KVM_CAP_SYS_HYPERV_CPUID is not supported. While we can't possibly handle
Re: [PATCH v5 52/65] i386/tdx: Wire TDX_REPORT_FATAL_ERROR with GuestPanic facility
On 2/29/2024 4:51 PM, Markus Armbruster wrote: Xiaoyao Li writes: Integrate TDX's TDX_REPORT_FATAL_ERROR into QEMU GuestPanic facility Originated-from: Isaku Yamahata Signed-off-by: Xiaoyao Li --- Changes in v5: - mention additional error information in gpa when it presents; - refine the documentation; (Markus) Changes in v4: - refine the documentation; (Markus) Changes in v3: - Add docmentation of new type and struct; (Daniel) - refine the error message handling; (Daniel) --- qapi/run-state.json | 31 +-- system/runstate.c | 58 +++ target/i386/kvm/tdx.c | 24 +- 3 files changed, 110 insertions(+), 3 deletions(-) diff --git a/qapi/run-state.json b/qapi/run-state.json index dd0770b379e5..b71dd1884eb6 100644 --- a/qapi/run-state.json +++ b/qapi/run-state.json @@ -483,10 +483,12 @@ # # @s390: s390 guest panic information type (Since: 2.12) # +# @tdx: tdx guest panic information type (Since: 9.0) +# # Since: 2.9 ## { 'enum': 'GuestPanicInformationType', - 'data': [ 'hyper-v', 's390' ] } + 'data': [ 'hyper-v', 's390', 'tdx' ] } ## # @GuestPanicInformation: @@ -501,7 +503,8 @@ 'base': {'type': 'GuestPanicInformationType'}, 'discriminator': 'type', 'data': {'hyper-v': 'GuestPanicInformationHyperV', - 's390': 'GuestPanicInformationS390'}} + 's390': 'GuestPanicInformationS390', + 'tdx' : 'GuestPanicInformationTdx'}} ## # @GuestPanicInformationHyperV: @@ -564,6 +567,30 @@ 'psw-addr': 'uint64', 'reason': 'S390CrashReason'}} +## +# @GuestPanicInformationTdx: +# +# TDX Guest panic information specific to TDX, as specified in the +# "Guest-Hypervisor Communication Interface (GHCI) Specification", +# section TDG.VP.VMCALL. +# +# @error-code: TD-specific error code +# +# @message: Human-readable error message provided by the guest. Not +# to be trusted. +# +# @gpa: guest-physical address of a page that contains more verbose +# error information, as zero-terminated string. Present when the +# "GPA valid" bit (bit 63) is set in @error-code. Uh, peeking at GHCI Spec section 3.4 TDG.VP.VMCALL, I see operand R12 consists of bitsnamedescription 31:0TD-specific error code TD-specific error code Panic – 0x0. Values – 0x1 to 0x reserved. 62:32 TD-specific extendedTD-specific extended error code. error code TD software defined. 63 GPA Valid Set if the TD specified additional information in the GPA parameter (R13). Is @error-code all of R12, or just bits 31:0? If it's all of R12, description of @error-code as "TD-specific error code" is misleading. We pass all of R12 to @error_code. Here it wants to use "error_code" as generic as the whole R12. Do you have any better description of it ? If it's just bits 31:0, then 'Present when the "GPA valid" bit (bit 63) is set in @error-code' is wrong. Could go with 'Only present when the guest provides this information'. +# +# Drop one of these two lines, please. +# Since: 9.0 +## +{'struct': 'GuestPanicInformationTdx', + 'data': {'error-code': 'uint64', + 'message': 'str', + '*gpa': 'uint64'}} + ## # @MEMORY_FAILURE: #
Re: [PATCH v5 49/65] i386/tdx: handle TDG.VP.VMCALL
On 2/29/2024 9:28 PM, Markus Armbruster wrote: Xiaoyao Li writes: On 2/29/2024 4:40 PM, Markus Armbruster wrote: Xiaoyao Li writes: From: Isaku Yamahata Add property "quote-generation-socket" to tdx-guest, which is a property of type SocketAddress to specify Quote Generation Service(QGS). On request of GetQuote, it connects to the QGS socket, read request data from shared guest memory, send the request data to the QGS, and store the response into shared guest memory, at last notify TD guest by interrupt. command line example: qemu-system-x86_64 \ -object '{"qom-type":"tdx-guest","id":"tdx0","quote-generation-socket":{"type": "vsock", "cid":"1","port":"1234"}}' \ -machine confidential-guest-support=tdx0 Note, above example uses vsock type socket because the QGS we used implements the vsock socket. It can be other types, like UNIX socket, which depends on the implementation of QGS. To avoid no response from QGS server, setup a timer for the transaction. If timeout, make it an error and interrupt guest. Define the threshold of time to 30s at present, maybe change to other value if not appropriate. Signed-off-by: Isaku Yamahata Codeveloped-by: Chenyi Qiang Signed-off-by: Chenyi Qiang Codeveloped-by: Xiaoyao Li Signed-off-by: Xiaoyao Li [...] diff --git a/qapi/qom.json b/qapi/qom.json index cac875349a3a..7b26b0a0d3aa 100644 --- a/qapi/qom.json +++ b/qapi/qom.json @@ -917,13 +917,19 @@ # (base64 encoded SHA384 digest). (A default value 0 of SHA384 is # used when absent). # +# @quote-generation-socket: socket address for Quote Generation +# Service (QGS). QGS is a daemon running on the host. User in +# TD guest cannot get TD quoting for attestation if QGS is not +# provided. So admin should always provide it. This makes me wonder why it's optional. Can you describe a use case for *not* specifying @quote-generation-socket? Maybe at last when all the TDX support lands on all the components, attestation will become a must for a TD guest to be usable. However, at least for today, booting and running a TD guest don't require attestation. So not provide it, doesn't affect anything excepting cannot get a Quote. Maybe # @quote-generation-socket: Socket address for Quote Generation # Service (QGS). QGS is a daemon running on the host. Without # it, the guest will not be able to get a TD quote for # attestation. Thanks! will update to it. [...]
Re: [PATCH v5 30/65] i386/tdx: Support user configurable mrconfigid/mrowner/mrownerconfig
On 3/7/2024 4:39 PM, Markus Armbruster wrote: Xiaoyao Li writes: On 2/29/2024 9:25 PM, Markus Armbruster wrote: Xiaoyao Li writes: On 2/29/2024 4:37 PM, Markus Armbruster wrote: Xiaoyao Li writes: From: Isaku Yamahata Three sha384 hash values, mrconfigid, mrowner and mrownerconfig, of a TD can be provided for TDX attestation. Detailed meaning of them can be found: https://lore.kernel.org/qemu-devel/31d6dbc1-f453-4cef-ab08-4813f4e0f...@intel.com/ Allow user to specify those values via property mrconfigid, mrowner and mrownerconfig. They are all in base64 format. example -object tdx-guest, \ mrconfigid=ASNFZ4mrze8BI0VniavN7wEjRWeJq83vASNFZ4mrze8BI0VniavN7wEjRWeJq83v,\ mrowner=ASNFZ4mrze8BI0VniavN7wEjRWeJq83vASNFZ4mrze8BI0VniavN7wEjRWeJq83v,\ mrownerconfig=ASNFZ4mrze8BI0VniavN7wEjRWeJq83vASNFZ4mrze8BI0VniavN7wEjRWeJq83v Signed-off-by: Isaku Yamahata Co-developed-by: Xiaoyao Li Signed-off-by: Xiaoyao Li [...] diff --git a/qapi/qom.json b/qapi/qom.json index 89ed89b9b46e..cac875349a3a 100644 --- a/qapi/qom.json +++ b/qapi/qom.json @@ -905,10 +905,25 @@ # pages. Some guest OS (e.g., Linux TD guest) may require this to # be set, otherwise they refuse to boot. # +# @mrconfigid: ID for non-owner-defined configuration of the guest TD, +# e.g., run-time or OS configuration (base64 encoded SHA384 digest). +# (A default value 0 of SHA384 is used when absent). Suggest to drop the parenthesis in the last sentence. @mrconfigid is a string, so the default value can't be 0. Actually, it's not just any string, but a base64 encoded SHA384 digest, which means it must be exactly 96 hex digits. So it can't be "0", either. It could be "". I thought value 0 of SHA384 just means it. That's my fault and my poor english. "Fault" is too harsh :) It's not as precise as I want our interface documentation to be. We work together to get there. More on this below. +# +# @mrowner: ID for the guest TD’s owner (base64 encoded SHA384 digest). +# (A default value 0 of SHA384 is used when absent). +# +# @mrownerconfig: ID for owner-defined configuration of the guest TD, +# e.g., specific to the workload rather than the run-time or OS +# (base64 encoded SHA384 digest). (A default value 0 of SHA384 is +# used when absent). +# # Since: 9.0 ## { 'struct': 'TdxGuestProperties', - 'data': { '*sept-ve-disable': 'bool' } } + 'data': { '*sept-ve-disable': 'bool', +'*mrconfigid': 'str', +'*mrowner': 'str', +'*mrownerconfig': 'str' } } ## # @ThreadContextProperties: diff --git a/target/i386/kvm/tdx.c b/target/i386/kvm/tdx.c index d0ad4f57b5d0..4ce2f1d082ce 100644 --- a/target/i386/kvm/tdx.c +++ b/target/i386/kvm/tdx.c @@ -13,6 +13,7 @@ #include "qemu/osdep.h" #include "qemu/error-report.h" +#include "qemu/base64.h" #include "qapi/error.h" #include "qom/object_interfaces.h" #include "standard-headers/asm-x86/kvm_para.h" @@ -516,6 +517,7 @@ int tdx_pre_create_vcpu(CPUState *cpu, Error **errp) X86CPU *x86cpu = X86_CPU(cpu); CPUX86State *env = >env; g_autofree struct kvm_tdx_init_vm *init_vm = NULL; +size_t data_len; int r = 0; object_property_set_bool(OBJECT(cpu), "pmu", false, _abort); @@ -528,6 +530,38 @@ int tdx_pre_create_vcpu(CPUState *cpu, Error **errp) init_vm = g_malloc0(sizeof(struct kvm_tdx_init_vm) + sizeof(struct kvm_cpuid_entry2) * KVM_MAX_CPUID_ENTRIES); +#define SHA384_DIGEST_SIZE 48 + +if (tdx_guest->mrconfigid) { +g_autofree uint8_t *data = qbase64_decode(tdx_guest->mrconfigid, + strlen(tdx_guest->mrconfigid), _len, errp); +if (!data || data_len != SHA384_DIGEST_SIZE) { +error_setg(errp, "TDX: failed to decode mrconfigid"); +return -1; +} +memcpy(init_vm->mrconfigid, data, data_len); +} When @mrconfigid is absent, the property remains null, and this conditional is not executed. init_vm->mrconfigid[], an array of 6 __u64, remains all zero. How does the kernel treat that? A all-zero SHA384 value is still a valid value, isn't it? KVM treats it with no difference. Can you point me to the spot in the kernel where mrconfigid is used? https://github.com/intel/tdx/blob/66a10e258636fa8ec9f5ce687607bf2196a92341/arch/x86/kvm/vmx/tdx.c#L2322 KVM just copy what QEMU provides into its own data structure @td_params. The format @of td_params is defined by TDX spec, and @td_params needs to be passed to TDX module when initialize the context of TD via SEAMCALL(TDH.MNG.INIT): https://github.com/intel/tdx/blob/66a10e258636fa8ec9f5ce687607bf2196a92341/arch/x86/kvm/vmx/tdx.c#L2450 In fact, a
Re: [PATCH v5 08/65] kvm: handle KVM_EXIT_MEMORY_FAULT
On 3/5/2024 5:10 PM, Isaku Yamahata wrote: On Thu, Feb 29, 2024 at 01:36:29AM -0500, Xiaoyao Li wrote: From: Chao Peng When geeting KVM_EXIT_MEMORY_FAULT exit, it indicates userspace needs to do the memory conversion on the RAMBlock to turn the memory into desired attribute, i.e., private/shared. Currently only KVM_MEMORY_EXIT_FLAG_PRIVATE in flags is valid when KVM_EXIT_MEMORY_FAULT happens. Note, KVM_EXIT_MEMORY_FAULT makes sense only when the RAMBlock has guest_memfd memory backend. Note, KVM_EXIT_MEMORY_FAULT returns with -EFAULT, so special handling is added. When page is converted from shared to private, the original shared memory can be discarded via ram_block_discard_range(). Note, shared memory can be discarded only when it's not back'ed by hugetlb because hugetlb is supposed to be pre-allocated and no need for discarding. Signed-off-by: Chao Peng Co-developed-by: Xiaoyao Li Signed-off-by: Xiaoyao Li --- Changes in v4: - open-coded ram_block_discard logic; - change warn_report() to error_report(); (Daniel) --- accel/kvm/kvm-all.c | 94 - 1 file changed, 84 insertions(+), 10 deletions(-) diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c index 70d482a2c936..87e4275932a7 100644 --- a/accel/kvm/kvm-all.c +++ b/accel/kvm/kvm-all.c @@ -2903,6 +2903,68 @@ static void kvm_eat_signals(CPUState *cpu) } while (sigismember(, SIG_IPI)); } +static int kvm_convert_memory(hwaddr start, hwaddr size, bool to_private) +{ +MemoryRegionSection section; +ram_addr_t offset; +MemoryRegion *mr; +RAMBlock *rb; +void *addr; +int ret = -1; + +if (!QEMU_PTR_IS_ALIGNED(start, qemu_host_page_size) || +!QEMU_PTR_IS_ALIGNED(size, qemu_host_page_size)) { +return -1; +} + +if (!size) { +return -1; +} + +section = memory_region_find(get_system_memory(), start, size); +mr = section.mr; +if (!mr) { +return -1; +} + +if (memory_region_has_guest_memfd(mr)) { +if (to_private) { +ret = kvm_set_memory_attributes_private(start, size); +} else { +ret = kvm_set_memory_attributes_shared(start, size); +} + +if (ret) { +memory_region_unref(section.mr); +return ret; +} + +addr = memory_region_get_ram_ptr(mr) + section.offset_within_region; +rb = qemu_ram_block_from_host(addr, false, ); + +if (to_private) { +if (rb->page_size != qemu_host_page_size) { +/* +* shared memory is back'ed by hugetlb, which is supposed to be +* pre-allocated and doesn't need to be discarded +*/ +return 0; The reference count leaks. Add memory_region_unref() is needed. thanks for catching it. Will fix it in next version. Otherwise looks good to me. Reviewed-by: Isaku Yamahata
Re: [PATCH 1/1] kvm: add support for guest physical bits
On 3/4/2024 10:58 PM, Gerd Hoffmann wrote: On Mon, Mar 04, 2024 at 09:54:40AM +0800, Xiaoyao Li wrote: On 3/1/2024 6:17 PM, Gerd Hoffmann wrote: query kvm for supported guest physical address bits using KVM_CAP_VM_GPA_BITS. Expose the value to the guest via cpuid (leaf 0x8008, eax, bits 16-23). Signed-off-by: Gerd Hoffmann --- target/i386/cpu.h | 1 + target/i386/cpu.c | 1 + target/i386/kvm/kvm.c | 8 3 files changed, 10 insertions(+) diff --git a/target/i386/cpu.h b/target/i386/cpu.h index 952174bb6f52..d427218827f6 100644 --- a/target/i386/cpu.h +++ b/target/i386/cpu.h @@ -2026,6 +2026,7 @@ struct ArchCPU { /* Number of physical address bits supported */ uint32_t phys_bits; +uint32_t guest_phys_bits; /* in order to simplify APIC support, we leave this pointer to the user */ diff --git a/target/i386/cpu.c b/target/i386/cpu.c index 2666ef380891..1a6cfc75951e 100644 --- a/target/i386/cpu.c +++ b/target/i386/cpu.c @@ -6570,6 +6570,7 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, uint32_t count, if (env->features[FEAT_8000_0001_EDX] & CPUID_EXT2_LM) { /* 64 bit processor */ *eax |= (cpu_x86_virtual_addr_width(env) << 8); + *eax |= (cpu->guest_phys_bits << 16); I think you misunderstand this field. If you expose this field to guest, it's the information for nested guest. i.e., the guest itself runs as a hypervisor will know its nested guest can have guest_phys_bits for physical addr. I think those limits (l1 + l2 guest phys-bits) are identical, no? Sorry, I didn't know this patch was based on the off-list proposal made by Paolo that changing the definition of CPUID.0x8008:EAX[23:16] to advertise the "maximum usable physical address bits". If you call out this in the change log, it can avoid the misunderstanding. As I replied to KVM series, I think the info is better to setup by KVM and reported by GET_SUPPORTED_CPUID. The problem this tries to solve is that making the guest phys-bits smaller than the host phys-bits is problematic (which why we have allow_smaller_maxphyaddr), but nevertheless there are cases where the usable guest physical address space is smaller than the host physical address space. One case is intel processors with phys-bits larger than 48 and 4-level EPT. Another case is amd processors with phys-bits larger than 48 and the l0 hypervisor using 4-level paging. The guest needs to know that limit, specifically the guest firmware so it knows where it can map PCI bars. take care, Gerd
Re: [PATCH 1/1] kvm: add support for guest physical bits
On 3/1/2024 6:17 PM, Gerd Hoffmann wrote: query kvm for supported guest physical address bits using KVM_CAP_VM_GPA_BITS. Expose the value to the guest via cpuid (leaf 0x8008, eax, bits 16-23). Signed-off-by: Gerd Hoffmann --- target/i386/cpu.h | 1 + target/i386/cpu.c | 1 + target/i386/kvm/kvm.c | 8 3 files changed, 10 insertions(+) diff --git a/target/i386/cpu.h b/target/i386/cpu.h index 952174bb6f52..d427218827f6 100644 --- a/target/i386/cpu.h +++ b/target/i386/cpu.h @@ -2026,6 +2026,7 @@ struct ArchCPU { /* Number of physical address bits supported */ uint32_t phys_bits; +uint32_t guest_phys_bits; /* in order to simplify APIC support, we leave this pointer to the user */ diff --git a/target/i386/cpu.c b/target/i386/cpu.c index 2666ef380891..1a6cfc75951e 100644 --- a/target/i386/cpu.c +++ b/target/i386/cpu.c @@ -6570,6 +6570,7 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, uint32_t count, if (env->features[FEAT_8000_0001_EDX] & CPUID_EXT2_LM) { /* 64 bit processor */ *eax |= (cpu_x86_virtual_addr_width(env) << 8); + *eax |= (cpu->guest_phys_bits << 16); I think you misunderstand this field. If you expose this field to guest, it's the information for nested guest. i.e., the guest itself runs as a hypervisor will know its nested guest can have guest_phys_bits for physical addr. } *ebx = env->features[FEAT_8000_0008_EBX]; if (cs->nr_cores * cs->nr_threads > 1) { diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c index 42970ab046fa..e06c9d66bb01 100644 --- a/target/i386/kvm/kvm.c +++ b/target/i386/kvm/kvm.c @@ -1716,6 +1716,7 @@ int kvm_arch_init_vcpu(CPUState *cs) X86CPU *cpu = X86_CPU(cs); CPUX86State *env = >env; uint32_t limit, i, j, cpuid_i; +uint32_t guest_phys_bits; uint32_t unused; struct kvm_cpuid_entry2 *c; uint32_t signature[3]; @@ -1751,6 +1752,13 @@ int kvm_arch_init_vcpu(CPUState *cs) env->apic_bus_freq = KVM_APIC_BUS_FREQUENCY; +guest_phys_bits = kvm_check_extension(cs->kvm_state, KVM_CAP_VM_GPA_BITS); +if (guest_phys_bits && +(cpu->guest_phys_bits == 0 || + cpu->guest_phys_bits > guest_phys_bits)) { +cpu->guest_phys_bits = guest_phys_bits; +} + /* * kvm_hyperv_expand_features() is called here for the second time in case * KVM_CAP_SYS_HYPERV_CPUID is not supported. While we can't possibly handle
Re: [PATCH v5 30/65] i386/tdx: Support user configurable mrconfigid/mrowner/mrownerconfig
On 2/29/2024 9:25 PM, Markus Armbruster wrote: Xiaoyao Li writes: On 2/29/2024 4:37 PM, Markus Armbruster wrote: Xiaoyao Li writes: From: Isaku Yamahata Three sha384 hash values, mrconfigid, mrowner and mrownerconfig, of a TD can be provided for TDX attestation. Detailed meaning of them can be found: https://lore.kernel.org/qemu-devel/31d6dbc1-f453-4cef-ab08-4813f4e0f...@intel.com/ Allow user to specify those values via property mrconfigid, mrowner and mrownerconfig. They are all in base64 format. example -object tdx-guest, \ mrconfigid=ASNFZ4mrze8BI0VniavN7wEjRWeJq83vASNFZ4mrze8BI0VniavN7wEjRWeJq83v,\ mrowner=ASNFZ4mrze8BI0VniavN7wEjRWeJq83vASNFZ4mrze8BI0VniavN7wEjRWeJq83v,\ mrownerconfig=ASNFZ4mrze8BI0VniavN7wEjRWeJq83vASNFZ4mrze8BI0VniavN7wEjRWeJq83v Signed-off-by: Isaku Yamahata Co-developed-by: Xiaoyao Li Signed-off-by: Xiaoyao Li [...] diff --git a/qapi/qom.json b/qapi/qom.json index 89ed89b9b46e..cac875349a3a 100644 --- a/qapi/qom.json +++ b/qapi/qom.json @@ -905,10 +905,25 @@ # pages. Some guest OS (e.g., Linux TD guest) may require this to # be set, otherwise they refuse to boot. # +# @mrconfigid: ID for non-owner-defined configuration of the guest TD, +# e.g., run-time or OS configuration (base64 encoded SHA384 digest). +# (A default value 0 of SHA384 is used when absent). Suggest to drop the parenthesis in the last sentence. @mrconfigid is a string, so the default value can't be 0. Actually, it's not just any string, but a base64 encoded SHA384 digest, which means it must be exactly 96 hex digits. So it can't be "0", either. It could be "". I thought value 0 of SHA384 just means it. That's my fault and my poor english. "Fault" is too harsh :) It's not as precise as I want our interface documentation to be. We work together to get there. More on this below. +# +# @mrowner: ID for the guest TD’s owner (base64 encoded SHA384 digest). +# (A default value 0 of SHA384 is used when absent). +# +# @mrownerconfig: ID for owner-defined configuration of the guest TD, +# e.g., specific to the workload rather than the run-time or OS +# (base64 encoded SHA384 digest). (A default value 0 of SHA384 is +# used when absent). +# # Since: 9.0 ## { 'struct': 'TdxGuestProperties', - 'data': { '*sept-ve-disable': 'bool' } } + 'data': { '*sept-ve-disable': 'bool', +'*mrconfigid': 'str', +'*mrowner': 'str', +'*mrownerconfig': 'str' } } ## # @ThreadContextProperties: diff --git a/target/i386/kvm/tdx.c b/target/i386/kvm/tdx.c index d0ad4f57b5d0..4ce2f1d082ce 100644 --- a/target/i386/kvm/tdx.c +++ b/target/i386/kvm/tdx.c @@ -13,6 +13,7 @@ #include "qemu/osdep.h" #include "qemu/error-report.h" +#include "qemu/base64.h" #include "qapi/error.h" #include "qom/object_interfaces.h" #include "standard-headers/asm-x86/kvm_para.h" @@ -516,6 +517,7 @@ int tdx_pre_create_vcpu(CPUState *cpu, Error **errp) X86CPU *x86cpu = X86_CPU(cpu); CPUX86State *env = >env; g_autofree struct kvm_tdx_init_vm *init_vm = NULL; +size_t data_len; int r = 0; object_property_set_bool(OBJECT(cpu), "pmu", false, _abort); @@ -528,6 +530,38 @@ int tdx_pre_create_vcpu(CPUState *cpu, Error **errp) init_vm = g_malloc0(sizeof(struct kvm_tdx_init_vm) + sizeof(struct kvm_cpuid_entry2) * KVM_MAX_CPUID_ENTRIES); +#define SHA384_DIGEST_SIZE 48 + +if (tdx_guest->mrconfigid) { +g_autofree uint8_t *data = qbase64_decode(tdx_guest->mrconfigid, + strlen(tdx_guest->mrconfigid), _len, errp); +if (!data || data_len != SHA384_DIGEST_SIZE) { +error_setg(errp, "TDX: failed to decode mrconfigid"); +return -1; +} +memcpy(init_vm->mrconfigid, data, data_len); +} When @mrconfigid is absent, the property remains null, and this conditional is not executed. init_vm->mrconfigid[], an array of 6 __u64, remains all zero. How does the kernel treat that? A all-zero SHA384 value is still a valid value, isn't it? KVM treats it with no difference. Can you point me to the spot in the kernel where mrconfigid is used? https://github.com/intel/tdx/blob/66a10e258636fa8ec9f5ce687607bf2196a92341/arch/x86/kvm/vmx/tdx.c#L2322 KVM just copy what QEMU provides into its own data structure @td_params. The format @of td_params is defined by TDX spec, and @td_params needs to be passed to TDX module when initialize the context of TD via SEAMCALL(TDH.MNG.INIT): https://github.com/intel/tdx/blob/66a10e258636fa8ec9f5ce687607bf2196a92341/arch/x86/kvm/vmx/tdx.c#L2450 In fact, all the three SHA384 fields, will be hashed together with some other fields (in td_params and other content of TD) to compromise the initial measurement of TD. TDX module doesn't care the value of td_params->mrconfigid.
Re: [PATCH v5 49/65] i386/tdx: handle TDG.VP.VMCALL
On 2/29/2024 4:40 PM, Markus Armbruster wrote: Xiaoyao Li writes: From: Isaku Yamahata Add property "quote-generation-socket" to tdx-guest, which is a property of type SocketAddress to specify Quote Generation Service(QGS). On request of GetQuote, it connects to the QGS socket, read request data from shared guest memory, send the request data to the QGS, and store the response into shared guest memory, at last notify TD guest by interrupt. command line example: qemu-system-x86_64 \ -object '{"qom-type":"tdx-guest","id":"tdx0","quote-generation-socket":{"type": "vsock", "cid":"1","port":"1234"}}' \ -machine confidential-guest-support=tdx0 Note, above example uses vsock type socket because the QGS we used implements the vsock socket. It can be other types, like UNIX socket, which depends on the implementation of QGS. To avoid no response from QGS server, setup a timer for the transaction. If timeout, make it an error and interrupt guest. Define the threshold of time to 30s at present, maybe change to other value if not appropriate. Signed-off-by: Isaku Yamahata Codeveloped-by: Chenyi Qiang Signed-off-by: Chenyi Qiang Codeveloped-by: Xiaoyao Li Signed-off-by: Xiaoyao Li [...] diff --git a/qapi/qom.json b/qapi/qom.json index cac875349a3a..7b26b0a0d3aa 100644 --- a/qapi/qom.json +++ b/qapi/qom.json @@ -917,13 +917,19 @@ # (base64 encoded SHA384 digest). (A default value 0 of SHA384 is # used when absent). # +# @quote-generation-socket: socket address for Quote Generation +# Service (QGS). QGS is a daemon running on the host. User in +# TD guest cannot get TD quoting for attestation if QGS is not +# provided. So admin should always provide it. This makes me wonder why it's optional. Can you describe a use case for *not* specifying @quote-generation-socket? Maybe at last when all the TDX support lands on all the components, attestation will become a must for a TD guest to be usable. However, at least for today, booting and running a TD guest don't require attestation. So not provide it, doesn't affect anything excepting cannot get a Quote. +# # Since: 9.0 ## { 'struct': 'TdxGuestProperties', 'data': { '*sept-ve-disable': 'bool', '*mrconfigid': 'str', '*mrowner': 'str', -'*mrownerconfig': 'str' } } +'*mrownerconfig': 'str', +'*quote-generation-socket': 'SocketAddress' } } ## # @ThreadContextProperties: [...]
Re: [PATCH v5 30/65] i386/tdx: Support user configurable mrconfigid/mrowner/mrownerconfig
On 2/29/2024 4:37 PM, Markus Armbruster wrote: Xiaoyao Li writes: From: Isaku Yamahata Three sha384 hash values, mrconfigid, mrowner and mrownerconfig, of a TD can be provided for TDX attestation. Detailed meaning of them can be found: https://lore.kernel.org/qemu-devel/31d6dbc1-f453-4cef-ab08-4813f4e0f...@intel.com/ Allow user to specify those values via property mrconfigid, mrowner and mrownerconfig. They are all in base64 format. example -object tdx-guest, \ mrconfigid=ASNFZ4mrze8BI0VniavN7wEjRWeJq83vASNFZ4mrze8BI0VniavN7wEjRWeJq83v,\ mrowner=ASNFZ4mrze8BI0VniavN7wEjRWeJq83vASNFZ4mrze8BI0VniavN7wEjRWeJq83v,\ mrownerconfig=ASNFZ4mrze8BI0VniavN7wEjRWeJq83vASNFZ4mrze8BI0VniavN7wEjRWeJq83v Signed-off-by: Isaku Yamahata Co-developed-by: Xiaoyao Li Signed-off-by: Xiaoyao Li --- Changes in v5: - refine the description of QAPI properties and add description of default value when not specified; Changes in v4: - describe more of there fields in qom.json - free the old value before set new value to avoid memory leak in _setter(); (Daniel) Changes in v3: - use base64 encoding instread of hex-string; --- qapi/qom.json | 17 - target/i386/kvm/tdx.c | 87 +++ target/i386/kvm/tdx.h | 3 ++ 3 files changed, 106 insertions(+), 1 deletion(-) diff --git a/qapi/qom.json b/qapi/qom.json index 89ed89b9b46e..cac875349a3a 100644 --- a/qapi/qom.json +++ b/qapi/qom.json @@ -905,10 +905,25 @@ # pages. Some guest OS (e.g., Linux TD guest) may require this to # be set, otherwise they refuse to boot. # +# @mrconfigid: ID for non-owner-defined configuration of the guest TD, +# e.g., run-time or OS configuration (base64 encoded SHA384 digest). +# (A default value 0 of SHA384 is used when absent). Suggest to drop the parenthesis in the last sentence. @mrconfigid is a string, so the default value can't be 0. Actually, it's not just any string, but a base64 encoded SHA384 digest, which means it must be exactly 96 hex digits. So it can't be "0", either. It could be "". I thought value 0 of SHA384 just means it. That's my fault and my poor english. More on this below. +# +# @mrowner: ID for the guest TD’s owner (base64 encoded SHA384 digest). +# (A default value 0 of SHA384 is used when absent). +# +# @mrownerconfig: ID for owner-defined configuration of the guest TD, +# e.g., specific to the workload rather than the run-time or OS +# (base64 encoded SHA384 digest). (A default value 0 of SHA384 is +# used when absent). +# # Since: 9.0 ## { 'struct': 'TdxGuestProperties', - 'data': { '*sept-ve-disable': 'bool' } } + 'data': { '*sept-ve-disable': 'bool', +'*mrconfigid': 'str', +'*mrowner': 'str', +'*mrownerconfig': 'str' } } ## # @ThreadContextProperties: diff --git a/target/i386/kvm/tdx.c b/target/i386/kvm/tdx.c index d0ad4f57b5d0..4ce2f1d082ce 100644 --- a/target/i386/kvm/tdx.c +++ b/target/i386/kvm/tdx.c @@ -13,6 +13,7 @@ #include "qemu/osdep.h" #include "qemu/error-report.h" +#include "qemu/base64.h" #include "qapi/error.h" #include "qom/object_interfaces.h" #include "standard-headers/asm-x86/kvm_para.h" @@ -516,6 +517,7 @@ int tdx_pre_create_vcpu(CPUState *cpu, Error **errp) X86CPU *x86cpu = X86_CPU(cpu); CPUX86State *env = >env; g_autofree struct kvm_tdx_init_vm *init_vm = NULL; +size_t data_len; int r = 0; object_property_set_bool(OBJECT(cpu), "pmu", false, _abort); @@ -528,6 +530,38 @@ int tdx_pre_create_vcpu(CPUState *cpu, Error **errp) init_vm = g_malloc0(sizeof(struct kvm_tdx_init_vm) + sizeof(struct kvm_cpuid_entry2) * KVM_MAX_CPUID_ENTRIES); +#define SHA384_DIGEST_SIZE 48 + +if (tdx_guest->mrconfigid) { +g_autofree uint8_t *data = qbase64_decode(tdx_guest->mrconfigid, + strlen(tdx_guest->mrconfigid), _len, errp); +if (!data || data_len != SHA384_DIGEST_SIZE) { +error_setg(errp, "TDX: failed to decode mrconfigid"); +return -1; +} +memcpy(init_vm->mrconfigid, data, data_len); +} When @mrconfigid is absent, the property remains null, and this conditional is not executed. init_vm->mrconfigid[], an array of 6 __u64, remains all zero. How does the kernel treat that? A all-zero SHA384 value is still a valid value, isn't it? KVM treats it with no difference. + +if (tdx_guest->mrowner) { +g_autofree uint8_t *data = qbase64_decode(tdx_guest->mrowner, + strlen(tdx_guest->mrowner), _len, errp); +if (!data || data_len != SHA384_DIGEST_SIZE) { +error_s
[PATCH v5 39/65] i386/tdx: Skip BIOS shadowing setup
TDX doesn't support map different GPAs to same private memory. Thus, aliasing top 128KB of BIOS as isa-bios is not supported. On the other hand, TDX guest cannot go to real mode, it can work fine without isa-bios. Signed-off-by: Xiaoyao Li Acked-by: Gerd Hoffmann --- Changes in v1: - update commit message and comment to clarify --- hw/i386/x86.c | 25 ++--- 1 file changed, 14 insertions(+), 11 deletions(-) diff --git a/hw/i386/x86.c b/hw/i386/x86.c index 5a0cadc88c4f..61c45dfc14dd 100644 --- a/hw/i386/x86.c +++ b/hw/i386/x86.c @@ -1190,17 +1190,20 @@ void x86_bios_rom_init(MachineState *ms, const char *default_firmware, } g_free(filename); -/* map the last 128KB of the BIOS in ISA space */ -isa_bios_size = MIN(bios_size, 128 * KiB); -isa_bios = g_malloc(sizeof(*isa_bios)); -memory_region_init_alias(isa_bios, NULL, "isa-bios", bios, - bios_size - isa_bios_size, isa_bios_size); -memory_region_add_subregion_overlap(rom_memory, -0x10 - isa_bios_size, -isa_bios, -1); -if (!isapc_ram_fw) { -memory_region_set_readonly(isa_bios, true); +/* For TDX, alias different GPAs to same private memory is not supported */ +if (!is_tdx_vm()) { +/* map the last 128KB of the BIOS in ISA space */ +isa_bios_size = MIN(bios_size, 128 * KiB); +isa_bios = g_malloc(sizeof(*isa_bios)); +memory_region_init_alias(isa_bios, NULL, "isa-bios", bios, +bios_size - isa_bios_size, isa_bios_size); +memory_region_add_subregion_overlap(rom_memory, +0x10 - isa_bios_size, +isa_bios, +1); +if (!isapc_ram_fw) { +memory_region_set_readonly(isa_bios, true); +} } /* map all the bios at the top of memory */ -- 2.34.1
[PATCH v5 40/65] i386/tdx: Don't initialize pc.rom for TDX VMs
For TDX, the address below 1MB are entirely general RAM. No need to initialize pc.rom memory region for TDs. Signed-off-by: Xiaoyao Li --- This is more as a workaround of the issue that for q35 machine type, the real memslot update (which requires memslot deletion )for pc.rom happens after tdx_init_memory_region. It leads to the private memory ADD'ed before get lost. I haven't work out a good solution to resolve the order issue. So just skip the pc.rom setup to avoid memslot deletion. --- hw/i386/pc.c | 21 - 1 file changed, 12 insertions(+), 9 deletions(-) diff --git a/hw/i386/pc.c b/hw/i386/pc.c index f5ff970acfa0..3f8dd218eb08 100644 --- a/hw/i386/pc.c +++ b/hw/i386/pc.c @@ -43,6 +43,7 @@ #include "sysemu/xen.h" #include "sysemu/reset.h" #include "kvm/kvm_i386.h" +#include "kvm/tdx.h" #include "hw/xen/xen.h" #include "qapi/qmp/qlist.h" #include "qemu/error-report.h" @@ -1028,16 +1029,18 @@ void pc_memory_init(PCMachineState *pcms, /* Initialize PC system firmware */ pc_system_firmware_init(pcms, rom_memory); -option_rom_mr = g_malloc(sizeof(*option_rom_mr)); -memory_region_init_ram(option_rom_mr, NULL, "pc.rom", PC_ROM_SIZE, - _fatal); -if (pcmc->pci_enabled) { -memory_region_set_readonly(option_rom_mr, true); +if (!is_tdx_vm()) { +option_rom_mr = g_malloc(sizeof(*option_rom_mr)); +memory_region_init_ram(option_rom_mr, NULL, "pc.rom", PC_ROM_SIZE, +_fatal); +if (pcmc->pci_enabled) { +memory_region_set_readonly(option_rom_mr, true); +} +memory_region_add_subregion_overlap(rom_memory, +PC_ROM_MIN_VGA, +option_rom_mr, +1); } -memory_region_add_subregion_overlap(rom_memory, -PC_ROM_MIN_VGA, -option_rom_mr, -1); fw_cfg = fw_cfg_arch_create(machine, x86ms->boot_cpus, x86ms->apic_id_limit); -- 2.34.1
[PATCH v5 52/65] i386/tdx: Wire TDX_REPORT_FATAL_ERROR with GuestPanic facility
Integrate TDX's TDX_REPORT_FATAL_ERROR into QEMU GuestPanic facility Originated-from: Isaku Yamahata Signed-off-by: Xiaoyao Li --- Changes in v5: - mention additional error information in gpa when it presents; - refine the documentation; (Markus) Changes in v4: - refine the documentation; (Markus) Changes in v3: - Add docmentation of new type and struct; (Daniel) - refine the error message handling; (Daniel) --- qapi/run-state.json | 31 +-- system/runstate.c | 58 +++ target/i386/kvm/tdx.c | 24 +- 3 files changed, 110 insertions(+), 3 deletions(-) diff --git a/qapi/run-state.json b/qapi/run-state.json index dd0770b379e5..b71dd1884eb6 100644 --- a/qapi/run-state.json +++ b/qapi/run-state.json @@ -483,10 +483,12 @@ # # @s390: s390 guest panic information type (Since: 2.12) # +# @tdx: tdx guest panic information type (Since: 9.0) +# # Since: 2.9 ## { 'enum': 'GuestPanicInformationType', - 'data': [ 'hyper-v', 's390' ] } + 'data': [ 'hyper-v', 's390', 'tdx' ] } ## # @GuestPanicInformation: @@ -501,7 +503,8 @@ 'base': {'type': 'GuestPanicInformationType'}, 'discriminator': 'type', 'data': {'hyper-v': 'GuestPanicInformationHyperV', - 's390': 'GuestPanicInformationS390'}} + 's390': 'GuestPanicInformationS390', + 'tdx' : 'GuestPanicInformationTdx'}} ## # @GuestPanicInformationHyperV: @@ -564,6 +567,30 @@ 'psw-addr': 'uint64', 'reason': 'S390CrashReason'}} +## +# @GuestPanicInformationTdx: +# +# TDX Guest panic information specific to TDX, as specified in the +# "Guest-Hypervisor Communication Interface (GHCI) Specification", +# section TDG.VP.VMCALL. +# +# @error-code: TD-specific error code +# +# @message: Human-readable error message provided by the guest. Not +# to be trusted. +# +# @gpa: guest-physical address of a page that contains more verbose +# error information, as zero-terminated string. Present when the +# "GPA valid" bit (bit 63) is set in @error-code. +# +# +# Since: 9.0 +## +{'struct': 'GuestPanicInformationTdx', + 'data': {'error-code': 'uint64', + 'message': 'str', + '*gpa': 'uint64'}} + ## # @MEMORY_FAILURE: # diff --git a/system/runstate.c b/system/runstate.c index d6ab860ecaa7..3b628c38739d 100644 --- a/system/runstate.c +++ b/system/runstate.c @@ -519,6 +519,52 @@ static void qemu_system_wakeup(void) } } +static char* tdx_parse_panic_message(char *message) +{ +bool printable = false; +char *buf = NULL; +int len = 0, i; + +/* + * Although message is defined as a json string, we shouldn't + * unconditionally treat it as is because the guest generated it and + * it's not necessarily trustable. + */ +if (message) { +/* The caller guarantees the NUL-terminated string. */ +len = strlen(message); + +printable = len > 0; +for (i = 0; i < len; i++) { +if (!(0x20 <= message[i] && message[i] <= 0x7e)) { +printable = false; +break; +} +} +} + +if (!printable && len) { +/* 3 = length of "%02x " */ +buf = g_malloc(len * 3); +for (i = 0; i < len; i++) { +if (message[i] == '\0') { +break; +} else { +sprintf(buf + 3 * i, "%02x ", message[i]); +} +} +if (i > 0) +/* replace the last ' '(space) to NUL */ +buf[i * 3 - 1] = '\0'; +else +buf[0] = '\0'; + +return buf; +} + +return message; +} + void qemu_system_guest_panicked(GuestPanicInformation *info) { qemu_log_mask(LOG_GUEST_ERROR, "Guest crashed"); @@ -560,7 +606,19 @@ void qemu_system_guest_panicked(GuestPanicInformation *info) S390CrashReason_str(info->u.s390.reason), info->u.s390.psw_mask, info->u.s390.psw_addr); +} else if (info->type == GUEST_PANIC_INFORMATION_TYPE_TDX) { +qemu_log_mask(LOG_GUEST_ERROR, + " TDX guest reports fatal error:" + " error code: 0x%#" PRIx64 " error message:\"%s\"\n", + info->u.tdx.error_code, + tdx_parse_panic_message(info->u.tdx.message)); +if (info->u.tdx.error_code & (1ull << 63)) { +qemu_log_mask(LOG_GUEST_ERROR, "Additional error information " + "can be found at gpa page: 0x%#" PRIx64 "\n", + info->u.tdx.gpa); +} } + qapi_free_GuestPanicInformation(info); } } diff --git a/target/i386/kvm/tdx.c b/target/i386/kvm
[PATCH v5 32/65] i386/tdx: Set kvm_readonly_mem_enabled to false for TDX VM
TDX only supports readonly for shared memory but not for private memory. In the view of QEMU, it has no idea whether a memslot is used as shared memory of private. Thus just mark kvm_readonly_mem_enabled to false to TDX VM for simplicity. Signed-off-by: Xiaoyao Li Acked-by: Gerd Hoffmann --- target/i386/kvm/tdx.c | 9 + 1 file changed, 9 insertions(+) diff --git a/target/i386/kvm/tdx.c b/target/i386/kvm/tdx.c index 42dbb5ce5c15..13f069171db7 100644 --- a/target/i386/kvm/tdx.c +++ b/target/i386/kvm/tdx.c @@ -480,6 +480,15 @@ static int tdx_kvm_init(ConfidentialGuestSupport *cgs, Error **errp) update_tdx_cpuid_lookup_by_tdx_caps(); +/* + * Set kvm_readonly_mem_allowed to false, because TDX only supports readonly + * memory for shared memory but not for private memory. Besides, whether a + * memslot is private or shared is not determined by QEMU. + * + * Thus, just mark readonly memory not supported for simplicity. + */ +kvm_readonly_mem_allowed = false; + tdx_guest = tdx; return 0; } -- 2.34.1
[PATCH v5 65/65] docs: Add TDX documentation
Add docs/system/i386/tdx.rst for TDX support, and add tdx in confidential-guest-support.rst Signed-off-by: Xiaoyao Li --- Changes in v5: - Add TD attestation section and update the QEMU parameter; Changes since v1: - Add prerequisite of private gmem; - update example command to launch TD; Changes since RFC v4: - add the restriction that kernel-irqchip must be split --- docs/system/confidential-guest-support.rst | 1 + docs/system/i386/tdx.rst | 143 + docs/system/target-i386.rst| 1 + 3 files changed, 145 insertions(+) create mode 100644 docs/system/i386/tdx.rst diff --git a/docs/system/confidential-guest-support.rst b/docs/system/confidential-guest-support.rst index 0c490dbda2b7..66129fbab64c 100644 --- a/docs/system/confidential-guest-support.rst +++ b/docs/system/confidential-guest-support.rst @@ -38,6 +38,7 @@ Supported mechanisms Currently supported confidential guest mechanisms are: * AMD Secure Encrypted Virtualization (SEV) (see :doc:`i386/amd-memory-encryption`) +* Intel Trust Domain Extension (TDX) (see :doc:`i386/tdx`) * POWER Protected Execution Facility (PEF) (see :ref:`power-papr-protected-execution-facility-pef`) * s390x Protected Virtualization (PV) (see :doc:`s390x/protvirt`) diff --git a/docs/system/i386/tdx.rst b/docs/system/i386/tdx.rst new file mode 100644 index ..8491cdcfa163 --- /dev/null +++ b/docs/system/i386/tdx.rst @@ -0,0 +1,143 @@ +Intel Trusted Domain eXtension (TDX) + + +Intel Trusted Domain eXtensions (TDX) refers to an Intel technology that extends +Virtual Machine Extensions (VMX) and Multi-Key Total Memory Encryption (MKTME) +with a new kind of virtual machine guest called a Trust Domain (TD). A TD runs +in a CPU mode that is designed to protect the confidentiality of its memory +contents and its CPU state from any other software, including the hosting +Virtual Machine Monitor (VMM), unless explicitly shared by the TD itself. + +Prerequisites +- + +To run TD, the physical machine needs to have TDX module loaded and initialized +while KVM hypervisor has TDX support and has TDX enabled. If those requirements +are met, the ``KVM_CAP_VM_TYPES`` will report the support of ``KVM_X86_TDX_VM``. + +Trust Domain Virtual Firmware (TDVF) + + +Trust Domain Virtual Firmware (TDVF) is required to provide TD services to boot +TD Guest OS. TDVF needs to be copied to guest private memory and measured before +the TD boots. + +KVM vcpu ioctl ``KVM_MEMORY_MAPPING`` can be used to populates the TDVF content +into its private memory. + +Since TDX doesn't support readonly memslot, TDVF cannot be mapped as pflash +device and it actually works as RAM. "-bios" option is chosen to load TDVF. + +OVMF is the opensource firmware that implements the TDVF support. Thus the +command line to specify and load TDVF is ``-bios OVMF.fd`` + +KVM private memory + + +TD's memory (RAM) needs to be able to be transformed between private and shared. +Its BIOS (OVMF/TDVF) needs to be mapped as private as well. Thus QEMU needs to +allocate private guest memfd for them via KVM's IOCTL (KVM_CREATE_GUEST_MEMFD), +which requires KVM is newer enough that reports KVM_CAP_GUEST_MEMFD. + +Feature Control +--- + +Unlike non-TDX VM, the CPU features (enumerated by CPU or MSR) of a TD is not +under full control of VMM. VMM can only configure part of features of a TD on +``KVM_TDX_INIT_VM`` command of VM scope ``MEMORY_ENCRYPT_OP`` ioctl. + +The configurable features have three types: + +- Attributes: + - PKS (bit 30) controls whether Supervisor Protection Keys is exposed to TD, + which determines related CPUID bit and CR4 bit; + - PERFMON (bit 63) controls whether PMU is exposed to TD. + +- XSAVE related features (XFAM): + XFAM is a 64b mask, which has the same format as XCR0 or IA32_XSS MSR. It + determines the set of extended features available for use by the guest TD. + +- CPUID features: + Only some bits of some CPUID leaves are directly configurable by VMM. + +What features can be configured is reported via TDX capabilities. + +TDX capabilities + + +The VM scope ``MEMORY_ENCRYPT_OP`` ioctl provides command ``KVM_TDX_CAPABILITIES`` +to get the TDX capabilities from KVM. It returns a data structure of +``struct kvm_tdx_capabilites``, which tells the supported configuration of +attributes, XFAM and CPUIDs. + +TD attestation +-- + +In TD guest, the attestation process is used to verify the TDX guest +trustworthiness to other entities before provisioning secrets to the guest. + +TD attestation is initiated first by calling TDG.MR.REPORT inside TD to get the +REPORT. Then the REPORT data needs to be converted into a remotely verifiable +Quote by SGX Quoting Enclave (QE). + +A host daemon, Quote Generation Service (QGS), provides the functionality of +SGX GE. I
[PATCH v5 28/65] i386/tdx: Disable pmu for TD guest
Current KVM doesn't support PMU for TD guest. It returns error if TD is created with PMU bit being set in attributes. Disable PMU for TD guest on QEMU side. Signed-off-by: Xiaoyao Li --- target/i386/kvm/tdx.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/target/i386/kvm/tdx.c b/target/i386/kvm/tdx.c index 262e86fd2c67..1c12cda002b8 100644 --- a/target/i386/kvm/tdx.c +++ b/target/i386/kvm/tdx.c @@ -496,6 +496,8 @@ int tdx_pre_create_vcpu(CPUState *cpu, Error **errp) g_autofree struct kvm_tdx_init_vm *init_vm = NULL; int r = 0; +object_property_set_bool(OBJECT(cpu), "pmu", false, _abort); + QEMU_LOCK_GUARD(_guest->lock); if (tdx_guest->initialized) { return r; -- 2.34.1
[PATCH v5 34/65] kvm/tdx: Ignore memory conversion to shared of unassigned region
From: Isaku Yamahata TDX requires vMMIO region to be shared. For KVM, MMIO region is the region which kvm memslot isn't assigned to (except in-kernel emulation). qemu has the memory region for vMMIO at each device level. While OVMF issues MapGPA(to-shared) conservatively on 32bit PCI MMIO region, qemu doesn't find corresponding vMMIO region because it's before PCI device allocation and memory_region_find() finds the device region, not PCI bus region. It's safe to ignore MapGPA(to-shared) because when guest accesses those region they use GPA with shared bit set for vMMIO. Ignore memory conversion request of non-assigned region to shared and return success. Otherwise OVMF is confused and panics there. Signed-off-by: Isaku Yamahata Signed-off-by: Xiaoyao Li --- accel/kvm/kvm-all.c | 12 1 file changed, 12 insertions(+) diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c index d533e2611ad8..9dc17a1b5f43 100644 --- a/accel/kvm/kvm-all.c +++ b/accel/kvm/kvm-all.c @@ -2953,6 +2953,18 @@ static int kvm_convert_memory(hwaddr start, hwaddr size, bool to_private) section = memory_region_find(get_system_memory(), start, size); mr = section.mr; if (!mr) { +/* + * Ignore converting non-assigned region to shared. + * + * TDX requires vMMIO region to be shared to inject #VE to guest. + * OVMF issues conservatively MapGPA(shared) on 32bit PCI MMIO region, + * and vIO-APIC 0xFEC0 4K page. + * OVMF assigns 32bit PCI MMIO region to + * [top of low memory: typically 2GB=0xC00, 0xFC0) + */ +if (!to_private) { +return 0; +} return -1; } -- 2.34.1
[PATCH v5 29/65] i386/tdx: Validate TD attributes
Validate TD attributes with tdx_caps that fixed-0 bits must be zero and fixed-1 bits must be set. Besides, sanity check the attribute bits that have not been supported by QEMU yet. e.g., debug bit, it will be allowed in the future when debug TD support lands in QEMU. Signed-off-by: Xiaoyao Li Acked-by: Gerd Hoffmann --- Changes in v3: - using error_setg() for error report; (Daniel) --- target/i386/kvm/tdx.c | 29 +++-- 1 file changed, 27 insertions(+), 2 deletions(-) diff --git a/target/i386/kvm/tdx.c b/target/i386/kvm/tdx.c index 1c12cda002b8..d0ad4f57b5d0 100644 --- a/target/i386/kvm/tdx.c +++ b/target/i386/kvm/tdx.c @@ -32,6 +32,7 @@ (1U << KVM_FEATURE_PV_SCHED_YIELD) | \ (1U << KVM_FEATURE_MSI_EXT_DEST_ID)) +#define TDX_TD_ATTRIBUTES_DEBUG BIT_ULL(0) #define TDX_TD_ATTRIBUTES_SEPT_VE_DISABLE BIT_ULL(28) #define TDX_TD_ATTRIBUTES_PKS BIT_ULL(30) #define TDX_TD_ATTRIBUTES_PERFMON BIT_ULL(63) @@ -479,13 +480,34 @@ static int tdx_kvm_init(ConfidentialGuestSupport *cgs, Error **errp) return 0; } -static void setup_td_guest_attributes(X86CPU *x86cpu) +static int tdx_validate_attributes(TdxGuest *tdx, Error **errp) +{ +if (((tdx->attributes & tdx_caps->attrs_fixed0) | tdx_caps->attrs_fixed1) != +tdx->attributes) { +error_setg(errp, "Invalid attributes 0x%lx for TDX VM " + "(fixed0 0x%llx, fixed1 0x%llx)", + tdx->attributes, tdx_caps->attrs_fixed0, + tdx_caps->attrs_fixed1); +return -1; +} + +if (tdx->attributes & TDX_TD_ATTRIBUTES_DEBUG) { +error_setg(errp, "Current QEMU doesn't support attributes.debug[bit 0] for TDX VM"); +return -1; +} + +return 0; +} + +static int setup_td_guest_attributes(X86CPU *x86cpu, Error **errp) { CPUX86State *env = >env; tdx_guest->attributes |= (env->features[FEAT_7_0_ECX] & CPUID_7_0_ECX_PKS) ? TDX_TD_ATTRIBUTES_PKS : 0; tdx_guest->attributes |= x86cpu->enable_pmu ? TDX_TD_ATTRIBUTES_PERFMON : 0; + +return tdx_validate_attributes(tdx_guest, errp); } int tdx_pre_create_vcpu(CPUState *cpu, Error **errp) @@ -512,7 +534,10 @@ int tdx_pre_create_vcpu(CPUState *cpu, Error **errp) return r; } -setup_td_guest_attributes(x86cpu); +r = setup_td_guest_attributes(x86cpu, errp); +if (r) { +return r; +} init_vm->cpuid.nent = kvm_x86_arch_cpuid(env, init_vm->cpuid.entries, 0); -- 2.34.1
[PATCH v5 35/65] memory: Introduce memory_region_init_ram_guest_memfd()
Introduce memory_region_init_ram_guest_memfd() to allocate private guset memfd on the MemoryRegion initialization. It's for the use case of TDVF, which must be private on TDX case. Signed-off-by: Xiaoyao Li --- Changes in v5: - drop memory_region_set_default_private() because this function is dropped in this v5 series; --- include/exec/memory.h | 6 ++ system/memory.c | 25 + 2 files changed, 31 insertions(+) diff --git a/include/exec/memory.h b/include/exec/memory.h index 679a8476852e..1e351f6fc875 100644 --- a/include/exec/memory.h +++ b/include/exec/memory.h @@ -1603,6 +1603,12 @@ bool memory_region_init_ram(MemoryRegion *mr, uint64_t size, Error **errp); +bool memory_region_init_ram_guest_memfd(MemoryRegion *mr, +Object *owner, +const char *name, +uint64_t size, +Error **errp); + /** * memory_region_init_rom: Initialize a ROM memory region. * diff --git a/system/memory.c b/system/memory.c index c756950c0c0f..85a22408e9a4 100644 --- a/system/memory.c +++ b/system/memory.c @@ -3606,6 +3606,31 @@ bool memory_region_init_ram(MemoryRegion *mr, return true; } +bool memory_region_init_ram_guest_memfd(MemoryRegion *mr, +Object *owner, +const char *name, +uint64_t size, +Error **errp) +{ +DeviceState *owner_dev; + +if (!memory_region_init_ram_flags_nomigrate(mr, owner, name, size, +RAM_GUEST_MEMFD, errp)) { +return false; +} + +/* This will assert if owner is neither NULL nor a DeviceState. + * We only want the owner here for the purposes of defining a + * unique name for migration. TODO: Ideally we should implement + * a naming scheme for Objects which are not DeviceStates, in + * which case we can relax this restriction. + */ +owner_dev = DEVICE(owner); +vmstate_register_ram(mr, owner_dev); + +return true; +} + bool memory_region_init_rom(MemoryRegion *mr, Object *owner, const char *name, -- 2.34.1
[PATCH v5 06/65] kvm: Introduce support for memory_attributes
Introduce the helper functions to set the attributes of a range of memory to private or shared. This is necessary to notify KVM the private/shared attribute of each gpa range. KVM needs the information to decide the GPA needs to be mapped at hva-based shared memory or guest_memfd based private memory. Signed-off-by: Xiaoyao Li --- Changes in v4: - move the check of kvm_supported_memory_attributes to the common kvm_set_memory_attributes(); (Wang Wei) - change warn_report() to error_report() in kvm_set_memory_attributes() and drop the __func__; (Daniel) --- accel/kvm/kvm-all.c | 44 include/sysemu/kvm.h | 3 +++ 2 files changed, 47 insertions(+) diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c index cd0aa7545a1f..70d482a2c936 100644 --- a/accel/kvm/kvm-all.c +++ b/accel/kvm/kvm-all.c @@ -92,6 +92,7 @@ static bool kvm_has_guest_debug; static int kvm_sstep_flags; static bool kvm_immediate_exit; static bool kvm_guest_memfd_supported; +static uint64_t kvm_supported_memory_attributes; static hwaddr kvm_max_slot_size = ~0; static const KVMCapabilityInfo kvm_required_capabilites[] = { @@ -1304,6 +1305,46 @@ void kvm_set_max_memslot_size(hwaddr max_slot_size) kvm_max_slot_size = max_slot_size; } +static int kvm_set_memory_attributes(hwaddr start, hwaddr size, uint64_t attr) +{ +struct kvm_memory_attributes attrs; +int r; + +if (kvm_supported_memory_attributes == 0) { +error_report("No memory attribute supported by KVM\n"); +return -EINVAL; +} + +if ((attr & kvm_supported_memory_attributes) != attr) { +error_report("memory attribute 0x%lx not supported by KVM," + " supported bits are 0x%lx\n", + attr, kvm_supported_memory_attributes); +return -EINVAL; +} + +attrs.attributes = attr; +attrs.address = start; +attrs.size = size; +attrs.flags = 0; + +r = kvm_vm_ioctl(kvm_state, KVM_SET_MEMORY_ATTRIBUTES, ); +if (r) { +error_report("failed to set memory (0x%lx+%#zx) with attr 0x%lx error '%s'", + start, size, attr, strerror(errno)); +} +return r; +} + +int kvm_set_memory_attributes_private(hwaddr start, hwaddr size) +{ +return kvm_set_memory_attributes(start, size, KVM_MEMORY_ATTRIBUTE_PRIVATE); +} + +int kvm_set_memory_attributes_shared(hwaddr start, hwaddr size) +{ +return kvm_set_memory_attributes(start, size, 0); +} + /* Called with KVMMemoryListener.slots_lock held */ static void kvm_set_phys_mem(KVMMemoryListener *kml, MemoryRegionSection *section, bool add) @@ -2439,6 +2480,9 @@ static int kvm_init(MachineState *ms) kvm_guest_memfd_supported = kvm_check_extension(s, KVM_CAP_GUEST_MEMFD); +ret = kvm_check_extension(s, KVM_CAP_MEMORY_ATTRIBUTES); +kvm_supported_memory_attributes = ret > 0 ? ret : 0; + if (object_property_find(OBJECT(current_machine), "kvm-type")) { g_autofree char *kvm_type = object_property_get_str(OBJECT(current_machine), "kvm-type", diff --git a/include/sysemu/kvm.h b/include/sysemu/kvm.h index 6cdf82de8372..8e83adfbbd19 100644 --- a/include/sysemu/kvm.h +++ b/include/sysemu/kvm.h @@ -546,4 +546,7 @@ uint32_t kvm_dirty_ring_size(void); bool kvm_hwpoisoned_mem(void); int kvm_create_guest_memfd(uint64_t size, uint64_t flags, Error **errp); + +int kvm_set_memory_attributes_private(hwaddr start, hwaddr size); +int kvm_set_memory_attributes_shared(hwaddr start, hwaddr size); #endif -- 2.34.1
[PATCH v5 30/65] i386/tdx: Support user configurable mrconfigid/mrowner/mrownerconfig
From: Isaku Yamahata Three sha384 hash values, mrconfigid, mrowner and mrownerconfig, of a TD can be provided for TDX attestation. Detailed meaning of them can be found: https://lore.kernel.org/qemu-devel/31d6dbc1-f453-4cef-ab08-4813f4e0f...@intel.com/ Allow user to specify those values via property mrconfigid, mrowner and mrownerconfig. They are all in base64 format. example -object tdx-guest, \ mrconfigid=ASNFZ4mrze8BI0VniavN7wEjRWeJq83vASNFZ4mrze8BI0VniavN7wEjRWeJq83v,\ mrowner=ASNFZ4mrze8BI0VniavN7wEjRWeJq83vASNFZ4mrze8BI0VniavN7wEjRWeJq83v,\ mrownerconfig=ASNFZ4mrze8BI0VniavN7wEjRWeJq83vASNFZ4mrze8BI0VniavN7wEjRWeJq83v Signed-off-by: Isaku Yamahata Co-developed-by: Xiaoyao Li Signed-off-by: Xiaoyao Li --- Changes in v5: - refine the description of QAPI properties and add description of default value when not specified; Changes in v4: - describe more of there fields in qom.json - free the old value before set new value to avoid memory leak in _setter(); (Daniel) Changes in v3: - use base64 encoding instread of hex-string; --- qapi/qom.json | 17 - target/i386/kvm/tdx.c | 87 +++ target/i386/kvm/tdx.h | 3 ++ 3 files changed, 106 insertions(+), 1 deletion(-) diff --git a/qapi/qom.json b/qapi/qom.json index 89ed89b9b46e..cac875349a3a 100644 --- a/qapi/qom.json +++ b/qapi/qom.json @@ -905,10 +905,25 @@ # pages. Some guest OS (e.g., Linux TD guest) may require this to # be set, otherwise they refuse to boot. # +# @mrconfigid: ID for non-owner-defined configuration of the guest TD, +# e.g., run-time or OS configuration (base64 encoded SHA384 digest). +# (A default value 0 of SHA384 is used when absent). +# +# @mrowner: ID for the guest TD’s owner (base64 encoded SHA384 digest). +# (A default value 0 of SHA384 is used when absent). +# +# @mrownerconfig: ID for owner-defined configuration of the guest TD, +# e.g., specific to the workload rather than the run-time or OS +# (base64 encoded SHA384 digest). (A default value 0 of SHA384 is +# used when absent). +# # Since: 9.0 ## { 'struct': 'TdxGuestProperties', - 'data': { '*sept-ve-disable': 'bool' } } + 'data': { '*sept-ve-disable': 'bool', +'*mrconfigid': 'str', +'*mrowner': 'str', +'*mrownerconfig': 'str' } } ## # @ThreadContextProperties: diff --git a/target/i386/kvm/tdx.c b/target/i386/kvm/tdx.c index d0ad4f57b5d0..4ce2f1d082ce 100644 --- a/target/i386/kvm/tdx.c +++ b/target/i386/kvm/tdx.c @@ -13,6 +13,7 @@ #include "qemu/osdep.h" #include "qemu/error-report.h" +#include "qemu/base64.h" #include "qapi/error.h" #include "qom/object_interfaces.h" #include "standard-headers/asm-x86/kvm_para.h" @@ -516,6 +517,7 @@ int tdx_pre_create_vcpu(CPUState *cpu, Error **errp) X86CPU *x86cpu = X86_CPU(cpu); CPUX86State *env = >env; g_autofree struct kvm_tdx_init_vm *init_vm = NULL; +size_t data_len; int r = 0; object_property_set_bool(OBJECT(cpu), "pmu", false, _abort); @@ -528,6 +530,38 @@ int tdx_pre_create_vcpu(CPUState *cpu, Error **errp) init_vm = g_malloc0(sizeof(struct kvm_tdx_init_vm) + sizeof(struct kvm_cpuid_entry2) * KVM_MAX_CPUID_ENTRIES); +#define SHA384_DIGEST_SIZE 48 + +if (tdx_guest->mrconfigid) { +g_autofree uint8_t *data = qbase64_decode(tdx_guest->mrconfigid, + strlen(tdx_guest->mrconfigid), _len, errp); +if (!data || data_len != SHA384_DIGEST_SIZE) { +error_setg(errp, "TDX: failed to decode mrconfigid"); +return -1; +} +memcpy(init_vm->mrconfigid, data, data_len); +} + +if (tdx_guest->mrowner) { +g_autofree uint8_t *data = qbase64_decode(tdx_guest->mrowner, + strlen(tdx_guest->mrowner), _len, errp); +if (!data || data_len != SHA384_DIGEST_SIZE) { +error_setg(errp, "TDX: failed to decode mrowner"); +return -1; +} +memcpy(init_vm->mrowner, data, data_len); +} + +if (tdx_guest->mrownerconfig) { +g_autofree uint8_t *data = qbase64_decode(tdx_guest->mrownerconfig, + strlen(tdx_guest->mrownerconfig), _len, errp); +if (!data || data_len != SHA384_DIGEST_SIZE) { +error_setg(errp, "TDX: failed to decode mrownerconfig"); +return -1; +} +memcpy(init_vm->mrownerconfig, data, data_len); +} + r = kvm_vm_enable_cap(kvm_state, KVM_CAP_MAX_VCPUS, 0, ms->smp.cpus); if (r < 0) { error_setg(errp, "Unable to set MAX VCPUS to %d", ms->smp.cpus); @@ -574,6 +608,51 @@ static void tdx_guest_set_sept_ve_disable(Object *obj, bool value, Error **errp) } } +static
[PATCH v5 51/65] i386/tdx: Handle TDG.VP.VMCALL
TD guest can use TDG.VP.VMCALL to request termination with error message encoded in GPRs. Parse and print the error message, and terminate the TD guest in the handler. Signed-off-by: Xiaoyao Li --- target/i386/kvm/tdx.c | 39 +++ target/i386/kvm/tdx.h | 1 + 2 files changed, 40 insertions(+) diff --git a/target/i386/kvm/tdx.c b/target/i386/kvm/tdx.c index 88d245577594..6cd72a26b970 100644 --- a/target/i386/kvm/tdx.c +++ b/target/i386/kvm/tdx.c @@ -1092,6 +1092,43 @@ static int tdx_handle_get_quote(X86CPU *cpu, struct kvm_tdx_vmcall *vmcall) return 0; } +static int tdx_handle_report_fatal_error(X86CPU *cpu, + struct kvm_tdx_vmcall *vmcall) +{ +uint64_t error_code = vmcall->in_r12; +char *message = NULL; + +if (error_code & 0x) { +error_report("TDX: REPORT_FATAL_ERROR: invalid error code: " + "0x%lx\n", error_code); +return -1; +} + +/* it has optional message */ +if (vmcall->in_r14) { +uint64_t * tmp; + +#define GUEST_PANIC_INFO_TDX_MESSAGE_MAX64 +message = g_malloc0(GUEST_PANIC_INFO_TDX_MESSAGE_MAX + 1); + +tmp = (uint64_t *)message; +/* The order is defined in TDX GHCI spec */ +*(tmp++) = cpu_to_le64(vmcall->in_r14); +*(tmp++) = cpu_to_le64(vmcall->in_r15); +*(tmp++) = cpu_to_le64(vmcall->in_rbx); +*(tmp++) = cpu_to_le64(vmcall->in_rdi); +*(tmp++) = cpu_to_le64(vmcall->in_rsi); +*(tmp++) = cpu_to_le64(vmcall->in_r8); +*(tmp++) = cpu_to_le64(vmcall->in_r9); +*(tmp++) = cpu_to_le64(vmcall->in_rdx); +message[GUEST_PANIC_INFO_TDX_MESSAGE_MAX] = '\0'; +assert((char *)tmp == message + GUEST_PANIC_INFO_TDX_MESSAGE_MAX); +} + +error_report("TD guest reports fatal error. %s\n", message ? : ""); +return -1; +} + static int tdx_handle_setup_event_notify_interrupt(X86CPU *cpu, struct kvm_tdx_vmcall *vmcall) { @@ -1126,6 +1163,8 @@ static int tdx_handle_vmcall(X86CPU *cpu, struct kvm_tdx_vmcall *vmcall) return tdx_handle_map_gpa(cpu, vmcall); case TDG_VP_VMCALL_GET_QUOTE: return tdx_handle_get_quote(cpu, vmcall); +case TDG_VP_VMCALL_REPORT_FATAL_ERROR: +return tdx_handle_report_fatal_error(cpu, vmcall); case TDG_VP_VMCALL_SETUP_EVENT_NOTIFY_INTERRUPT: return tdx_handle_setup_event_notify_interrupt(cpu, vmcall); default: diff --git a/target/i386/kvm/tdx.h b/target/i386/kvm/tdx.h index b6b8742e79f7..86b1f80d6f49 100644 --- a/target/i386/kvm/tdx.h +++ b/target/i386/kvm/tdx.h @@ -20,6 +20,7 @@ typedef struct TdxGuestClass { #define TDG_VP_VMCALL_MAP_GPA 0x10001ULL #define TDG_VP_VMCALL_GET_QUOTE 0x10002ULL +#define TDG_VP_VMCALL_REPORT_FATAL_ERROR0x10003ULL #define TDG_VP_VMCALL_SETUP_EVENT_NOTIFY_INTERRUPT 0x10004ULL #define TDG_VP_VMCALL_SUCCESS 0xULL -- 2.34.1
[PATCH v5 59/65] hw/i386: add eoi_intercept_unsupported member to X86MachineState
Add a new bool member, eoi_intercept_unsupported, to X86MachineState with default value false. Set true for TDX VM. Inability to intercept eoi causes impossibility to emulate level triggered interrupt to be re-injected when level is still kept active. which affects interrupt controller emulation. Signed-off-by: Xiaoyao Li Acked-by: Gerd Hoffmann --- hw/i386/x86.c | 1 + include/hw/i386/x86.h | 1 + target/i386/kvm/tdx.c | 2 ++ 3 files changed, 4 insertions(+) diff --git a/hw/i386/x86.c b/hw/i386/x86.c index 61c45dfc14dd..6ff2475535bc 100644 --- a/hw/i386/x86.c +++ b/hw/i386/x86.c @@ -1425,6 +1425,7 @@ static void x86_machine_initfn(Object *obj) x86ms->oem_table_id = g_strndup(ACPI_BUILD_APPNAME8, 8); x86ms->bus_lock_ratelimit = 0; x86ms->above_4g_mem_start = 4 * GiB; +x86ms->eoi_intercept_unsupported = false; } static void x86_machine_class_init(ObjectClass *oc, void *data) diff --git a/include/hw/i386/x86.h b/include/hw/i386/x86.h index d28e79cc484a..033f0a34891b 100644 --- a/include/hw/i386/x86.h +++ b/include/hw/i386/x86.h @@ -60,6 +60,7 @@ struct X86MachineState { uint64_t above_4g_mem_start; /* CPU and apic information: */ +bool eoi_intercept_unsupported; unsigned pci_irq_mask; unsigned apic_id_limit; uint16_t boot_cpus; diff --git a/target/i386/kvm/tdx.c b/target/i386/kvm/tdx.c index 0225a9b79b36..b1fb326bd395 100644 --- a/target/i386/kvm/tdx.c +++ b/target/i386/kvm/tdx.c @@ -727,6 +727,8 @@ static int tdx_kvm_init(ConfidentialGuestSupport *cgs, Error **errp) return -EINVAL; } +x86ms->eoi_intercept_unsupported = true; + if (!tdx_caps) { r = get_tdx_capabilities(errp); if (r) { -- 2.34.1
[PATCH v5 18/65] i386/tdx: Make Intel-PT unsupported for TD guest
Due to the fact that Intel-PT virtualization support has been broken in QEMU since Sapphire Rapids generation[1], below warning is triggered when luanching TD guest: warning: host doesn't support requested feature: CPUID.07H:EBX.intel-pt [bit 25] Before Intel-pt is fixed in QEMU, just make Intel-PT unsupported for TD guest, to avoid the confusing warning. [1] https://lore.kernel.org/qemu-devel/20230531084311.3807277-1-xiaoyao...@intel.com/ Signed-off-by: Xiaoyao Li --- Changes in v4: - newly added patch; --- target/i386/kvm/tdx.c | 5 + 1 file changed, 5 insertions(+) diff --git a/target/i386/kvm/tdx.c b/target/i386/kvm/tdx.c index 85d96140b450..239170142e4f 100644 --- a/target/i386/kvm/tdx.c +++ b/target/i386/kvm/tdx.c @@ -292,6 +292,11 @@ void tdx_get_supported_cpuid(uint32_t function, uint32_t index, int reg, if (function == 1 && reg == R_ECX && !enable_cpu_pm) { *ret &= ~CPUID_EXT_MONITOR; } + +/* QEMU Intel-pt support is broken, don't advertise Intel-PT */ +if (function == 7 && reg == R_EBX) { +*ret &= ~CPUID_7_0_EBX_INTEL_PT; +} } enum tdx_ioctl_level{ -- 2.34.1
[PATCH v5 17/65] i386/tdx: Adjust the supported CPUID based on TDX restrictions
According to Chapter "CPUID Virtualization" in TDX module spec, CPUID bits of TD can be classified into 6 types: 1 | As configured | configurable by VMM, independent of native value; 2 | As configured | configurable by VMM if the bit is supported natively (if native) | Otherwise it equals as native(0). 3 | Fixed | fixed to 0/1 4 | Native| reflect the native value 5 | Calculated| calculated by TDX module. 6 | Inducing #VE | get #VE exception Note: 1. All the configurable XFAM related features and TD attributes related features fall into type #2. And fixed0/1 bits of XFAM and TD attributes fall into type #3. 2. For CPUID leaves not listed in "CPUID virtualization Overview" table in TDX module spec, TDX module injects #VE to TDs when those are queried. For this case, TDs can request CPUID emulation from VMM via TDVMCALL and the values are fully controlled by VMM. Due to TDX module has its own virtualization policy on CPUID bits, it leads to what reported via KVM_GET_SUPPORTED_CPUID diverges from the supported CPUID bits for TDs. In order to keep a consistent CPUID configuration between VMM and TDs. Adjust supported CPUID for TDs based on TDX restrictions. Currently only focus on the CPUID leaves recognized by QEMU's feature_word_info[] that are indexed by a FeatureWord. Introduce a TDX CPUID lookup table, which maintains 1 entry for each FeatureWord. Each entry has below fields: - tdx_fixed0/1: The bits that are fixed as 0/1; - depends_on_vmm_cap: The bits that are configurable from the view of TDX module. But they requires emulation of VMM when configured as enabled. For those, they are not supported if VMM doesn't report them as supported. So they need be fixed up by checking if VMM supports them. - inducing_ve: TD gets #VE when querying this CPUID leaf. The result is totally configurable by VMM. - supported_on_ve: It's valid only when @inducing_ve is true. It represents the maximum feature set supported that be emulated for TDs. By applying TDX CPUID lookup table and TDX capabilities reported from TDX module, the supported CPUID for TDs can be obtained from following steps: - get the base of VMM supported feature set; - if the leaf is not a FeatureWord just return VMM's value without modification; - if the leaf is an inducing_ve type, applying supported_on_ve mask and return; - include all native bits, it covers type #2, #4, and parts of type #1. (it also includes some unsupported bits. The following step will correct it.) - apply fixed0/1 to it (it covers #3, and rectifies the previous step); - add configurable bits (it covers the other part of type #1); - fix the ones in vmm_fixup; (Calculated type is ignored since it's determined at runtime). Co-developed-by: Chenyi Qiang Signed-off-by: Chenyi Qiang Signed-off-by: Xiaoyao Li --- target/i386/cpu.h | 16 +++ target/i386/kvm/kvm.c | 4 + target/i386/kvm/tdx.c | 263 ++ target/i386/kvm/tdx.h | 3 + 4 files changed, 286 insertions(+) diff --git a/target/i386/cpu.h b/target/i386/cpu.h index 952174bb6f52..7bd604f802a1 100644 --- a/target/i386/cpu.h +++ b/target/i386/cpu.h @@ -787,6 +787,8 @@ uint64_t x86_cpu_get_supported_feature_word(FeatureWord w, /* Support RDFSBASE/RDGSBASE/WRFSBASE/WRGSBASE */ #define CPUID_7_0_EBX_FSGSBASE (1U << 0) +/* Support for TSC adjustment MSR 0x3B */ +#define CPUID_7_0_EBX_TSC_ADJUST(1U << 1) /* Support SGX */ #define CPUID_7_0_EBX_SGX (1U << 2) /* 1st Group of Advanced Bit Manipulation Extensions */ @@ -805,8 +807,12 @@ uint64_t x86_cpu_get_supported_feature_word(FeatureWord w, #define CPUID_7_0_EBX_INVPCID (1U << 10) /* Restricted Transactional Memory */ #define CPUID_7_0_EBX_RTM (1U << 11) +/* Cache QoS Monitoring */ +#define CPUID_7_0_EBX_PQM (1U << 12) /* Memory Protection Extension */ #define CPUID_7_0_EBX_MPX (1U << 14) +/* Resource Director Technology Allocation */ +#define CPUID_7_0_EBX_RDT_A (1U << 15) /* AVX-512 Foundation */ #define CPUID_7_0_EBX_AVX512F (1U << 16) /* AVX-512 Doubleword & Quadword Instruction */ @@ -862,10
[PATCH v5 22/65] i386/kvm: Move architectural CPUID leaf generation to separate helper
From: Sean Christopherson Move the architectural (for lack of a better term) CPUID leaf generation to a separate helper so that the generation code can be reused by TDX, which needs to generate a canonical VM-scoped configuration. Signed-off-by: Sean Christopherson Signed-off-by: Xiaoyao Li --- target/i386/kvm/kvm.c | 459 +++-- target/i386/kvm/kvm_i386.h | 3 + 2 files changed, 240 insertions(+), 222 deletions(-) diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c index 389b631c03dd..315998c8f7e5 100644 --- a/target/i386/kvm/kvm.c +++ b/target/i386/kvm/kvm.c @@ -1731,6 +1731,241 @@ static void kvm_init_nested_state(CPUX86State *env) } } +uint32_t kvm_x86_arch_cpuid(CPUX86State *env, struct kvm_cpuid_entry2 *entries, +uint32_t cpuid_i) +{ +uint32_t limit, i, j; +uint32_t unused; +struct kvm_cpuid_entry2 *c; + +if (cpuid_i > KVM_MAX_CPUID_ENTRIES) { +error_report("exceeded cpuid index (%d) for entries[]", cpuid_i); +abort(); +} + +cpu_x86_cpuid(env, 0, 0, , , , ); + +for (i = 0; i <= limit; i++) { +if (cpuid_i == KVM_MAX_CPUID_ENTRIES) { +fprintf(stderr, "unsupported level value: 0x%x\n", limit); +abort(); +} +c = [cpuid_i++]; + +switch (i) { +case 2: { +/* Keep reading function 2 till all the input is received */ +int times; + +c->function = i; +c->flags = KVM_CPUID_FLAG_STATEFUL_FUNC | + KVM_CPUID_FLAG_STATE_READ_NEXT; +cpu_x86_cpuid(env, i, 0, >eax, >ebx, >ecx, >edx); +times = c->eax & 0xff; + +for (j = 1; j < times; ++j) { +if (cpuid_i == KVM_MAX_CPUID_ENTRIES) { +fprintf(stderr, "cpuid_data is full, no space for " +"cpuid(eax:2):eax & 0xf = 0x%x\n", times); +abort(); +} +c = [cpuid_i++]; +c->function = i; +c->flags = KVM_CPUID_FLAG_STATEFUL_FUNC; +cpu_x86_cpuid(env, i, 0, >eax, >ebx, >ecx, >edx); +} +break; +} +case 0x1f: +if (env->nr_dies < 2) { +cpuid_i--; +break; +} +/* fallthrough */ +case 4: +case 0xb: +case 0xd: +for (j = 0; ; j++) { +if (i == 0xd && j == 64) { +break; +} + +c->function = i; +c->flags = KVM_CPUID_FLAG_SIGNIFCANT_INDEX; +c->index = j; +cpu_x86_cpuid(env, i, j, >eax, >ebx, >ecx, >edx); + +if (i == 4 && c->eax == 0) { +break; +} +if (i == 0xb && !(c->ecx & 0xff00)) { +break; +} +if (i == 0x1f && !(c->ecx & 0xff00)) { +break; +} +if (i == 0xd && c->eax == 0) { +continue; +} +if (cpuid_i == KVM_MAX_CPUID_ENTRIES) { +fprintf(stderr, "cpuid_data is full, no space for " +"cpuid(eax:0x%x,ecx:0x%x)\n", i, j); +abort(); +} +c = [cpuid_i++]; +} +break; +case 0x12: +for (j = 0; ; j++) { +c->function = i; +c->flags = KVM_CPUID_FLAG_SIGNIFCANT_INDEX; +c->index = j; +cpu_x86_cpuid(env, i, j, >eax, >ebx, >ecx, >edx); + +if (j > 1 && (c->eax & 0xf) != 1) { +break; +} + +if (cpuid_i == KVM_MAX_CPUID_ENTRIES) { +fprintf(stderr, "cpuid_data is full, no space for " +"cpuid(eax:0x12,ecx:0x%x)\n", j); +abort(); +} +c = [cpuid_i++]; +} +break; +case 0x7: +case 0x14: +case 0x1d: +case 0x1e: { +uint32_t times; + +c->function = i; +c->index = 0; +c->flags = KVM_CPUID_FLAG_SIGNIFCANT_INDEX; +cpu_x86_cpuid(env, i, 0, >eax, >ebx, >ecx, >edx); +times = c->eax; + +for (j = 1; j <= times; ++j) { +if (cpuid_i == KVM_MAX_CPUID_ENTRIES) { +fprintf(stderr, "cpuid_data is full, no space for " +"cpuid
[PATCH v5 53/65] pci-host/q35: Move PAM initialization above SMRAM initialization
From: Isaku Yamahata In mch_realize(), process PAM initialization before SMRAM initialization so that later patch can skill all the SMRAM related with a single check. Signed-off-by: Isaku Yamahata Signed-off-by: Xiaoyao Li --- hw/pci-host/q35.c | 19 ++- 1 file changed, 10 insertions(+), 9 deletions(-) diff --git a/hw/pci-host/q35.c b/hw/pci-host/q35.c index 0d7d4e3f0860..98d4a7c253a6 100644 --- a/hw/pci-host/q35.c +++ b/hw/pci-host/q35.c @@ -568,6 +568,16 @@ static void mch_realize(PCIDevice *d, Error **errp) /* setup pci memory mapping */ pc_pci_as_mapping_init(mch->system_memory, mch->pci_address_space); +/* PAM */ +init_pam(>pam_regions[0], OBJECT(mch), mch->ram_memory, + mch->system_memory, mch->pci_address_space, + PAM_BIOS_BASE, PAM_BIOS_SIZE); +for (i = 0; i < ARRAY_SIZE(mch->pam_regions) - 1; ++i) { +init_pam(>pam_regions[i + 1], OBJECT(mch), mch->ram_memory, + mch->system_memory, mch->pci_address_space, + PAM_EXPAN_BASE + i * PAM_EXPAN_SIZE, PAM_EXPAN_SIZE); +} + /* if *disabled* show SMRAM to all CPUs */ memory_region_init_alias(>smram_region, OBJECT(mch), "smram-region", mch->pci_address_space, MCH_HOST_BRIDGE_SMRAM_C_BASE, @@ -634,15 +644,6 @@ static void mch_realize(PCIDevice *d, Error **errp) object_property_add_const_link(qdev_get_machine(), "smram", OBJECT(>smram)); - -init_pam(>pam_regions[0], OBJECT(mch), mch->ram_memory, - mch->system_memory, mch->pci_address_space, - PAM_BIOS_BASE, PAM_BIOS_SIZE); -for (i = 0; i < ARRAY_SIZE(mch->pam_regions) - 1; ++i) { -init_pam(>pam_regions[i + 1], OBJECT(mch), mch->ram_memory, - mch->system_memory, mch->pci_address_space, - PAM_EXPAN_BASE + i * PAM_EXPAN_SIZE, PAM_EXPAN_SIZE); -} } uint64_t mch_mcfg_base(void) -- 2.34.1
[PATCH v5 50/65] i386/tdx: handle TDG.VP.VMCALL hypercall
From: Isaku Yamahata MapGPA is a hypercall to convert GPA from/to private GPA to/from shared GPA. As the conversion function is already implemented as kvm_convert_memory, wire it to TDX hypercall exit. Signed-off-by: Isaku Yamahata Signed-off-by: Xiaoyao Li --- accel/kvm/kvm-all.c | 2 +- include/sysemu/kvm.h | 2 ++ target/i386/kvm/tdx.c | 54 +++ target/i386/kvm/tdx.h | 1 + 4 files changed, 58 insertions(+), 1 deletion(-) diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c index 9dc17a1b5f43..2c83b6d270f7 100644 --- a/accel/kvm/kvm-all.c +++ b/accel/kvm/kvm-all.c @@ -2930,7 +2930,7 @@ static void kvm_eat_signals(CPUState *cpu) } while (sigismember(, SIG_IPI)); } -static int kvm_convert_memory(hwaddr start, hwaddr size, bool to_private) +int kvm_convert_memory(hwaddr start, hwaddr size, bool to_private) { MemoryRegionSection section; ram_addr_t offset; diff --git a/include/sysemu/kvm.h b/include/sysemu/kvm.h index 82b547848130..78fcf1c3a617 100644 --- a/include/sysemu/kvm.h +++ b/include/sysemu/kvm.h @@ -550,4 +550,6 @@ int kvm_create_guest_memfd(uint64_t size, uint64_t flags, Error **errp); int kvm_set_memory_attributes_private(hwaddr start, hwaddr size); int kvm_set_memory_attributes_shared(hwaddr start, hwaddr size); + +int kvm_convert_memory(hwaddr start, hwaddr size, bool to_private); #endif diff --git a/target/i386/kvm/tdx.c b/target/i386/kvm/tdx.c index 7dfda507cc8c..88d245577594 100644 --- a/target/i386/kvm/tdx.c +++ b/target/i386/kvm/tdx.c @@ -898,6 +898,58 @@ static hwaddr tdx_shared_bit(X86CPU *cpu) return (cpu->phys_bits > 48) ? BIT_ULL(51) : BIT_ULL(47); } +/* 64MB at most in one call. What value is appropriate? */ +#define TDX_MAP_GPA_MAX_LEN (64 * 1024 * 1024) + +static int tdx_handle_map_gpa(X86CPU *cpu, struct kvm_tdx_vmcall *vmcall) +{ +hwaddr shared_bit = tdx_shared_bit(cpu); +hwaddr gpa = vmcall->in_r12 & ~shared_bit; +bool private = !(vmcall->in_r12 & shared_bit); +hwaddr size = vmcall->in_r13; +bool retry = false; +int ret = 0; + +vmcall->status_code = TDG_VP_VMCALL_INVALID_OPERAND; + +if (!QEMU_IS_ALIGNED(gpa, 4096) || !QEMU_IS_ALIGNED(size, 4096)) { +vmcall->status_code = TDG_VP_VMCALL_ALIGN_ERROR; +return 0; +} + +/* Overflow case. */ +if (gpa + size < gpa) { +return 0; +} +if (gpa >= (1ULL << cpu->phys_bits) || +gpa + size >= (1ULL << cpu->phys_bits)) { +return 0; +} + +if (size > TDX_MAP_GPA_MAX_LEN) { +retry = true; +size = TDX_MAP_GPA_MAX_LEN; +} + +if (size > 0) { +ret = kvm_convert_memory(gpa, size, private); +} + +if (!ret) { +if (retry) { +vmcall->status_code = TDG_VP_VMCALL_RETRY; +vmcall->out_r11 = gpa + size; +if (!private) { +vmcall->out_r11 |= shared_bit; +} +} else { +vmcall->status_code = TDG_VP_VMCALL_SUCCESS; +} +} + +return 0; +} + static void tdx_get_quote_completion(struct tdx_generate_quote_task *task) { int ret; @@ -1070,6 +1122,8 @@ static int tdx_handle_vmcall(X86CPU *cpu, struct kvm_tdx_vmcall *vmcall) } switch (vmcall->subfunction) { +case TDG_VP_VMCALL_MAP_GPA: +return tdx_handle_map_gpa(cpu, vmcall); case TDG_VP_VMCALL_GET_QUOTE: return tdx_handle_get_quote(cpu, vmcall); case TDG_VP_VMCALL_SETUP_EVENT_NOTIFY_INTERRUPT: diff --git a/target/i386/kvm/tdx.h b/target/i386/kvm/tdx.h index b7ce793d6037..b6b8742e79f7 100644 --- a/target/i386/kvm/tdx.h +++ b/target/i386/kvm/tdx.h @@ -18,6 +18,7 @@ typedef struct TdxGuestClass { ConfidentialGuestSupportClass parent_class; } TdxGuestClass; +#define TDG_VP_VMCALL_MAP_GPA 0x10001ULL #define TDG_VP_VMCALL_GET_QUOTE 0x10002ULL #define TDG_VP_VMCALL_SETUP_EVENT_NOTIFY_INTERRUPT 0x10004ULL -- 2.34.1
[PATCH v5 55/65] i386/tdx: Disable SMM for TDX VMs
TDX doesn't support SMM and VMM cannot emulate SMM for TDX VMs because VMM cannot manipulate TDX VM's memory. Disable SMM for TDX VMs and error out if user requests to enable SMM. Signed-off-by: Xiaoyao Li Acked-by: Gerd Hoffmann --- target/i386/kvm/tdx.c | 8 1 file changed, 8 insertions(+) diff --git a/target/i386/kvm/tdx.c b/target/i386/kvm/tdx.c index 811a3b81af99..c3fadbc5c58e 100644 --- a/target/i386/kvm/tdx.c +++ b/target/i386/kvm/tdx.c @@ -707,11 +707,19 @@ static Notifier tdx_machine_done_notify = { static int tdx_kvm_init(ConfidentialGuestSupport *cgs, Error **errp) { MachineState *ms = MACHINE(qdev_get_machine()); +X86MachineState *x86ms = X86_MACHINE(ms); TdxGuest *tdx = TDX_GUEST(cgs); int r = 0; ms->require_guest_memfd = true; +if (x86ms->smm == ON_OFF_AUTO_AUTO) { +x86ms->smm = ON_OFF_AUTO_OFF; +} else if (x86ms->smm == ON_OFF_AUTO_ON) { +error_setg(errp, "TDX VM doesn't support SMM"); +return -EINVAL; +} + if (!tdx_caps) { r = get_tdx_capabilities(errp); if (r) { -- 2.34.1
[PATCH v5 47/65] i386/tdx: Finalize TDX VM
Invoke KVM_TDX_FINALIZE_VM to finalize the TD's measurement and make the TD vCPUs runnable once machine initialization is complete. Signed-off-by: Xiaoyao Li Acked-by: Gerd Hoffmann --- target/i386/kvm/tdx.c | 7 +++ 1 file changed, 7 insertions(+) diff --git a/target/i386/kvm/tdx.c b/target/i386/kvm/tdx.c index 4625f806920e..d445d4b70f77 100644 --- a/target/i386/kvm/tdx.c +++ b/target/i386/kvm/tdx.c @@ -685,6 +685,13 @@ static void tdx_finalize_vm(Notifier *notifier, void *unused) */ ram_block = tdx_guest->tdvf_mr->ram_block; ram_block_discard_range(ram_block, 0, ram_block->max_length); + +r = tdx_vm_ioctl(KVM_TDX_FINALIZE_VM, 0, NULL); +if (r < 0) { +error_report("KVM_TDX_FINALIZE_VM failed %s", strerror(-r)); +exit(0); +} +tdx_guest->parent_obj.ready = true; } static Notifier tdx_machine_done_notify = { -- 2.34.1
[PATCH v5 37/65] i386/tdvf: Introduce function to parse TDVF metadata
From: Isaku Yamahata TDX VM needs to boot with its specialized firmware, Trusted Domain Virtual Firmware (TDVF). QEMU needs to parse TDVF and map it in TD guest memory prior to running the TDX VM. A TDVF Metadata in TDVF image describes the structure of firmware. QEMU refers to it to setup memory for TDVF. Introduce function tdvf_parse_metadata() to parse the metadata from TDVF image and store the info of each TDVF section. TDX metadata is located by a TDX metadata offset block, which is a GUID-ed structure. The data portion of the GUID structure contains only an 4-byte field that is the offset of TDX metadata to the end of firmware file. Select X86_FW_OVMF when TDX is enable to leverage existing functions to parse and search OVMF's GUID-ed structures. Signed-off-by: Isaku Yamahata Co-developed-by: Xiaoyao Li Signed-off-by: Xiaoyao Li Acked-by: Gerd Hoffmann --- Changes in v1: - rename tdvf_parse_section_entry() to tdvf_parse_and_check_section_entry() Changes in RFC v4: - rename TDX_METADATA_GUID to TDX_METADATA_OFFSET_GUID --- hw/i386/Kconfig| 1 + hw/i386/meson.build| 1 + hw/i386/tdvf.c | 199 + include/hw/i386/tdvf.h | 51 +++ 4 files changed, 252 insertions(+) create mode 100644 hw/i386/tdvf.c create mode 100644 include/hw/i386/tdvf.h diff --git a/hw/i386/Kconfig b/hw/i386/Kconfig index c0ccf50ac3ef..4e6c8905f077 100644 --- a/hw/i386/Kconfig +++ b/hw/i386/Kconfig @@ -12,6 +12,7 @@ config SGX config TDX bool +select X86_FW_OVMF depends on KVM config PC diff --git a/hw/i386/meson.build b/hw/i386/meson.build index b9c1ca39cb05..f09441c1ea54 100644 --- a/hw/i386/meson.build +++ b/hw/i386/meson.build @@ -28,6 +28,7 @@ i386_ss.add(when: 'CONFIG_PC', if_true: files( 'port92.c')) i386_ss.add(when: 'CONFIG_X86_FW_OVMF', if_true: files('pc_sysfw_ovmf.c'), if_false: files('pc_sysfw_ovmf-stubs.c')) +i386_ss.add(when: 'CONFIG_TDX', if_true: files('tdvf.c')) subdir('kvm') subdir('xen') diff --git a/hw/i386/tdvf.c b/hw/i386/tdvf.c new file mode 100644 index ..ff51f40088f0 --- /dev/null +++ b/hw/i386/tdvf.c @@ -0,0 +1,199 @@ +/* + * SPDX-License-Identifier: GPL-2.0-or-later + + * Copyright (c) 2020 Intel Corporation + * Author: Isaku Yamahata + * + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + + * You should have received a copy of the GNU General Public License along + * with this program; if not, see <http://www.gnu.org/licenses/>. + */ + +#include "qemu/osdep.h" +#include "qemu/error-report.h" + +#include "hw/i386/pc.h" +#include "hw/i386/tdvf.h" +#include "sysemu/kvm.h" + +#define TDX_METADATA_OFFSET_GUID"e47a6535-984a-4798-865e-4685a7bf8ec2" +#define TDX_METADATA_VERSION1 +#define TDVF_SIGNATURE 0x46564454 /* TDVF as little endian */ + +typedef struct { +uint32_t DataOffset; +uint32_t RawDataSize; +uint64_t MemoryAddress; +uint64_t MemoryDataSize; +uint32_t Type; +uint32_t Attributes; +} TdvfSectionEntry; + +typedef struct { +uint32_t Signature; +uint32_t Length; +uint32_t Version; +uint32_t NumberOfSectionEntries; +TdvfSectionEntry SectionEntries[]; +} TdvfMetadata; + +struct tdx_metadata_offset { +uint32_t offset; +}; + +static TdvfMetadata *tdvf_get_metadata(void *flash_ptr, int size) +{ +TdvfMetadata *metadata; +uint32_t offset = 0; +uint8_t *data; + +if ((uint32_t) size != size) { +return NULL; +} + +if (pc_system_ovmf_table_find(TDX_METADATA_OFFSET_GUID, , NULL)) { +offset = size - le32_to_cpu(((struct tdx_metadata_offset *)data)->offset); + +if (offset + sizeof(*metadata) > size) { +return NULL; +} +} else { +error_report("Cannot find TDX_METADATA_OFFSET_GUID"); +return NULL; +} + +metadata = flash_ptr + offset; + +/* Finally, verify the signature to determine if this is a TDVF image. */ +metadata->Signature = le32_to_cpu(metadata->Signature); +if (metadata->Signature != TDVF_SIGNATURE) { +error_report("Invalid TDVF signature in metadata!"); +return NULL; +} + +/* Sanity check that the TDVF doesn't overlap its own metadata. */ +metadata->Length = le32_to_cpu(metadata->Length); +if (offset + metadata->Length > size) { +re
[PATCH v5 19/65] i386/tdx: Update tdx_cpuid_lookup[].tdx_fixed0/1 by tdx_caps.cpuid_config[]
tdx_cpuid_lookup[].tdx_fixed0/1 is QEMU maintained data which reflects TDX restrictions regrading what bits are fixed by TDX module. It's retrieved from TDX spec and static. However, TDX may evolve and change some fixed fields to configurable in the future. Update tdx_cpuid.lookup[].tdx_fixed0/1 fields by removing the bits that reported from TDX module as configurable. This can adapt with the updated TDX (module) automatically. Signed-off-by: Xiaoyao Li --- target/i386/kvm/tdx.c | 34 ++ 1 file changed, 34 insertions(+) diff --git a/target/i386/kvm/tdx.c b/target/i386/kvm/tdx.c index 239170142e4f..424c0f3c0fbb 100644 --- a/target/i386/kvm/tdx.c +++ b/target/i386/kvm/tdx.c @@ -377,6 +377,38 @@ static int get_tdx_capabilities(Error **errp) return 0; } +static void update_tdx_cpuid_lookup_by_tdx_caps(void) +{ +KvmTdxCpuidLookup *entry; +FeatureWordInfo *fi; +uint32_t config; +FeatureWord w; + +for (w = 0; w < FEATURE_WORDS; w++) { +fi = _word_info[w]; +entry = _cpuid_lookup[w]; + +if (fi->type != CPUID_FEATURE_WORD) { +continue; +} + +config = tdx_cap_cpuid_config(fi->cpuid.eax, + fi->cpuid.needs_ecx ? fi->cpuid.ecx : ~0u, + fi->cpuid.reg); + +if (!config) { +continue; +} + +/* + * Remove the configurable bits from tdx_fixed0/1 in case QEMU + * maintained fixed0/1 values is outdated to TDX module. + */ +entry->tdx_fixed0 &= ~config; +entry->tdx_fixed1 &= ~config; +} +} + static int tdx_kvm_init(ConfidentialGuestSupport *cgs, Error **errp) { MachineState *ms = MACHINE(qdev_get_machine()); @@ -392,6 +424,8 @@ static int tdx_kvm_init(ConfidentialGuestSupport *cgs, Error **errp) } } +update_tdx_cpuid_lookup_by_tdx_caps(); + tdx_guest = tdx; return 0; } -- 2.34.1
[PATCH v5 63/65] i386/tdx: Skip kvm_put_apicbase() for TDs
KVM doesn't allow wirting to MSR_IA32_APICBASE for TDs. Signed-off-by: Xiaoyao Li Acked-by: Gerd Hoffmann --- target/i386/kvm/kvm.c | 5 + 1 file changed, 5 insertions(+) diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c index d23f94b77257..31aed1c9aae0 100644 --- a/target/i386/kvm/kvm.c +++ b/target/i386/kvm/kvm.c @@ -3052,6 +3052,11 @@ void kvm_put_apicbase(X86CPU *cpu, uint64_t value) { int ret; +/* TODO: Allow accessing guest state for debug TDs. */ +if (is_tdx_vm()) { +return; +} + ret = kvm_put_one_msr(cpu, MSR_IA32_APICBASE, value); assert(ret == 1); } -- 2.34.1
[PATCH v5 24/65] i386/tdx: Initialize TDX before creating TD vcpus
Invoke KVM_TDX_INIT in kvm_arch_pre_create_vcpu() that KVM_TDX_INIT configures global TD configurations, e.g. the canonical CPUID config, and must be executed prior to creating vCPUs. Use kvm_x86_arch_cpuid() to setup the CPUID settings for TDX VM. Note, this doesn't address the fact that QEMU may change the CPUID configuration when creating vCPUs, i.e. punts on refactoring QEMU to provide a stable CPUID config prior to kvm_arch_init(). Signed-off-by: Xiaoyao Li Acked-by: Gerd Hoffmann Acked-by: Markus Armbruster --- Changes in v4: - mark init_vm with g_autofree() and use QEMU_LOCK_GUARD() to eliminate the goto labels; (Daniel) Changes in v3: - Pass @errp in tdx_pre_create_vcpu() and pass error info to it. (Daniel) --- accel/kvm/kvm-all.c | 9 +++- target/i386/kvm/kvm.c | 9 target/i386/kvm/meson.build | 2 +- target/i386/kvm/tdx-stub.c | 8 target/i386/kvm/tdx.c | 41 + target/i386/kvm/tdx.h | 4 6 files changed, 71 insertions(+), 2 deletions(-) create mode 100644 target/i386/kvm/tdx-stub.c diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c index a8a99d48e4ce..c9df41efa484 100644 --- a/accel/kvm/kvm-all.c +++ b/accel/kvm/kvm-all.c @@ -431,8 +431,15 @@ int kvm_init_vcpu(CPUState *cpu, Error **errp) trace_kvm_init_vcpu(cpu->cpu_index, kvm_arch_vcpu_id(cpu)); +/* + * tdx_pre_create_vcpu() may call cpu_x86_cpuid(). It in turn may call + * kvm_vm_ioctl(). Set cpu->kvm_state in advance to avoid NULL pointer + * dereference. + */ +cpu->kvm_state = s; ret = kvm_arch_pre_create_vcpu(cpu, errp); if (ret < 0) { +cpu->kvm_state = NULL; goto err; } @@ -440,11 +447,11 @@ int kvm_init_vcpu(CPUState *cpu, Error **errp) if (ret < 0) { error_setg_errno(errp, -ret, "kvm_init_vcpu: kvm_get_vcpu failed (%lu)", kvm_arch_vcpu_id(cpu)); +cpu->kvm_state = NULL; goto err; } cpu->kvm_fd = ret; -cpu->kvm_state = s; cpu->vcpu_dirty = true; cpu->dirty_pages = 0; cpu->throttle_us_per_full = 0; diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c index 315998c8f7e5..1664ac49005e 100644 --- a/target/i386/kvm/kvm.c +++ b/target/i386/kvm/kvm.c @@ -2271,6 +2271,15 @@ int kvm_arch_init_vcpu(CPUState *cs) return r; } +int kvm_arch_pre_create_vcpu(CPUState *cpu, Error **errp) +{ +if (is_tdx_vm()) { +return tdx_pre_create_vcpu(cpu, errp); +} + +return 0; +} + int kvm_arch_destroy_vcpu(CPUState *cs) { X86CPU *cpu = X86_CPU(cs); diff --git a/target/i386/kvm/meson.build b/target/i386/kvm/meson.build index 26a1ab038513..460c5f8f85f3 100644 --- a/target/i386/kvm/meson.build +++ b/target/i386/kvm/meson.build @@ -7,7 +7,7 @@ i386_kvm_ss.add(files( i386_kvm_ss.add(when: 'CONFIG_XEN_EMU', if_true: files('xen-emu.c')) -i386_kvm_ss.add(when: 'CONFIG_TDX', if_true: files('tdx.c')) +i386_kvm_ss.add(when: 'CONFIG_TDX', if_true: files('tdx.c'), if_false: files('tdx-stub.c')) i386_system_ss.add(when: 'CONFIG_HYPERV', if_true: files('hyperv.c'), if_false: files('hyperv-stub.c')) diff --git a/target/i386/kvm/tdx-stub.c b/target/i386/kvm/tdx-stub.c new file mode 100644 index ..b614b46d3f4a --- /dev/null +++ b/target/i386/kvm/tdx-stub.c @@ -0,0 +1,8 @@ +#include "qemu/osdep.h" + +#include "tdx.h" + +int tdx_pre_create_vcpu(CPUState *cpu, Error **errp) +{ +return -EINVAL; +} diff --git a/target/i386/kvm/tdx.c b/target/i386/kvm/tdx.c index 144acd8c9912..d548ec340285 100644 --- a/target/i386/kvm/tdx.c +++ b/target/i386/kvm/tdx.c @@ -475,6 +475,45 @@ static int tdx_kvm_init(ConfidentialGuestSupport *cgs, Error **errp) return 0; } +int tdx_pre_create_vcpu(CPUState *cpu, Error **errp) +{ +MachineState *ms = MACHINE(qdev_get_machine()); +X86CPU *x86cpu = X86_CPU(cpu); +CPUX86State *env = >env; +g_autofree struct kvm_tdx_init_vm *init_vm = NULL; +int r = 0; + +QEMU_LOCK_GUARD(_guest->lock); +if (tdx_guest->initialized) { +return r; +} + +init_vm = g_malloc0(sizeof(struct kvm_tdx_init_vm) + +sizeof(struct kvm_cpuid_entry2) * KVM_MAX_CPUID_ENTRIES); + +r = kvm_vm_enable_cap(kvm_state, KVM_CAP_MAX_VCPUS, 0, ms->smp.cpus); +if (r < 0) { +error_setg(errp, "Unable to set MAX VCPUS to %d", ms->smp.cpus); +return r; +} + +init_vm->cpuid.nent = kvm_x86_arch_cpuid(env, init_vm->cpuid.entries, 0); + +init_vm->attributes = tdx_guest->attributes; + +do { +r = tdx_vm_ioctl(KVM_TDX_INIT_VM, 0, init_vm); +} while (r == -EAGAIN); +if (r < 0) { +error_setg_errno(errp, -r, "KVM_TDX_INIT_VM failed"); +return r; +} + +tdx_guest->initialized = true; + +return 0; +} + /* tdx guest
[PATCH v5 46/65] i386/tdx: Call KVM_TDX_INIT_VCPU to initialize TDX vcpu
TDX vcpu needs to be initialized by SEAMCALL(TDH.VP.INIT) and KVM provides vcpu level IOCTL KVM_TDX_INIT_VCPU for it. KVM_TDX_INIT_VCPU needs the address of the HOB as input. Invoke it for each vcpu after HOB list is created. Signed-off-by: Xiaoyao Li Acked-by: Gerd Hoffmann --- target/i386/kvm/tdx.c | 18 ++ 1 file changed, 18 insertions(+) diff --git a/target/i386/kvm/tdx.c b/target/i386/kvm/tdx.c index dcabe359eda5..4625f806920e 100644 --- a/target/i386/kvm/tdx.c +++ b/target/i386/kvm/tdx.c @@ -591,6 +591,22 @@ static void tdx_init_ram_entries(void) tdx_guest->nr_ram_entries = j; } +static void tdx_post_init_vcpus(void) +{ +TdxFirmwareEntry *hob; +CPUState *cpu; +int r; + +hob = tdx_get_hob_entry(tdx_guest); +CPU_FOREACH(cpu) { +r = tdx_vcpu_ioctl(cpu, KVM_TDX_INIT_VCPU, 0, (void *)hob->address); +if (r < 0) { +error_report("KVM_TDX_INIT_VCPU failed %s", strerror(-r)); +exit(1); +} +} +} + static void tdx_finalize_vm(Notifier *notifier, void *unused) { TdxFirmware *tdvf = _guest->tdvf; @@ -623,6 +639,8 @@ static void tdx_finalize_vm(Notifier *notifier, void *unused) tdvf_hob_create(tdx_guest, tdx_get_hob_entry(tdx_guest)); +tdx_post_init_vcpus(); + for_each_tdx_fw_entry(tdvf, entry) { struct kvm_memory_mapping mapping = { .base_gfn = entry->address >> 12, -- 2.34.1
[PATCH v5 33/65] kvm/tdx: Don't complain when converting vMMIO region to shared
From: Isaku Yamahata Because vMMIO region needs to be shared region, guest TD may explicitly convert such region from private to shared. Don't complain such conversion. Signed-off-by: Isaku Yamahata Signed-off-by: Xiaoyao Li --- accel/kvm/kvm-all.c | 19 --- 1 file changed, 16 insertions(+), 3 deletions(-) diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c index c9df41efa484..d533e2611ad8 100644 --- a/accel/kvm/kvm-all.c +++ b/accel/kvm/kvm-all.c @@ -2985,9 +2985,22 @@ static int kvm_convert_memory(hwaddr start, hwaddr size, bool to_private) ret = ram_block_discard_guest_memfd_range(rb, offset, size); } } else { -error_report("Convert non guest_memfd backed memory region " -"(0x%"HWADDR_PRIx" ,+ 0x%"HWADDR_PRIx") to %s", -start, size, to_private ? "private" : "shared"); +/* + * Because vMMIO region must be shared, guest TD may convert vMMIO + * region to shared explicitly. Don't complain such case. See + * memory_region_type() for checking if the region is MMIO region. + */ +if (!to_private && +!memory_region_is_ram(mr) && +!memory_region_is_ram_device(mr) && +!memory_region_is_rom(mr) && +!memory_region_is_romd(mr)) { + ret = 0; +} else { +error_report("Convert non guest_memfd backed memory region " +"(0x%"HWADDR_PRIx" ,+ 0x%"HWADDR_PRIx") to %s", +start, size, to_private ? "private" : "shared"); +} } memory_region_unref(section.mr); -- 2.34.1
[PATCH v5 49/65] i386/tdx: handle TDG.VP.VMCALL
From: Isaku Yamahata Add property "quote-generation-socket" to tdx-guest, which is a property of type SocketAddress to specify Quote Generation Service(QGS). On request of GetQuote, it connects to the QGS socket, read request data from shared guest memory, send the request data to the QGS, and store the response into shared guest memory, at last notify TD guest by interrupt. command line example: qemu-system-x86_64 \ -object '{"qom-type":"tdx-guest","id":"tdx0","quote-generation-socket":{"type": "vsock", "cid":"1","port":"1234"}}' \ -machine confidential-guest-support=tdx0 Note, above example uses vsock type socket because the QGS we used implements the vsock socket. It can be other types, like UNIX socket, which depends on the implementation of QGS. To avoid no response from QGS server, setup a timer for the transaction. If timeout, make it an error and interrupt guest. Define the threshold of time to 30s at present, maybe change to other value if not appropriate. Signed-off-by: Isaku Yamahata Codeveloped-by: Chenyi Qiang Signed-off-by: Chenyi Qiang Codeveloped-by: Xiaoyao Li Signed-off-by: Xiaoyao Li --- Changes in v5: - add more decription of quote-generation-socket property; Changes in v4: - merge next patch "i386/tdx: setup a timer for the qio channel"; Changes in v3: - rename property "quote-generation-service" to "quote-generation-socket"; - change the type of "quote-generation-socket" from str to SocketAddress; - squash next patch into this one; --- qapi/qom.json | 8 +- target/i386/kvm/meson.build | 2 +- target/i386/kvm/tdx-quote-generator.c | 170 target/i386/kvm/tdx-quote-generator.h | 95 +++ target/i386/kvm/tdx.c | 216 ++ target/i386/kvm/tdx.h | 6 + 6 files changed, 495 insertions(+), 2 deletions(-) create mode 100644 target/i386/kvm/tdx-quote-generator.c create mode 100644 target/i386/kvm/tdx-quote-generator.h diff --git a/qapi/qom.json b/qapi/qom.json index cac875349a3a..7b26b0a0d3aa 100644 --- a/qapi/qom.json +++ b/qapi/qom.json @@ -917,13 +917,19 @@ # (base64 encoded SHA384 digest). (A default value 0 of SHA384 is # used when absent). # +# @quote-generation-socket: socket address for Quote Generation +# Service (QGS). QGS is a daemon running on the host. User in +# TD guest cannot get TD quoting for attestation if QGS is not +# provided. So admin should always provide it. +# # Since: 9.0 ## { 'struct': 'TdxGuestProperties', 'data': { '*sept-ve-disable': 'bool', '*mrconfigid': 'str', '*mrowner': 'str', -'*mrownerconfig': 'str' } } +'*mrownerconfig': 'str', +'*quote-generation-socket': 'SocketAddress' } } ## # @ThreadContextProperties: diff --git a/target/i386/kvm/meson.build b/target/i386/kvm/meson.build index 460c5f8f85f3..d38b8e8e5fca 100644 --- a/target/i386/kvm/meson.build +++ b/target/i386/kvm/meson.build @@ -7,7 +7,7 @@ i386_kvm_ss.add(files( i386_kvm_ss.add(when: 'CONFIG_XEN_EMU', if_true: files('xen-emu.c')) -i386_kvm_ss.add(when: 'CONFIG_TDX', if_true: files('tdx.c'), if_false: files('tdx-stub.c')) +i386_kvm_ss.add(when: 'CONFIG_TDX', if_true: files('tdx.c', 'tdx-quote-generator.c'), if_false: files('tdx-stub.c')) i386_system_ss.add(when: 'CONFIG_HYPERV', if_true: files('hyperv.c'), if_false: files('hyperv-stub.c')) diff --git a/target/i386/kvm/tdx-quote-generator.c b/target/i386/kvm/tdx-quote-generator.c new file mode 100644 index ..057ad09e7e95 --- /dev/null +++ b/target/i386/kvm/tdx-quote-generator.c @@ -0,0 +1,170 @@ +/* + * QEMU TDX support + * + * Copyright Intel + * + * Author: + * Xiaoyao Li + * + * This work is licensed under the terms of the GNU GPL, version 2 or later. + * See the COPYING file in the top-level directory + * + */ + +#include "qemu/osdep.h" +#include "qemu/error-report.h" +#include "qapi/error.h" +#include "qapi/qapi-visit-sockets.h" + +#include "tdx-quote-generator.h" + +typedef struct TdxQuoteGeneratorClass { +DeviceClass parent_class; +} TdxQuoteGeneratorClass; + +OBJECT_DEFINE_TYPE(TdxQuoteGenerator, tdx_quote_generator, TDX_QUOTE_GENERATOR, OBJECT) + +static void tdx_quote_generator_finalize(Object *obj) +{ +} + +static void tdx_quote_generator_class_init(ObjectClass *oc, void *data) +{ +} + +static void tdx_quote_generator_init(Object *obj) +{ +} + +static void tdx_generate_quote_cleanup(struct tdx_generate_quote_task *task) +{ +timer_del(>timer); + +g_source_remove(task->watch); +qio_channel_close(QIO_CHANNEL(task->sioc), NULL); +object_unref(OBJECT(task->sioc)); + +/* Maintain the number of in-flight requests. */
[PATCH v5 62/65] i386/tdx: Only configure MSR_IA32_UCODE_REV in kvm_init_msrs() for TDs
For TDs, only MSR_IA32_UCODE_REV in kvm_init_msrs() can be configured by VMM, while the features enumerated/controlled by other MSRs except MSR_IA32_UCODE_REV in kvm_init_msrs() are not under control of VMM. Only configure MSR_IA32_UCODE_REV for TDs. Signed-off-by: Xiaoyao Li Acked-by: Gerd Hoffmann --- target/i386/kvm/kvm.c | 44 ++- 1 file changed, 23 insertions(+), 21 deletions(-) diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c index 83276e23b19f..d23f94b77257 100644 --- a/target/i386/kvm/kvm.c +++ b/target/i386/kvm/kvm.c @@ -3270,32 +3270,34 @@ static void kvm_init_msrs(X86CPU *cpu) CPUX86State *env = >env; kvm_msr_buf_reset(cpu); -if (has_msr_arch_capabs) { -kvm_msr_entry_add(cpu, MSR_IA32_ARCH_CAPABILITIES, - env->features[FEAT_ARCH_CAPABILITIES]); -} - -if (has_msr_core_capabs) { -kvm_msr_entry_add(cpu, MSR_IA32_CORE_CAPABILITY, - env->features[FEAT_CORE_CAPABILITY]); -} - -if (has_msr_perf_capabs && cpu->enable_pmu) { -kvm_msr_entry_add_perf(cpu, env->features); + +if (!is_tdx_vm()) { +if (has_msr_arch_capabs) { +kvm_msr_entry_add(cpu, MSR_IA32_ARCH_CAPABILITIES, +env->features[FEAT_ARCH_CAPABILITIES]); +} + +if (has_msr_core_capabs) { +kvm_msr_entry_add(cpu, MSR_IA32_CORE_CAPABILITY, +env->features[FEAT_CORE_CAPABILITY]); +} + +if (has_msr_perf_capabs && cpu->enable_pmu) { +kvm_msr_entry_add_perf(cpu, env->features); +} + +/* + * Older kernels do not include VMX MSRs in KVM_GET_MSR_INDEX_LIST, but + * all kernels with MSR features should have them. + */ +if (kvm_feature_msrs && cpu_has_vmx(env)) { +kvm_msr_entry_add_vmx(cpu, env->features); +} } if (has_msr_ucode_rev) { kvm_msr_entry_add(cpu, MSR_IA32_UCODE_REV, cpu->ucode_rev); } - -/* - * Older kernels do not include VMX MSRs in KVM_GET_MSR_INDEX_LIST, but - * all kernels with MSR features should have them. - */ -if (kvm_feature_msrs && cpu_has_vmx(env)) { -kvm_msr_entry_add_vmx(cpu, env->features); -} - assert(kvm_buf_set_msrs(cpu) == 0); } -- 2.34.1
[PATCH v5 64/65] i386/tdx: Don't get/put guest state for TDX VMs
From: Sean Christopherson Don't get/put state of TDX VMs since accessing/mutating guest state of production TDs is not supported. Note, it will be allowed for a debug TD. Corresponding support will be introduced when debug TD support is implemented in the future. Signed-off-by: Sean Christopherson Signed-off-by: Xiaoyao Li Acked-by: Gerd Hoffmann --- target/i386/kvm/kvm.c | 11 +++ 1 file changed, 11 insertions(+) diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c index 31aed1c9aae0..39113718ea14 100644 --- a/target/i386/kvm/kvm.c +++ b/target/i386/kvm/kvm.c @@ -4620,6 +4620,11 @@ int kvm_arch_put_registers(CPUState *cpu, int level) assert(cpu_is_stopped(cpu) || qemu_cpu_is_self(cpu)); +/* TODO: Allow accessing guest state for debug TDs. */ +if (is_tdx_vm()) { +return 0; +} + /* * Put MSR_IA32_FEATURE_CONTROL first, this ensures the VM gets out of VMX * root operation upon vCPU reset. kvm_put_msr_feature_control() should also @@ -4720,6 +4725,12 @@ int kvm_arch_get_registers(CPUState *cs) if (ret < 0) { goto out; } + +/* TODO: Allow accessing guest state for debug TDs. */ +if (is_tdx_vm()) { +return 0; +} + ret = kvm_getput_regs(cpu, 0); if (ret < 0) { goto out; -- 2.34.1
[PATCH v5 44/65] i386/tdx: Setup the TD HOB list
The TD HOB list is used to pass the information from VMM to TDVF. The TD HOB must include PHIT HOB and Resource Descriptor HOB. More details can be found in TDVF specification and PI specification. Build the TD HOB in TDX's machine_init_done callback. Co-developed-by: Isaku Yamahata Signed-off-by: Isaku Yamahata Co-developed-by: Sean Christopherson Signed-off-by: Sean Christopherson Signed-off-by: Xiaoyao Li Acked-by: Gerd Hoffmann --- Changes in v1: - drop the code of adding mmio resources since OVMF prepares all the MMIO hob itself. --- hw/i386/meson.build | 2 +- hw/i386/tdvf-hob.c| 147 ++ hw/i386/tdvf-hob.h| 24 +++ target/i386/kvm/tdx.c | 16 + 4 files changed, 188 insertions(+), 1 deletion(-) create mode 100644 hw/i386/tdvf-hob.c create mode 100644 hw/i386/tdvf-hob.h diff --git a/hw/i386/meson.build b/hw/i386/meson.build index f09441c1ea54..2345383940ab 100644 --- a/hw/i386/meson.build +++ b/hw/i386/meson.build @@ -28,7 +28,7 @@ i386_ss.add(when: 'CONFIG_PC', if_true: files( 'port92.c')) i386_ss.add(when: 'CONFIG_X86_FW_OVMF', if_true: files('pc_sysfw_ovmf.c'), if_false: files('pc_sysfw_ovmf-stubs.c')) -i386_ss.add(when: 'CONFIG_TDX', if_true: files('tdvf.c')) +i386_ss.add(when: 'CONFIG_TDX', if_true: files('tdvf.c', 'tdvf-hob.c')) subdir('kvm') subdir('xen') diff --git a/hw/i386/tdvf-hob.c b/hw/i386/tdvf-hob.c new file mode 100644 index ..0da6ff2df576 --- /dev/null +++ b/hw/i386/tdvf-hob.c @@ -0,0 +1,147 @@ +/* + * SPDX-License-Identifier: GPL-2.0-or-later + + * Copyright (c) 2020 Intel Corporation + * Author: Isaku Yamahata + * + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + + * You should have received a copy of the GNU General Public License along + * with this program; if not, see <http://www.gnu.org/licenses/>. + */ + +#include "qemu/osdep.h" +#include "qemu/log.h" +#include "qemu/error-report.h" +#include "e820_memory_layout.h" +#include "hw/i386/pc.h" +#include "hw/i386/x86.h" +#include "hw/pci/pcie_host.h" +#include "sysemu/kvm.h" +#include "standard-headers/uefi/uefi.h" +#include "tdvf-hob.h" + +typedef struct TdvfHob { +hwaddr hob_addr; +void *ptr; +int size; + +/* working area */ +void *current; +void *end; +} TdvfHob; + +static uint64_t tdvf_current_guest_addr(const TdvfHob *hob) +{ +return hob->hob_addr + (hob->current - hob->ptr); +} + +static void tdvf_align(TdvfHob *hob, size_t align) +{ +hob->current = QEMU_ALIGN_PTR_UP(hob->current, align); +} + +static void *tdvf_get_area(TdvfHob *hob, uint64_t size) +{ +void *ret; + +if (hob->current + size > hob->end) { +error_report("TD_HOB overrun, size = 0x%" PRIx64, size); +exit(1); +} + +ret = hob->current; +hob->current += size; +tdvf_align(hob, 8); +return ret; +} + +static void tdvf_hob_add_memory_resources(TdxGuest *tdx, TdvfHob *hob) +{ +EFI_HOB_RESOURCE_DESCRIPTOR *region; +EFI_RESOURCE_ATTRIBUTE_TYPE attr; +EFI_RESOURCE_TYPE resource_type; + +TdxRamEntry *e; +int i; + +for (i = 0; i < tdx->nr_ram_entries; i++) { +e = >ram_entries[i]; + +if (e->type == TDX_RAM_UNACCEPTED) { +resource_type = EFI_RESOURCE_MEMORY_UNACCEPTED; +attr = EFI_RESOURCE_ATTRIBUTE_TDVF_UNACCEPTED; +} else if (e->type == TDX_RAM_ADDED){ +resource_type = EFI_RESOURCE_SYSTEM_MEMORY; +attr = EFI_RESOURCE_ATTRIBUTE_TDVF_PRIVATE; +} else { +error_report("unknown TDX_RAM_ENTRY type %d", e->type); +exit(1); +} + +region = tdvf_get_area(hob, sizeof(*region)); +*region = (EFI_HOB_RESOURCE_DESCRIPTOR) { +.Header = { +.HobType = EFI_HOB_TYPE_RESOURCE_DESCRIPTOR, +.HobLength = cpu_to_le16(sizeof(*region)), +.Reserved = cpu_to_le32(0), +}, +.Owner = EFI_HOB_OWNER_ZERO, +.ResourceType = cpu_to_le32(resource_type), +.ResourceAttribute = cpu_to_le32(attr), +.PhysicalStart = cpu_to_le64(e->address), +.ResourceLength = cpu_to_le64(e->length), +}; +} +} + +void tdvf_hob_create(TdxGuest *tdx, TdxFirmwareEntry *td_hob
[PATCH v5 61/65] i386/tdx: Don't synchronize guest tsc for TDs
From: Isaku Yamahata TSC of TDs is not accessible and KVM doesn't allow access of MSR_IA32_TSC for TDs. To avoid the assert() in kvm_get_tsc, make kvm_synchronize_all_tsc() noop for TDs, Signed-off-by: Isaku Yamahata Reviewed-by: Connor Kuehl Signed-off-by: Xiaoyao Li Acked-by: Gerd Hoffmann --- target/i386/kvm/kvm.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c index b1b0384b0c5c..83276e23b19f 100644 --- a/target/i386/kvm/kvm.c +++ b/target/i386/kvm/kvm.c @@ -277,7 +277,7 @@ void kvm_synchronize_all_tsc(void) { CPUState *cpu; -if (kvm_enabled()) { +if (kvm_enabled() && !is_tdx_vm()) { CPU_FOREACH(cpu) { run_on_cpu(cpu, do_kvm_synchronize_tsc, RUN_ON_CPU_NULL); } -- 2.34.1
[PATCH v5 41/65] i386/tdx: Track mem_ptr for each firmware entry of TDVF
For each TDVF sections, QEMU needs to copy the content to guest private memory via KVM API (KVM_TDX_INIT_MEM_REGION). Introduce a field @mem_ptr for TdxFirmwareEntry to track the memory pointer of each TDVF sections. So that QEMU can add/copy them to guest private memory later. TDVF sections can be classified into two groups: - Firmware itself, e.g., TDVF BFV and CFV, that located separately from guest RAM. Its memory pointer is the bios pointer. - Sections located at guest RAM, e.g., TEMP_MEM and TD_HOB. mmap a new memory range for them. Register a machine_init_done callback to do the stuff. Signed-off-by: Xiaoyao Li Acked-by: Gerd Hoffmann --- hw/i386/tdvf.c | 1 + include/hw/i386/tdvf.h | 7 +++ target/i386/kvm/tdx.c | 31 +++ 3 files changed, 39 insertions(+) diff --git a/hw/i386/tdvf.c b/hw/i386/tdvf.c index ff51f40088f0..0a6445705160 100644 --- a/hw/i386/tdvf.c +++ b/hw/i386/tdvf.c @@ -189,6 +189,7 @@ int tdvf_parse_metadata(TdxFirmware *fw, void *flash_ptr, int size) } g_free(sections); +fw->mem_ptr = flash_ptr; return 0; err: diff --git a/include/hw/i386/tdvf.h b/include/hw/i386/tdvf.h index 593341eb2e93..d880af245a73 100644 --- a/include/hw/i386/tdvf.h +++ b/include/hw/i386/tdvf.h @@ -39,13 +39,20 @@ typedef struct TdxFirmwareEntry { uint64_t size; uint32_t type; uint32_t attributes; + +void *mem_ptr; } TdxFirmwareEntry; typedef struct TdxFirmware { +void *mem_ptr; + uint32_t nr_entries; TdxFirmwareEntry *entries; } TdxFirmware; +#define for_each_tdx_fw_entry(fw, e)\ +for (e = (fw)->entries; e != (fw)->entries + (fw)->nr_entries; e++) + int tdvf_parse_metadata(TdxFirmware *fw, void *flash_ptr, int size); #endif /* HW_I386_TDVF_H */ diff --git a/target/i386/kvm/tdx.c b/target/i386/kvm/tdx.c index 2bb6e9e9c392..1b4bca9cc3cd 100644 --- a/target/i386/kvm/tdx.c +++ b/target/i386/kvm/tdx.c @@ -14,6 +14,7 @@ #include "qemu/osdep.h" #include "qemu/error-report.h" #include "qemu/base64.h" +#include "qemu/mmap-alloc.h" #include "qapi/error.h" #include "qom/object_interfaces.h" #include "standard-headers/asm-x86/kvm_para.h" @@ -22,6 +23,7 @@ #include "exec/ramblock.h" #include "hw/i386/x86.h" +#include "hw/i386/tdvf.h" #include "kvm_i386.h" #include "tdx.h" #include "../cpu-internal.h" @@ -470,6 +472,33 @@ void tdx_set_tdvf_region(MemoryRegion *tdvf_mr) tdx_guest->tdvf_mr = tdvf_mr; } +static void tdx_finalize_vm(Notifier *notifier, void *unused) +{ +TdxFirmware *tdvf = _guest->tdvf; +TdxFirmwareEntry *entry; + +for_each_tdx_fw_entry(tdvf, entry) { +switch (entry->type) { +case TDVF_SECTION_TYPE_BFV: +case TDVF_SECTION_TYPE_CFV: +entry->mem_ptr = tdvf->mem_ptr + entry->data_offset; +break; +case TDVF_SECTION_TYPE_TD_HOB: +case TDVF_SECTION_TYPE_TEMP_MEM: +entry->mem_ptr = qemu_ram_mmap(-1, entry->size, + qemu_real_host_page_size(), 0, 0); +break; +default: +error_report("Unsupported TDVF section %d", entry->type); +exit(1); +} +} +} + +static Notifier tdx_machine_done_notify = { +.notify = tdx_finalize_vm, +}; + static int tdx_kvm_init(ConfidentialGuestSupport *cgs, Error **errp) { MachineState *ms = MACHINE(qdev_get_machine()); @@ -496,6 +525,8 @@ static int tdx_kvm_init(ConfidentialGuestSupport *cgs, Error **errp) */ kvm_readonly_mem_allowed = false; +qemu_add_machine_init_done_notifier(_machine_done_notify); + tdx_guest = tdx; return 0; } -- 2.34.1
[PATCH v5 60/65] hw/i386: add option to forcibly report edge trigger in acpi tables
From: Isaku Yamahata When level trigger isn't supported on x86 platform, forcibly report edge trigger in acpi tables. Signed-off-by: Isaku Yamahata Signed-off-by: Xiaoyao Li Acked-by: Gerd Hoffmann --- hw/i386/acpi-build.c | 99 --- hw/i386/acpi-common.c | 45 +++- 2 files changed, 101 insertions(+), 43 deletions(-) diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c index 15242b9096b5..8b493a2e9605 100644 --- a/hw/i386/acpi-build.c +++ b/hw/i386/acpi-build.c @@ -964,7 +964,8 @@ static void build_dbg_aml(Aml *table) aml_append(table, scope); } -static Aml *build_link_dev(const char *name, uint8_t uid, Aml *reg) +static Aml *build_link_dev(const char *name, uint8_t uid, Aml *reg, + bool level_trigger_unsupported) { Aml *dev; Aml *crs; @@ -976,7 +977,10 @@ static Aml *build_link_dev(const char *name, uint8_t uid, Aml *reg) aml_append(dev, aml_name_decl("_UID", aml_int(uid))); crs = aml_resource_template(); -aml_append(crs, aml_interrupt(AML_CONSUMER, AML_LEVEL, AML_ACTIVE_HIGH, +aml_append(crs, aml_interrupt(AML_CONSUMER, + level_trigger_unsupported ? + AML_EDGE : AML_LEVEL, + AML_ACTIVE_HIGH, AML_SHARED, irqs, ARRAY_SIZE(irqs))); aml_append(dev, aml_name_decl("_PRS", crs)); @@ -1000,7 +1004,8 @@ static Aml *build_link_dev(const char *name, uint8_t uid, Aml *reg) return dev; } -static Aml *build_gsi_link_dev(const char *name, uint8_t uid, uint8_t gsi) +static Aml *build_gsi_link_dev(const char *name, uint8_t uid, + uint8_t gsi, bool level_trigger_unsupported) { Aml *dev; Aml *crs; @@ -1013,7 +1018,10 @@ static Aml *build_gsi_link_dev(const char *name, uint8_t uid, uint8_t gsi) crs = aml_resource_template(); irqs = gsi; -aml_append(crs, aml_interrupt(AML_CONSUMER, AML_LEVEL, AML_ACTIVE_HIGH, +aml_append(crs, aml_interrupt(AML_CONSUMER, + level_trigger_unsupported ? + AML_EDGE : AML_LEVEL, + AML_ACTIVE_HIGH, AML_SHARED, , 1)); aml_append(dev, aml_name_decl("_PRS", crs)); @@ -1032,7 +1040,7 @@ static Aml *build_gsi_link_dev(const char *name, uint8_t uid, uint8_t gsi) } /* _CRS method - get current settings */ -static Aml *build_iqcr_method(bool is_piix4) +static Aml *build_iqcr_method(bool is_piix4, bool level_trigger_unsupported) { Aml *if_ctx; uint32_t irqs; @@ -1040,7 +1048,9 @@ static Aml *build_iqcr_method(bool is_piix4) Aml *crs = aml_resource_template(); irqs = 0; -aml_append(crs, aml_interrupt(AML_CONSUMER, AML_LEVEL, +aml_append(crs, aml_interrupt(AML_CONSUMER, + level_trigger_unsupported ? + AML_EDGE : AML_LEVEL, AML_ACTIVE_HIGH, AML_SHARED, , 1)); aml_append(method, aml_name_decl("PRR0", crs)); @@ -1074,7 +1084,7 @@ static Aml *build_irq_status_method(void) return method; } -static void build_piix4_pci0_int(Aml *table) +static void build_piix4_pci0_int(Aml *table, bool level_trigger_unsupported) { Aml *dev; Aml *crs; @@ -1087,12 +1097,16 @@ static void build_piix4_pci0_int(Aml *table) aml_append(sb_scope, pci0_scope); aml_append(sb_scope, build_irq_status_method()); -aml_append(sb_scope, build_iqcr_method(true)); +aml_append(sb_scope, build_iqcr_method(true, level_trigger_unsupported)); -aml_append(sb_scope, build_link_dev("LNKA", 0, aml_name("PRQ0"))); -aml_append(sb_scope, build_link_dev("LNKB", 1, aml_name("PRQ1"))); -aml_append(sb_scope, build_link_dev("LNKC", 2, aml_name("PRQ2"))); -aml_append(sb_scope, build_link_dev("LNKD", 3, aml_name("PRQ3"))); +aml_append(sb_scope, build_link_dev("LNKA", 0, aml_name("PRQ0"), +level_trigger_unsupported)); +aml_append(sb_scope, build_link_dev("LNKB", 1, aml_name("PRQ1"), +level_trigger_unsupported)); +aml_append(sb_scope, build_link_dev("LNKC", 2, aml_name("PRQ2"), +level_trigger_unsupported)); +aml_append(sb_scope, build_link_dev("LNKD", 3, aml_name("PRQ3"), +level_trigger_unsupported)); dev = aml_device("LNKS"); { @@ -1101,7 +1115,9 @@ static void build_piix4_pci0_int(Aml *table) crs = aml_resource_template(); irqs = 9; -aml_append(crs, aml_inte
[PATCH v5 56/65] i386/tdx: Disable PIC for TDX VMs
Legacy PIC (8259) cannot be supported for TDX VMs since TDX module doesn't allow directly interrupt injection. Using posted interrupts for the PIC is not a viable option as the guest BIOS/kernel will not do EOI for PIC IRQs, i.e. will leave the vIRR bit set. Hence disable PIC for TDX VMs and error out if user wants PIC. Signed-off-by: Xiaoyao Li Acked-by: Gerd Hoffmann --- target/i386/kvm/tdx.c | 7 +++ 1 file changed, 7 insertions(+) diff --git a/target/i386/kvm/tdx.c b/target/i386/kvm/tdx.c index c3fadbc5c58e..0225a9b79b36 100644 --- a/target/i386/kvm/tdx.c +++ b/target/i386/kvm/tdx.c @@ -720,6 +720,13 @@ static int tdx_kvm_init(ConfidentialGuestSupport *cgs, Error **errp) return -EINVAL; } +if (x86ms->pic == ON_OFF_AUTO_AUTO) { +x86ms->pic = ON_OFF_AUTO_OFF; +} else if (x86ms->pic == ON_OFF_AUTO_ON) { +error_setg(errp, "TDX VM doesn't support PIC"); +return -EINVAL; +} + if (!tdx_caps) { r = get_tdx_capabilities(errp); if (r) { -- 2.34.1
[PATCH v5 57/65] i386/tdx: Don't allow system reset for TDX VMs
TDX CPU state is protected and thus vcpu state cann't be reset by VMM. Signed-off-by: Xiaoyao Li Acked-by: Gerd Hoffmann --- target/i386/kvm/kvm.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c index 2748086231d5..b1b0384b0c5c 100644 --- a/target/i386/kvm/kvm.c +++ b/target/i386/kvm/kvm.c @@ -5673,7 +5673,7 @@ bool kvm_has_waitpkg(void) bool kvm_arch_cpu_check_are_resettable(void) { -return !sev_es_enabled(); +return !sev_es_enabled() && !is_tdx_vm(); } #define ARCH_REQ_XCOMP_GUEST_PERM 0x1025 -- 2.34.1
[PATCH v5 43/65] headers: Add definitions from UEFI spec for volumes, resources, etc...
Add UEFI definitions for literals, enums, structs, GUIDs, etc... that will be used by TDX to build the UEFI Hand-Off Block (HOB) that is passed to the Trusted Domain Virtual Firmware (TDVF). All values come from the UEFI specification [1], PI spec [2] and TDVF design guide[3]. [1] UEFI Specification v2.1.0 https://uefi.org/sites/default/files/resources/UEFI_Spec_2_10_Aug29.pdf [2] UEFI PI spec v1.8 https://uefi.org/sites/default/files/resources/UEFI_PI_Spec_1_8_March3.pdf [3] https://software.intel.com/content/dam/develop/external/us/en/documents/tdx-virtual-firmware-design-guide-rev-1.pdf Signed-off-by: Xiaoyao Li Acked-by: Gerd Hoffmann --- include/standard-headers/uefi/uefi.h | 198 +++ 1 file changed, 198 insertions(+) create mode 100644 include/standard-headers/uefi/uefi.h diff --git a/include/standard-headers/uefi/uefi.h b/include/standard-headers/uefi/uefi.h new file mode 100644 index ..b15aba796156 --- /dev/null +++ b/include/standard-headers/uefi/uefi.h @@ -0,0 +1,198 @@ +/* + * Copyright (C) 2020 Intel Corporation + * + * Author: Isaku Yamahata + * + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + + * You should have received a copy of the GNU General Public License along + * with this program; if not, see <http://www.gnu.org/licenses/>. + * + */ + +#ifndef HW_I386_UEFI_H +#define HW_I386_UEFI_H + +/***/ +/* + * basic EFI definitions + * supplemented with UEFI Specification Version 2.8 (Errata A) + * released February 2020 + */ +/* UEFI integer is little endian */ + +typedef struct { +uint32_t Data1; +uint16_t Data2; +uint16_t Data3; +uint8_t Data4[8]; +} EFI_GUID; + +typedef enum { +EfiReservedMemoryType, +EfiLoaderCode, +EfiLoaderData, +EfiBootServicesCode, +EfiBootServicesData, +EfiRuntimeServicesCode, +EfiRuntimeServicesData, +EfiConventionalMemory, +EfiUnusableMemory, +EfiACPIReclaimMemory, +EfiACPIMemoryNVS, +EfiMemoryMappedIO, +EfiMemoryMappedIOPortSpace, +EfiPalCode, +EfiPersistentMemory, +EfiUnacceptedMemoryType, +EfiMaxMemoryType +} EFI_MEMORY_TYPE; + +#define EFI_HOB_HANDOFF_TABLE_VERSION 0x0009 + +#define EFI_HOB_TYPE_HANDOFF 0x0001 +#define EFI_HOB_TYPE_MEMORY_ALLOCATION0x0002 +#define EFI_HOB_TYPE_RESOURCE_DESCRIPTOR 0x0003 +#define EFI_HOB_TYPE_GUID_EXTENSION 0x0004 +#define EFI_HOB_TYPE_FV 0x0005 +#define EFI_HOB_TYPE_CPU 0x0006 +#define EFI_HOB_TYPE_MEMORY_POOL 0x0007 +#define EFI_HOB_TYPE_FV2 0x0009 +#define EFI_HOB_TYPE_LOAD_PEIM_UNUSED 0x000A +#define EFI_HOB_TYPE_UEFI_CAPSULE 0x000B +#define EFI_HOB_TYPE_FV3 0x000C +#define EFI_HOB_TYPE_UNUSED 0xFFFE +#define EFI_HOB_TYPE_END_OF_HOB_LIST 0x + +typedef struct { +uint16_t HobType; +uint16_t HobLength; +uint32_t Reserved; +} EFI_HOB_GENERIC_HEADER; + +typedef uint64_t EFI_PHYSICAL_ADDRESS; +typedef uint32_t EFI_BOOT_MODE; + +typedef struct { +EFI_HOB_GENERIC_HEADER Header; +uint32_t Version; +EFI_BOOT_MODE BootMode; +EFI_PHYSICAL_ADDRESS EfiMemoryTop; +EFI_PHYSICAL_ADDRESS EfiMemoryBottom; +EFI_PHYSICAL_ADDRESS EfiFreeMemoryTop; +EFI_PHYSICAL_ADDRESS EfiFreeMemoryBottom; +EFI_PHYSICAL_ADDRESS EfiEndOfHobList; +} EFI_HOB_HANDOFF_INFO_TABLE; + +#define EFI_RESOURCE_SYSTEM_MEMORY 0x +#define EFI_RESOURCE_MEMORY_MAPPED_IO 0x0001 +#define EFI_RESOURCE_IO 0x0002 +#define EFI_RESOURCE_FIRMWARE_DEVICE0x0003 +#define EFI_RESOURCE_MEMORY_MAPPED_IO_PORT 0x0004 +#define EFI_RESOURCE_MEMORY_RESERVED0x0005 +#define EFI_RESOURCE_IO_RESERVED0x0006 +#define EFI_RESOURCE_MEMORY_UNACCEPTED 0x0007 +#define EFI_RESOURCE_MAX_MEMORY_TYPE0x0008 + +#define EFI_RESOURCE_ATTRIBUTE_PRESENT 0x0001 +#define EFI_RESOURCE_ATTRIBUTE_INITIALIZED 0x0002 +#define EFI_RESOURCE_ATTRIBUTE_TESTED 0x0004 +#define EFI_RESOURCE_ATTRIBUTE_SINGLE_BIT_ECC 0x0008 +#define EFI_RESOURCE_ATTRIBUTE_MULTIPLE_BIT_ECC 0x0010 +#define EFI_RESOURCE_ATTRIBUTE_ECC_RESERVED_1 0x0020 +#define EFI_RESOURCE_ATTRIBUTE_ECC_RESERVED_2 0x0040 +#define EFI_RESOURCE_ATTRIBUTE_READ_PROTECTED 0x00
[PATCH v5 45/65] i386/tdx: Populate TDVF private memory via KVM_MEMORY_MAPPING
From: Isaku Yamahata TDVF firmware (CODE and VARS) needs to be copied to TD's private memory, as well as TD HOB and TEMP memory. If the TDVF section has TDVF_SECTION_ATTRIBUTES_MR_EXTEND set in the flag, calling KVM_TDX_EXTEND_MEMORY to extend the measurement. After populating the TDVF memory, the original image located in shared ramblock can be discarded. Signed-off-by: Isaku Yamahata Signed-off-by: Xiaoyao Li Acked-by: Gerd Hoffmann --- Changes in v1: - rename variable @metadata to @flags --- target/i386/kvm/tdx.c | 47 +++ 1 file changed, 47 insertions(+) diff --git a/target/i386/kvm/tdx.c b/target/i386/kvm/tdx.c index fb9c60172fde..dcabe359eda5 100644 --- a/target/i386/kvm/tdx.c +++ b/target/i386/kvm/tdx.c @@ -595,6 +595,8 @@ static void tdx_finalize_vm(Notifier *notifier, void *unused) { TdxFirmware *tdvf = _guest->tdvf; TdxFirmwareEntry *entry; +RAMBlock *ram_block; +int r; tdx_init_ram_entries(); @@ -620,6 +622,51 @@ static void tdx_finalize_vm(Notifier *notifier, void *unused) sizeof(TdxRamEntry), _ram_entry_compare); tdvf_hob_create(tdx_guest, tdx_get_hob_entry(tdx_guest)); + +for_each_tdx_fw_entry(tdvf, entry) { +struct kvm_memory_mapping mapping = { +.base_gfn = entry->address >> 12, +.nr_pages = entry->size >> 12, +.source = (__u64)entry->mem_ptr, +}; + +do { +r = kvm_vcpu_ioctl(first_cpu, KVM_MEMORY_MAPPING, ); +} while (r == -EAGAIN); + +if (r < 0) { + error_report("KVM_MEMORY_MAPPING failed %s", strerror(-r)); + exit(1); +} + +if (entry->attributes & TDVF_SECTION_ATTRIBUTES_MR_EXTEND) { +mapping = (struct kvm_memory_mapping) { +.base_gfn = entry->address >> 12, +.nr_pages = entry->size >> 12, +}; + +do { +r = tdx_vm_ioctl(KVM_TDX_EXTEND_MEMORY, 0, ); +} while (r == -EAGAIN); +if (r < 0) { +error_report("KVM_TDX_EXTEND_MEMORY failed %s", strerror(-r)); +exit(1); +} +} + +if (entry->type == TDVF_SECTION_TYPE_TD_HOB || +entry->type == TDVF_SECTION_TYPE_TEMP_MEM) { +qemu_ram_munmap(-1, entry->mem_ptr, entry->size); +entry->mem_ptr = NULL; +} +} + +/* + * TDVF image has been copied into private region above via + * KVM_MEMORY_MAPPING. It becomes useless. + */ +ram_block = tdx_guest->tdvf_mr->ram_block; +ram_block_discard_range(ram_block, 0, ram_block->max_length); } static Notifier tdx_machine_done_notify = { -- 2.34.1
[PATCH v5 31/65] i386/tdx: Implement user specified tsc frequency
Reuse "-cpu,tsc-frequency=" to get user wanted tsc frequency and call VM scope VM_SET_TSC_KHZ to set the tsc frequency of TD before KVM_TDX_INIT_VM. Besides, sanity check the tsc frequency to be in the legal range and legal granularity (required by TDX module). Signed-off-by: Xiaoyao Li Acked-by: Gerd Hoffmann --- Changes in v3: - use @errp to report error info; (Daniel) Changes in v1: - Use VM scope VM_SET_TSC_KHZ to set the TSC frequency of TD since KVM side drop the @tsc_khz field in struct kvm_tdx_init_vm --- target/i386/kvm/kvm.c | 9 + target/i386/kvm/tdx.c | 25 + 2 files changed, 34 insertions(+) diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c index 1664ac49005e..4f998b2d6d37 100644 --- a/target/i386/kvm/kvm.c +++ b/target/i386/kvm/kvm.c @@ -793,6 +793,15 @@ static int kvm_arch_set_tsc_khz(CPUState *cs) int r, cur_freq; bool set_ioctl = false; +/* + * TSC of TD vcpu is immutable, it cannot be set/changed via vcpu scope + * VM_SET_TSC_KHZ, but only be initialized via VM scope VM_SET_TSC_KHZ + * before ioctl KVM_TDX_INIT_VM in tdx_pre_create_vcpu() + */ +if (is_tdx_vm()) { +return 0; +} + if (!env->tsc_khz) { return 0; } diff --git a/target/i386/kvm/tdx.c b/target/i386/kvm/tdx.c index 4ce2f1d082ce..42dbb5ce5c15 100644 --- a/target/i386/kvm/tdx.c +++ b/target/i386/kvm/tdx.c @@ -33,6 +33,9 @@ (1U << KVM_FEATURE_PV_SCHED_YIELD) | \ (1U << KVM_FEATURE_MSI_EXT_DEST_ID)) +#define TDX_MIN_TSC_FREQUENCY_KHZ (100 * 1000) +#define TDX_MAX_TSC_FREQUENCY_KHZ (10 * 1000 * 1000) + #define TDX_TD_ATTRIBUTES_DEBUG BIT_ULL(0) #define TDX_TD_ATTRIBUTES_SEPT_VE_DISABLE BIT_ULL(28) #define TDX_TD_ATTRIBUTES_PKS BIT_ULL(30) @@ -568,6 +571,28 @@ int tdx_pre_create_vcpu(CPUState *cpu, Error **errp) return r; } +if (env->tsc_khz && (env->tsc_khz < TDX_MIN_TSC_FREQUENCY_KHZ || + env->tsc_khz > TDX_MAX_TSC_FREQUENCY_KHZ)) { +error_setg(errp, "Invalid TSC %ld KHz, must specify cpu_frequency between [%d, %d] kHz", + env->tsc_khz, TDX_MIN_TSC_FREQUENCY_KHZ, + TDX_MAX_TSC_FREQUENCY_KHZ); + return -EINVAL; +} + +if (env->tsc_khz % (25 * 1000)) { +error_setg(errp, "Invalid TSC %ld KHz, it must be multiple of 25MHz", + env->tsc_khz); +return -EINVAL; +} + +/* it's safe even env->tsc_khz is 0. KVM uses host's tsc_khz in this case */ +r = kvm_vm_ioctl(kvm_state, KVM_SET_TSC_KHZ, env->tsc_khz); +if (r < 0) { +error_setg_errno(errp, -r, "Unable to set TSC frequency to %" PRId64 " kHz", + env->tsc_khz); +return r; +} + r = setup_td_guest_attributes(x86cpu, errp); if (r) { return r; -- 2.34.1
[PATCH v5 48/65] i386/tdx: handle TDG.VP.VMCALL
From: Isaku Yamahata For SetupEventNotifyInterrupt, record interrupt vector and the apic id of the vcpu that received this TDVMCALL. Later it can inject interrupt with given vector to the specific vcpu that received SetupEventNotifyInterrupt. Signed-off-by: Isaku Yamahata Signed-off-by: Xiaoyao Li --- target/i386/kvm/kvm.c | 8 ++ target/i386/kvm/tdx-stub.c | 5 target/i386/kvm/tdx.c | 53 ++ target/i386/kvm/tdx.h | 14 ++ 4 files changed, 80 insertions(+) diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c index 4f998b2d6d37..2748086231d5 100644 --- a/target/i386/kvm/kvm.c +++ b/target/i386/kvm/kvm.c @@ -5413,6 +5413,14 @@ int kvm_arch_handle_exit(CPUState *cs, struct kvm_run *run) ret = kvm_xen_handle_exit(cpu, >xen); break; #endif +case KVM_EXIT_TDX: +if (!is_tdx_vm()) { +error_report("KVM: get KVM_EXIT_TDX for a non-TDX VM."); +ret = -1; +break; +} +ret = tdx_handle_exit(cpu, >tdx); +break; default: fprintf(stderr, "KVM: unknown exit reason %d\n", run->exit_reason); ret = -1; diff --git a/target/i386/kvm/tdx-stub.c b/target/i386/kvm/tdx-stub.c index a064d583d393..57cd25793842 100644 --- a/target/i386/kvm/tdx-stub.c +++ b/target/i386/kvm/tdx-stub.c @@ -11,3 +11,8 @@ int tdx_parse_tdvf(void *flash_ptr, int size) { return -EINVAL; } + +int tdx_handle_exit(X86CPU *cpu, struct kvm_tdx_exit *tdx_exit) +{ +return -EINVAL; +} diff --git a/target/i386/kvm/tdx.c b/target/i386/kvm/tdx.c index d445d4b70f77..49f94d9d46f4 100644 --- a/target/i386/kvm/tdx.c +++ b/target/i386/kvm/tdx.c @@ -866,6 +866,56 @@ int tdx_parse_tdvf(void *flash_ptr, int size) return tdvf_parse_metadata(_guest->tdvf, flash_ptr, size); } +static int tdx_handle_setup_event_notify_interrupt(X86CPU *cpu, + struct kvm_tdx_vmcall *vmcall) +{ +int vector = vmcall->in_r12; + +if (32 <= vector && vector <= 255) { +qemu_mutex_lock(_guest->lock); +tdx_guest->event_notify_vector = vector; +tdx_guest->event_notify_apicid = cpu->apic_id; +qemu_mutex_unlock(_guest->lock); +vmcall->status_code = TDG_VP_VMCALL_SUCCESS; +} else { +vmcall->status_code = TDG_VP_VMCALL_INVALID_OPERAND; +} + +return 0; +} + +static int tdx_handle_vmcall(X86CPU *cpu, struct kvm_tdx_vmcall *vmcall) +{ +vmcall->status_code = TDG_VP_VMCALL_INVALID_OPERAND; + +/* For now handle only TDG.VP.VMCALL leaf defined in TDX GHCI */ +if (vmcall->type != 0) { +error_report("Unknown TDG.VP.VMCALL type 0x%llx subfunction 0x%llx", + vmcall->type, vmcall->subfunction); +return -1; +} + +switch (vmcall->subfunction) { +case TDG_VP_VMCALL_SETUP_EVENT_NOTIFY_INTERRUPT: +return tdx_handle_setup_event_notify_interrupt(cpu, vmcall); +default: +error_report("Unknown TDG.VP.VMCALL type 0x%llx subfunction 0x%llx", + vmcall->type, vmcall->subfunction); +return -1; +} +} + +int tdx_handle_exit(X86CPU *cpu, struct kvm_tdx_exit *tdx_exit) +{ +switch (tdx_exit->type) { +case KVM_EXIT_TDX_VMCALL: +return tdx_handle_vmcall(cpu, _exit->u.vmcall); +default: +error_report("unknown tdx exit type 0x%x", tdx_exit->type); +return -1; +} +} + static bool tdx_guest_get_sept_ve_disable(Object *obj, Error **errp) { TdxGuest *tdx = TDX_GUEST(obj); @@ -956,6 +1006,9 @@ static void tdx_guest_init(Object *obj) object_property_add_str(obj, "mrownerconfig", tdx_guest_get_mrownerconfig, tdx_guest_set_mrownerconfig); + +tdx->event_notify_vector = -1; +tdx->event_notify_apicid = -1; } static void tdx_guest_finalize(Object *obj) diff --git a/target/i386/kvm/tdx.h b/target/i386/kvm/tdx.h index 3fb4069268f6..b3d4462fe718 100644 --- a/target/i386/kvm/tdx.h +++ b/target/i386/kvm/tdx.h @@ -7,6 +7,7 @@ #include "exec/confidential-guest-support.h" #include "hw/i386/tdvf.h" +#include "sysemu/kvm.h" #define TYPE_TDX_GUEST "tdx-guest" #define TDX_GUEST(obj) OBJECT_CHECK(TdxGuest, (obj), TYPE_TDX_GUEST) @@ -15,6 +16,14 @@ typedef struct TdxGuestClass { ConfidentialGuestSupportClass parent_class; } TdxGuestClass; +#define TDG_VP_VMCALL_SETUP_EVENT_NOTIFY_INTERRUPT 0x10004ULL + +#define TDG_VP_VMCALL_SUCCESS 0xULL +#define TDG_VP_VMCALL_RETRY 0x0001ULL +#define TDG_VP_VMCALL_INVALID_OPERAND 0x8000ULL +#define TDG_VP_VMCALL_GPA_INUSE 0x8001ULL +#define TDG_VP_VMCALL_ALIGN_ERROR 0x8
[PATCH v5 42/65] i386/tdx: Track RAM entries for TDX VM
The RAM of TDX VM can be classified into two types: - TDX_RAM_UNACCEPTED: default type of TDX memory, which needs to be accepted by TDX guest before it can be used and will be all-zeros after being accepted. - TDX_RAM_ADDED: the RAM that is ADD'ed to TD guest before running, and can be used directly. E.g., TD HOB and TEMP MEM that needed by TDVF. Maintain TdxRamEntries[] which grabs the initial RAM info from e820 table and mark each RAM range as default type TDX_RAM_UNACCEPTED. Then turn the range of TD HOB and TEMP MEM to TDX_RAM_ADDED since these ranges will be ADD'ed before TD runs and no need to be accepted runtime. The TdxRamEntries[] are later used to setup the memory TD resource HOB that passes memory info from QEMU to TDVF. Signed-off-by: Xiaoyao Li Acked-by: Gerd Hoffmann --- Changes in v3: - use enum TdxRamType in struct TdxRamEntry; (Isaku) - Fix the indention; (Daniel) Changes in v1: - simplify the algorithm of tdx_accept_ram_range() (Suggested-by: Gerd Hoffman) (1) Change the existing entry to cover the accepted ram range. (2) If there is room before the accepted ram range add a TDX_RAM_UNACCEPTED entry for that. (3) If there is room after the accepted ram range add a TDX_RAM_UNACCEPTED entry for that. --- target/i386/kvm/tdx.c | 111 ++ target/i386/kvm/tdx.h | 14 ++ 2 files changed, 125 insertions(+) diff --git a/target/i386/kvm/tdx.c b/target/i386/kvm/tdx.c index 1b4bca9cc3cd..98b2cfd40651 100644 --- a/target/i386/kvm/tdx.c +++ b/target/i386/kvm/tdx.c @@ -22,6 +22,7 @@ #include "sysemu/sysemu.h" #include "exec/ramblock.h" +#include "hw/i386/e820_memory_layout.h" #include "hw/i386/x86.h" #include "hw/i386/tdvf.h" #include "kvm_i386.h" @@ -472,11 +473,117 @@ void tdx_set_tdvf_region(MemoryRegion *tdvf_mr) tdx_guest->tdvf_mr = tdvf_mr; } +static void tdx_add_ram_entry(uint64_t address, uint64_t length, + enum TdxRamType type) +{ +uint32_t nr_entries = tdx_guest->nr_ram_entries; +tdx_guest->ram_entries = g_renew(TdxRamEntry, tdx_guest->ram_entries, + nr_entries + 1); + +tdx_guest->ram_entries[nr_entries].address = address; +tdx_guest->ram_entries[nr_entries].length = length; +tdx_guest->ram_entries[nr_entries].type = type; +tdx_guest->nr_ram_entries++; +} + +static int tdx_accept_ram_range(uint64_t address, uint64_t length) +{ +uint64_t head_start, tail_start, head_length, tail_length; +uint64_t tmp_address, tmp_length; +TdxRamEntry *e; +int i; + +for (i = 0; i < tdx_guest->nr_ram_entries; i++) { +e = _guest->ram_entries[i]; + +if (address + length <= e->address || +e->address + e->length <= address) { +continue; +} + +/* + * The to-be-accepted ram range must be fully contained by one + * RAM entry. + */ +if (e->address > address || +e->address + e->length < address + length) { +return -EINVAL; +} + +if (e->type == TDX_RAM_ADDED) { +return -EINVAL; +} + +break; +} + +if (i == tdx_guest->nr_ram_entries) { +return -1; +} + +tmp_address = e->address; +tmp_length = e->length; + +e->address = address; +e->length = length; +e->type = TDX_RAM_ADDED; + +head_length = address - tmp_address; +if (head_length > 0) { +head_start = tmp_address; +tdx_add_ram_entry(head_start, head_length, TDX_RAM_UNACCEPTED); +} + +tail_start = address + length; +if (tail_start < tmp_address + tmp_length) { +tail_length = tmp_address + tmp_length - tail_start; +tdx_add_ram_entry(tail_start, tail_length, TDX_RAM_UNACCEPTED); +} + +return 0; +} + +static int tdx_ram_entry_compare(const void *lhs_, const void* rhs_) +{ +const TdxRamEntry *lhs = lhs_; +const TdxRamEntry *rhs = rhs_; + +if (lhs->address == rhs->address) { +return 0; +} +if (le64_to_cpu(lhs->address) > le64_to_cpu(rhs->address)) { +return 1; +} +return -1; +} + +static void tdx_init_ram_entries(void) +{ +unsigned i, j, nr_e820_entries; + +nr_e820_entries = e820_get_num_entries(); +tdx_guest->ram_entries = g_new(TdxRamEntry, nr_e820_entries); + +for (i = 0, j = 0; i < nr_e820_entries; i++) { +uint64_t addr, len; + +if (e820_get_entry(i, E820_RAM, , )) { +tdx_guest->ram_entries[j].address = addr; +tdx_guest->ram_entries[j].length = len; +tdx_guest->ram_entries[j].type = TDX_RAM_UNACCEPTED; +j++; +} +} +tdx_guest->nr_ram_entries = j; +} + static void tdx_finalize_vm(Notifier *not
[PATCH v5 58/65] i386/tdx: LMCE is not supported for TDX
LMCE is not supported TDX since KVM doesn't provide emulation for MSR_IA32_FEAT_CTL. Signed-off-by: Xiaoyao Li --- target/i386/kvm/kvm-cpu.c | 5 + 1 file changed, 5 insertions(+) diff --git a/target/i386/kvm/kvm-cpu.c b/target/i386/kvm/kvm-cpu.c index 9c791b7b0520..8c618869533c 100644 --- a/target/i386/kvm/kvm-cpu.c +++ b/target/i386/kvm/kvm-cpu.c @@ -15,6 +15,7 @@ #include "sysemu/sysemu.h" #include "hw/boards.h" +#include "tdx.h" #include "kvm_i386.h" #include "hw/core/accel-cpu.h" @@ -60,6 +61,10 @@ static bool lmce_supported(void) if (kvm_ioctl(kvm_state, KVM_X86_GET_MCE_CAP_SUPPORTED, _cap) < 0) { return false; } + +if (is_tdx_vm()) +return false; + return !!(mce_cap & MCG_LMCE_P); } -- 2.34.1
[PATCH v5 54/65] q35: Introduce smm_ranges property for q35-pci-host
From: Isaku Yamahata Add a q35 property to check whether or not SMM ranges, e.g. SMRAM, TSEG, etc... exist for the target platform. TDX doesn't support SMM and doesn't play nice with QEMU modifying related guest memory ranges. Signed-off-by: Isaku Yamahata Co-developed-by: Sean Christopherson Signed-off-by: Sean Christopherson Signed-off-by: Xiaoyao Li --- hw/i386/pc_q35.c | 2 ++ hw/pci-host/q35.c | 42 +++ include/hw/i386/pc.h | 1 + include/hw/pci-host/q35.h | 1 + 4 files changed, 33 insertions(+), 13 deletions(-) diff --git a/hw/i386/pc_q35.c b/hw/i386/pc_q35.c index 45a4102e75f5..dc3f68676ea0 100644 --- a/hw/i386/pc_q35.c +++ b/hw/i386/pc_q35.c @@ -212,6 +212,8 @@ static void pc_q35_init(MachineState *machine) x86ms->above_4g_mem_size, NULL); object_property_set_bool(phb, PCI_HOST_BYPASS_IOMMU, pcms->default_bus_bypass_iommu, NULL); +object_property_set_bool(phb, PCI_HOST_PROP_SMM_RANGES, + x86_machine_is_smm_enabled(x86ms), NULL); sysbus_realize_and_unref(SYS_BUS_DEVICE(phb), _fatal); /* pci */ diff --git a/hw/pci-host/q35.c b/hw/pci-host/q35.c index 98d4a7c253a6..0b6cbaed7ed5 100644 --- a/hw/pci-host/q35.c +++ b/hw/pci-host/q35.c @@ -179,6 +179,8 @@ static Property q35_host_props[] = { mch.below_4g_mem_size, 0), DEFINE_PROP_SIZE(PCI_HOST_ABOVE_4G_MEM_SIZE, Q35PCIHost, mch.above_4g_mem_size, 0), +DEFINE_PROP_BOOL(PCI_HOST_PROP_SMM_RANGES, Q35PCIHost, + mch.has_smm_ranges, true), DEFINE_PROP_BOOL("x-pci-hole64-fix", Q35PCIHost, pci_hole64_fix, true), DEFINE_PROP_END_OF_LIST(), }; @@ -214,6 +216,7 @@ static void q35_host_initfn(Object *obj) /* mch's object_initialize resets the default value, set it again */ qdev_prop_set_uint64(DEVICE(s), PCI_HOST_PROP_PCI_HOLE64_SIZE, Q35_PCI_HOST_HOLE64_SIZE_DEFAULT); + object_property_add(obj, PCI_HOST_PROP_PCI_HOLE_START, "uint32", q35_host_get_pci_hole_start, NULL, NULL, NULL); @@ -476,6 +479,10 @@ static void mch_write_config(PCIDevice *d, mch_update_pciexbar(mch); } +if (!mch->has_smm_ranges) { +return; +} + if (ranges_overlap(address, len, MCH_HOST_BRIDGE_SMRAM, MCH_HOST_BRIDGE_SMRAM_SIZE)) { mch_update_smram(mch); @@ -494,10 +501,13 @@ static void mch_write_config(PCIDevice *d, static void mch_update(MCHPCIState *mch) { mch_update_pciexbar(mch); + mch_update_pam(mch); -mch_update_smram(mch); -mch_update_ext_tseg_mbytes(mch); -mch_update_smbase_smram(mch); +if (mch->has_smm_ranges) { +mch_update_smram(mch); +mch_update_ext_tseg_mbytes(mch); +mch_update_smbase_smram(mch); +} /* * pci hole goes from end-of-low-ram to io-apic. @@ -538,19 +548,21 @@ static void mch_reset(DeviceState *qdev) pci_set_quad(d->config + MCH_HOST_BRIDGE_PCIEXBAR, MCH_HOST_BRIDGE_PCIEXBAR_DEFAULT); -d->config[MCH_HOST_BRIDGE_SMRAM] = MCH_HOST_BRIDGE_SMRAM_DEFAULT; -d->config[MCH_HOST_BRIDGE_ESMRAMC] = MCH_HOST_BRIDGE_ESMRAMC_DEFAULT; -d->wmask[MCH_HOST_BRIDGE_SMRAM] = MCH_HOST_BRIDGE_SMRAM_WMASK; -d->wmask[MCH_HOST_BRIDGE_ESMRAMC] = MCH_HOST_BRIDGE_ESMRAMC_WMASK; +if (mch->has_smm_ranges) { +d->config[MCH_HOST_BRIDGE_SMRAM] = MCH_HOST_BRIDGE_SMRAM_DEFAULT; +d->config[MCH_HOST_BRIDGE_ESMRAMC] = MCH_HOST_BRIDGE_ESMRAMC_DEFAULT; +d->wmask[MCH_HOST_BRIDGE_SMRAM] = MCH_HOST_BRIDGE_SMRAM_WMASK; +d->wmask[MCH_HOST_BRIDGE_ESMRAMC] = MCH_HOST_BRIDGE_ESMRAMC_WMASK; -if (mch->ext_tseg_mbytes > 0) { -pci_set_word(d->config + MCH_HOST_BRIDGE_EXT_TSEG_MBYTES, - MCH_HOST_BRIDGE_EXT_TSEG_MBYTES_QUERY); +if (mch->ext_tseg_mbytes > 0) { +pci_set_word(d->config + MCH_HOST_BRIDGE_EXT_TSEG_MBYTES, +MCH_HOST_BRIDGE_EXT_TSEG_MBYTES_QUERY); +} + +d->config[MCH_HOST_BRIDGE_F_SMBASE] = 0; +d->wmask[MCH_HOST_BRIDGE_F_SMBASE] = 0xff; } -d->config[MCH_HOST_BRIDGE_F_SMBASE] = 0; -d->wmask[MCH_HOST_BRIDGE_F_SMBASE] = 0xff; - mch_update(mch); } @@ -578,6 +590,10 @@ static void mch_realize(PCIDevice *d, Error **errp) PAM_EXPAN_BASE + i * PAM_EXPAN_SIZE, PAM_EXPAN_SIZE); } +if (!mch->has_smm_ranges) { +return; +} + /* if *disabled* show SMRAM to all CPUs */ memory_region_init_alias(>smram_region, OBJECT(mch), "smram-region", mch->pci_address_space, MCH_HOST_BRIDGE_SMRAM_C_BASE, diff --git a/include/hw/i386/pc.h b/include/hw/i386/pc.h index 5
[PATCH v5 36/65] i386/tdx: load TDVF for TD guest
From: Chao Peng TDVF(OVMF) needs to run at private memory for TD guest. TDX cannot support pflash device since it doesn't support read-only private memory. Thus load TDVF(OVMF) with -bios option for TDs. Use memory_region_init_ram_guest_memfd() to allocate the MemoryRegion for TDVF because it needs to be located at private memory. Also store the MemoryRegion pointer of TDVF since the shared ramblock of it can be discared after it gets copied to private ramblock. Signed-off-by: Chao Peng Co-developed-by: Xiaoyao Li Signed-off-by: Xiaoyao Li --- hw/i386/x86.c | 13 +++-- target/i386/kvm/tdx.c | 7 +++ target/i386/kvm/tdx.h | 3 +++ 3 files changed, 21 insertions(+), 2 deletions(-) diff --git a/hw/i386/x86.c b/hw/i386/x86.c index fa7095310f37..5a0cadc88c4f 100644 --- a/hw/i386/x86.c +++ b/hw/i386/x86.c @@ -47,6 +47,7 @@ #include "hw/intc/i8259.h" #include "hw/rtc/mc146818rtc.h" #include "target/i386/sev.h" +#include "kvm/tdx.h" #include "hw/acpi/cpu_hotplug.h" #include "hw/irq.h" @@ -1157,9 +1158,17 @@ void x86_bios_rom_init(MachineState *ms, const char *default_firmware, (bios_size % 65536) != 0) { goto bios_error; } + bios = g_malloc(sizeof(*bios)); -memory_region_init_ram(bios, NULL, "pc.bios", bios_size, _fatal); -if (sev_enabled()) { +if (is_tdx_vm()) { +memory_region_init_ram_guest_memfd(bios, NULL, "pc.bios", bios_size, + _fatal); +tdx_set_tdvf_region(bios); +} else { +memory_region_init_ram(bios, NULL, "pc.bios", bios_size, _fatal); +} + +if (sev_enabled() || is_tdx_vm()) { /* * The concept of a "reset" simply doesn't exist for * confidential computing guests, we have to destroy and diff --git a/target/i386/kvm/tdx.c b/target/i386/kvm/tdx.c index 13f069171db7..7c8e14e3cc58 100644 --- a/target/i386/kvm/tdx.c +++ b/target/i386/kvm/tdx.c @@ -19,6 +19,7 @@ #include "standard-headers/asm-x86/kvm_para.h" #include "sysemu/kvm.h" #include "sysemu/sysemu.h" +#include "exec/ramblock.h" #include "hw/i386/x86.h" #include "kvm_i386.h" @@ -463,6 +464,12 @@ static void update_tdx_cpuid_lookup_by_tdx_caps(void) (tdx_caps->xfam_fixed1 & CPUID_XSTATE_XSS_MASK) >> 32; } +void tdx_set_tdvf_region(MemoryRegion *tdvf_mr) +{ +assert(!tdx_guest->tdvf_mr); +tdx_guest->tdvf_mr = tdvf_mr; +} + static int tdx_kvm_init(ConfidentialGuestSupport *cgs, Error **errp) { MachineState *ms = MACHINE(qdev_get_machine()); diff --git a/target/i386/kvm/tdx.h b/target/i386/kvm/tdx.h index 2697e6bdfb1d..c021223001a5 100644 --- a/target/i386/kvm/tdx.h +++ b/target/i386/kvm/tdx.h @@ -24,6 +24,8 @@ typedef struct TdxGuest { char *mrconfigid; /* base64 encoded sha348 digest */ char *mrowner; /* base64 encoded sha348 digest */ char *mrownerconfig;/* base64 encoded sha348 digest */ + +MemoryRegion *tdvf_mr; } TdxGuest; #ifdef CONFIG_TDX @@ -35,5 +37,6 @@ bool is_tdx_vm(void); void tdx_get_supported_cpuid(uint32_t function, uint32_t index, int reg, uint32_t *ret); int tdx_pre_create_vcpu(CPUState *cpu, Error **errp); +void tdx_set_tdvf_region(MemoryRegion *tdvf_mr); #endif /* QEMU_I386_TDX_H */ -- 2.34.1
[PATCH v5 27/65] i386/tdx: Wire CPU features up with attributes of TD guest
For QEMU VMs, PKS is configured via CPUID_7_0_ECX_PKS and PMU is configured by x86cpu->enable_pmu. Reuse the existing configuration interface for TDX VMs. Signed-off-by: Xiaoyao Li Acked-by: Gerd Hoffmann --- target/i386/kvm/tdx.c | 13 + 1 file changed, 13 insertions(+) diff --git a/target/i386/kvm/tdx.c b/target/i386/kvm/tdx.c index b6295a644566..262e86fd2c67 100644 --- a/target/i386/kvm/tdx.c +++ b/target/i386/kvm/tdx.c @@ -33,6 +33,8 @@ (1U << KVM_FEATURE_MSI_EXT_DEST_ID)) #define TDX_TD_ATTRIBUTES_SEPT_VE_DISABLE BIT_ULL(28) +#define TDX_TD_ATTRIBUTES_PKS BIT_ULL(30) +#define TDX_TD_ATTRIBUTES_PERFMON BIT_ULL(63) #define TDX_ATTRIBUTES_MAX_BITS 64 @@ -477,6 +479,15 @@ static int tdx_kvm_init(ConfidentialGuestSupport *cgs, Error **errp) return 0; } +static void setup_td_guest_attributes(X86CPU *x86cpu) +{ +CPUX86State *env = >env; + +tdx_guest->attributes |= (env->features[FEAT_7_0_ECX] & CPUID_7_0_ECX_PKS) ? + TDX_TD_ATTRIBUTES_PKS : 0; +tdx_guest->attributes |= x86cpu->enable_pmu ? TDX_TD_ATTRIBUTES_PERFMON : 0; +} + int tdx_pre_create_vcpu(CPUState *cpu, Error **errp) { MachineState *ms = MACHINE(qdev_get_machine()); @@ -499,6 +510,8 @@ int tdx_pre_create_vcpu(CPUState *cpu, Error **errp) return r; } +setup_td_guest_attributes(x86cpu); + init_vm->cpuid.nent = kvm_x86_arch_cpuid(env, init_vm->cpuid.entries, 0); init_vm->attributes = tdx_guest->attributes; -- 2.34.1