Re: [PATCH v2 02/22] target/arm: Add confidential guest support
On Fri, Apr 19, 2024 at 05:25:12PM +0100, Daniel P. Berrangé wrote: > On Fri, Apr 19, 2024 at 04:56:50PM +0100, Jean-Philippe Brucker wrote: > > Add a new RmeGuest object, inheriting from ConfidentialGuestSupport, to > > support the Arm Realm Management Extension (RME). It is instantiated by > > passing on the command-line: > > > > -M virt,confidential-guest-support= > > -object guest-rme,id=[,options...] Hm, the commit message is wrong, it should say "rme-guest". > How about either "arm-rme" or merely 'rme' as the object name I don't feel strongly about the name, but picked this one to be consistent with existing confidential-guest-support objects: sev-guest, pef-guest, s390-pv-guest, and upcoming tdx-guest [1] Thanks, Jean [1] https://lore.kernel.org/qemu-devel/20240229063726.610065-13-xiaoyao...@intel.com/
[PATCH v2 06/22] hw/arm/virt: Disable DTB randomness for confidential VMs
The dtb-randomness feature, which adds random seeds to the DTB, isn't really compatible with confidential VMs since it randomizes the Realm Initial Measurement. Enabling it is not an error, but it prevents attestation. It also isn't useful to a Realm, which doesn't trust host input. Currently the feature is automatically enabled, unless the user disables it on the command-line. Change it to OnOffAuto, and automatically disable it for confidential VMs, unless the user explicitly enables it. Signed-off-by: Jean-Philippe Brucker --- v1->v2: separate patch, use OnOffAuto --- docs/system/arm/virt.rst | 9 + include/hw/arm/virt.h| 2 +- hw/arm/virt.c| 41 +--- 3 files changed, 32 insertions(+), 20 deletions(-) diff --git a/docs/system/arm/virt.rst b/docs/system/arm/virt.rst index 26fcba00b7..e4bbfec662 100644 --- a/docs/system/arm/virt.rst +++ b/docs/system/arm/virt.rst @@ -172,10 +172,11 @@ dtb-randomness rng-seed and kaslr-seed nodes (in both "/chosen" and "/secure-chosen") to use for features like the random number generator and address space randomisation. The default is - ``on``. You will want to disable it if your trusted boot chain - will verify the DTB it is passed, since this option causes the - DTB to be non-deterministic. It would be the responsibility of - the firmware to come up with a seed and pass it on if it wants to. + ``off`` for confidential VMs, and ``on`` otherwise. You will want + to disable it if your trusted boot chain will verify the DTB it is + passed, since this option causes the DTB to be non-deterministic. + It would be the responsibility of the firmware to come up with a + seed and pass it on if it wants to. dtb-kaslr-seed A deprecated synonym for dtb-randomness. diff --git a/include/hw/arm/virt.h b/include/hw/arm/virt.h index bb486d36b1..90a148dac2 100644 --- a/include/hw/arm/virt.h +++ b/include/hw/arm/virt.h @@ -150,7 +150,7 @@ struct VirtMachineState { bool virt; bool ras; bool mte; -bool dtb_randomness; +OnOffAuto dtb_randomness; OnOffAuto acpi; VirtGICType gic_version; VirtIOMMUType iommu; diff --git a/hw/arm/virt.c b/hw/arm/virt.c index 07ad31876e..f300f100b5 100644 --- a/hw/arm/virt.c +++ b/hw/arm/virt.c @@ -259,6 +259,7 @@ static bool ns_el2_virt_timer_present(void) static void create_fdt(VirtMachineState *vms) { +bool dtb_randomness = true; MachineState *ms = MACHINE(vms); int nb_numa_nodes = ms->numa_state->num_nodes; void *fdt = create_device_tree(>fdt_size); @@ -268,6 +269,16 @@ static void create_fdt(VirtMachineState *vms) exit(1); } +/* + * Including random data in the DTB causes random intial measurement on CCA, + * so disable it for confidential VMs. + */ +if (vms->dtb_randomness == ON_OFF_AUTO_OFF || +(vms->dtb_randomness == ON_OFF_AUTO_AUTO && + virt_machine_is_confidential(vms))) { +dtb_randomness = false; +} + ms->fdt = fdt; /* Header */ @@ -278,13 +289,13 @@ static void create_fdt(VirtMachineState *vms) /* /chosen must exist for load_dtb to fill in necessary properties later */ qemu_fdt_add_subnode(fdt, "/chosen"); -if (vms->dtb_randomness) { +if (dtb_randomness) { create_randomness(ms, "/chosen"); } if (vms->secure) { qemu_fdt_add_subnode(fdt, "/secure-chosen"); -if (vms->dtb_randomness) { +if (dtb_randomness) { create_randomness(ms, "/secure-chosen"); } } @@ -2474,18 +2485,21 @@ static void virt_set_its(Object *obj, bool value, Error **errp) vms->its = value; } -static bool virt_get_dtb_randomness(Object *obj, Error **errp) +static void virt_get_dtb_randomness(Object *obj, Visitor *v, const char *name, +void *opaque, Error **errp) { VirtMachineState *vms = VIRT_MACHINE(obj); +OnOffAuto dtb_randomness = vms->dtb_randomness; -return vms->dtb_randomness; +visit_type_OnOffAuto(v, name, _randomness, errp); } -static void virt_set_dtb_randomness(Object *obj, bool value, Error **errp) +static void virt_set_dtb_randomness(Object *obj, Visitor *v, const char *name, +void *opaque, Error **errp) { VirtMachineState *vms = VIRT_MACHINE(obj); -vms->dtb_randomness = value; +visit_type_OnOffAuto(v, name, >dtb_randomness, errp); } static char *virt_get_oem_id(Object *obj, Error **errp) @@ -3123,16 +3137,16 @@ static void virt_machine_class_init(ObjectClass *oc, void *data) "Set on/off to enable/disable " "ITS instantiation"); -object_class_property_add_bool(oc, "dtb-randomness", -
[PATCH v2 18/22] target/arm/kvm: Disable Realm reboot
A realm cannot be reset, it must be recreated from scratch. The RMM specification defines states of a Realm as NEW -> ACTIVE -> SYSTEM_OFF, after which the Realm can only be destroyed. A PCSI_SYSTEM_RESET call, which normally reboots the system, puts the Realm in SYSTEM_OFF state. QEMU does not support recreating a VM. Normally, a reboot request by the guest causes all devices to reset, which cannot work for a Realm. Indeed, loading images into Realm memory and changing the PC is only allowed for a Realm in NEW state. Resetting the images for a Realm in SYSTEM_OFF state will cause QEMU to crash with a bus error. Handle reboot requests by the guest more gracefully, by indicating to runstate.c that the vCPUs of a Realm are not resettable, and that QEMU should exit. Reviewed-by: Richard Henderson Signed-off-by: Jean-Philippe Brucker --- target/arm/kvm.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/target/arm/kvm.c b/target/arm/kvm.c index 9855cadb1b..60c2ef9388 100644 --- a/target/arm/kvm.c +++ b/target/arm/kvm.c @@ -1694,7 +1694,8 @@ int kvm_arch_msi_data_to_gsi(uint32_t data) bool kvm_arch_cpu_check_are_resettable(void) { -return true; +/* A Realm cannot be reset */ +return !kvm_arm_rme_enabled(); } static void kvm_arch_get_eager_split_size(Object *obj, Visitor *v, -- 2.44.0
[PATCH v2 19/22] target/arm/cpu: Inform about reading confidential CPU registers
The host cannot access registers of a Realm. Instead of showing all registers as zero in "info registers", display a message about this restriction. Signed-off-by: Jean-Philippe Brucker --- v1->v2: new --- target/arm/cpu.c | 5 + 1 file changed, 5 insertions(+) diff --git a/target/arm/cpu.c b/target/arm/cpu.c index ab8d007a86..18d1b88e2f 100644 --- a/target/arm/cpu.c +++ b/target/arm/cpu.c @@ -1070,6 +1070,11 @@ static void aarch64_cpu_dump_state(CPUState *cs, FILE *f, int flags) const char *ns_status; bool sve; +if (cpu->kvm_rme) { +qemu_fprintf(f, "the CPU registers are confidential to the realm\n"); +return; +} + qemu_fprintf(f, " PC=%016" PRIx64 " ", env->pc); for (i = 0; i < 32; i++) { if (i == 31) { -- 2.44.0
[PATCH v2 13/22] hw/arm/boot: Register Linux BSS section for confidential guests
Although the BSS section is not currently part of the kernel blob, it needs to be registered as guest RAM for confidential guest support, because the kernel needs to access it before it is able to setup its RAM regions. It would be tempting to simply add the BSS as part of the ROM blob (ie pass kernel_size as max_len argument to rom_add_blob()) and let the ROM loader notifier deal with the full image size generically, but that would add zero-initialization of the BSS region by the loader, which adds a significant overhead. For a 40MB kernel with a 17MB BSS, I measured an average boot time regression of 2.8ms on a fast desktop, 5.7% of the QEMU setup time). On a slower host, the regression could be much larger. Instead, add a special case to initialize the kernel's BSS IPA range. Signed-off-by: Jean-Philippe Brucker --- v1->v2: new --- target/arm/kvm_arm.h | 5 + hw/arm/boot.c| 11 +++ target/arm/kvm-rme.c | 10 ++ 3 files changed, 26 insertions(+) diff --git a/target/arm/kvm_arm.h b/target/arm/kvm_arm.h index 4386b0..4b787dd628 100644 --- a/target/arm/kvm_arm.h +++ b/target/arm/kvm_arm.h @@ -218,6 +218,7 @@ int kvm_arm_set_irq(int cpu, int irqtype, int irq, int level); int kvm_arm_rme_init(MachineState *ms); int kvm_arm_rme_vm_type(MachineState *ms); +void kvm_arm_rme_init_guest_ram(hwaddr base, size_t size); bool kvm_arm_rme_enabled(void); int kvm_arm_rme_vcpu_init(CPUState *cs); @@ -243,6 +244,10 @@ static inline bool kvm_arm_sve_supported(void) return false; } +static inline void kvm_arm_rme_init_guest_ram(hwaddr base, size_t size) +{ +} + /* * These functions should never actually be called without KVM support. */ diff --git a/hw/arm/boot.c b/hw/arm/boot.c index 84ea6a807a..9f522e332b 100644 --- a/hw/arm/boot.c +++ b/hw/arm/boot.c @@ -26,6 +26,7 @@ #include "qemu/config-file.h" #include "qemu/option.h" #include "qemu/units.h" +#include "kvm_arm.h" /* Kernel boot protocol is specified in the kernel docs * Documentation/arm/Booting and Documentation/arm64/booting.txt @@ -850,6 +851,7 @@ static uint64_t load_aarch64_image(const char *filename, hwaddr mem_base, { hwaddr kernel_load_offset = KERNEL64_LOAD_ADDR; uint64_t kernel_size = 0; +uint64_t page_size; uint8_t *buffer; int size; @@ -916,6 +918,15 @@ static uint64_t load_aarch64_image(const char *filename, hwaddr mem_base, *entry = mem_base + kernel_load_offset; rom_add_blob_fixed_as(filename, buffer, size, *entry, as); +/* + * Register the kernel BSS as realm resource, so the kernel can use it right + * away. Align up to skip the last page, which still contains kernel + * data. + */ +page_size = qemu_real_host_page_size(); +kvm_arm_rme_init_guest_ram(QEMU_ALIGN_UP(*entry + size, page_size), + QEMU_ALIGN_DOWN(kernel_size - size, page_size)); + g_free(buffer); return kernel_size; diff --git a/target/arm/kvm-rme.c b/target/arm/kvm-rme.c index bee6694d6d..b2ad10ef6d 100644 --- a/target/arm/kvm-rme.c +++ b/target/arm/kvm-rme.c @@ -203,6 +203,16 @@ int kvm_arm_rme_init(MachineState *ms) return 0; } +/* + * kvm_arm_rme_init_guest_ram - Initialize a Realm IPA range + */ +void kvm_arm_rme_init_guest_ram(hwaddr base, size_t size) +{ +if (rme_guest) { +rme_add_ram_region(base, size, /* populate */ false); +} +} + int kvm_arm_rme_vcpu_init(CPUState *cs) { ARMCPU *cpu = ARM_CPU(cs); -- 2.44.0
[PATCH v2 15/22] target/arm/kvm-rme: Add measurement algorithm property
This option selects which measurement algorithm to use for attestation. Supported values are SHA256 and SHA512. Default to SHA512 arbitrarily. SHA512 is generally faster on 64-bit architectures. On a few arm64 CPUs I tested SHA256 is much faster, but that's most likely because they only support acceleration via FEAT_SHA256 (Armv8.0) and not FEAT_SHA512 (Armv8.2). Future CPUs supporting RME are likely to also support FEAT_SHA512. Cc: Eric Blake Cc: Markus Armbruster Cc: Daniel P. Berrangé Cc: Eduardo Habkost Signed-off-by: Jean-Philippe Brucker --- v1->v2: use enum, pick default --- qapi/qom.json| 18 +- target/arm/kvm-rme.c | 39 ++- 2 files changed, 55 insertions(+), 2 deletions(-) diff --git a/qapi/qom.json b/qapi/qom.json index 91654aa267..84dce666b2 100644 --- a/qapi/qom.json +++ b/qapi/qom.json @@ -931,18 +931,34 @@ 'data': { '*cpu-affinity': ['uint16'], '*node-affinity': ['uint16'] } } +## +# @RmeGuestMeasurementAlgo: +# +# @sha256: Use the SHA256 algorithm +# @sha512: Use the SHA512 algorithm +# +# Algorithm to use for realm measurements +# +# Since: FIXME +## +{ 'enum': 'RmeGuestMeasurementAlgo', + 'data': ['sha256', 'sha512'] } + ## # @RmeGuestProperties: # # Properties for rme-guest objects. # +# @measurement-algo: Realm measurement algorithm (default: sha512) +# # @personalization-value: Realm personalization value, as a 64-byte hex string # (default: 0) # # Since: FIXME ## { 'struct': 'RmeGuestProperties', - 'data': { '*personalization-value': 'str' } } + 'data': { '*personalization-value': 'str', +'*measurement-algo': 'RmeGuestMeasurementAlgo' } } ## # @ObjectType: diff --git a/target/arm/kvm-rme.c b/target/arm/kvm-rme.c index cb5c3f7a22..8f39e54aaa 100644 --- a/target/arm/kvm-rme.c +++ b/target/arm/kvm-rme.c @@ -23,13 +23,14 @@ OBJECT_DECLARE_SIMPLE_TYPE(RmeGuest, RME_GUEST) #define RME_PAGE_SIZE qemu_real_host_page_size() -#define RME_MAX_CFG 1 +#define RME_MAX_CFG 2 struct RmeGuest { ConfidentialGuestSupport parent_obj; Notifier rom_load_notifier; GSList *ram_regions; uint8_t *personalization_value; +RmeGuestMeasurementAlgo measurement_algo; }; typedef struct { @@ -73,6 +74,19 @@ static int rme_configure_one(RmeGuest *guest, uint32_t cfg, Error **errp) memcpy(args.rpv, guest->personalization_value, KVM_CAP_ARM_RME_RPV_SIZE); cfg_str = "personalization value"; break; +case KVM_CAP_ARM_RME_CFG_HASH_ALGO: +switch (guest->measurement_algo) { +case RME_GUEST_MEASUREMENT_ALGO_SHA256: +args.hash_algo = KVM_CAP_ARM_RME_MEASUREMENT_ALGO_SHA256; +break; +case RME_GUEST_MEASUREMENT_ALGO_SHA512: +args.hash_algo = KVM_CAP_ARM_RME_MEASUREMENT_ALGO_SHA512; +break; +default: +g_assert_not_reached(); +} +cfg_str = "hash algorithm"; +break; default: g_assert_not_reached(); } @@ -338,12 +352,34 @@ static void rme_set_rpv(Object *obj, const char *value, Error **errp) } } +static int rme_get_measurement_algo(Object *obj, Error **errp) +{ +RmeGuest *guest = RME_GUEST(obj); + +return guest->measurement_algo; +} + +static void rme_set_measurement_algo(Object *obj, int algo, Error **errp) +{ +RmeGuest *guest = RME_GUEST(obj); + +guest->measurement_algo = algo; +} + static void rme_guest_class_init(ObjectClass *oc, void *data) { object_class_property_add_str(oc, "personalization-value", rme_get_rpv, rme_set_rpv); object_class_property_set_description(oc, "personalization-value", "Realm personalization value (512-bit hexadecimal number)"); + +object_class_property_add_enum(oc, "measurement-algo", + "RmeGuestMeasurementAlgo", + _lookup, + rme_get_measurement_algo, + rme_set_measurement_algo); +object_class_property_set_description(oc, "measurement-algo", +"Realm measurement algorithm ('sha256', 'sha512')"); } static void rme_guest_instance_init(Object *obj) @@ -353,6 +389,7 @@ static void rme_guest_instance_init(Object *obj) exit(1); } rme_guest = RME_GUEST(obj); +rme_guest->measurement_algo = RME_GUEST_MEASUREMENT_ALGO_SHA512; } static const TypeInfo rme_guest_info = { -- 2.44.0
[PATCH v2 04/22] target/arm/kvm-rme: Initialize realm
The machine code calls kvm_arm_rme_vm_type() to get the VM flag and KVM calls kvm_arm_rme_init() to issue KVM hypercalls: * create the realm descriptor, * load images into Realm RAM (in another patch), * finalize the REC (vCPU) after the registers are reset, * activate the realm at the end, at which point the realm is sealed. Signed-off-by: Jean-Philippe Brucker --- v1->v2: * Use g_assert_not_reached() in stubs * Init from kvm_arch_init() rather than hw/arm/virt * Cache rme_guest --- target/arm/kvm_arm.h | 16 +++ target/arm/kvm-rme.c | 101 +++ target/arm/kvm.c | 7 ++- 3 files changed, 123 insertions(+), 1 deletion(-) diff --git a/target/arm/kvm_arm.h b/target/arm/kvm_arm.h index cfaa0d9bc7..8e2d90c265 100644 --- a/target/arm/kvm_arm.h +++ b/target/arm/kvm_arm.h @@ -203,6 +203,8 @@ int kvm_arm_vgic_probe(void); void kvm_arm_pmu_init(ARMCPU *cpu); void kvm_arm_pmu_set_irq(ARMCPU *cpu, int irq); +int kvm_arm_vcpu_finalize(ARMCPU *cpu, int feature); + /** * kvm_arm_pvtime_init: * @cpu: ARMCPU @@ -214,6 +216,11 @@ void kvm_arm_pvtime_init(ARMCPU *cpu, uint64_t ipa); int kvm_arm_set_irq(int cpu, int irqtype, int irq, int level); +int kvm_arm_rme_init(MachineState *ms); +int kvm_arm_rme_vm_type(MachineState *ms); + +bool kvm_arm_rme_enabled(void); + #else /* @@ -283,6 +290,15 @@ static inline uint32_t kvm_arm_sve_get_vls(ARMCPU *cpu) g_assert_not_reached(); } +static inline int kvm_arm_rme_init(MachineState *ms) +{ +g_assert_not_reached(); +} + +static inline int kvm_arm_rme_vm_type(MachineState *ms) +{ +g_assert_not_reached(); +} #endif #endif diff --git a/target/arm/kvm-rme.c b/target/arm/kvm-rme.c index 960dd75608..23ac2d32d4 100644 --- a/target/arm/kvm-rme.c +++ b/target/arm/kvm-rme.c @@ -23,14 +23,115 @@ struct RmeGuest { ConfidentialGuestSupport parent_obj; }; +static RmeGuest *rme_guest; + +bool kvm_arm_rme_enabled(void) +{ +return !!rme_guest; +} + +static int rme_create_rd(Error **errp) +{ +int ret = kvm_vm_enable_cap(kvm_state, KVM_CAP_ARM_RME, 0, +KVM_CAP_ARM_RME_CREATE_RD); + +if (ret) { +error_setg_errno(errp, -ret, "RME: failed to create Realm Descriptor"); +} +return ret; +} + +static void rme_vm_state_change(void *opaque, bool running, RunState state) +{ +int ret; +CPUState *cs; + +if (!running) { +return; +} + +ret = rme_create_rd(_abort); +if (ret) { +return; +} + +/* + * Now that do_cpu_reset() initialized the boot PC and + * kvm_cpu_synchronize_post_reset() registered it, we can finalize the REC. + */ +CPU_FOREACH(cs) { +ret = kvm_arm_vcpu_finalize(ARM_CPU(cs), KVM_ARM_VCPU_REC); +if (ret) { +error_report("RME: failed to finalize vCPU: %s", strerror(-ret)); +exit(1); +} +} + +ret = kvm_vm_enable_cap(kvm_state, KVM_CAP_ARM_RME, 0, +KVM_CAP_ARM_RME_ACTIVATE_REALM); +if (ret) { +error_report("RME: failed to activate realm: %s", strerror(-ret)); +exit(1); +} +} + +int kvm_arm_rme_init(MachineState *ms) +{ +static Error *rme_mig_blocker; +ConfidentialGuestSupport *cgs = ms->cgs; + +if (!rme_guest) { +return 0; +} + +if (!cgs) { +error_report("missing -machine confidential-guest-support parameter"); +return -EINVAL; +} + +if (!kvm_check_extension(kvm_state, KVM_CAP_ARM_RME)) { +return -ENODEV; +} + +error_setg(_mig_blocker, "RME: migration is not implemented"); +migrate_add_blocker(_mig_blocker, _fatal); + +/* + * The realm activation is done last, when the VM starts, after all images + * have been loaded and all vcpus finalized. + */ +qemu_add_vm_change_state_handler(rme_vm_state_change, NULL); + +cgs->ready = true; +return 0; +} + +int kvm_arm_rme_vm_type(MachineState *ms) +{ +if (rme_guest) { +return KVM_VM_TYPE_ARM_REALM; +} +return 0; +} + static void rme_guest_class_init(ObjectClass *oc, void *data) { } +static void rme_guest_instance_init(Object *obj) +{ +if (rme_guest) { +error_report("a single instance of RmeGuest is supported"); +exit(1); +} +rme_guest = RME_GUEST(obj); +} + static const TypeInfo rme_guest_info = { .parent = TYPE_CONFIDENTIAL_GUEST_SUPPORT, .name = TYPE_RME_GUEST, .instance_size = sizeof(struct RmeGuest), +.instance_init = rme_guest_instance_init, .class_init = rme_guest_class_init, .interfaces = (InterfaceInfo[]) { { TYPE_USER_CREATABLE }, diff --git a/target/arm/kvm.c b/target/arm/kvm.c index a5673241e5..b00077c1a5 100644 --- a/target/arm/kvm.c +++ b/target/arm/kvm.c @@ -93,7 +93,7 @@ static int kvm_arm_vcpu_init(ARMCPU *cpu) * * Returns: 0 if success else < 0 error co
[PATCH v2 09/22] target/arm/kvm-rme: Initialize vCPU
The target code calls kvm_arm_vcpu_init() to mark the vCPU as part of a Realm. For a Realm vCPU, only x0-x7 can be set at runtime. Before boot, the PC can also be set, and is ignored at runtime. KVM also accepts a few system register changes during initial configuration, as returned by KVM_GET_REG_LIST. Signed-off-by: Jean-Philippe Brucker --- v1->v2: only do the GP regs, since they are sync'd explicitly. Other registers use the existing reglist facility. --- target/arm/cpu.h | 3 +++ target/arm/kvm_arm.h | 1 + target/arm/kvm-rme.c | 10 target/arm/kvm.c | 61 4 files changed, 75 insertions(+) diff --git a/target/arm/cpu.h b/target/arm/cpu.h index bc0c84873f..d3ff1b4a31 100644 --- a/target/arm/cpu.h +++ b/target/arm/cpu.h @@ -945,6 +945,9 @@ struct ArchCPU { OnOffAuto kvm_steal_time; #endif /* CONFIG_KVM */ +/* Realm Management Extension */ +bool kvm_rme; + /* Uniprocessor system with MP extensions */ bool mp_is_up; diff --git a/target/arm/kvm_arm.h b/target/arm/kvm_arm.h index 8e2d90c265..4386b0 100644 --- a/target/arm/kvm_arm.h +++ b/target/arm/kvm_arm.h @@ -220,6 +220,7 @@ int kvm_arm_rme_init(MachineState *ms); int kvm_arm_rme_vm_type(MachineState *ms); bool kvm_arm_rme_enabled(void); +int kvm_arm_rme_vcpu_init(CPUState *cs); #else diff --git a/target/arm/kvm-rme.c b/target/arm/kvm-rme.c index 23ac2d32d4..aa9c3b5551 100644 --- a/target/arm/kvm-rme.c +++ b/target/arm/kvm-rme.c @@ -106,6 +106,16 @@ int kvm_arm_rme_init(MachineState *ms) return 0; } +int kvm_arm_rme_vcpu_init(CPUState *cs) +{ +ARMCPU *cpu = ARM_CPU(cs); + +if (rme_guest) { +cpu->kvm_rme = true; +} +return 0; +} + int kvm_arm_rme_vm_type(MachineState *ms) { if (rme_guest) { diff --git a/target/arm/kvm.c b/target/arm/kvm.c index 3504276822..3a2233ec73 100644 --- a/target/arm/kvm.c +++ b/target/arm/kvm.c @@ -1920,6 +1920,11 @@ int kvm_arch_init_vcpu(CPUState *cs) return ret; } +ret = kvm_arm_rme_vcpu_init(cs); +if (ret) { +return ret; +} + if (cpu_isar_feature(aa64_sve, cpu)) { ret = kvm_arm_sve_set_vls(cpu); if (ret) { @@ -2056,6 +2061,35 @@ static int kvm_arch_put_sve(CPUState *cs) return 0; } +static int kvm_arm_rme_put_core_regs(CPUState *cs) +{ +int i, ret; +struct kvm_one_reg reg; +ARMCPU *cpu = ARM_CPU(cs); +CPUARMState *env = >env; + +/* + * The RME ABI only allows us to set 8 GPRs and the PC + */ +for (i = 0; i < 8; i++) { +reg.id = AARCH64_CORE_REG(regs.regs[i]); +reg.addr = (uintptr_t) >xregs[i]; +ret = kvm_vcpu_ioctl(cs, KVM_SET_ONE_REG, ); +if (ret) { +return ret; +} +} + +reg.id = AARCH64_CORE_REG(regs.pc); +reg.addr = (uintptr_t) >pc; +ret = kvm_vcpu_ioctl(cs, KVM_SET_ONE_REG, ); +if (ret) { +return ret; +} + +return 0; +} + static int kvm_arm_put_core_regs(CPUState *cs, int level) { uint64_t val; @@ -2066,6 +2100,10 @@ static int kvm_arm_put_core_regs(CPUState *cs, int level) ARMCPU *cpu = ARM_CPU(cs); CPUARMState *env = >env; +if (cpu->kvm_rme) { +return kvm_arm_rme_put_core_regs(cs); +} + /* If we are in AArch32 mode then we need to copy the AArch32 regs to the * AArch64 registers before pushing them out to 64-bit KVM. */ @@ -2253,6 +2291,25 @@ static int kvm_arch_get_sve(CPUState *cs) return 0; } +static int kvm_arm_rme_get_core_regs(CPUState *cs) +{ +int i, ret; +struct kvm_one_reg reg; +ARMCPU *cpu = ARM_CPU(cs); +CPUARMState *env = >env; + +for (i = 0; i < 8; i++) { +reg.id = AARCH64_CORE_REG(regs.regs[i]); +reg.addr = (uintptr_t) >xregs[i]; +ret = kvm_vcpu_ioctl(cs, KVM_GET_ONE_REG, ); +if (ret) { +return ret; +} +} + +return 0; +} + static int kvm_arm_get_core_regs(CPUState *cs) { uint64_t val; @@ -2263,6 +2320,10 @@ static int kvm_arm_get_core_regs(CPUState *cs) ARMCPU *cpu = ARM_CPU(cs); CPUARMState *env = >env; +if (cpu->kvm_rme) { +return kvm_arm_rme_get_core_regs(cs); +} + for (i = 0; i < 31; i++) { ret = kvm_get_one_reg(cs, AARCH64_CORE_REG(regs.regs[i]), >xregs[i]); -- 2.44.0
[PATCH v2 16/22] target/arm/cpu: Set number of breakpoints and watchpoints in KVM
Add "num-breakpoints" and "num-watchpoints" CPU parameters to configure the debug features that KVM presents to the guest. The KVM vCPU configuration is modified by calling SET_ONE_REG on the ID register. This is needed for Realm VMs, whose parameters include breakpoints and watchpoints, and influence the Realm Initial Measurement. Signed-off-by: Jean-Philippe Brucker --- v1->v2: new --- target/arm/cpu.h | 4 ++ target/arm/kvm_arm.h | 2 + target/arm/arm-qmp-cmds.c | 1 + target/arm/cpu64.c| 77 +++ target/arm/kvm.c | 56 +++- 5 files changed, 139 insertions(+), 1 deletion(-) diff --git a/target/arm/cpu.h b/target/arm/cpu.h index d3ff1b4a31..24080da2b7 100644 --- a/target/arm/cpu.h +++ b/target/arm/cpu.h @@ -1089,6 +1089,10 @@ struct ArchCPU { /* Generic timer counter frequency, in Hz */ uint64_t gt_cntfrq_hz; + +/* Allows to override the default configuration */ +uint8_t num_bps; +uint8_t num_wps; }; typedef struct ARMCPUInfo { diff --git a/target/arm/kvm_arm.h b/target/arm/kvm_arm.h index 4b787dd628..b040686eab 100644 --- a/target/arm/kvm_arm.h +++ b/target/arm/kvm_arm.h @@ -16,6 +16,8 @@ #define KVM_ARM_VGIC_V2 (1 << 0) #define KVM_ARM_VGIC_V3 (1 << 1) +#define KVM_REG_ARM_ID_AA64DFR0_EL1 ARM64_SYS_REG(3, 0, 0, 5, 0) + /** * kvm_arm_register_device: * @mr: memory region for this device diff --git a/target/arm/arm-qmp-cmds.c b/target/arm/arm-qmp-cmds.c index 3cc8cc738b..0f574bb1dd 100644 --- a/target/arm/arm-qmp-cmds.c +++ b/target/arm/arm-qmp-cmds.c @@ -95,6 +95,7 @@ static const char *cpu_model_advertised_features[] = { "sve1408", "sve1536", "sve1664", "sve1792", "sve1920", "sve2048", "kvm-no-adjvtime", "kvm-steal-time", "pauth", "pauth-impdef", "pauth-qarma3", +"num-breakpoints", "num-watchpoints", NULL }; diff --git a/target/arm/cpu64.c b/target/arm/cpu64.c index 985b1efe16..9ca74eb019 100644 --- a/target/arm/cpu64.c +++ b/target/arm/cpu64.c @@ -571,6 +571,82 @@ void aarch64_add_pauth_properties(Object *obj) } } +#if defined(CONFIG_KVM) +static void arm_cpu_get_num_wps(Object *obj, Visitor *v, const char *name, +void *opaque, Error **errp) +{ +uint8_t val; +ARMCPU *cpu = ARM_CPU(obj); + +val = cpu->num_wps; +if (val == 0) { +val = FIELD_EX64(cpu->isar.id_aa64dfr0, ID_AA64DFR0, WRPS) + 1; +} + +visit_type_uint8(v, name, , errp); +} + +static void arm_cpu_set_num_wps(Object *obj, Visitor *v, const char *name, +void *opaque, Error **errp) +{ +uint8_t val; +ARMCPU *cpu = ARM_CPU(obj); +uint8_t max_wps = FIELD_EX64(cpu->isar.id_aa64dfr0, ID_AA64DFR0, WRPS) + 1; + +if (!visit_type_uint8(v, name, , errp)) { +return; +} + +if (val < 2 || val > max_wps) { +error_setg(errp, "invalid number of watchpoints"); +return; +} + +cpu->num_wps = val; +} + +static void arm_cpu_get_num_bps(Object *obj, Visitor *v, const char *name, +void *opaque, Error **errp) +{ +uint8_t val; +ARMCPU *cpu = ARM_CPU(obj); + +val = cpu->num_bps; +if (val == 0) { +val = FIELD_EX64(cpu->isar.id_aa64dfr0, ID_AA64DFR0, BRPS) + 1; +} + +visit_type_uint8(v, name, , errp); +} + +static void arm_cpu_set_num_bps(Object *obj, Visitor *v, const char *name, +void *opaque, Error **errp) +{ +uint8_t val; +ARMCPU *cpu = ARM_CPU(obj); +uint8_t max_bps = FIELD_EX64(cpu->isar.id_aa64dfr0, ID_AA64DFR0, BRPS) + 1; + +if (!visit_type_uint8(v, name, , errp)) { +return; +} + +if (val < 2 || val > max_bps) { +error_setg(errp, "invalid number of breakpoints"); +return; +} + +cpu->num_bps = val; +} + +static void aarch64_add_kvm_writable_properties(Object *obj) +{ +object_property_add(obj, "num-breakpoints", "uint8", arm_cpu_get_num_bps, +arm_cpu_set_num_bps, NULL, NULL); +object_property_add(obj, "num-watchpoints", "uint8", arm_cpu_get_num_wps, +arm_cpu_set_num_wps, NULL, NULL); +} +#endif /* CONFIG_KVM */ + void arm_cpu_lpa2_finalize(ARMCPU *cpu, Error **errp) { uint64_t t; @@ -713,6 +789,7 @@ static void aarch64_host_initfn(Object *obj) if (arm_feature(>env, ARM_FEATURE_AARCH64)) { aarch64_add_sve_properties(obj); aarch64_add_pauth_properties(obj); +aarch64_add_kvm_writable_properties(obj); } #elif defined(CONFIG_HVF) ARMCPU *cpu = ARM_CPU(obj); diff --git a/target/arm/kvm.c b/target/arm/kvm.c index 6d368bf442..623980a25b
[PATCH v2 11/22] hw/core/loader: Add ROM loader notifier
Add a function to register a notifier, that is invoked after a ROM gets loaded into guest memory. It will be used by Arm confidential guest support, in order to register all blobs loaded into memory with KVM, so that their content is part of the initial VM measurement and contribute to the guest attestation. Signed-off-by: Jean-Philippe Brucker --- v1->v2: new --- include/hw/loader.h | 15 +++ hw/core/loader.c| 15 +++ 2 files changed, 30 insertions(+) diff --git a/include/hw/loader.h b/include/hw/loader.h index 8685e27334..79fab25dd9 100644 --- a/include/hw/loader.h +++ b/include/hw/loader.h @@ -356,6 +356,21 @@ void hmp_info_roms(Monitor *mon, const QDict *qdict); ssize_t rom_add_vga(const char *file); ssize_t rom_add_option(const char *file, int32_t bootindex); +typedef struct RomLoaderNotify { +/* Parameters passed to rom_add_blob() */ +hwaddr addr; +size_t len; +size_t max_len; +} RomLoaderNotify; + +/** + * rom_add_load_notifier - Add a notifier for loaded images + * + * Add a notifier that will be invoked with a RomLoaderNotify structure for each + * blob loaded into guest memory, after the blob is loaded. + */ +void rom_add_load_notifier(Notifier *notifier); + /* This is the usual maximum in uboot, so if a uImage overflows this, it would * overflow on real hardware too. */ #define UBOOT_MAX_GUNZIP_BYTES (64 << 20) diff --git a/hw/core/loader.c b/hw/core/loader.c index b8e52f3fb0..4bd236cf89 100644 --- a/hw/core/loader.c +++ b/hw/core/loader.c @@ -67,6 +67,8 @@ #include static int roms_loaded; +static NotifierList rom_loader_notifier = +NOTIFIER_LIST_INITIALIZER(rom_loader_notifier); /* return the size or -1 if error */ int64_t get_image_size(const char *filename) @@ -1209,6 +1211,11 @@ MemoryRegion *rom_add_blob(const char *name, const void *blob, size_t len, return mr; } +void rom_add_load_notifier(Notifier *notifier) +{ +notifier_list_add(_loader_notifier, notifier); +} + /* This function is specific for elf program because we don't need to allocate * all the rom. We just allocate the first part and the rest is just zeros. This * is why romsize and datasize are different. Also, this function takes its own @@ -1250,6 +1257,7 @@ ssize_t rom_add_option(const char *file, int32_t bootindex) static void rom_reset(void *unused) { Rom *rom; +RomLoaderNotify notify; QTAILQ_FOREACH(rom, , next) { if (rom->fw_file) { @@ -1298,6 +1306,13 @@ static void rom_reset(void *unused) cpu_flush_icache_range(rom->addr, rom->datasize); trace_loader_write_rom(rom->name, rom->addr, rom->datasize, rom->isrom); + +notify = (RomLoaderNotify) { +.addr = rom->addr, +.len = rom->datasize, +.max_len = rom->romsize, +}; +notifier_list_notify(_loader_notifier, ); } } -- 2.44.0
[PATCH v2 21/22] hw/arm/virt: Move virt_flash_create() to machvirt_init()
For confidential VMs we'll want to skip flash device creation. Unfortunately, in virt_instance_init() the machine->cgs member has not yet been initialized, so we cannot check whether confidential guest is enabled. Move virt_flash_create() to machvirt_init(), where we can access the machine->cgs member. Signed-off-by: Jean-Philippe Brucker --- v1->v2: new --- hw/arm/virt.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/hw/arm/virt.c b/hw/arm/virt.c index eca9a96b5a..bed19d0b79 100644 --- a/hw/arm/virt.c +++ b/hw/arm/virt.c @@ -2071,6 +2071,8 @@ static void machvirt_init(MachineState *machine) unsigned int smp_cpus = machine->smp.cpus; unsigned int max_cpus = machine->smp.max_cpus; +virt_flash_create(vms); + possible_cpus = mc->possible_cpu_arch_ids(machine); /* @@ -3229,8 +3231,6 @@ static void virt_instance_init(Object *obj) vms->irqmap = a15irqmap; -virt_flash_create(vms); - vms->oem_id = g_strndup(ACPI_BUILD_APPNAME6, 6); vms->oem_table_id = g_strndup(ACPI_BUILD_APPNAME8, 8); } -- 2.44.0
[PATCH v2 12/22] target/arm/kvm-rme: Populate Realm memory
Collect the images copied into guest RAM into a sorted list, and issue POPULATE_REALM KVM ioctls once we've created the Realm Descriptor. The images are part of the Realm Initial Measurement. Signed-off-by: Jean-Philippe Brucker --- v1->v2: Use a ROM loader notifier --- target/arm/kvm-rme.c | 97 1 file changed, 97 insertions(+) diff --git a/target/arm/kvm-rme.c b/target/arm/kvm-rme.c index aa9c3b5551..bee6694d6d 100644 --- a/target/arm/kvm-rme.c +++ b/target/arm/kvm-rme.c @@ -9,9 +9,11 @@ #include "exec/confidential-guest-support.h" #include "hw/boards.h" #include "hw/core/cpu.h" +#include "hw/loader.h" #include "kvm_arm.h" #include "migration/blocker.h" #include "qapi/error.h" +#include "qemu/error-report.h" #include "qom/object_interfaces.h" #include "sysemu/kvm.h" #include "sysemu/runstate.h" @@ -19,10 +21,21 @@ #define TYPE_RME_GUEST "rme-guest" OBJECT_DECLARE_SIMPLE_TYPE(RmeGuest, RME_GUEST) +#define RME_PAGE_SIZE qemu_real_host_page_size() + struct RmeGuest { ConfidentialGuestSupport parent_obj; +Notifier rom_load_notifier; +GSList *ram_regions; }; +typedef struct { +hwaddr base; +hwaddr len; +/* Populate guest RAM with data, or only initialize the IPA range */ +bool populate; +} RmeRamRegion; + static RmeGuest *rme_guest; bool kvm_arm_rme_enabled(void) @@ -41,6 +54,41 @@ static int rme_create_rd(Error **errp) return ret; } +static void rme_populate_realm(gpointer data, gpointer unused) +{ +int ret; +const RmeRamRegion *region = data; + +if (region->populate) { +struct kvm_cap_arm_rme_populate_realm_args populate_args = { +.populate_ipa_base = region->base, +.populate_ipa_size = region->len, +.flags = KVM_ARM_RME_POPULATE_FLAGS_MEASURE, +}; +ret = kvm_vm_enable_cap(kvm_state, KVM_CAP_ARM_RME, 0, +KVM_CAP_ARM_RME_POPULATE_REALM, +(intptr_t)_args); +if (ret) { +error_report("RME: failed to populate realm (0x%"HWADDR_PRIx", 0x%"HWADDR_PRIx"): %s", + region->base, region->len, strerror(-ret)); +exit(1); +} +} else { +struct kvm_cap_arm_rme_init_ipa_args init_args = { +.init_ipa_base = region->base, +.init_ipa_size = region->len, +}; +ret = kvm_vm_enable_cap(kvm_state, KVM_CAP_ARM_RME, 0, +KVM_CAP_ARM_RME_INIT_IPA_REALM, +(intptr_t)_args); +if (ret) { +error_report("RME: failed to initialize GPA range (0x%"HWADDR_PRIx", 0x%"HWADDR_PRIx"): %s", + region->base, region->len, strerror(-ret)); +exit(1); +} +} +} + static void rme_vm_state_change(void *opaque, bool running, RunState state) { int ret; @@ -55,6 +103,9 @@ static void rme_vm_state_change(void *opaque, bool running, RunState state) return; } +g_slist_foreach(rme_guest->ram_regions, rme_populate_realm, NULL); +g_slist_free_full(g_steal_pointer(_guest->ram_regions), g_free); + /* * Now that do_cpu_reset() initialized the boot PC and * kvm_cpu_synchronize_post_reset() registered it, we can finalize the REC. @@ -75,6 +126,49 @@ static void rme_vm_state_change(void *opaque, bool running, RunState state) } } +static gint rme_compare_ram_regions(gconstpointer a, gconstpointer b) +{ +const RmeRamRegion *ra = a; +const RmeRamRegion *rb = b; + +g_assert(ra->base != rb->base); +return ra->base < rb->base ? -1 : 1; +} + +static void rme_add_ram_region(hwaddr base, hwaddr len, bool populate) +{ +RmeRamRegion *region; + +region = g_new0(RmeRamRegion, 1); +region->base = QEMU_ALIGN_DOWN(base, RME_PAGE_SIZE); +region->len = QEMU_ALIGN_UP(len, RME_PAGE_SIZE); +region->populate = populate; + +/* + * The Realm Initial Measurement (RIM) depends on the order in which we + * initialize and populate the RAM regions. To help a verifier + * independently calculate the RIM, sort regions by GPA. + */ +rme_guest->ram_regions = g_slist_insert_sorted(rme_guest->ram_regions, + region, + rme_compare_ram_regions); +} + +static void rme_rom_load_notify(Notifier *notifier, void *data) +{ +RomLoaderNotify *rom = data; + +if (rom->addr == -1) { +/* + * These blobs (ACPI tables) are not loaded into guest RAM at reset. + * Instead the firmware will load them via fw_cfg and measure them +
[PATCH v2 17/22] target/arm/cpu: Set number of PMU counters in KVM
Add a "num-pmu-counters" CPU parameter to configure the number of counters that KVM presents to the guest. This is needed for Realm VMs, whose parameters include the number of PMU counters and influence the Realm Initial Measurement. Signed-off-by: Jean-Philippe Brucker --- v1->v2: new --- target/arm/cpu.h | 3 +++ target/arm/kvm_arm.h | 1 + target/arm/arm-qmp-cmds.c | 2 +- target/arm/cpu64.c| 41 +++ target/arm/kvm.c | 34 +++- 5 files changed, 79 insertions(+), 2 deletions(-) diff --git a/target/arm/cpu.h b/target/arm/cpu.h index 24080da2b7..84f3a67dab 100644 --- a/target/arm/cpu.h +++ b/target/arm/cpu.h @@ -1093,6 +1093,7 @@ struct ArchCPU { /* Allows to override the default configuration */ uint8_t num_bps; uint8_t num_wps; +int8_t num_pmu_ctrs; }; typedef struct ARMCPUInfo { @@ -2312,6 +2313,8 @@ FIELD(MFAR, FPA, 12, 40) FIELD(MFAR, NSE, 62, 1) FIELD(MFAR, NS, 63, 1) +FIELD(PMCR, N, 11, 5) + QEMU_BUILD_BUG_ON(ARRAY_SIZE(((ARMCPU *)0)->ccsidr) <= R_V7M_CSSELR_INDEX_MASK); /* If adding a feature bit which corresponds to a Linux ELF diff --git a/target/arm/kvm_arm.h b/target/arm/kvm_arm.h index b040686eab..62e39e7184 100644 --- a/target/arm/kvm_arm.h +++ b/target/arm/kvm_arm.h @@ -17,6 +17,7 @@ #define KVM_ARM_VGIC_V3 (1 << 1) #define KVM_REG_ARM_ID_AA64DFR0_EL1 ARM64_SYS_REG(3, 0, 0, 5, 0) +#define KVM_REG_ARM_PMCR_EL0ARM64_SYS_REG(3, 3, 9, 12, 0) /** * kvm_arm_register_device: diff --git a/target/arm/arm-qmp-cmds.c b/target/arm/arm-qmp-cmds.c index 0f574bb1dd..985d4270b8 100644 --- a/target/arm/arm-qmp-cmds.c +++ b/target/arm/arm-qmp-cmds.c @@ -95,7 +95,7 @@ static const char *cpu_model_advertised_features[] = { "sve1408", "sve1536", "sve1664", "sve1792", "sve1920", "sve2048", "kvm-no-adjvtime", "kvm-steal-time", "pauth", "pauth-impdef", "pauth-qarma3", -"num-breakpoints", "num-watchpoints", +"num-breakpoints", "num-watchpoints", "num-pmu-counters", NULL }; diff --git a/target/arm/cpu64.c b/target/arm/cpu64.c index 9ca74eb019..6c2b922d93 100644 --- a/target/arm/cpu64.c +++ b/target/arm/cpu64.c @@ -638,12 +638,53 @@ static void arm_cpu_set_num_bps(Object *obj, Visitor *v, const char *name, cpu->num_bps = val; } +static void arm_cpu_get_num_pmu_ctrs(Object *obj, Visitor *v, const char *name, + void *opaque, Error **errp) +{ +uint8_t val; +ARMCPU *cpu = ARM_CPU(obj); + +if (cpu->num_pmu_ctrs == -1) { +val = FIELD_EX64(cpu->isar.reset_pmcr_el0, PMCR, N); +} else { +val = cpu->num_pmu_ctrs; +} + +visit_type_uint8(v, name, , errp); +} + +static void arm_cpu_set_num_pmu_ctrs(Object *obj, Visitor *v, const char *name, + void *opaque, Error **errp) +{ +uint8_t val; +ARMCPU *cpu = ARM_CPU(obj); +uint8_t max_ctrs = FIELD_EX64(cpu->isar.reset_pmcr_el0, PMCR, N); + +if (!visit_type_uint8(v, name, , errp)) { +return; +} + +if (val > max_ctrs) { +error_setg(errp, "invalid number of PMU counters"); +return; +} + +cpu->num_pmu_ctrs = val; +} + static void aarch64_add_kvm_writable_properties(Object *obj) { +ARMCPU *cpu = ARM_CPU(obj); + object_property_add(obj, "num-breakpoints", "uint8", arm_cpu_get_num_bps, arm_cpu_set_num_bps, NULL, NULL); object_property_add(obj, "num-watchpoints", "uint8", arm_cpu_get_num_wps, arm_cpu_set_num_wps, NULL, NULL); + +cpu->num_pmu_ctrs = -1; +object_property_add(obj, "num-pmu-counters", "uint8", +arm_cpu_get_num_pmu_ctrs, arm_cpu_set_num_pmu_ctrs, +NULL, NULL); } #endif /* CONFIG_KVM */ diff --git a/target/arm/kvm.c b/target/arm/kvm.c index 623980a25b..9855cadb1b 100644 --- a/target/arm/kvm.c +++ b/target/arm/kvm.c @@ -418,7 +418,7 @@ static bool kvm_arm_get_host_cpu_features(ARMHostCPUFeatures *ahcf) if (pmu_supported) { /* PMCR_EL0 is only accessible if the vCPU has feature PMU_V3 */ err |= read_sys_reg64(fdarray[2], >isar.reset_pmcr_el0, - ARM64_SYS_REG(3, 3, 9, 12, 0)); + KVM_REG_ARM_PMCR_EL0); } if (sve_supported) { @@ -919,9 +919,41 @@ static void kvm_arm_configure_aa64dfr0(ARMCPU *cpu) } } +static void kvm_arm_configure_pmcr(ARMCPU *cpu) +{ +int ret; +uint64_t val, newval; +CPUState *cs = CPU(cpu); + +if (cpu->num_pmu_ctrs == -1) { +return; +} + +newval
[PATCH v2 20/22] target/arm/kvm-rme: Enable guest memfd
Request that RAM block uses the KVM guest memfd call to allocate guest memory. With RME, guest memory is not accessible by the host, and using guest memfd ensures that the host kernel is aware of this and doesn't attempt to access guest pages. Done in a separate patch because ms->require_guest_memfd is not yet merged. Signed-off-by: Jean-Philippe Brucker --- v1->v2: new --- target/arm/kvm-rme.c | 1 + 1 file changed, 1 insertion(+) diff --git a/target/arm/kvm-rme.c b/target/arm/kvm-rme.c index 8f39e54aaa..71cc1d4147 100644 --- a/target/arm/kvm-rme.c +++ b/target/arm/kvm-rme.c @@ -263,6 +263,7 @@ int kvm_arm_rme_init(MachineState *ms) rme_guest->rom_load_notifier.notify = rme_rom_load_notify; rom_add_load_notifier(_guest->rom_load_notifier); +ms->require_guest_memfd = true; cgs->ready = true; return 0; } -- 2.44.0
[PATCH v2 22/22] hw/arm/virt: Use RAM instead of flash for confidential guest firmware
The flash device that holds firmware code relies on read-only stage-2 mappings. Read accesses behave as RAM and write accesses as MMIO. Since the RMM does not support read-only mappings we cannot use the flash device as-is. That isn't a problem because the firmware does not want to disclose any information to the host, hence will not store its variables in clear persistent memory. We can therefore replace the flash device with RAM, and load the firmware there. Signed-off-by: Jean-Philippe Brucker --- v1->v2: new --- include/hw/arm/boot.h | 9 + hw/arm/boot.c | 34 ++--- hw/arm/virt.c | 44 +++ 3 files changed, 84 insertions(+), 3 deletions(-) diff --git a/include/hw/arm/boot.h b/include/hw/arm/boot.h index 80c492d742..d91cfc6942 100644 --- a/include/hw/arm/boot.h +++ b/include/hw/arm/boot.h @@ -112,6 +112,10 @@ struct arm_boot_info { */ bool firmware_loaded; +/* Used when loading firmware into RAM */ +hwaddr firmware_base; +hwaddr firmware_max_size; + /* Address at which board specific loader/setup code exists. If enabled, * this code-blob will run before anything else. It must return to the * caller via the link register. There is no stack set up. Enabled by @@ -132,6 +136,11 @@ struct arm_boot_info { bool secure_board_setup; arm_endianness endianness; + +/* + * Confidential guest boot loads everything into RAM so it can be measured. + */ +bool confidential; }; /** diff --git a/hw/arm/boot.c b/hw/arm/boot.c index 9f522e332b..26c6334d52 100644 --- a/hw/arm/boot.c +++ b/hw/arm/boot.c @@ -1154,7 +1154,31 @@ static void arm_setup_direct_kernel_boot(ARMCPU *cpu, } } -static void arm_setup_firmware_boot(ARMCPU *cpu, struct arm_boot_info *info) +static void arm_setup_confidential_firmware_boot(ARMCPU *cpu, + struct arm_boot_info *info, + const char *firmware_filename) +{ +ssize_t fw_size; +const char *fname; +AddressSpace *as = arm_boot_address_space(cpu, info); + +fname = qemu_find_file(QEMU_FILE_TYPE_BIOS, firmware_filename); +if (!fname) { +error_report("Could not find firmware image '%s'", firmware_filename); +exit(1); +} + +fw_size = load_image_targphys_as(firmware_filename, + info->firmware_base, + info->firmware_max_size, as); +if (fw_size <= 0) { +error_report("could not load firmware '%s'", firmware_filename); +exit(1); +} +} + +static void arm_setup_firmware_boot(ARMCPU *cpu, struct arm_boot_info *info, +const char *firmware_filename) { /* Set up for booting firmware (which might load a kernel via fw_cfg) */ @@ -1205,6 +1229,10 @@ static void arm_setup_firmware_boot(ARMCPU *cpu, struct arm_boot_info *info) } } +if (info->confidential) { +arm_setup_confidential_firmware_boot(cpu, info, firmware_filename); +} + /* * We will start from address 0 (typically a boot ROM image) in the * same way as hardware. Leave env->boot_info NULL, so that @@ -1243,9 +1271,9 @@ void arm_load_kernel(ARMCPU *cpu, MachineState *ms, struct arm_boot_info *info) info->dtb_filename = ms->dtb; info->dtb_limit = 0; -/* Load the kernel. */ +/* Load the kernel and/or firmware. */ if (!info->kernel_filename || info->firmware_loaded) { -arm_setup_firmware_boot(cpu, info); +arm_setup_firmware_boot(cpu, info, ms->firmware); } else { arm_setup_direct_kernel_boot(cpu, info); } diff --git a/hw/arm/virt.c b/hw/arm/virt.c index bed19d0b79..4a6281fc89 100644 --- a/hw/arm/virt.c +++ b/hw/arm/virt.c @@ -1178,6 +1178,10 @@ static PFlashCFI01 *virt_flash_create1(VirtMachineState *vms, static void virt_flash_create(VirtMachineState *vms) { +if (virt_machine_is_confidential(vms)) { +return; +} + vms->flash[0] = virt_flash_create1(vms, "virt.flash0", "pflash0"); vms->flash[1] = virt_flash_create1(vms, "virt.flash1", "pflash1"); } @@ -1213,6 +1217,10 @@ static void virt_flash_map(VirtMachineState *vms, hwaddr flashsize = vms->memmap[VIRT_FLASH].size / 2; hwaddr flashbase = vms->memmap[VIRT_FLASH].base; +if (virt_machine_is_confidential(vms)) { +return; +} + virt_flash_map1(vms->flash[0], flashbase, flashsize, secure_sysmem); virt_flash_map1(vms->flash[1], flashbase + flashsize, flashsize, @@ -1228,6 +1236,10 @@ static void virt_flash_fdt(VirtMachineState *vms, MachineState *ms = MACHINE(vms); char *nodename; +if (virt_machine_is_confidential(vms)) { +ret
[PATCH v2 00/22] arm: Run CCA VMs with KVM
These patches enable launching a confidential guest with QEMU KVM on Arm. The KVM changes for CCA have now been posted as v2 [1]. Launching a confidential VM requires two additional command-line parameters: -M confidential-guest-support=rme0 -object rme-guest,id=rme0 Since the RFC [2] I tried to address all review comments, and added a few features: * Enabled support for guest memfd by Xiaoyao Li and Chao Peng [3]. Guest memfd is mandatory for CCA. * Support firmware boot (edk2). * Use CPU command-line arguments for Realm parameters. SVE vector length uses the existing sve -cpu parameters, while breakpoints, watchpoints and PMU counters use new CPU parameters. The full series based on the memfd patches is at: https://git.codelinaro.org/linaro/dcap/qemu.git branch cca/v2 Please find instructions for building and running the whole CCA stack at: https://linaro.atlassian.net/wiki/spaces/QEMU/pages/29051027459/Building+an+RME+stack+for+QEMU [1] https://lore.kernel.org/kvm/20240412084056.1733704-1-steven.pr...@arm.com/ [2] https://lore.kernel.org/all/20230127150727.612594-1-jean-phili...@linaro.org/ [3] https://lore.kernel.org/qemu-devel/20240322181116.1228416-1-pbonz...@redhat.com/ Jean-Philippe Brucker (22): kvm: Merge kvm_check_extension() and kvm_vm_check_extension() target/arm: Add confidential guest support target/arm/kvm: Return immediately on error in kvm_arch_init() target/arm/kvm-rme: Initialize realm hw/arm/virt: Add support for Arm RME hw/arm/virt: Disable DTB randomness for confidential VMs hw/arm/virt: Reserve one bit of guest-physical address for RME target/arm/kvm: Split kvm_arch_get/put_registers target/arm/kvm-rme: Initialize vCPU target/arm/kvm: Create scratch VM as Realm if necessary hw/core/loader: Add ROM loader notifier target/arm/kvm-rme: Populate Realm memory hw/arm/boot: Register Linux BSS section for confidential guests target/arm/kvm-rme: Add Realm Personalization Value parameter target/arm/kvm-rme: Add measurement algorithm property target/arm/cpu: Set number of breakpoints and watchpoints in KVM target/arm/cpu: Set number of PMU counters in KVM target/arm/kvm: Disable Realm reboot target/arm/cpu: Inform about reading confidential CPU registers target/arm/kvm-rme: Enable guest memfd hw/arm/virt: Move virt_flash_create() to machvirt_init() hw/arm/virt: Use RAM instead of flash for confidential guest firmware docs/system/arm/virt.rst | 9 +- docs/system/confidential-guest-support.rst | 1 + qapi/qom.json | 34 +- include/hw/arm/boot.h | 9 + include/hw/arm/virt.h | 2 +- include/hw/loader.h| 15 + include/sysemu/kvm.h | 2 - include/sysemu/kvm_int.h | 1 + target/arm/cpu.h | 10 + target/arm/kvm_arm.h | 25 ++ accel/kvm/kvm-all.c| 34 +- hw/arm/boot.c | 45 ++- hw/arm/virt.c | 118 -- hw/core/loader.c | 15 + target/arm/arm-qmp-cmds.c | 1 + target/arm/cpu.c | 5 + target/arm/cpu64.c | 118 ++ target/arm/kvm-rme.c | 413 + target/arm/kvm.c | 200 +- target/i386/kvm/kvm.c | 6 +- target/ppc/kvm.c | 36 +- target/arm/meson.build | 7 +- 22 files changed, 1023 insertions(+), 83 deletions(-) create mode 100644 target/arm/kvm-rme.c -- 2.44.0
[PATCH v2 03/22] target/arm/kvm: Return immediately on error in kvm_arch_init()
Returning an error to kvm_init() is fatal anyway, no need to continue the initialization. Signed-off-by: Jean-Philippe Brucker --- v1->v2: new --- target/arm/kvm.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/target/arm/kvm.c b/target/arm/kvm.c index 3371ffa401..a5673241e5 100644 --- a/target/arm/kvm.c +++ b/target/arm/kvm.c @@ -566,7 +566,7 @@ int kvm_arch_init(MachineState *ms, KVMState *s) !kvm_check_extension(s, KVM_CAP_ARM_IRQ_LINE_LAYOUT_2)) { error_report("Using more than 256 vcpus requires a host kernel " "with KVM_CAP_ARM_IRQ_LINE_LAYOUT_2"); -ret = -EINVAL; +return -EINVAL; } if (kvm_check_extension(s, KVM_CAP_ARM_NISV_TO_USER)) { @@ -595,6 +595,7 @@ int kvm_arch_init(MachineState *ms, KVMState *s) if (ret < 0) { error_report("Enabling of Eager Page Split failed: %s", strerror(-ret)); +return ret; } } } -- 2.44.0
[PATCH v2 01/22] kvm: Merge kvm_check_extension() and kvm_vm_check_extension()
The KVM_CHECK_EXTENSION ioctl can be issued either on the global fd (/dev/kvm), or on the VM fd obtained with KVM_CREATE_VM. For most extensions, KVM returns the same value with either method, but for some of them it can refine the returned value depending on the VM type. The KVM documentation [1] advises to use the VM fd: Based on their initialization different VMs may have different capabilities. It is thus encouraged to use the vm ioctl to query for capabilities (available with KVM_CAP_CHECK_EXTENSION_VM on the vm fd) Ongoing work on Arm confidential VMs confirms this, as some capabilities become unavailable to confidential VMs, requiring changes in QEMU to use kvm_vm_check_extension() instead of kvm_check_extension() [2]. Rather than changing each check one by one, change kvm_check_extension() to always issue the ioctl on the VM fd when available, and remove kvm_vm_check_extension(). Fall back to the global fd when the VM check is unavailable: * Ancient kernels do not support KVM_CHECK_EXTENSION on the VM fd, since it was added by commit 92b591a4c46b ("KVM: Allow KVM_CHECK_EXTENSION on the vm fd") in Linux 3.17 [3]. Support for Linux 3.16 ended in June 2020, but there may still be old images around. * A couple of calls must be issued before the VM fd is available, since they determine the VM type: KVM_CAP_MIPS_VZ and KVM_CAP_ARM_VM_IPA_SIZE Does any user actually depend on the check being done on the global fd instead of the VM fd? I surveyed all cases where KVM presently returns different values depending on the query method. Luckily QEMU already calls kvm_vm_check_extension() for most of those. Only three of them are ambiguous, because currently done on the global fd: * KVM_CAP_MAX_VCPUS and KVM_CAP_MAX_VCPU_ID on Arm, changes value if the user requests a vGIC different from the default. But QEMU queries this before vGIC configuration, so the reported value will be the same. * KVM_CAP_SW_TLB on PPC. When issued on the global fd, returns false if the kvm-hv module is loaded; when issued on the VM fd, returns false only if the VM type is HV instead of PR. If this returns false, then QEMU will fail to initialize a BOOKE206 MMU model. So this patch supposedly improves things, as it allows to run this type of vCPU even when both KVM modules are loaded. * KVM_CAP_PPC_SECURE_GUEST. Similarly, doing this check on a VM fd refines the returned value, and ensures that SVM is actually supported. Since QEMU follows the check with kvm_vm_enable_cap(), this patch should only provide better error reporting. [1] https://www.kernel.org/doc/html/latest/virt/kvm/api.html#kvm-check-extension [2] https://lore.kernel.org/kvm/875ybi0ytc@redhat.com/ [3] https://github.com/torvalds/linux/commit/92b591a4c46b Cc: Marcelo Tosatti Cc: Nicholas Piggin Cc: Daniel Henrique Barboza Cc: qemu-...@nongnu.org Suggested-by: Cornelia Huck Signed-off-by: Jean-Philippe Brucker --- v1: https://lore.kernel.org/qemu-devel/20230421163822.839167-1-jean-phili...@linaro.org/ v1->v2: Initialize check_extension_vm using kvm_vm_ioctl() as suggested --- include/sysemu/kvm.h | 2 -- include/sysemu/kvm_int.h | 1 + accel/kvm/kvm-all.c | 34 +++--- target/arm/kvm.c | 2 +- target/i386/kvm/kvm.c| 6 +++--- target/ppc/kvm.c | 36 ++-- 6 files changed, 38 insertions(+), 43 deletions(-) diff --git a/include/sysemu/kvm.h b/include/sysemu/kvm.h index c6f34d4794..df97077434 100644 --- a/include/sysemu/kvm.h +++ b/include/sysemu/kvm.h @@ -404,8 +404,6 @@ bool kvm_arch_stop_on_emulation_error(CPUState *cpu); int kvm_check_extension(KVMState *s, unsigned int extension); -int kvm_vm_check_extension(KVMState *s, unsigned int extension); - #define kvm_vm_enable_cap(s, capability, cap_flags, ...) \ ({ \ struct kvm_enable_cap cap = {\ diff --git a/include/sysemu/kvm_int.h b/include/sysemu/kvm_int.h index cad763e240..fa4c9aeb96 100644 --- a/include/sysemu/kvm_int.h +++ b/include/sysemu/kvm_int.h @@ -123,6 +123,7 @@ struct KVMState uint16_t xen_gnttab_max_frames; uint16_t xen_evtchn_max_pirq; char *device; +bool check_extension_vm; }; void kvm_memory_listener_register(KVMState *s, KVMMemoryListener *kml, diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c index e08dd04164..3d9fbc8a98 100644 --- a/accel/kvm/kvm-all.c +++ b/accel/kvm/kvm-all.c @@ -1128,7 +1128,11 @@ int kvm_check_extension(KVMState *s, unsigned int extension) { int ret; -ret = kvm_ioctl(s, KVM_CHECK_EXTENSION, extension); +if (!s->check_extension_vm) { +ret = kvm_ioctl(s, KVM_CHECK_EXTENSION, extension); +} else { +ret = kvm_vm_ioctl(s, KVM_CHECK_EXTENSION, extension); +} if (ret < 0) { ret = 0; } @@ -1136,19 +1140,6 @@ int
[PATCH v2 14/22] target/arm/kvm-rme: Add Realm Personalization Value parameter
The Realm Personalization Value (RPV) is provided by the user to distinguish Realms that have the same initial measurement. The user provides up to 64 hexadecimal bytes. They are stored into the RPV in the same order, zero-padded on the right. Cc: Eric Blake Cc: Markus Armbruster Cc: Daniel P. Berrangé Cc: Eduardo Habkost Signed-off-by: Jean-Philippe Brucker --- v1->v2: Move parsing early, store as-is rather than reverted --- qapi/qom.json| 15 +- target/arm/kvm-rme.c | 111 +++ 2 files changed, 125 insertions(+), 1 deletion(-) diff --git a/qapi/qom.json b/qapi/qom.json index 623ec8071f..91654aa267 100644 --- a/qapi/qom.json +++ b/qapi/qom.json @@ -931,6 +931,18 @@ 'data': { '*cpu-affinity': ['uint16'], '*node-affinity': ['uint16'] } } +## +# @RmeGuestProperties: +# +# Properties for rme-guest objects. +# +# @personalization-value: Realm personalization value, as a 64-byte hex string +# (default: 0) +# +# Since: FIXME +## +{ 'struct': 'RmeGuestProperties', + 'data': { '*personalization-value': 'str' } } ## # @ObjectType: @@ -1066,7 +1078,8 @@ 'tls-creds-x509': 'TlsCredsX509Properties', 'tls-cipher-suites': 'TlsCredsProperties', 'x-remote-object':'RemoteObjectProperties', - 'x-vfio-user-server': 'VfioUserServerProperties' + 'x-vfio-user-server': 'VfioUserServerProperties', + 'rme-guest': 'RmeGuestProperties' } } ## diff --git a/target/arm/kvm-rme.c b/target/arm/kvm-rme.c index b2ad10ef6d..cb5c3f7a22 100644 --- a/target/arm/kvm-rme.c +++ b/target/arm/kvm-rme.c @@ -23,10 +23,13 @@ OBJECT_DECLARE_SIMPLE_TYPE(RmeGuest, RME_GUEST) #define RME_PAGE_SIZE qemu_real_host_page_size() +#define RME_MAX_CFG 1 + struct RmeGuest { ConfidentialGuestSupport parent_obj; Notifier rom_load_notifier; GSList *ram_regions; +uint8_t *personalization_value; }; typedef struct { @@ -54,6 +57,48 @@ static int rme_create_rd(Error **errp) return ret; } +static int rme_configure_one(RmeGuest *guest, uint32_t cfg, Error **errp) +{ +int ret; +const char *cfg_str; +struct kvm_cap_arm_rme_config_item args = { +.cfg = cfg, +}; + +switch (cfg) { +case KVM_CAP_ARM_RME_CFG_RPV: +if (!guest->personalization_value) { +return 0; +} +memcpy(args.rpv, guest->personalization_value, KVM_CAP_ARM_RME_RPV_SIZE); +cfg_str = "personalization value"; +break; +default: +g_assert_not_reached(); +} + +ret = kvm_vm_enable_cap(kvm_state, KVM_CAP_ARM_RME, 0, +KVM_CAP_ARM_RME_CONFIG_REALM, (intptr_t)); +if (ret) { +error_setg_errno(errp, -ret, "RME: failed to configure %s", cfg_str); +} +return ret; +} + +static int rme_configure(void) +{ +int ret; +int cfg; + +for (cfg = 0; cfg < RME_MAX_CFG; cfg++) { +ret = rme_configure_one(rme_guest, cfg, _abort); +if (ret) { +return ret; +} +} +return 0; +} + static void rme_populate_realm(gpointer data, gpointer unused) { int ret; @@ -98,6 +143,11 @@ static void rme_vm_state_change(void *opaque, bool running, RunState state) return; } +ret = rme_configure(); +if (ret) { +return; +} + ret = rme_create_rd(_abort); if (ret) { return; @@ -231,8 +281,69 @@ int kvm_arm_rme_vm_type(MachineState *ms) return 0; } +static char *rme_get_rpv(Object *obj, Error **errp) +{ +RmeGuest *guest = RME_GUEST(obj); +GString *s; +int i; + +if (!guest->personalization_value) { +return NULL; +} + +s = g_string_sized_new(KVM_CAP_ARM_RME_RPV_SIZE * 2 + 1); + +for (i = 0; i < KVM_CAP_ARM_RME_RPV_SIZE; i++) { +g_string_append_printf(s, "%02x", guest->personalization_value[i]); +} + +return g_string_free(s, /* free_segment */ false); +} + +static void rme_set_rpv(Object *obj, const char *value, Error **errp) +{ +RmeGuest *guest = RME_GUEST(obj); +size_t len = strlen(value); +uint8_t *out; +int i = 1; +int ret; + +g_free(guest->personalization_value); +guest->personalization_value = out = g_malloc0(KVM_CAP_ARM_RME_RPV_SIZE); + +/* Two chars per byte */ +if (len > KVM_CAP_ARM_RME_RPV_SIZE * 2) { +error_setg(errp, "Realm Personalization Value is too large"); +return; +} + +/* First byte may have a single char */ +if (len % 2) { +ret = sscanf(value, "%1hhx", out++); +} else { +ret = sscanf(value, "%2hhx", out++); +i++; +} +if (ret != 1) { +error_setg(errp, "Invalid Realm Personalization Value"); +return; +} + +for (; i < len; i += 2) { +ret = sscanf(value + i,
[PATCH v2 05/22] hw/arm/virt: Add support for Arm RME
When confidential-guest-support is enabled for the virt machine, call the RME init function, and add the RME flag to the VM type. Signed-off-by: Jean-Philippe Brucker --- v1->v2: * Don't explicitly disable steal_time, it's now done through KVM capabilities * Split patch --- hw/arm/virt.c | 15 +-- 1 file changed, 13 insertions(+), 2 deletions(-) diff --git a/hw/arm/virt.c b/hw/arm/virt.c index a9a913aead..07ad31876e 100644 --- a/hw/arm/virt.c +++ b/hw/arm/virt.c @@ -224,6 +224,11 @@ static const int a15irqmap[] = { [VIRT_PLATFORM_BUS] = 112, /* ...to 112 + PLATFORM_BUS_NUM_IRQS -1 */ }; +static bool virt_machine_is_confidential(VirtMachineState *vms) +{ +return MACHINE(vms)->cgs; +} + static void create_randomness(MachineState *ms, const char *node) { struct { @@ -2111,10 +2116,11 @@ static void machvirt_init(MachineState *machine) * if the guest has EL2 then we will use SMC as the conduit, * and otherwise we will use HVC (for backwards compatibility and * because if we're using KVM then we must use HVC). + * Realm guests must also use SMC. */ if (vms->secure && firmware_loaded) { vms->psci_conduit = QEMU_PSCI_CONDUIT_DISABLED; -} else if (vms->virt) { +} else if (vms->virt || virt_machine_is_confidential(vms)) { vms->psci_conduit = QEMU_PSCI_CONDUIT_SMC; } else { vms->psci_conduit = QEMU_PSCI_CONDUIT_HVC; @@ -2917,6 +2923,7 @@ static HotplugHandler *virt_machine_get_hotplug_handler(MachineState *machine, static int virt_kvm_type(MachineState *ms, const char *type_str) { VirtMachineState *vms = VIRT_MACHINE(ms); +int rme_vm_type = kvm_arm_rme_vm_type(ms); int max_vm_pa_size, requested_pa_size; bool fixed_ipa; @@ -2946,7 +2953,11 @@ static int virt_kvm_type(MachineState *ms, const char *type_str) * the implicit legacy 40b IPA setting, in which case the kvm_type * must be 0. */ -return fixed_ipa ? 0 : requested_pa_size; +if (fixed_ipa) { +return 0; +} + +return requested_pa_size | rme_vm_type; } static void virt_machine_class_init(ObjectClass *oc, void *data) -- 2.44.0
[PATCH v2 07/22] hw/arm/virt: Reserve one bit of guest-physical address for RME
When RME is enabled, the upper GPA bit is used to distinguish protected from unprotected addresses. Reserve it when setting up the guest memory map. Signed-off-by: Jean-Philippe Brucker --- v1->v2: separate patch --- hw/arm/virt.c | 14 -- 1 file changed, 12 insertions(+), 2 deletions(-) diff --git a/hw/arm/virt.c b/hw/arm/virt.c index f300f100b5..eca9a96b5a 100644 --- a/hw/arm/virt.c +++ b/hw/arm/virt.c @@ -2939,14 +2939,24 @@ static int virt_kvm_type(MachineState *ms, const char *type_str) VirtMachineState *vms = VIRT_MACHINE(ms); int rme_vm_type = kvm_arm_rme_vm_type(ms); int max_vm_pa_size, requested_pa_size; +int rme_reserve_bit = 0; bool fixed_ipa; -max_vm_pa_size = kvm_arm_get_max_vm_ipa_size(ms, _ipa); +if (rme_vm_type) { +/* + * With RME, the upper GPA bit differentiates Realm from NS memory. + * Reserve the upper bit to ensure that highmem devices will fit. + */ +rme_reserve_bit = 1; +} + +max_vm_pa_size = kvm_arm_get_max_vm_ipa_size(ms, _ipa) - + rme_reserve_bit; /* we freeze the memory map to compute the highest gpa */ virt_set_memmap(vms, max_vm_pa_size); -requested_pa_size = 64 - clz64(vms->highest_gpa); +requested_pa_size = 64 - clz64(vms->highest_gpa) + rme_reserve_bit; /* * KVM requires the IPA size to be at least 32 bits. -- 2.44.0
[PATCH v2 10/22] target/arm/kvm: Create scratch VM as Realm if necessary
Some ID registers have a different value for a Realm VM, for example ID_AA64DFR0_EL1 contains the number of breakpoints/watchpoints implemented by RMM instead of the hardware. Even though RMM is in charge of setting up most Realm registers, KVM still provides GET_ONE_REG interface on a Realm VM to probe the VM's capabilities. KVM only reports the maximum IPA it supports, but RMM may support smaller sizes. If the VM creation fails with the value returned by KVM, then retry with the smaller working address. This needs a better solution. Signed-off-by: Jean-Philippe Brucker --- target/arm/kvm.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/target/arm/kvm.c b/target/arm/kvm.c index 3a2233ec73..6d368bf442 100644 --- a/target/arm/kvm.c +++ b/target/arm/kvm.c @@ -104,6 +104,7 @@ bool kvm_arm_create_scratch_host_vcpu(const uint32_t *cpus_to_try, { int ret = 0, kvmfd = -1, vmfd = -1, cpufd = -1; int max_vm_pa_size; +int vm_type; kvmfd = qemu_open_old("/dev/kvm", O_RDWR); if (kvmfd < 0) { @@ -113,8 +114,9 @@ bool kvm_arm_create_scratch_host_vcpu(const uint32_t *cpus_to_try, if (max_vm_pa_size < 0) { max_vm_pa_size = 0; } +vm_type = kvm_arm_rme_vm_type(MACHINE(qdev_get_machine())); do { -vmfd = ioctl(kvmfd, KVM_CREATE_VM, max_vm_pa_size); +vmfd = ioctl(kvmfd, KVM_CREATE_VM, max_vm_pa_size | vm_type); } while (vmfd == -1 && errno == EINTR); if (vmfd < 0) { goto err; -- 2.44.0
[PATCH v2 08/22] target/arm/kvm: Split kvm_arch_get/put_registers
The confidential guest support in KVM limits the number of registers that we can read and write. Split the get/put_registers function to prepare for it. Signed-off-by: Jean-Philippe Brucker --- target/arm/kvm.c | 30 -- 1 file changed, 28 insertions(+), 2 deletions(-) diff --git a/target/arm/kvm.c b/target/arm/kvm.c index b00077c1a5..3504276822 100644 --- a/target/arm/kvm.c +++ b/target/arm/kvm.c @@ -2056,7 +2056,7 @@ static int kvm_arch_put_sve(CPUState *cs) return 0; } -int kvm_arch_put_registers(CPUState *cs, int level) +static int kvm_arm_put_core_regs(CPUState *cs, int level) { uint64_t val; uint32_t fpr; @@ -2159,6 +2159,19 @@ int kvm_arch_put_registers(CPUState *cs, int level) return ret; } +return 0; +} + +int kvm_arch_put_registers(CPUState *cs, int level) +{ +int ret; +ARMCPU *cpu = ARM_CPU(cs); + +ret = kvm_arm_put_core_regs(cs, level); +if (ret) { +return ret; +} + write_cpustate_to_list(cpu, true); if (!write_list_to_kvmstate(cpu, level)) { @@ -2240,7 +2253,7 @@ static int kvm_arch_get_sve(CPUState *cs) return 0; } -int kvm_arch_get_registers(CPUState *cs) +static int kvm_arm_get_core_regs(CPUState *cs) { uint64_t val; unsigned int el; @@ -2343,6 +2356,19 @@ int kvm_arch_get_registers(CPUState *cs) } vfp_set_fpcr(env, fpr); +return 0; +} + +int kvm_arch_get_registers(CPUState *cs) +{ +int ret; +ARMCPU *cpu = ARM_CPU(cs); + +ret = kvm_arm_get_core_regs(cs); +if (ret) { +return ret; +} + ret = kvm_get_vcpu_events(cpu); if (ret) { return ret; -- 2.44.0
[PATCH v2 02/22] target/arm: Add confidential guest support
Add a new RmeGuest object, inheriting from ConfidentialGuestSupport, to support the Arm Realm Management Extension (RME). It is instantiated by passing on the command-line: -M virt,confidential-guest-support= -object guest-rme,id=[,options...] This is only the skeleton. Support will be added in following patches. Cc: Eric Blake Cc: Markus Armbruster Cc: Daniel P. Berrangé Cc: Eduardo Habkost Reviewed-by: Philippe Mathieu-Daudé Reviewed-by: Richard Henderson Signed-off-by: Jean-Philippe Brucker --- docs/system/confidential-guest-support.rst | 1 + qapi/qom.json | 3 +- target/arm/kvm-rme.c | 46 ++ target/arm/meson.build | 7 +++- 4 files changed, 55 insertions(+), 2 deletions(-) create mode 100644 target/arm/kvm-rme.c diff --git a/docs/system/confidential-guest-support.rst b/docs/system/confidential-guest-support.rst index 0c490dbda2..acf46d8856 100644 --- a/docs/system/confidential-guest-support.rst +++ b/docs/system/confidential-guest-support.rst @@ -40,5 +40,6 @@ Currently supported confidential guest mechanisms are: * AMD Secure Encrypted Virtualization (SEV) (see :doc:`i386/amd-memory-encryption`) * POWER Protected Execution Facility (PEF) (see :ref:`power-papr-protected-execution-facility-pef`) * s390x Protected Virtualization (PV) (see :doc:`s390x/protvirt`) +* Arm Realm Management Extension (RME) Other mechanisms may be supported in future. diff --git a/qapi/qom.json b/qapi/qom.json index 85e6b4f84a..623ec8071f 100644 --- a/qapi/qom.json +++ b/qapi/qom.json @@ -996,7 +996,8 @@ 'tls-creds-x509', 'tls-cipher-suites', { 'name': 'x-remote-object', 'features': [ 'unstable' ] }, -{ 'name': 'x-vfio-user-server', 'features': [ 'unstable' ] } +{ 'name': 'x-vfio-user-server', 'features': [ 'unstable' ] }, +'rme-guest' ] } ## diff --git a/target/arm/kvm-rme.c b/target/arm/kvm-rme.c new file mode 100644 index 00..960dd75608 --- /dev/null +++ b/target/arm/kvm-rme.c @@ -0,0 +1,46 @@ +/* + * QEMU Arm RME support + * + * Copyright Linaro 2024 + */ + +#include "qemu/osdep.h" + +#include "exec/confidential-guest-support.h" +#include "hw/boards.h" +#include "hw/core/cpu.h" +#include "kvm_arm.h" +#include "migration/blocker.h" +#include "qapi/error.h" +#include "qom/object_interfaces.h" +#include "sysemu/kvm.h" +#include "sysemu/runstate.h" + +#define TYPE_RME_GUEST "rme-guest" +OBJECT_DECLARE_SIMPLE_TYPE(RmeGuest, RME_GUEST) + +struct RmeGuest { +ConfidentialGuestSupport parent_obj; +}; + +static void rme_guest_class_init(ObjectClass *oc, void *data) +{ +} + +static const TypeInfo rme_guest_info = { +.parent = TYPE_CONFIDENTIAL_GUEST_SUPPORT, +.name = TYPE_RME_GUEST, +.instance_size = sizeof(struct RmeGuest), +.class_init = rme_guest_class_init, +.interfaces = (InterfaceInfo[]) { +{ TYPE_USER_CREATABLE }, +{ } +} +}; + +static void rme_register_types(void) +{ +type_register_static(_guest_info); +} + +type_init(rme_register_types); diff --git a/target/arm/meson.build b/target/arm/meson.build index 2e10464dbb..c610c078f7 100644 --- a/target/arm/meson.build +++ b/target/arm/meson.build @@ -8,7 +8,12 @@ arm_ss.add(files( )) arm_ss.add(zlib) -arm_ss.add(when: 'CONFIG_KVM', if_true: files('hyp_gdbstub.c', 'kvm.c'), if_false: files('kvm-stub.c')) +arm_ss.add(when: 'CONFIG_KVM', + if_true: files( +'hyp_gdbstub.c', +'kvm.c', +'kvm-rme.c'), + if_false: files('kvm-stub.c')) arm_ss.add(when: 'CONFIG_HVF', if_true: files('hyp_gdbstub.c')) arm_ss.add(when: 'TARGET_AARCH64', if_true: files( -- 2.44.0
Re: [PATCH v2] virtio-iommu: Use qemu_real_host_page_mask as default page_size_mask
On Wed, Feb 21, 2024 at 11:41:57AM +0100, Eric Auger wrote: > Hi, > > On 2/13/24 13:00, Michael S. Tsirkin wrote: > > On Tue, Feb 13, 2024 at 12:24:22PM +0100, Eric Auger wrote: > >> Hi Michael, > >> On 2/13/24 12:09, Michael S. Tsirkin wrote: > >>> On Tue, Feb 13, 2024 at 11:32:13AM +0100, Eric Auger wrote: > Do you have an other concern? > >>> I also worry a bit about migrating between hosts with different > >>> page sizes. Not with kvm I am guessing but with tcg it does work I think? > >> I have never tried but is it a valid use case? Adding Peter in CC. > >>> Is this just for vfio and vdpa? Can we limit this to these setups > >>> maybe? > >> I am afraid we know the actual use case too later. If the VFIO device is > >> hotplugged we have started working with 4kB granule. > >> > >> The other way is to introduce a min_granule option as done for aw-bits. > >> But it is heavier. > >> > >> Thanks > >> > >> Eric > > Let's say, if you are changing the default then we definitely want > > a way to get the cmpatible behaviour for tcg. > > So the compat machinery should be user-accessible too and documented. > > I guess I need to add a new option to guarantee the machine compat. > > I was thinking about an enum GranuleMode property taking the following > values, 4KB, 64KB, host > Jean, do you think there is a rationale offering something richer? 16KB seems to be gaining popularity, we should include that (I think it's the only granule supported by Apple IOMMU?). Hopefully that will be enough. Thanks, Jean > > Obviously being able to set the exact page_size_mask + host mode would > be better but this does not really fit into any std property type. > > Thanks > > Eric > > >
Re: [PATCH v3 0/3] VIRTIO-IOMMU: Introduce an aw-bits option
On Thu, Feb 08, 2024 at 11:10:16AM +0100, Eric Auger wrote: > In [1] and [2] we attempted to fix a case where a VFIO-PCI device > protected with a virtio-iommu is assigned to an x86 guest. On x86 > the physical IOMMU may have an address width (gaw) of 39 or 48 bits > whereas the virtio-iommu exposes a 64b input address space by default. > Hence the guest may try to use the full 64b space and DMA MAP > failures may be encountered. To work around this issue we endeavoured > to pass usable host IOVA regions (excluding the out of range space) from > VFIO to the virtio-iommu device so that the virtio-iommu driver can > query those latter during the probe request and let the guest iommu > kernel subsystem carve them out. > > However if there are several devices in the same iommu group, > only the reserved regions of the first one are taken into > account by the iommu subsystem of the guest. This generally > works on baremetal because devices are not going to > expose different reserved regions. However in our case, this > may prevent from taking into account the host iommu geometry. > > So the simplest solution to this problem looks to introduce an > input address width option, aw-bits, which matches what is > done on the intel-iommu. By default, from now on it is set > to 39 bits with pc_q35 and 48 with arm virt. This replaces the > previous default value of 64b. So we need to introduce a compat > for machines older than 9.0 to behave similarly. We use > hw_compat_8_2 to acheive that goal. For the series: Reviewed-by: Jean-Philippe Brucker > > Outstanding series [2] remains useful to let resv regions beeing > communicated on time before the probe request. > > [1] [PATCH v4 00/12] VIRTIO-IOMMU/VFIO: Don't assume 64b IOVA space > https://lore.kernel.org/all/20231019134651.842175-1-eric.au...@redhat.com/ > - This is merged - > > [2] [RFC 0/7] VIRTIO-IOMMU/VFIO: Fix host iommu geometry handling for > hotplugged devices > https://lore.kernel.org/all/20240117080414.316890-1-eric.au...@redhat.com/ > - This is pending for review on the ML - > > This series can be found at: > https://github.com/eauger/qemu/tree/virtio-iommu-aw-bits-v3 > previous > https://github.com/eauger/qemu/tree/virtio-iommu-aw-bits-v2 > > Applied on top of [3] > [PATCH v2] virtio-iommu: Use qemu_real_host_page_mask as default > page_size_mask > https://lore.kernel.org/all/20240117132039.332273-1-eric.au...@redhat.com/ > > History: > v2 -> v3: > - Collected Zhenzhong and CĂ©dric's R-b + Yanghang's T-b > - use _abort instead of NULL error handle > on object_property_get_uint() call (CĂ©dric) > - use VTD_HOST_AW_39BIT (CĂ©dric) > > v1 -> v2 > - Limit aw to 48b on ARM > - Check aw is within [32,64] > - Use hw_compat_8_2 > > > Eric Auger (3): > virtio-iommu: Add an option to define the input range width > virtio-iommu: Trace domain range limits as unsigned int > hw: Set virtio-iommu aw-bits default value on pc_q35 and arm virt > > include/hw/virtio/virtio-iommu.h | 1 + > hw/arm/virt.c| 6 ++ > hw/core/machine.c| 5 - > hw/i386/pc.c | 6 ++ > hw/virtio/virtio-iommu.c | 7 ++- > hw/virtio/trace-events | 2 +- > 6 files changed, 24 insertions(+), 3 deletions(-) > > -- > 2.41.0 >
Re: [PATCH v2 1/3] virtio-iommu: Add an option to define the input range width
On Thu, Feb 08, 2024 at 09:16:35AM +0100, Eric Auger wrote: > >> diff --git a/hw/virtio/virtio-iommu.c b/hw/virtio/virtio-iommu.c > >> index ec2ba11d1d..7870bdbeee 100644 > >> --- a/hw/virtio/virtio-iommu.c > >> +++ b/hw/virtio/virtio-iommu.c > >> @@ -1314,7 +1314,11 @@ static void virtio_iommu_device_realize(DeviceState > >> *dev, Error **errp) > >> */ > >> s->config.bypass = s->boot_bypass; > >> s->config.page_size_mask = qemu_real_host_page_mask(); > >> -s->config.input_range.end = UINT64_MAX; > >> +if (s->aw_bits < 32 || s->aw_bits > 64) { > > I'm wondering if we should lower this to 16 bits, just to support all > > possible host SMMU configurations (the smallest address space configurable > > with T0SZ is 25-bit, or 16-bit with the STT extension). > Is it a valid use case case to assign host devices protected by > virtio-iommu with a physical SMMU featuring Small Translation Table? Probably not, I'm guessing STT is for tiny embedded implementations where QEMU or even virtualization is not a use case. But because smaller mobile platforms now implement SMMUv3, using smaller IOVA spaces and thus fewer page tables can be beneficial. One use case I have in mind is android with pKVM, each app has its own VM, and devices can be partitioned into lots of address spaces with PASID, so you can save a lot of memory and table-walk time by shrinking those address space. But that particular case will use crosvm so isn't relevant here, it's only an example. Mainly I was concerned that if the Linux driver decides to allow configuring smaller address spaces (maybe a linux cmdline option), then using a architectural limit here would be a safe bet that things can still work. But we can always change it in a later version, or implement finer controls (ideally the guest driver would configure the VA size in ATTACH). > It leaves 64kB IOVA space only. Besides in the spec, it is wriiten the > min T0SZ can even be 12. > > "The minimum valid value is 16 unless all of the following also hold, in > which case the minimum permitted > value is 12: > – SMMUv3.1 or later is supported. > – SMMU_IDR5.VAX indicates support for 52-bit Vas. > – The corresponding CD.TGx selects a 64KB granule. > " Yes that's confusing because va_size = 64 - T0SZ, so T0SZ=12 actually describes the largest address size, 52. > > At the moment I would prefer to stick to the limit suggested by Alex > which looks also sensible for other archs whereas 16 doesn't. Agreed, it should be sufficient. Thanks, Jean
Re: [PATCH v2 1/3] virtio-iommu: Add an option to define the input range width
Hi Eric, On Thu, Feb 01, 2024 at 05:32:22PM +0100, Eric Auger wrote: > aw-bits is a new option that allows to set the bit width of > the input address range. This value will be used as a default for > the device config input_range.end. By default it is set to 64 bits > which is the current value. > > Signed-off-by: Eric Auger > > --- > > v1 -> v2: > - Check the aw-bits value is within [32,64] > --- > include/hw/virtio/virtio-iommu.h | 1 + > hw/virtio/virtio-iommu.c | 7 ++- > 2 files changed, 7 insertions(+), 1 deletion(-) > > diff --git a/include/hw/virtio/virtio-iommu.h > b/include/hw/virtio/virtio-iommu.h > index 781ebaea8f..5fbe4677c2 100644 > --- a/include/hw/virtio/virtio-iommu.h > +++ b/include/hw/virtio/virtio-iommu.h > @@ -66,6 +66,7 @@ struct VirtIOIOMMU { > bool boot_bypass; > Notifier machine_done; > bool granule_frozen; > +uint8_t aw_bits; > }; > > #endif > diff --git a/hw/virtio/virtio-iommu.c b/hw/virtio/virtio-iommu.c > index ec2ba11d1d..7870bdbeee 100644 > --- a/hw/virtio/virtio-iommu.c > +++ b/hw/virtio/virtio-iommu.c > @@ -1314,7 +1314,11 @@ static void virtio_iommu_device_realize(DeviceState > *dev, Error **errp) > */ > s->config.bypass = s->boot_bypass; > s->config.page_size_mask = qemu_real_host_page_mask(); > -s->config.input_range.end = UINT64_MAX; > +if (s->aw_bits < 32 || s->aw_bits > 64) { I'm wondering if we should lower this to 16 bits, just to support all possible host SMMU configurations (the smallest address space configurable with T0SZ is 25-bit, or 16-bit with the STT extension). Thanks, Jean > +error_setg(errp, "aw-bits must be within [32,64]"); > +} > +s->config.input_range.end = > +s->aw_bits == 64 ? UINT64_MAX : BIT_ULL(s->aw_bits) - 1; > s->config.domain_range.end = UINT32_MAX; > s->config.probe_size = VIOMMU_PROBE_SIZE; > > @@ -1525,6 +1529,7 @@ static Property virtio_iommu_properties[] = { > DEFINE_PROP_LINK("primary-bus", VirtIOIOMMU, primary_bus, > TYPE_PCI_BUS, PCIBus *), > DEFINE_PROP_BOOL("boot-bypass", VirtIOIOMMU, boot_bypass, true), > +DEFINE_PROP_UINT8("aw-bits", VirtIOIOMMU, aw_bits, 64), > DEFINE_PROP_END_OF_LIST(), > }; > > -- > 2.41.0 >
Re: [RFC 0/7] VIRTIO-IOMMU/VFIO: Fix host iommu geometry handling for hotplugged devices
On Mon, Jan 29, 2024 at 05:38:55PM +0100, Eric Auger wrote: > > There may be a separate argument for clearing bypass. With a coldplugged > > VFIO device the flow is: > > > > 1. Map the whole guest address space in VFIO to implement boot-bypass. > >This allocates all guest pages, which takes a while and is wasteful. > >I've actually crashed a host that way, when spawning a guest with too > >much RAM. > interesting > > 2. Start the VM > > 3. When the virtio-iommu driver attaches a (non-identity) domain to the > >assigned endpoint, then unmap the whole address space in VFIO, and most > >pages are given back to the host. > > > > We can't disable boot-bypass because the BIOS needs it. But instead the > > flow could be: > > > > 1. Start the VM, with only the virtual endpoints. Nothing to pin. > > 2. The virtio-iommu driver disables bypass during boot > We needed this boot-bypass mode for booting with virtio-blk-scsi > protected with virtio-iommu for instance. > That was needed because we don't have any virtio-iommu driver in edk2 as > opposed to intel iommu driver, right? Yes. What I had in mind is the x86 SeaBIOS which doesn't have any IOMMU driver and accesses the default SATA device: $ qemu-system-x86_64 -M q35 -device virtio-iommu,boot-bypass=off qemu: virtio_iommu_translate sid=250 is not known!! qemu: no buffer available in event queue to report event qemu: AHCI: Failed to start FIS receive engine: bad FIS receive buffer address But it's the same problem with edk2. Also a guest OS without a virtio-iommu driver needs boot-bypass. Once firmware boot is complete, the OS with a virtio-iommu driver normally can turn bypass off in the config space, it's not useful anymore. If it needs to put some endpoints in bypass, then it can attach them to a bypass domain. > > 3. Hotplug the VFIO device. With bypass disabled there is no need to pin > >the whole guest address space, unless the guest explicitly asks for an > >identity domain. > > > > However, I don't know if this is a realistic scenario that will actually > > be used. > > > > By the way, do you have an easy way to reproduce the issue described here? > > I've had to enable iommu.forcedac=1 on the command-line, otherwise Linux > > just allocates 32-bit IOVAs. > I don't have a simple generic reproducer. It happens when assigning this > device: > Ethernet Controller E810-C for QSFP (Ethernet Network Adapter E810-C-Q2) > > I have not encountered that issue with another device yet. > I see on guest side in dmesg: > [Â Â Â 6.849292] ice :00:05.0: Using 64-bit DMA addresses > > That's emitted in dma-iommu.c iommu_dma_alloc_iova(). > Looks like the guest first tries to allocate an iova in the 32-bit AS > and if this fails use the whole dma_limit. > Seems the 32b IOVA alloc failed here ;-) Interesting, are you running some demanding workload and a lot of CPUs? That's a lot of IOVAs used up, I'm curious about what kind of DMA pattern does that. Thanks, Jean
Re: [PATCH 0/3] VIRTIO-IOMMU: Introduce an aw-bits option
On Mon, Jan 29, 2024 at 03:07:41PM +0100, Eric Auger wrote: > Hi Jean-Philippe, > > On 1/29/24 13:23, Jean-Philippe Brucker wrote: > > Hi Eric, > > > > On Tue, Jan 23, 2024 at 07:15:54PM +0100, Eric Auger wrote: > >> In [1] and [2] we attempted to fix a case where a VFIO-PCI device > >> protected with a virtio-iommu is assigned to an x86 guest. On x86 > >> the physical IOMMU may have an address width (gaw) of 39 or 48 bits > >> whereas the virtio-iommu exposes a 64b input address space by default. > >> Hence the guest may try to use the full 64b space and DMA MAP > >> failures may be encountered. To work around this issue we endeavoured > >> to pass usable host IOVA regions (excluding the out of range space) from > >> VFIO to the virtio-iommu device so that the virtio-iommu driver can > >> query those latter during the probe request and let the guest iommu > >> kernel subsystem carve them out. > >> > >> However if there are several devices in the same iommu group, > >> only the reserved regions of the first one are taken into > >> account by the iommu subsystem of the guest. This generally > >> works on baremetal because devices are not going to > >> expose different reserved regions. However in our case, this > >> may prevent from taking into account the host iommu geometry. > >> > >> So the simplest solution to this problem looks to introduce an > >> input address width option, aw-bits, which matches what is > >> done on the intel-iommu. By default, from now on it is set > >> to 39 bits with pc_q35 and 64b with arm virt. > > Doesn't Arm have the same problem? The TTB0 page tables limit what can be > > mapped to 48-bit, or 52-bit when SMMU_IDR5.VAX==1 and granule is 64kB. > > A Linux host driver could configure smaller VA sizes: > > * SMMUv2 limits the VA to SMMU_IDR2.UBS (upstream bus size) which > > can go as low as 32-bit (I'm assuming we don't care about 32-bit hosts). > Yes I think we can ignore that use case. > > * SMMUv3 currently limits the VA to CONFIG_ARM64_VA_BITS, which > > could be as low as 36 bits (but realistically 39, since 36 depends on > > 16kB pages and CONFIG_EXPERT). > Further reading "3.4.1 Input address size and Virtual Address size" ooks > indeed SMMU_IDR5.VAX gives info on the physical SMMU actual > implementation max (which matches intel iommu gaw). I missed that. Now I > am confused about should we limit VAS to 39 to accomodate of the worst > case host SW configuration or shall we use 48 instead? I don't know what's best either. 48 should be fine if hosts normally enable VA_BITS_48 (I see debian has it [1], not sure how to find the others). [1] https://salsa.debian.org/kernel-team/linux/-/blob/master/debian/config/arm64/config?ref_type=heads#L18 > If we set such a low 39b value, won't it prevent some guests from > properly working? It's not that low, since it gives each endpoint a private 512GB address space, but yes there might be special cases that reach the limit. Maybe assign a multi-queue NIC to a 256-vCPU guest, and if you want per-vCPU DMA pools, then with a 39-bit address space you only get 2GB per vCPU. With 48-bit you get 1TB which should be plenty. 52-bit private IOVA space doesn't seem useful, I doubt we'll ever need to support that on the MAP/UNMAP interface. So I guess 48-bit can be the default, and users with special setups can override aw-bits. Thanks, Jean
Re: [PATCH 0/3] VIRTIO-IOMMU: Introduce an aw-bits option
Hi Eric, On Tue, Jan 23, 2024 at 07:15:54PM +0100, Eric Auger wrote: > In [1] and [2] we attempted to fix a case where a VFIO-PCI device > protected with a virtio-iommu is assigned to an x86 guest. On x86 > the physical IOMMU may have an address width (gaw) of 39 or 48 bits > whereas the virtio-iommu exposes a 64b input address space by default. > Hence the guest may try to use the full 64b space and DMA MAP > failures may be encountered. To work around this issue we endeavoured > to pass usable host IOVA regions (excluding the out of range space) from > VFIO to the virtio-iommu device so that the virtio-iommu driver can > query those latter during the probe request and let the guest iommu > kernel subsystem carve them out. > > However if there are several devices in the same iommu group, > only the reserved regions of the first one are taken into > account by the iommu subsystem of the guest. This generally > works on baremetal because devices are not going to > expose different reserved regions. However in our case, this > may prevent from taking into account the host iommu geometry. > > So the simplest solution to this problem looks to introduce an > input address width option, aw-bits, which matches what is > done on the intel-iommu. By default, from now on it is set > to 39 bits with pc_q35 and 64b with arm virt. Doesn't Arm have the same problem? The TTB0 page tables limit what can be mapped to 48-bit, or 52-bit when SMMU_IDR5.VAX==1 and granule is 64kB. A Linux host driver could configure smaller VA sizes: * SMMUv2 limits the VA to SMMU_IDR2.UBS (upstream bus size) which can go as low as 32-bit (I'm assuming we don't care about 32-bit hosts). * SMMUv3 currently limits the VA to CONFIG_ARM64_VA_BITS, which could be as low as 36 bits (but realistically 39, since 36 depends on 16kB pages and CONFIG_EXPERT). But 64-bit definitely can't work for VFIO, and I suppose isn't useful for virtual devices, so maybe 39 is also a reasonable default on Arm. Thanks, Jean > This replaces the > previous default value of 64b. So we need to introduce a compat > for pc_q35 machines older than 9.0 to behave similarly.
Re: [RFC 0/7] VIRTIO-IOMMU/VFIO: Fix host iommu geometry handling for hotplugged devices
Hi, On Thu, Jan 18, 2024 at 10:43:55AM +0100, Eric Auger wrote: > Hi Zhenzhong, > On 1/18/24 08:10, Duan, Zhenzhong wrote: > > Hi Eric, > > > >> -Original Message- > >> From: Eric Auger > >> Cc: m...@redhat.com; c...@redhat.com > >> Subject: [RFC 0/7] VIRTIO-IOMMU/VFIO: Fix host iommu geometry handling > >> for hotplugged devices > >> > >> In [1] we attempted to fix a case where a VFIO-PCI device protected > >> with a virtio-iommu was assigned to an x86 guest. On x86 the physical > >> IOMMU may have an address width (gaw) of 39 or 48 bits whereas the > >> virtio-iommu used to expose a 64b address space by default. > >> Hence the guest was trying to use the full 64b space and we hit > >> DMA MAP failures. To work around this issue we managed to pass > >> usable IOVA regions (excluding the out of range space) from VFIO > >> to the virtio-iommu device. This was made feasible by introducing > >> a new IOMMU Memory Region callback dubbed iommu_set_iova_regions(). > >> This latter gets called when the IOMMU MR is enabled which > >> causes the vfio_listener_region_add() to be called. > >> > >> However with VFIO-PCI hotplug, this technique fails due to the > >> race between the call to the callback in the add memory listener > >> and the virtio-iommu probe request. Indeed the probe request gets > >> called before the attach to the domain. So in that case the usable > >> regions are communicated after the probe request and fail to be > >> conveyed to the guest. To be honest the problem was hinted by > >> Jean-Philippe in [1] and I should have been more careful at > >> listening to him and testing with hotplug :-( > > It looks the global virtio_iommu_config.bypass is never cleared in guest. > > When guest virtio_iommu driver enable IOMMU, should it clear this > > bypass attribute? > > If it could be cleared in viommu_probe(), then qemu will call > > virtio_iommu_set_config() then virtio_iommu_switch_address_space_all() > > to enable IOMMU MR. Then both coldplugged and hotplugged devices will work. > > this field is iommu wide while the probe applies on a one device.In > general I would prefer not to be dependent on the MR enablement. We know > that the device is likely to be protected and we can collect its > requirements beforehand. > > > > > Intel iommu has a similar bit in register GCMD_REG.TE, when guest > > intel_iommu driver probe set it, on qemu side, > > vtd_address_space_refresh_all() > > is called to enable IOMMU MRs. > interesting. > > Would be curious to get Jean Philippe's pov. I'd rather not rely on this, it's hard to justify a driver change based only on QEMU internals. And QEMU can't count on the driver always clearing bypass. There could be situations where the guest can't afford to do it, like if an endpoint is owned by the firmware and has to keep running. There may be a separate argument for clearing bypass. With a coldplugged VFIO device the flow is: 1. Map the whole guest address space in VFIO to implement boot-bypass. This allocates all guest pages, which takes a while and is wasteful. I've actually crashed a host that way, when spawning a guest with too much RAM. 2. Start the VM 3. When the virtio-iommu driver attaches a (non-identity) domain to the assigned endpoint, then unmap the whole address space in VFIO, and most pages are given back to the host. We can't disable boot-bypass because the BIOS needs it. But instead the flow could be: 1. Start the VM, with only the virtual endpoints. Nothing to pin. 2. The virtio-iommu driver disables bypass during boot 3. Hotplug the VFIO device. With bypass disabled there is no need to pin the whole guest address space, unless the guest explicitly asks for an identity domain. However, I don't know if this is a realistic scenario that will actually be used. By the way, do you have an easy way to reproduce the issue described here? I've had to enable iommu.forcedac=1 on the command-line, otherwise Linux just allocates 32-bit IOVAs. > > > >> For coldplugged device the technique works because we make sure all > >> the IOMMU MR are enabled once on the machine init done: 94df5b2180 > >> ("virtio-iommu: Fix 64kB host page size VFIO device assignment") > >> for granule freeze. But I would be keen to get rid of this trick. > >> > >> Using an IOMMU MR Ops is unpractical because this relies on the IOMMU > >> MR to have been enabled and the corresponding vfio_listener_region_add() > >> to be executed. Instead this series proposes to replace the usage of this > >> API by the recently introduced PCIIOMMUOps: ba7d12eb8c ("hw/pci: > >> modify > >> pci_setup_iommu() to set PCIIOMMUOps"). That way, the callback can be > >> called earlier, once the usable IOVA regions have been collected by > >> VFIO, without the need for the IOMMU MR to be enabled. > >> > >> This looks cleaner. In the short term this may also be used for > >> passing the page size mask, which would allow to get rid of the > >> hacky transient IOMMU MR
Re: [PATCH] virtio-iommu: Use qemu_real_host_page_mask as default page_size_mask
On Thu, Dec 21, 2023 at 08:45:05AM -0500, Eric Auger wrote: > We used to set default page_size_mask to qemu_target_page_mask() but > with VFIO assignment it makes more sense to use the actual host page mask > instead. > > So from now on qemu_real_host_page_mask() will be used as a default. > To be able to migrate older code, we increase the vmstat version_id > to 3 and if an older incoming v2 stream is detected we set the previous > default value. > > The new default is well adapted to configs where host and guest have > the same page size. This allows to fix hotplugging VFIO devices on a > 64kB guest and a 64kB host. This test case has been failing before > and even crashing qemu with hw_error("vfio: DMA mapping failed, > unable to continue") in VFIO common). Indeed the hot-attached VFIO > device would call memory_region_iommu_set_page_size_mask with 64kB > mask whereas after the granule was frozen to 4kB on machine init done. I guess TARGET_PAGE_MASK is always 4kB on arm64 CPUs, since it's the smallest supported and the guest configures its page size at runtime. Even if QEMU's software IOMMU can deal with any page size, VFIO can't so passing the host page size seems more accurate than forcing a value of 4kB. > Now this works. However the new default will prevent 4kB guest on > 64kB host because the granule will be set to 64kB which would be > larger than the guest page size. In that situation, the virtio-iommu > driver fails the viommu_domain_finalise() with > "granule 0x1 larger than system page zie 0x1000". "size" (it could matter if someone searches for this message later) > > The current limitation of global granule in the virtio-iommu > should be removed and turned into per domain granule. But > until we get this upgraded, this new default is probably > better because I don't think anyone is currently interested in > running a 4kB page size guest with virtio-iommu on a 64kB host. > However supporting 64kB guest on 64kB host with virtio-iommu and > VFIO looks a more important feature. > > Signed-off-by: Eric Auger So to summarize the configurations that work for hotplug (tested with QEMU system emulation with SMMU + QEMU VMM with virtio-iommu): Host | Guest | virtio-net | IGB passthrough 4k | 4k| Y | Y 64k | 64k | Y | N -> Y (fixed by this patch) 64k | 4k| Y -> N | N 4k | 64k | Y | Y The change is a reasonable trade-off in my opinion. It fixes the more common 64k on 64k case, and for 4k on 64k, the error is now contained to the guest and made clear ("granule 0x1 larger than system page size 0x1000") instead of crashing the VMM. A guest OS now discovers that the host needs DMA buffers aligned on 64k and could actually support this case (but Linux won't because it can't control the origin of all DMA buffers). Later, support for page tables will enable 4k on 64k for all devices. Tested-by: Jean-Philippe Brucker Reviewed-by: Jean-Philippe Brucker > --- > hw/virtio/virtio-iommu.c | 7 +-- > 1 file changed, 5 insertions(+), 2 deletions(-) > > diff --git a/hw/virtio/virtio-iommu.c b/hw/virtio/virtio-iommu.c > index 9d463efc52..b77e3644ea 100644 > --- a/hw/virtio/virtio-iommu.c > +++ b/hw/virtio/virtio-iommu.c > @@ -1313,7 +1313,7 @@ static void virtio_iommu_device_realize(DeviceState > *dev, Error **errp) > * in vfio realize > */ > s->config.bypass = s->boot_bypass; > -s->config.page_size_mask = qemu_target_page_mask(); > +s->config.page_size_mask = qemu_real_host_page_mask(); > s->config.input_range.end = UINT64_MAX; > s->config.domain_range.end = UINT32_MAX; > s->config.probe_size = VIOMMU_PROBE_SIZE; > @@ -1491,13 +1491,16 @@ static int iommu_post_load(void *opaque, int > version_id) > * still correct. > */ > virtio_iommu_switch_address_space_all(s); > +if (version_id <= 2) { > +s->config.page_size_mask = qemu_target_page_mask(); > +} > return 0; > } > > static const VMStateDescription vmstate_virtio_iommu_device = { > .name = "virtio-iommu-device", > .minimum_version_id = 2, > -.version_id = 2, > +.version_id = 3, > .post_load = iommu_post_load, > .fields = (VMStateField[]) { > VMSTATE_GTREE_DIRECT_KEY_V(domains, VirtIOIOMMU, 2, > -- > 2.27.0 >
[PATCH] target/arm/helper: Propagate MDCR_EL2.HPMN into PMCR_EL0.N
MDCR_EL2.HPMN allows an hypervisor to limit the number of PMU counters available to EL1 and EL0 (to keep the others to itself). QEMU already implements this split correctly, except for PMCR_EL0.N reads: the number of counters read by EL1 or EL0 should be the one configured in MDCR_EL2.HPMN. Signed-off-by: Jean-Philippe Brucker --- target/arm/helper.c | 22 -- 1 file changed, 20 insertions(+), 2 deletions(-) diff --git a/target/arm/helper.c b/target/arm/helper.c index ff1970981e..bec293bc93 100644 --- a/target/arm/helper.c +++ b/target/arm/helper.c @@ -1475,6 +1475,22 @@ static void pmcr_write(CPUARMState *env, const ARMCPRegInfo *ri, pmu_op_finish(env); } +static uint64_t pmcr_read(CPUARMState *env, const ARMCPRegInfo *ri) +{ +uint64_t pmcr = env->cp15.c9_pmcr; + +/* + * If EL2 is implemented and enabled for the current security state, reads + * of PMCR.N from EL1 or EL0 return the value of MDCR_EL2.HPMN or HDCR.HPMN. + */ +if (arm_current_el(env) <= 1 && arm_is_el2_enabled(env)) { +pmcr &= ~PMCRN_MASK; +pmcr |= (env->cp15.mdcr_el2 & MDCR_HPMN) << PMCRN_SHIFT; +} + +return pmcr; +} + static void pmswinc_write(CPUARMState *env, const ARMCPRegInfo *ri, uint64_t value) { @@ -7137,8 +7153,9 @@ static void define_pmu_regs(ARMCPU *cpu) .fgt = FGT_PMCR_EL0, .type = ARM_CP_IO | ARM_CP_ALIAS, .fieldoffset = offsetoflow32(CPUARMState, cp15.c9_pmcr), -.accessfn = pmreg_access, .writefn = pmcr_write, -.raw_writefn = raw_write, +.accessfn = pmreg_access, +.readfn = pmcr_read, .raw_readfn = raw_read, +.writefn = pmcr_write, .raw_writefn = raw_write, }; ARMCPRegInfo pmcr64 = { .name = "PMCR_EL0", .state = ARM_CP_STATE_AA64, @@ -7148,6 +7165,7 @@ static void define_pmu_regs(ARMCPU *cpu) .type = ARM_CP_IO, .fieldoffset = offsetof(CPUARMState, cp15.c9_pmcr), .resetvalue = cpu->isar.reset_pmcr_el0, +.readfn = pmcr_read, .raw_readfn = raw_read, .writefn = pmcr_write, .raw_writefn = raw_write, }; -- 2.43.0
Re: [PATCH] hw/arm/virt: fix GIC maintenance IRQ registration
On Fri, Nov 10, 2023 at 10:19:30AM +, Peter Maydell wrote: > On Fri, 10 Nov 2023 at 09:07, Jean-Philippe Brucker > wrote: > > > > Since commit 9036e917f8 ("{include/}hw/arm: refactor virt PPI logic"), > > GIC maintenance IRQ registration fails on arm64: > > > > [0.979743] kvm [1]: Cannot register interrupt 9 > > > > That commit re-defined VIRTUAL_PMU_IRQ to be a INTID but missed a case > > where the maintenance IRQ is actually referred by its PPI index. Just > > like commit fa68ecb330db ("hw/arm/virt: fix PMU IRQ registration"), use > > INITID_TO_PPI(). A search of "GIC_FDT_IRQ_TYPE_PPI" indicates that there > > shouldn't be more similar issues. > > > > Fixes: 9036e917f8 ("{include/}hw/arm: refactor virt PPI logic") > > Signed-off-by: Jean-Philippe Brucker > > Isn't this already fixed by commit fa68ecb330dbd ? No, that commit fixed the PMU interrupt (I copied most of its commit message and referenced it), but the GIC maintenance interrupt still needed to be fixed. Thanks, Jean
[PATCH] hw/arm/virt: fix GIC maintenance IRQ registration
Since commit 9036e917f8 ("{include/}hw/arm: refactor virt PPI logic"), GIC maintenance IRQ registration fails on arm64: [0.979743] kvm [1]: Cannot register interrupt 9 That commit re-defined VIRTUAL_PMU_IRQ to be a INTID but missed a case where the maintenance IRQ is actually referred by its PPI index. Just like commit fa68ecb330db ("hw/arm/virt: fix PMU IRQ registration"), use INITID_TO_PPI(). A search of "GIC_FDT_IRQ_TYPE_PPI" indicates that there shouldn't be more similar issues. Fixes: 9036e917f8 ("{include/}hw/arm: refactor virt PPI logic") Signed-off-by: Jean-Philippe Brucker --- hw/arm/virt.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/hw/arm/virt.c b/hw/arm/virt.c index 783d71a1b3..f5e685b060 100644 --- a/hw/arm/virt.c +++ b/hw/arm/virt.c @@ -591,7 +591,8 @@ static void fdt_add_gic_node(VirtMachineState *vms) if (vms->virt) { qemu_fdt_setprop_cells(ms->fdt, nodename, "interrupts", - GIC_FDT_IRQ_TYPE_PPI, ARCH_GIC_MAINT_IRQ, + GIC_FDT_IRQ_TYPE_PPI, + INTID_TO_PPI(ARCH_GIC_MAINT_IRQ), GIC_FDT_IRQ_FLAGS_LEVEL_HI); } } else { @@ -615,7 +616,8 @@ static void fdt_add_gic_node(VirtMachineState *vms) 2, vms->memmap[VIRT_GIC_VCPU].base, 2, vms->memmap[VIRT_GIC_VCPU].size); qemu_fdt_setprop_cells(ms->fdt, nodename, "interrupts", - GIC_FDT_IRQ_TYPE_PPI, ARCH_GIC_MAINT_IRQ, + GIC_FDT_IRQ_TYPE_PPI, + INTID_TO_PPI(ARCH_GIC_MAINT_IRQ), GIC_FDT_IRQ_FLAGS_LEVEL_HI); } } -- 2.42.0
Re: [PATCH v2 09/12] util/reserved-region: Add new ReservedRegion helpers
list; > +} > +} else if (range_contains_range(range_iter, r)) { > +/* new region is included in the current region */ > +if (range_lob(range_iter) == range_lob(r)) { > +/* adjacent on the left side, derives into 2 regions */ > +range_set_bounds(range_iter, range_upb(r) + 1, > + range_upb(range_iter)); > +return g_list_insert_before(list, l, reg); > +} else if (range_upb(range_iter) == range_upb(r)) { > +/* adjacent on the right side, derives into 2 regions */ > +range_set_bounds(range_iter, range_lob(range_iter), > + range_lob(r) - 1); > +l = l->next; > +} else { > +uint64_t lob = range_lob(range_iter); > +/* > + * the new range is in the middle of an existing one, > + * split this latter into 3 regs instead > + */ > +range_set_bounds(range_iter, range_upb(r) + 1, > + range_upb(range_iter)); > +new_reg = g_new0(ReservedRegion, 1); > +new_reg->type = resv_iter->type; > +range_set_bounds(_reg->range, > + lob, range_lob(r) - 1); > +list = g_list_insert_before(list, l, new_reg); > +return g_list_insert_before(list, l, reg); > +} > +} else if (range_lob(r) < range_lob(range_iter)) { > + range_set_bounds(range_iter, range_upb(r) + 1, > + range_upb(range_iter)); > +return g_list_insert_before(list, l, reg); > +} else { /* intersection on the upper range */ > +range_set_bounds(range_iter, range_lob(range_iter), > + range_lob(r) - 1); > +l = l->next; > +} > +} /* overlap */ > +} > +return g_list_append(list, reg); Looks correct overall Reviewed-by: Jean-Philippe Brucker > +} > + > diff --git a/util/meson.build b/util/meson.build > index c4827fd70a..eb677b40c2 100644 > --- a/util/meson.build > +++ b/util/meson.build > @@ -51,6 +51,7 @@ util_ss.add(files('qdist.c')) > util_ss.add(files('qht.c')) > util_ss.add(files('qsp.c')) > util_ss.add(files('range.c')) > +util_ss.add(files('reserved-region.c')) > util_ss.add(files('stats64.c')) > util_ss.add(files('systemd.c')) > util_ss.add(files('transactions.c')) > -- > 2.41.0 >
Re: [PATCH v2 07/12] virtio-iommu: Implement set_iova_ranges() callback
On Wed, Sep 13, 2023 at 10:01:42AM +0200, Eric Auger wrote: > The implementation populates the array of per IOMMUDevice > host reserved regions. > > It is forbidden to have conflicting sets of host IOVA ranges > to be applied onto the same IOMMU MR (implied by different > host devices). > > Signed-off-by: Eric Auger > > --- > > v1 -> v2: > - Forbid conflicting sets of host resv regions > --- > include/hw/virtio/virtio-iommu.h | 2 ++ > hw/virtio/virtio-iommu.c | 48 > 2 files changed, 50 insertions(+) > > diff --git a/include/hw/virtio/virtio-iommu.h > b/include/hw/virtio/virtio-iommu.h > index 70b8ace34d..31b69c8261 100644 > --- a/include/hw/virtio/virtio-iommu.h > +++ b/include/hw/virtio/virtio-iommu.h > @@ -40,6 +40,8 @@ typedef struct IOMMUDevice { > MemoryRegion root; /* The root container of the device */ > MemoryRegion bypass_mr; /* The alias of shared memory MR */ > GList *resv_regions; > +Range *host_resv_regions; > +uint32_t nr_host_resv_regions; > } IOMMUDevice; > > typedef struct IOMMUPciBus { > diff --git a/hw/virtio/virtio-iommu.c b/hw/virtio/virtio-iommu.c > index ea359b586a..ed2df5116f 100644 > --- a/hw/virtio/virtio-iommu.c > +++ b/hw/virtio/virtio-iommu.c > @@ -20,6 +20,7 @@ > #include "qemu/osdep.h" > #include "qemu/log.h" > #include "qemu/iov.h" > +#include "qemu/range.h" > #include "exec/target_page.h" > #include "hw/qdev-properties.h" > #include "hw/virtio/virtio.h" > @@ -1158,6 +1159,52 @@ static int > virtio_iommu_set_page_size_mask(IOMMUMemoryRegion *mr, > return 0; > } > > +static int virtio_iommu_set_iova_ranges(IOMMUMemoryRegion *mr, > +uint32_t nr_ranges, > +struct Range *iova_ranges, > +Error **errp) > +{ > +IOMMUDevice *sdev = container_of(mr, IOMMUDevice, iommu_mr); > +uint32_t nr_host_resv_regions; > +Range *host_resv_regions; > +int ret = -EINVAL; > + > +if (!nr_ranges) { > +return 0; > +} > + > +if (sdev->host_resv_regions) { > +range_inverse_array(nr_ranges, iova_ranges, > +_host_resv_regions, _resv_regions, > +0, UINT64_MAX); > +if (nr_host_resv_regions != sdev->nr_host_resv_regions) { > +goto error; > +} > +for (int i = 0; i < nr_host_resv_regions; i++) { > +Range *new = _resv_regions[i]; > +Range *existing = >host_resv_regions[i]; > + > +if (!range_contains_range(existing, new)) { > +goto error; > +} > +} > +ret = 0; > +goto out; > +} > + > +range_inverse_array(nr_ranges, iova_ranges, > +>nr_host_resv_regions, > >host_resv_regions, > +0, UINT64_MAX); Can set_iova_ranges() only be called for the first time before the guest has had a chance to issue a probe request? Maybe we could add a sanity-check that the guest hasn't issued a probe request yet, since we can't notify about updated reserved regions. I'm probably misremembering because I thought Linux set up IOMMU contexts (including probe requests) before enabling DMA master in PCI which cause QEMU VFIO to issue these calls. I'll double check. Thanks, Jean > + > +return 0; > +error: > +error_setg(errp, "IOMMU mr=%s Conflicting host reserved regions set!", > + mr->parent_obj.name); > +out: > +g_free(host_resv_regions); > +return ret; > +} > + > static void virtio_iommu_system_reset(void *opaque) > { > VirtIOIOMMU *s = opaque; > @@ -1453,6 +1500,7 @@ static void > virtio_iommu_memory_region_class_init(ObjectClass *klass, > imrc->replay = virtio_iommu_replay; > imrc->notify_flag_changed = virtio_iommu_notify_flag_changed; > imrc->iommu_set_page_size_mask = virtio_iommu_set_page_size_mask; > +imrc->iommu_set_iova_ranges = virtio_iommu_set_iova_ranges; > } > > static const TypeInfo virtio_iommu_info = { > -- > 2.41.0 >
Re: [PATCH v2 05/12] virtio-iommu: Introduce per IOMMUDevice reserved regions
Hi Eric, On Wed, Sep 13, 2023 at 10:01:40AM +0200, Eric Auger wrote: > For the time being the per device reserved regions are > just a duplicate of IOMMU wide reserved regions. Subsequent > patches will combine those with host reserved regions, if any. > > Signed-off-by: Eric Auger > --- > include/hw/virtio/virtio-iommu.h | 1 + > hw/virtio/virtio-iommu.c | 42 ++-- > 2 files changed, 35 insertions(+), 8 deletions(-) > > diff --git a/include/hw/virtio/virtio-iommu.h > b/include/hw/virtio/virtio-iommu.h > index eea4564782..70b8ace34d 100644 > --- a/include/hw/virtio/virtio-iommu.h > +++ b/include/hw/virtio/virtio-iommu.h > @@ -39,6 +39,7 @@ typedef struct IOMMUDevice { > AddressSpace as; > MemoryRegion root; /* The root container of the device */ > MemoryRegion bypass_mr; /* The alias of shared memory MR */ > +GList *resv_regions; > } IOMMUDevice; > > typedef struct IOMMUPciBus { > diff --git a/hw/virtio/virtio-iommu.c b/hw/virtio/virtio-iommu.c > index 979cdb5648..ea359b586a 100644 > --- a/hw/virtio/virtio-iommu.c > +++ b/hw/virtio/virtio-iommu.c > @@ -624,22 +624,48 @@ static int virtio_iommu_unmap(VirtIOIOMMU *s, > return ret; > } > > +static int consolidate_resv_regions(IOMMUDevice *sdev) > +{ > +VirtIOIOMMU *s = sdev->viommu; > +int i; > + > +for (i = 0; i < s->nr_prop_resv_regions; i++) { > +ReservedRegion *reg = g_new0(ReservedRegion, 1); > + > +*reg = s->prop_resv_regions[i]; > +sdev->resv_regions = g_list_append(sdev->resv_regions, reg); > +} > +return 0; > +} > + > static ssize_t virtio_iommu_fill_resv_mem_prop(VirtIOIOMMU *s, uint32_t ep, > uint8_t *buf, size_t free) > { > struct virtio_iommu_probe_resv_mem prop = {}; > size_t size = sizeof(prop), length = size - sizeof(prop.head), total; > -int i; > +IOMMUDevice *sdev; > +GList *l; > +int ret; > > -total = size * s->nr_prop_resv_regions; > +sdev = container_of(virtio_iommu_mr(s, ep), IOMMUDevice, iommu_mr); > +if (!sdev) { > +return -EINVAL; > +} > > +ret = consolidate_resv_regions(sdev); > +if (ret) { > +return ret; > +} > + > +total = size * g_list_length(sdev->resv_regions); > if (total > free) { > return -ENOSPC; > } > > -for (i = 0; i < s->nr_prop_resv_regions; i++) { > -unsigned subtype = s->prop_resv_regions[i].type; > -Range *range = >prop_resv_regions[i].range; > +for (l = sdev->resv_regions; l; l = l->next) { > +ReservedRegion *reg = l->data; > +unsigned subtype = reg->type; > +Range *range = >range; > > assert(subtype == VIRTIO_IOMMU_RESV_MEM_T_RESERVED || > subtype == VIRTIO_IOMMU_RESV_MEM_T_MSI); > @@ -857,7 +883,7 @@ static IOMMUTLBEntry > virtio_iommu_translate(IOMMUMemoryRegion *mr, hwaddr addr, > bool bypass_allowed; > int granule; > bool found; > -int i; > +GList *l; > > interval.low = addr; > interval.high = addr + 1; > @@ -895,8 +921,8 @@ static IOMMUTLBEntry > virtio_iommu_translate(IOMMUMemoryRegion *mr, hwaddr addr, > goto unlock; > } > > -for (i = 0; i < s->nr_prop_resv_regions; i++) { > -ReservedRegion *reg = >prop_resv_regions[i]; > +for (l = sdev->resv_regions; l; l = l->next) { > +ReservedRegion *reg = l->data; This means translate() now only takes reserved regions into account after the guest issues a probe request, which only happens if the guest actually supports the probe feature. It may be better to build the list earlier (like when creating the IOMMUDevice), and complete it in set_iova_ranges(). I guess both could call consolidate() which would rebuild the whole list, for example Thanks, Jean > > if (range_contains(>range, addr)) { > switch (reg->type) { > -- > 2.41.0 >
Re: [PATCH v3 0/6] target/arm: Fixes for RME
On Thu, Aug 10, 2023 at 02:16:56PM +0100, Peter Maydell wrote: > This didn't build for the linux-user targets. I squashed > this into patch 6: > > diff --git a/target/arm/cpu.c b/target/arm/cpu.c > index 7df1f7600b1..d906d2b1caa 100644 > --- a/target/arm/cpu.c > +++ b/target/arm/cpu.c > @@ -2169,9 +2169,11 @@ static void arm_cpu_realizefn(DeviceState *dev, > Error **errp) > set_feature(env, ARM_FEATURE_VBAR); > } > > -if (cpu_isar_feature(aa64_rme, cpu)) { > +#ifndef CONFIG_USER_ONLY > +if (tcg_enabled() && cpu_isar_feature(aa64_rme, cpu)) { > arm_register_el_change_hook(cpu, _rme_post_el_change, 0); > } > +#endif > > register_cp_regs_for_features(cpu); > arm_cpu_register_gdb_regs_for_features(cpu); > > With that, I've applied the series to target-arm-for-8.2. Thank you, sorry about the build error, I'll add linux-user to my tests Thanks, Jean
[PATCH v3 5/6] target/arm/helper: Check SCR_EL3.{NSE, NS} encoding for AT instructions
The AT instruction is UNDEFINED if the {NSE,NS} configuration is invalid. Add a function to check this on all AT instructions that apply to an EL lower than 3. Suggested-by: Peter Maydell Signed-off-by: Jean-Philippe Brucker --- target/arm/helper.c | 38 +++--- 1 file changed, 27 insertions(+), 11 deletions(-) diff --git a/target/arm/helper.c b/target/arm/helper.c index fbb03c364b..dbfe9f2f5e 100644 --- a/target/arm/helper.c +++ b/target/arm/helper.c @@ -3616,6 +3616,22 @@ static void ats1h_write(CPUARMState *env, const ARMCPRegInfo *ri, #endif /* CONFIG_TCG */ } +static CPAccessResult at_e012_access(CPUARMState *env, const ARMCPRegInfo *ri, + bool isread) +{ +/* + * R_NYXTL: instruction is UNDEFINED if it applies to an Exception level + * lower than EL3 and the combination SCR_EL3.{NSE,NS} is reserved. This can + * only happen when executing at EL3 because that combination also causes an + * illegal exception return. We don't need to check FEAT_RME either, because + * scr_write() ensures that the NSE bit is not set otherwise. + */ +if ((env->cp15.scr_el3 & (SCR_NSE | SCR_NS)) == SCR_NSE) { +return CP_ACCESS_TRAP; +} +return CP_ACCESS_OK; +} + static CPAccessResult at_s1e2_access(CPUARMState *env, const ARMCPRegInfo *ri, bool isread) { @@ -3623,7 +3639,7 @@ static CPAccessResult at_s1e2_access(CPUARMState *env, const ARMCPRegInfo *ri, !(env->cp15.scr_el3 & (SCR_NS | SCR_EEL2))) { return CP_ACCESS_TRAP; } -return CP_ACCESS_OK; +return at_e012_access(env, ri, isread); } static void ats_write64(CPUARMState *env, const ARMCPRegInfo *ri, @@ -5505,38 +5521,38 @@ static const ARMCPRegInfo v8_cp_reginfo[] = { .opc0 = 1, .opc1 = 0, .crn = 7, .crm = 8, .opc2 = 0, .access = PL1_W, .type = ARM_CP_NO_RAW | ARM_CP_RAISES_EXC, .fgt = FGT_ATS1E1R, - .writefn = ats_write64 }, + .accessfn = at_e012_access, .writefn = ats_write64 }, { .name = "AT_S1E1W", .state = ARM_CP_STATE_AA64, .opc0 = 1, .opc1 = 0, .crn = 7, .crm = 8, .opc2 = 1, .access = PL1_W, .type = ARM_CP_NO_RAW | ARM_CP_RAISES_EXC, .fgt = FGT_ATS1E1W, - .writefn = ats_write64 }, + .accessfn = at_e012_access, .writefn = ats_write64 }, { .name = "AT_S1E0R", .state = ARM_CP_STATE_AA64, .opc0 = 1, .opc1 = 0, .crn = 7, .crm = 8, .opc2 = 2, .access = PL1_W, .type = ARM_CP_NO_RAW | ARM_CP_RAISES_EXC, .fgt = FGT_ATS1E0R, - .writefn = ats_write64 }, + .accessfn = at_e012_access, .writefn = ats_write64 }, { .name = "AT_S1E0W", .state = ARM_CP_STATE_AA64, .opc0 = 1, .opc1 = 0, .crn = 7, .crm = 8, .opc2 = 3, .access = PL1_W, .type = ARM_CP_NO_RAW | ARM_CP_RAISES_EXC, .fgt = FGT_ATS1E0W, - .writefn = ats_write64 }, + .accessfn = at_e012_access, .writefn = ats_write64 }, { .name = "AT_S12E1R", .state = ARM_CP_STATE_AA64, .opc0 = 1, .opc1 = 4, .crn = 7, .crm = 8, .opc2 = 4, .access = PL2_W, .type = ARM_CP_NO_RAW | ARM_CP_RAISES_EXC, - .writefn = ats_write64 }, + .accessfn = at_e012_access, .writefn = ats_write64 }, { .name = "AT_S12E1W", .state = ARM_CP_STATE_AA64, .opc0 = 1, .opc1 = 4, .crn = 7, .crm = 8, .opc2 = 5, .access = PL2_W, .type = ARM_CP_NO_RAW | ARM_CP_RAISES_EXC, - .writefn = ats_write64 }, + .accessfn = at_e012_access, .writefn = ats_write64 }, { .name = "AT_S12E0R", .state = ARM_CP_STATE_AA64, .opc0 = 1, .opc1 = 4, .crn = 7, .crm = 8, .opc2 = 6, .access = PL2_W, .type = ARM_CP_NO_RAW | ARM_CP_RAISES_EXC, - .writefn = ats_write64 }, + .accessfn = at_e012_access, .writefn = ats_write64 }, { .name = "AT_S12E0W", .state = ARM_CP_STATE_AA64, .opc0 = 1, .opc1 = 4, .crn = 7, .crm = 8, .opc2 = 7, .access = PL2_W, .type = ARM_CP_NO_RAW | ARM_CP_RAISES_EXC, - .writefn = ats_write64 }, + .accessfn = at_e012_access, .writefn = ats_write64 }, /* AT S1E2* are elsewhere as they UNDEF from EL3 if EL2 is not present */ { .name = "AT_S1E3R", .state = ARM_CP_STATE_AA64, .opc0 = 1, .opc1 = 6, .crn = 7, .crm = 8, .opc2 = 0, @@ -8078,12 +8094,12 @@ static const ARMCPRegInfo ats1e1_reginfo[] = { .opc0 = 1, .opc1 = 0, .crn = 7, .crm = 9, .opc2 = 0, .access = PL1_W, .type = ARM_CP_NO_RAW | ARM_CP_RAISES_EXC, .fgt = FGT_ATS1E1RP, - .writefn = ats_write64 }, + .accessfn = at_e012_access, .writefn = ats_write64 }, { .name = "AT_S1E1WP", .state = ARM_CP_STATE_AA64, .opc0 = 1, .opc1 = 0, .crn = 7, .crm = 9, .opc2 = 1, .access = PL1_W, .type = ARM_CP_NO_RAW | ARM_CP_RAISES_EXC, .fgt = FGT_ATS1E1WP, - .writefn = ats_write64 }, + .accessfn = at_e012_access, .writefn = ats_write64 }, }; static const ARMCPRegInfo ats1cp_reginfo[] = { -- 2.41.0
[PATCH v3 2/6] target/arm/helper: Fix tlbmask and tlbbits for TLBI VAE2*
When HCR_EL2.E2H is enabled, TLB entries are formed using the EL2&0 translation regime, instead of the EL2 translation regime. The TLB VAE2* instructions invalidate the regime that corresponds to the current value of HCR_EL2.E2H. At the moment we only invalidate the EL2 translation regime. This causes problems with RMM, which issues TLBI VAE2IS instructions with HCR_EL2.E2H enabled. Update vae2_tlbmask() to take HCR_EL2.E2H into account. Add vae2_tlbbits() as well, since the top-byte-ignore configuration is different between the EL2&0 and EL2 regime. Signed-off-by: Jean-Philippe Brucker Reviewed-by: Peter Maydell --- target/arm/helper.c | 50 - 1 file changed, 40 insertions(+), 10 deletions(-) diff --git a/target/arm/helper.c b/target/arm/helper.c index 2959d27543..a4c2c1bde5 100644 --- a/target/arm/helper.c +++ b/target/arm/helper.c @@ -4663,6 +4663,21 @@ static int vae1_tlbmask(CPUARMState *env) return mask; } +static int vae2_tlbmask(CPUARMState *env) +{ +uint64_t hcr = arm_hcr_el2_eff(env); +uint16_t mask; + +if (hcr & HCR_E2H) { +mask = ARMMMUIdxBit_E20_2 | + ARMMMUIdxBit_E20_2_PAN | + ARMMMUIdxBit_E20_0; +} else { +mask = ARMMMUIdxBit_E2; +} +return mask; +} + /* Return 56 if TBI is enabled, 64 otherwise. */ static int tlbbits_for_regime(CPUARMState *env, ARMMMUIdx mmu_idx, uint64_t addr) @@ -4689,6 +4704,25 @@ static int vae1_tlbbits(CPUARMState *env, uint64_t addr) return tlbbits_for_regime(env, mmu_idx, addr); } +static int vae2_tlbbits(CPUARMState *env, uint64_t addr) +{ +uint64_t hcr = arm_hcr_el2_eff(env); +ARMMMUIdx mmu_idx; + +/* + * Only the regime of the mmu_idx below is significant. + * Regime EL2&0 has two ranges with separate TBI configuration, while EL2 + * only has one. + */ +if (hcr & HCR_E2H) { +mmu_idx = ARMMMUIdx_E20_2; +} else { +mmu_idx = ARMMMUIdx_E2; +} + +return tlbbits_for_regime(env, mmu_idx, addr); +} + static void tlbi_aa64_vmalle1is_write(CPUARMState *env, const ARMCPRegInfo *ri, uint64_t value) { @@ -4781,10 +4815,11 @@ static void tlbi_aa64_vae2_write(CPUARMState *env, const ARMCPRegInfo *ri, * flush-last-level-only. */ CPUState *cs = env_cpu(env); -int mask = e2_tlbmask(env); +int mask = vae2_tlbmask(env); uint64_t pageaddr = sextract64(value << 12, 0, 56); +int bits = vae2_tlbbits(env, pageaddr); -tlb_flush_page_by_mmuidx(cs, pageaddr, mask); +tlb_flush_page_bits_by_mmuidx(cs, pageaddr, mask, bits); } static void tlbi_aa64_vae3_write(CPUARMState *env, const ARMCPRegInfo *ri, @@ -4838,11 +4873,11 @@ static void tlbi_aa64_vae2is_write(CPUARMState *env, const ARMCPRegInfo *ri, uint64_t value) { CPUState *cs = env_cpu(env); +int mask = vae2_tlbmask(env); uint64_t pageaddr = sextract64(value << 12, 0, 56); -int bits = tlbbits_for_regime(env, ARMMMUIdx_E2, pageaddr); +int bits = vae2_tlbbits(env, pageaddr); -tlb_flush_page_bits_by_mmuidx_all_cpus_synced(cs, pageaddr, - ARMMMUIdxBit_E2, bits); +tlb_flush_page_bits_by_mmuidx_all_cpus_synced(cs, pageaddr, mask, bits); } static void tlbi_aa64_vae3is_write(CPUARMState *env, const ARMCPRegInfo *ri, @@ -5014,11 +5049,6 @@ static void tlbi_aa64_rvae1is_write(CPUARMState *env, do_rvae_write(env, value, vae1_tlbmask(env), true); } -static int vae2_tlbmask(CPUARMState *env) -{ -return ARMMMUIdxBit_E2; -} - static void tlbi_aa64_rvae2_write(CPUARMState *env, const ARMCPRegInfo *ri, uint64_t value) -- 2.41.0
[PATCH v3 0/6] target/arm: Fixes for RME
A few patches to fix RME support and allow booting a realm guest, based on "[PATCH v2 00/15] target/arm/ptw: Cleanups and a few bugfixes" https://lore.kernel.org/all/20230807141514.19075-1-peter.mayd...@linaro.org/ Since v2: * Updated the comment in patch 5. I also removed the check for FEAT_RME, because as pointed out in "target/arm: Catch illegal-exception-return from EL3 with bad NSE/NS", the SCR_NSE bit can only be set with FEAT_RME enabled. Because of this additional change, I didn't add the Reviewed-by. * Added an EL-change hook to patch 6, to update the timer IRQ when changing the security state. I was wondering whether the el_change function should filter security state changes, since we only need to update IRQ state when switching between Root and Secure/NonSecure. But with a small syscall benchmark exercising EL0-EL1 switch with FEAT_RME enabled, I couldn't see any difference with and without the el_change hook, so I kept it simple. * Also added the .raw_write callback for CNTHCTL_EL2. v2: https://lore.kernel.org/all/20230802170157.401491-1-jean-phili...@linaro.org/ Jean-Philippe Brucker (6): target/arm/ptw: Load stage-2 tables from realm physical space target/arm/helper: Fix tlbmask and tlbbits for TLBI VAE2* target/arm: Skip granule protection checks for AT instructions target/arm: Pass security space rather than flag for AT instructions target/arm/helper: Check SCR_EL3.{NSE,NS} encoding for AT instructions target/arm/helper: Implement CNTHCTL_EL2.CNT[VP]MASK target/arm/cpu.h| 4 + target/arm/internals.h | 25 +++--- target/arm/cpu.c| 4 + target/arm/helper.c | 184 ++-- target/arm/ptw.c| 39 ++--- target/arm/trace-events | 7 +- 6 files changed, 188 insertions(+), 75 deletions(-) -- 2.41.0
[PATCH v3 6/6] target/arm/helper: Implement CNTHCTL_EL2.CNT[VP]MASK
When FEAT_RME is implemented, these bits override the value of CNT[VP]_CTL_EL0.IMASK in Realm and Root state. Move the IRQ state update into a new gt_update_irq() function and test those bits every time we recompute the IRQ state. Since we're removing the IRQ state from some trace events, add a new trace event for gt_update_irq(). Signed-off-by: Jean-Philippe Brucker --- target/arm/cpu.h| 4 +++ target/arm/cpu.c| 4 +++ target/arm/helper.c | 65 ++--- target/arm/trace-events | 7 +++-- 4 files changed, 66 insertions(+), 14 deletions(-) diff --git a/target/arm/cpu.h b/target/arm/cpu.h index bcd65a63ca..855a76ae81 100644 --- a/target/arm/cpu.h +++ b/target/arm/cpu.h @@ -1115,6 +1115,7 @@ struct ArchCPU { }; unsigned int gt_cntfrq_period_ns(ARMCPU *cpu); +void gt_rme_post_el_change(ARMCPU *cpu, void *opaque); void arm_cpu_post_init(Object *obj); @@ -1743,6 +1744,9 @@ static inline void xpsr_write(CPUARMState *env, uint32_t val, uint32_t mask) #define HSTR_TTEE (1 << 16) #define HSTR_TJDBX (1 << 17) +#define CNTHCTL_CNTVMASK (1 << 18) +#define CNTHCTL_CNTPMASK (1 << 19) + /* Return the current FPSCR value. */ uint32_t vfp_get_fpscr(CPUARMState *env); void vfp_set_fpscr(CPUARMState *env, uint32_t val); diff --git a/target/arm/cpu.c b/target/arm/cpu.c index 93c28d50e5..7df1f7600b 100644 --- a/target/arm/cpu.c +++ b/target/arm/cpu.c @@ -2169,6 +2169,10 @@ static void arm_cpu_realizefn(DeviceState *dev, Error **errp) set_feature(env, ARM_FEATURE_VBAR); } +if (cpu_isar_feature(aa64_rme, cpu)) { +arm_register_el_change_hook(cpu, _rme_post_el_change, 0); +} + register_cp_regs_for_features(cpu); arm_cpu_register_gdb_regs_for_features(cpu); diff --git a/target/arm/helper.c b/target/arm/helper.c index dbfe9f2f5e..86ce6a52bb 100644 --- a/target/arm/helper.c +++ b/target/arm/helper.c @@ -2608,6 +2608,39 @@ static uint64_t gt_get_countervalue(CPUARMState *env) return qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) / gt_cntfrq_period_ns(cpu); } +static void gt_update_irq(ARMCPU *cpu, int timeridx) +{ +CPUARMState *env = >env; +uint64_t cnthctl = env->cp15.cnthctl_el2; +ARMSecuritySpace ss = arm_security_space(env); +/* ISTATUS && !IMASK */ +int irqstate = (env->cp15.c14_timer[timeridx].ctl & 6) == 4; + +/* + * If bit CNTHCTL_EL2.CNT[VP]MASK is set, it overrides IMASK. + * It is RES0 in Secure and NonSecure state. + */ +if ((ss == ARMSS_Root || ss == ARMSS_Realm) && +((timeridx == GTIMER_VIRT && (cnthctl & CNTHCTL_CNTVMASK)) || + (timeridx == GTIMER_PHYS && (cnthctl & CNTHCTL_CNTPMASK { +irqstate = 0; +} + +qemu_set_irq(cpu->gt_timer_outputs[timeridx], irqstate); +trace_arm_gt_update_irq(timeridx, irqstate); +} + +void gt_rme_post_el_change(ARMCPU *cpu, void *ignored) +{ +/* + * Changing security state between Root and Secure/NonSecure, which may + * happen when switching EL, can change the effective value of CNTHCTL_EL2 + * mask bits. Update the IRQ state accordingly. + */ +gt_update_irq(cpu, GTIMER_VIRT); +gt_update_irq(cpu, GTIMER_PHYS); +} + static void gt_recalc_timer(ARMCPU *cpu, int timeridx) { ARMGenericTimer *gt = >env.cp15.c14_timer[timeridx]; @@ -2623,13 +2656,9 @@ static void gt_recalc_timer(ARMCPU *cpu, int timeridx) /* Note that this must be unsigned 64 bit arithmetic: */ int istatus = count - offset >= gt->cval; uint64_t nexttick; -int irqstate; gt->ctl = deposit32(gt->ctl, 2, 1, istatus); -irqstate = (istatus && !(gt->ctl & 2)); -qemu_set_irq(cpu->gt_timer_outputs[timeridx], irqstate); - if (istatus) { /* Next transition is when count rolls back over to zero */ nexttick = UINT64_MAX; @@ -2648,14 +2677,14 @@ static void gt_recalc_timer(ARMCPU *cpu, int timeridx) } else { timer_mod(cpu->gt_timer[timeridx], nexttick); } -trace_arm_gt_recalc(timeridx, irqstate, nexttick); +trace_arm_gt_recalc(timeridx, nexttick); } else { /* Timer disabled: ISTATUS and timer output always clear */ gt->ctl &= ~4; -qemu_set_irq(cpu->gt_timer_outputs[timeridx], 0); timer_del(cpu->gt_timer[timeridx]); trace_arm_gt_recalc_disabled(timeridx); } +gt_update_irq(cpu, timeridx); } static void gt_timer_reset(CPUARMState *env, const ARMCPRegInfo *ri, @@ -2759,10 +2788,8 @@ static void gt_ctl_write(CPUARMState *env, const ARMCPRegInfo *ri, * IMASK toggled: don't need to recalculate, * just set the interrupt line based on ISTATUS */ -int irqstate = (oldval & 4) && !(value & 2); - -trace_arm_g
[PATCH v3 3/6] target/arm: Skip granule protection checks for AT instructions
GPC checks are not performed on the output address for AT instructions, as stated by ARM DDI 0487J in D8.12.2: When populating PAR_EL1 with the result of an address translation instruction, granule protection checks are not performed on the final output address of a successful translation. Rename get_phys_addr_with_secure(), since it's only used to handle AT instructions. Signed-off-by: Jean-Philippe Brucker Reviewed-by: Peter Maydell --- target/arm/internals.h | 25 ++--- target/arm/helper.c| 8 ++-- target/arm/ptw.c | 11 ++- 3 files changed, 26 insertions(+), 18 deletions(-) diff --git a/target/arm/internals.h b/target/arm/internals.h index 0f01bc32a8..fc90c364f7 100644 --- a/target/arm/internals.h +++ b/target/arm/internals.h @@ -1190,12 +1190,11 @@ typedef struct GetPhysAddrResult { } GetPhysAddrResult; /** - * get_phys_addr_with_secure: get the physical address for a virtual address + * get_phys_addr: get the physical address for a virtual address * @env: CPUARMState * @address: virtual address to get physical address for * @access_type: 0 for read, 1 for write, 2 for execute * @mmu_idx: MMU index indicating required translation regime - * @is_secure: security state for the access * @result: set on translation success. * @fi: set to fault info if the translation fails * @@ -1212,26 +1211,30 @@ typedef struct GetPhysAddrResult { * * for PSMAv5 based systems we don't bother to return a full FSR format *value. */ -bool get_phys_addr_with_secure(CPUARMState *env, target_ulong address, - MMUAccessType access_type, - ARMMMUIdx mmu_idx, bool is_secure, - GetPhysAddrResult *result, ARMMMUFaultInfo *fi) +bool get_phys_addr(CPUARMState *env, target_ulong address, + MMUAccessType access_type, ARMMMUIdx mmu_idx, + GetPhysAddrResult *result, ARMMMUFaultInfo *fi) __attribute__((nonnull)); /** - * get_phys_addr: get the physical address for a virtual address + * get_phys_addr_with_secure_nogpc: get the physical address for a virtual + * address * @env: CPUARMState * @address: virtual address to get physical address for * @access_type: 0 for read, 1 for write, 2 for execute * @mmu_idx: MMU index indicating required translation regime + * @is_secure: security state for the access * @result: set on translation success. * @fi: set to fault info if the translation fails * - * Similarly, but use the security regime of @mmu_idx. + * Similar to get_phys_addr, but use the given security regime and don't perform + * a Granule Protection Check on the resulting address. */ -bool get_phys_addr(CPUARMState *env, target_ulong address, - MMUAccessType access_type, ARMMMUIdx mmu_idx, - GetPhysAddrResult *result, ARMMMUFaultInfo *fi) +bool get_phys_addr_with_secure_nogpc(CPUARMState *env, target_ulong address, + MMUAccessType access_type, + ARMMMUIdx mmu_idx, bool is_secure, + GetPhysAddrResult *result, + ARMMMUFaultInfo *fi) __attribute__((nonnull)); bool pmsav8_mpu_lookup(CPUARMState *env, uint32_t address, diff --git a/target/arm/helper.c b/target/arm/helper.c index a4c2c1bde5..427de6bd2a 100644 --- a/target/arm/helper.c +++ b/target/arm/helper.c @@ -3365,8 +3365,12 @@ static uint64_t do_ats_write(CPUARMState *env, uint64_t value, ARMMMUFaultInfo fi = {}; GetPhysAddrResult res = {}; -ret = get_phys_addr_with_secure(env, value, access_type, mmu_idx, -is_secure, , ); +/* + * I_MXTJT: Granule protection checks are not performed on the final address + * of a successful translation. + */ +ret = get_phys_addr_with_secure_nogpc(env, value, access_type, mmu_idx, + is_secure, , ); /* * ATS operations only do S1 or S1+S2 translations, so we never diff --git a/target/arm/ptw.c b/target/arm/ptw.c index 063adbd84a..33179f3471 100644 --- a/target/arm/ptw.c +++ b/target/arm/ptw.c @@ -3418,16 +3418,17 @@ static bool get_phys_addr_gpc(CPUARMState *env, S1Translate *ptw, return false; } -bool get_phys_addr_with_secure(CPUARMState *env, target_ulong address, - MMUAccessType access_type, ARMMMUIdx mmu_idx, - bool is_secure, GetPhysAddrResult *result, - ARMMMUFaultInfo *fi) +bool get_phys_addr_with_secure_nogpc(CPUARMState *env, target_ulong address, + MMUAccessType access_type, + ARMMMUIdx mmu_idx, bool is_secure, + GetPhysAddrResult *result
[PATCH v3 4/6] target/arm: Pass security space rather than flag for AT instructions
At the moment we only handle Secure and Nonsecure security spaces for the AT instructions. Add support for Realm and Root. For AArch64, arm_security_space() gives the desired space. ARM DDI0487J says (R_NYXTL): If EL3 is implemented, then when an address translation instruction that applies to an Exception level lower than EL3 is executed, the Effective value of SCR_EL3.{NSE, NS} determines the target Security state that the instruction applies to. For AArch32, some instructions can access NonSecure space from Secure, so we still need to pass the state explicitly to do_ats_write(). Signed-off-by: Jean-Philippe Brucker Reviewed-by: Peter Maydell --- target/arm/internals.h | 18 +- target/arm/helper.c| 27 --- target/arm/ptw.c | 12 ++-- 3 files changed, 27 insertions(+), 30 deletions(-) diff --git a/target/arm/internals.h b/target/arm/internals.h index fc90c364f7..cf13bb94f5 100644 --- a/target/arm/internals.h +++ b/target/arm/internals.h @@ -1217,24 +1217,24 @@ bool get_phys_addr(CPUARMState *env, target_ulong address, __attribute__((nonnull)); /** - * get_phys_addr_with_secure_nogpc: get the physical address for a virtual - * address + * get_phys_addr_with_space_nogpc: get the physical address for a virtual + * address * @env: CPUARMState * @address: virtual address to get physical address for * @access_type: 0 for read, 1 for write, 2 for execute * @mmu_idx: MMU index indicating required translation regime - * @is_secure: security state for the access + * @space: security space for the access * @result: set on translation success. * @fi: set to fault info if the translation fails * - * Similar to get_phys_addr, but use the given security regime and don't perform + * Similar to get_phys_addr, but use the given security space and don't perform * a Granule Protection Check on the resulting address. */ -bool get_phys_addr_with_secure_nogpc(CPUARMState *env, target_ulong address, - MMUAccessType access_type, - ARMMMUIdx mmu_idx, bool is_secure, - GetPhysAddrResult *result, - ARMMMUFaultInfo *fi) +bool get_phys_addr_with_space_nogpc(CPUARMState *env, target_ulong address, +MMUAccessType access_type, +ARMMMUIdx mmu_idx, ARMSecuritySpace space, +GetPhysAddrResult *result, +ARMMMUFaultInfo *fi) __attribute__((nonnull)); bool pmsav8_mpu_lookup(CPUARMState *env, uint32_t address, diff --git a/target/arm/helper.c b/target/arm/helper.c index 427de6bd2a..fbb03c364b 100644 --- a/target/arm/helper.c +++ b/target/arm/helper.c @@ -3357,7 +3357,7 @@ static int par_el1_shareability(GetPhysAddrResult *res) static uint64_t do_ats_write(CPUARMState *env, uint64_t value, MMUAccessType access_type, ARMMMUIdx mmu_idx, - bool is_secure) + ARMSecuritySpace ss) { bool ret; uint64_t par64; @@ -3369,8 +3369,8 @@ static uint64_t do_ats_write(CPUARMState *env, uint64_t value, * I_MXTJT: Granule protection checks are not performed on the final address * of a successful translation. */ -ret = get_phys_addr_with_secure_nogpc(env, value, access_type, mmu_idx, - is_secure, , ); +ret = get_phys_addr_with_space_nogpc(env, value, access_type, mmu_idx, ss, + , ); /* * ATS operations only do S1 or S1+S2 translations, so we never @@ -3535,7 +3535,7 @@ static void ats_write(CPUARMState *env, const ARMCPRegInfo *ri, uint64_t value) uint64_t par64; ARMMMUIdx mmu_idx; int el = arm_current_el(env); -bool secure = arm_is_secure_below_el3(env); +ARMSecuritySpace ss = arm_security_space(env); switch (ri->opc2 & 6) { case 0: @@ -3543,10 +3543,9 @@ static void ats_write(CPUARMState *env, const ARMCPRegInfo *ri, uint64_t value) switch (el) { case 3: mmu_idx = ARMMMUIdx_E3; -secure = true; break; case 2: -g_assert(!secure); /* ARMv8.4-SecEL2 is 64-bit only */ +g_assert(ss != ARMSS_Secure); /* ARMv8.4-SecEL2 is 64-bit only */ /* fall through */ case 1: if (ri->crm == 9 && (env->uncached_cpsr & CPSR_PAN)) { @@ -3564,10 +3563,9 @@ static void ats_write(CPUARMState *env, const ARMCPRegInfo *ri, uint64_t value) switch (el) { case 3: mmu_idx = ARMMMUIdx_E10_0; -secure = true; break; case 2: -g_assert(!secure); /* ARMv
[PATCH v3 1/6] target/arm/ptw: Load stage-2 tables from realm physical space
In realm state, stage-2 translation tables are fetched from the realm physical address space (R_PGRQD). Signed-off-by: Jean-Philippe Brucker Reviewed-by: Peter Maydell --- target/arm/ptw.c | 26 ++ 1 file changed, 18 insertions(+), 8 deletions(-) diff --git a/target/arm/ptw.c b/target/arm/ptw.c index d1de934702..063adbd84a 100644 --- a/target/arm/ptw.c +++ b/target/arm/ptw.c @@ -157,22 +157,32 @@ static ARMMMUIdx ptw_idx_for_stage_2(CPUARMState *env, ARMMMUIdx stage2idx) /* * We're OK to check the current state of the CPU here because - * (1) we always invalidate all TLBs when the SCR_EL3.NS bit changes + * (1) we always invalidate all TLBs when the SCR_EL3.NS or SCR_EL3.NSE bit + * changes. * (2) there's no way to do a lookup that cares about Stage 2 for a * different security state to the current one for AArch64, and AArch32 * never has a secure EL2. (AArch32 ATS12NSO[UP][RW] allow EL3 to do * an NS stage 1+2 lookup while the NS bit is 0.) */ -if (!arm_is_secure_below_el3(env) || !arm_el_is_aa64(env, 3)) { +if (!arm_el_is_aa64(env, 3)) { return ARMMMUIdx_Phys_NS; } -if (stage2idx == ARMMMUIdx_Stage2_S) { -s2walk_secure = !(env->cp15.vstcr_el2 & VSTCR_SW); -} else { -s2walk_secure = !(env->cp15.vtcr_el2 & VTCR_NSW); -} -return s2walk_secure ? ARMMMUIdx_Phys_S : ARMMMUIdx_Phys_NS; +switch (arm_security_space_below_el3(env)) { +case ARMSS_NonSecure: +return ARMMMUIdx_Phys_NS; +case ARMSS_Realm: +return ARMMMUIdx_Phys_Realm; +case ARMSS_Secure: +if (stage2idx == ARMMMUIdx_Stage2_S) { +s2walk_secure = !(env->cp15.vstcr_el2 & VSTCR_SW); +} else { +s2walk_secure = !(env->cp15.vtcr_el2 & VTCR_NSW); +} +return s2walk_secure ? ARMMMUIdx_Phys_S : ARMMMUIdx_Phys_NS; +default: +g_assert_not_reached(); +} } static bool regime_translation_big_endian(CPUARMState *env, ARMMMUIdx mmu_idx) -- 2.41.0
Re: [PATCH v2 5/6] target/arm/helper: Check SCR_EL3.{NSE,NS} encoding for AT instructions
On Mon, Aug 07, 2023 at 10:54:05AM +0100, Peter Maydell wrote: > On Fri, 4 Aug 2023 at 19:08, Peter Maydell wrote: > > > > On Wed, 2 Aug 2023 at 18:02, Jean-Philippe Brucker > > wrote: > > > > > > The AT instruction is UNDEFINED if the {NSE,NS} configuration is > > > invalid. Add a function to check this on all AT instructions that apply > > > to an EL lower than 3. > > > > > > Suggested-by: Peter Maydell > > > Signed-off-by: Jean-Philippe Brucker > > > --- > > > target/arm/helper.c | 36 +--- > > > 1 file changed, 25 insertions(+), 11 deletions(-) > > > > > > diff --git a/target/arm/helper.c b/target/arm/helper.c > > > index fbb03c364b..77dd80ad28 100644 > > > --- a/target/arm/helper.c > > > +++ b/target/arm/helper.c > > > @@ -3616,6 +3616,20 @@ static void ats1h_write(CPUARMState *env, const > > > ARMCPRegInfo *ri, > > > #endif /* CONFIG_TCG */ > > > } > > > > > > +static CPAccessResult at_e012_access(CPUARMState *env, const > > > ARMCPRegInfo *ri, > > > + bool isread) > > > +{ > > > +/* > > > + * R_NYXTL: instruction is UNDEFINED if it applies to an Exception > > > level > > > + * lower than EL3 and the combination SCR_EL3.{NSE,NS} is reserved. > > > + */ > > > +if (cpu_isar_feature(aa64_rme, env_archcpu(env)) && > > > +(env->cp15.scr_el3 & (SCR_NSE | SCR_NS)) == SCR_NSE) { > > > +return CP_ACCESS_TRAP; > > > +} > > > > The AArch64.AT() pseudocode and the text in the individual > > AT insn descriptions ("When FEAT_RME is implemented, if the Effective > > value of SCR_EL3.{NSE, NS} is a reserved value, this instruction is > > UNDEFINED at EL3") say that this check needs an "arm_current_el(env) == 3" > > condition too. > > It's been pointed out to me that since trying to return from > EL3 with SCR_EL3.{NSE,NS} == {1,0} is an illegal exception return, > it's not actually possible to try to execute these insns in this > state from any other EL than EL3. So we don't actually need > to check for EL3 here. > > QEMU's implementation of exception return is missing that > check for illegal-exception-return on bad {NSE,NS}, though. I can add a patch to check that exception return condition, and add a comment here explaining that this can only happen when executing at EL3 Thanks, Jean
[PATCH v2 0/6] target/arm: Fixes for RME
A few patches to fix RME support and allow booting a realm guest, based on https://lore.kernel.org/qemu-devel/20230714154648.327466-1-peter.mayd...@linaro.org/ Since v1 I fixed patches 1, 2 and 6 following Peter's comments, and added patch 5. Patch 6 now factors the timer IRQ update into a new function, which is a bit invasive but seems cleaner. v1: https://lore.kernel.org/qemu-devel/20230719153018.1456180-2-jean-phili...@linaro.org/ Jean-Philippe Brucker (6): target/arm/ptw: Load stage-2 tables from realm physical space target/arm/helper: Fix tlbmask and tlbbits for TLBI VAE2* target/arm: Skip granule protection checks for AT instructions target/arm: Pass security space rather than flag for AT instructions target/arm/helper: Check SCR_EL3.{NSE,NS} encoding for AT instructions target/arm/helper: Implement CNTHCTL_EL2.CNT[VP]MASK target/arm/cpu.h| 3 + target/arm/internals.h | 25 +++--- target/arm/helper.c | 171 +--- target/arm/ptw.c| 39 + target/arm/trace-events | 7 +- 5 files changed, 170 insertions(+), 75 deletions(-) -- 2.41.0
[PATCH v2 5/6] target/arm/helper: Check SCR_EL3.{NSE, NS} encoding for AT instructions
The AT instruction is UNDEFINED if the {NSE,NS} configuration is invalid. Add a function to check this on all AT instructions that apply to an EL lower than 3. Suggested-by: Peter Maydell Signed-off-by: Jean-Philippe Brucker --- target/arm/helper.c | 36 +--- 1 file changed, 25 insertions(+), 11 deletions(-) diff --git a/target/arm/helper.c b/target/arm/helper.c index fbb03c364b..77dd80ad28 100644 --- a/target/arm/helper.c +++ b/target/arm/helper.c @@ -3616,6 +3616,20 @@ static void ats1h_write(CPUARMState *env, const ARMCPRegInfo *ri, #endif /* CONFIG_TCG */ } +static CPAccessResult at_e012_access(CPUARMState *env, const ARMCPRegInfo *ri, + bool isread) +{ +/* + * R_NYXTL: instruction is UNDEFINED if it applies to an Exception level + * lower than EL3 and the combination SCR_EL3.{NSE,NS} is reserved. + */ +if (cpu_isar_feature(aa64_rme, env_archcpu(env)) && +(env->cp15.scr_el3 & (SCR_NSE | SCR_NS)) == SCR_NSE) { +return CP_ACCESS_TRAP; +} +return CP_ACCESS_OK; +} + static CPAccessResult at_s1e2_access(CPUARMState *env, const ARMCPRegInfo *ri, bool isread) { @@ -3623,7 +3637,7 @@ static CPAccessResult at_s1e2_access(CPUARMState *env, const ARMCPRegInfo *ri, !(env->cp15.scr_el3 & (SCR_NS | SCR_EEL2))) { return CP_ACCESS_TRAP; } -return CP_ACCESS_OK; +return at_e012_access(env, ri, isread); } static void ats_write64(CPUARMState *env, const ARMCPRegInfo *ri, @@ -5505,38 +5519,38 @@ static const ARMCPRegInfo v8_cp_reginfo[] = { .opc0 = 1, .opc1 = 0, .crn = 7, .crm = 8, .opc2 = 0, .access = PL1_W, .type = ARM_CP_NO_RAW | ARM_CP_RAISES_EXC, .fgt = FGT_ATS1E1R, - .writefn = ats_write64 }, + .accessfn = at_e012_access, .writefn = ats_write64 }, { .name = "AT_S1E1W", .state = ARM_CP_STATE_AA64, .opc0 = 1, .opc1 = 0, .crn = 7, .crm = 8, .opc2 = 1, .access = PL1_W, .type = ARM_CP_NO_RAW | ARM_CP_RAISES_EXC, .fgt = FGT_ATS1E1W, - .writefn = ats_write64 }, + .accessfn = at_e012_access, .writefn = ats_write64 }, { .name = "AT_S1E0R", .state = ARM_CP_STATE_AA64, .opc0 = 1, .opc1 = 0, .crn = 7, .crm = 8, .opc2 = 2, .access = PL1_W, .type = ARM_CP_NO_RAW | ARM_CP_RAISES_EXC, .fgt = FGT_ATS1E0R, - .writefn = ats_write64 }, + .accessfn = at_e012_access, .writefn = ats_write64 }, { .name = "AT_S1E0W", .state = ARM_CP_STATE_AA64, .opc0 = 1, .opc1 = 0, .crn = 7, .crm = 8, .opc2 = 3, .access = PL1_W, .type = ARM_CP_NO_RAW | ARM_CP_RAISES_EXC, .fgt = FGT_ATS1E0W, - .writefn = ats_write64 }, + .accessfn = at_e012_access, .writefn = ats_write64 }, { .name = "AT_S12E1R", .state = ARM_CP_STATE_AA64, .opc0 = 1, .opc1 = 4, .crn = 7, .crm = 8, .opc2 = 4, .access = PL2_W, .type = ARM_CP_NO_RAW | ARM_CP_RAISES_EXC, - .writefn = ats_write64 }, + .accessfn = at_e012_access, .writefn = ats_write64 }, { .name = "AT_S12E1W", .state = ARM_CP_STATE_AA64, .opc0 = 1, .opc1 = 4, .crn = 7, .crm = 8, .opc2 = 5, .access = PL2_W, .type = ARM_CP_NO_RAW | ARM_CP_RAISES_EXC, - .writefn = ats_write64 }, + .accessfn = at_e012_access, .writefn = ats_write64 }, { .name = "AT_S12E0R", .state = ARM_CP_STATE_AA64, .opc0 = 1, .opc1 = 4, .crn = 7, .crm = 8, .opc2 = 6, .access = PL2_W, .type = ARM_CP_NO_RAW | ARM_CP_RAISES_EXC, - .writefn = ats_write64 }, + .accessfn = at_e012_access, .writefn = ats_write64 }, { .name = "AT_S12E0W", .state = ARM_CP_STATE_AA64, .opc0 = 1, .opc1 = 4, .crn = 7, .crm = 8, .opc2 = 7, .access = PL2_W, .type = ARM_CP_NO_RAW | ARM_CP_RAISES_EXC, - .writefn = ats_write64 }, + .accessfn = at_e012_access, .writefn = ats_write64 }, /* AT S1E2* are elsewhere as they UNDEF from EL3 if EL2 is not present */ { .name = "AT_S1E3R", .state = ARM_CP_STATE_AA64, .opc0 = 1, .opc1 = 6, .crn = 7, .crm = 8, .opc2 = 0, @@ -8078,12 +8092,12 @@ static const ARMCPRegInfo ats1e1_reginfo[] = { .opc0 = 1, .opc1 = 0, .crn = 7, .crm = 9, .opc2 = 0, .access = PL1_W, .type = ARM_CP_NO_RAW | ARM_CP_RAISES_EXC, .fgt = FGT_ATS1E1RP, - .writefn = ats_write64 }, + .accessfn = at_e012_access, .writefn = ats_write64 }, { .name = "AT_S1E1WP", .state = ARM_CP_STATE_AA64, .opc0 = 1, .opc1 = 0, .crn = 7, .crm = 9, .opc2 = 1, .access = PL1_W, .type = ARM_CP_NO_RAW | ARM_CP_RAISES_EXC, .fgt = FGT_ATS1E1WP, - .writefn = ats_write64 }, + .accessfn = at_e012_access, .writefn = ats_write64 }, }; static const ARMCPRegInfo ats1cp_reginfo[] = { -- 2.41.0
[PATCH v2 1/6] target/arm/ptw: Load stage-2 tables from realm physical space
In realm state, stage-2 translation tables are fetched from the realm physical address space (R_PGRQD). Signed-off-by: Jean-Philippe Brucker --- target/arm/ptw.c | 26 ++ 1 file changed, 18 insertions(+), 8 deletions(-) diff --git a/target/arm/ptw.c b/target/arm/ptw.c index d1de934702..063adbd84a 100644 --- a/target/arm/ptw.c +++ b/target/arm/ptw.c @@ -157,22 +157,32 @@ static ARMMMUIdx ptw_idx_for_stage_2(CPUARMState *env, ARMMMUIdx stage2idx) /* * We're OK to check the current state of the CPU here because - * (1) we always invalidate all TLBs when the SCR_EL3.NS bit changes + * (1) we always invalidate all TLBs when the SCR_EL3.NS or SCR_EL3.NSE bit + * changes. * (2) there's no way to do a lookup that cares about Stage 2 for a * different security state to the current one for AArch64, and AArch32 * never has a secure EL2. (AArch32 ATS12NSO[UP][RW] allow EL3 to do * an NS stage 1+2 lookup while the NS bit is 0.) */ -if (!arm_is_secure_below_el3(env) || !arm_el_is_aa64(env, 3)) { +if (!arm_el_is_aa64(env, 3)) { return ARMMMUIdx_Phys_NS; } -if (stage2idx == ARMMMUIdx_Stage2_S) { -s2walk_secure = !(env->cp15.vstcr_el2 & VSTCR_SW); -} else { -s2walk_secure = !(env->cp15.vtcr_el2 & VTCR_NSW); -} -return s2walk_secure ? ARMMMUIdx_Phys_S : ARMMMUIdx_Phys_NS; +switch (arm_security_space_below_el3(env)) { +case ARMSS_NonSecure: +return ARMMMUIdx_Phys_NS; +case ARMSS_Realm: +return ARMMMUIdx_Phys_Realm; +case ARMSS_Secure: +if (stage2idx == ARMMMUIdx_Stage2_S) { +s2walk_secure = !(env->cp15.vstcr_el2 & VSTCR_SW); +} else { +s2walk_secure = !(env->cp15.vtcr_el2 & VTCR_NSW); +} +return s2walk_secure ? ARMMMUIdx_Phys_S : ARMMMUIdx_Phys_NS; +default: +g_assert_not_reached(); +} } static bool regime_translation_big_endian(CPUARMState *env, ARMMMUIdx mmu_idx) -- 2.41.0
[PATCH v2 3/6] target/arm: Skip granule protection checks for AT instructions
GPC checks are not performed on the output address for AT instructions, as stated by ARM DDI 0487J in D8.12.2: When populating PAR_EL1 with the result of an address translation instruction, granule protection checks are not performed on the final output address of a successful translation. Rename get_phys_addr_with_secure(), since it's only used to handle AT instructions. Signed-off-by: Jean-Philippe Brucker Reviewed-by: Peter Maydell --- target/arm/internals.h | 25 ++--- target/arm/helper.c| 8 ++-- target/arm/ptw.c | 11 ++- 3 files changed, 26 insertions(+), 18 deletions(-) diff --git a/target/arm/internals.h b/target/arm/internals.h index 0f01bc32a8..fc90c364f7 100644 --- a/target/arm/internals.h +++ b/target/arm/internals.h @@ -1190,12 +1190,11 @@ typedef struct GetPhysAddrResult { } GetPhysAddrResult; /** - * get_phys_addr_with_secure: get the physical address for a virtual address + * get_phys_addr: get the physical address for a virtual address * @env: CPUARMState * @address: virtual address to get physical address for * @access_type: 0 for read, 1 for write, 2 for execute * @mmu_idx: MMU index indicating required translation regime - * @is_secure: security state for the access * @result: set on translation success. * @fi: set to fault info if the translation fails * @@ -1212,26 +1211,30 @@ typedef struct GetPhysAddrResult { * * for PSMAv5 based systems we don't bother to return a full FSR format *value. */ -bool get_phys_addr_with_secure(CPUARMState *env, target_ulong address, - MMUAccessType access_type, - ARMMMUIdx mmu_idx, bool is_secure, - GetPhysAddrResult *result, ARMMMUFaultInfo *fi) +bool get_phys_addr(CPUARMState *env, target_ulong address, + MMUAccessType access_type, ARMMMUIdx mmu_idx, + GetPhysAddrResult *result, ARMMMUFaultInfo *fi) __attribute__((nonnull)); /** - * get_phys_addr: get the physical address for a virtual address + * get_phys_addr_with_secure_nogpc: get the physical address for a virtual + * address * @env: CPUARMState * @address: virtual address to get physical address for * @access_type: 0 for read, 1 for write, 2 for execute * @mmu_idx: MMU index indicating required translation regime + * @is_secure: security state for the access * @result: set on translation success. * @fi: set to fault info if the translation fails * - * Similarly, but use the security regime of @mmu_idx. + * Similar to get_phys_addr, but use the given security regime and don't perform + * a Granule Protection Check on the resulting address. */ -bool get_phys_addr(CPUARMState *env, target_ulong address, - MMUAccessType access_type, ARMMMUIdx mmu_idx, - GetPhysAddrResult *result, ARMMMUFaultInfo *fi) +bool get_phys_addr_with_secure_nogpc(CPUARMState *env, target_ulong address, + MMUAccessType access_type, + ARMMMUIdx mmu_idx, bool is_secure, + GetPhysAddrResult *result, + ARMMMUFaultInfo *fi) __attribute__((nonnull)); bool pmsav8_mpu_lookup(CPUARMState *env, uint32_t address, diff --git a/target/arm/helper.c b/target/arm/helper.c index a4c2c1bde5..427de6bd2a 100644 --- a/target/arm/helper.c +++ b/target/arm/helper.c @@ -3365,8 +3365,12 @@ static uint64_t do_ats_write(CPUARMState *env, uint64_t value, ARMMMUFaultInfo fi = {}; GetPhysAddrResult res = {}; -ret = get_phys_addr_with_secure(env, value, access_type, mmu_idx, -is_secure, , ); +/* + * I_MXTJT: Granule protection checks are not performed on the final address + * of a successful translation. + */ +ret = get_phys_addr_with_secure_nogpc(env, value, access_type, mmu_idx, + is_secure, , ); /* * ATS operations only do S1 or S1+S2 translations, so we never diff --git a/target/arm/ptw.c b/target/arm/ptw.c index 063adbd84a..33179f3471 100644 --- a/target/arm/ptw.c +++ b/target/arm/ptw.c @@ -3418,16 +3418,17 @@ static bool get_phys_addr_gpc(CPUARMState *env, S1Translate *ptw, return false; } -bool get_phys_addr_with_secure(CPUARMState *env, target_ulong address, - MMUAccessType access_type, ARMMMUIdx mmu_idx, - bool is_secure, GetPhysAddrResult *result, - ARMMMUFaultInfo *fi) +bool get_phys_addr_with_secure_nogpc(CPUARMState *env, target_ulong address, + MMUAccessType access_type, + ARMMMUIdx mmu_idx, bool is_secure, + GetPhysAddrResult *result
[PATCH v2 6/6] target/arm/helper: Implement CNTHCTL_EL2.CNT[VP]MASK
When FEAT_RME is implemented, these bits override the value of CNT[VP]_CTL_EL0.IMASK in Realm and Root state. Move the IRQ state update into a new gt_update_irq() function and test those bits every time we recompute the IRQ state. Since we're removing the IRQ state from some trace events, add a new trace event for gt_update_irq(). Signed-off-by: Jean-Philippe Brucker --- target/arm/cpu.h| 3 +++ target/arm/helper.c | 54 - target/arm/trace-events | 7 +++--- 3 files changed, 50 insertions(+), 14 deletions(-) diff --git a/target/arm/cpu.h b/target/arm/cpu.h index bcd65a63ca..bedc7ec6dc 100644 --- a/target/arm/cpu.h +++ b/target/arm/cpu.h @@ -1743,6 +1743,9 @@ static inline void xpsr_write(CPUARMState *env, uint32_t val, uint32_t mask) #define HSTR_TTEE (1 << 16) #define HSTR_TJDBX (1 << 17) +#define CNTHCTL_CNTVMASK (1 << 18) +#define CNTHCTL_CNTPMASK (1 << 19) + /* Return the current FPSCR value. */ uint32_t vfp_get_fpscr(CPUARMState *env); void vfp_set_fpscr(CPUARMState *env, uint32_t val); diff --git a/target/arm/helper.c b/target/arm/helper.c index 77dd80ad28..68e915ddda 100644 --- a/target/arm/helper.c +++ b/target/arm/helper.c @@ -2608,6 +2608,28 @@ static uint64_t gt_get_countervalue(CPUARMState *env) return qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) / gt_cntfrq_period_ns(cpu); } +static void gt_update_irq(ARMCPU *cpu, int timeridx) +{ +CPUARMState *env = >env; +uint64_t cnthctl = env->cp15.cnthctl_el2; +ARMSecuritySpace ss = arm_security_space(env); +/* ISTATUS && !IMASK */ +int irqstate = (env->cp15.c14_timer[timeridx].ctl & 6) == 4; + +/* + * If bit CNTHCTL_EL2.CNT[VP]MASK is set, it overrides IMASK. + * It is RES0 in Secure and NonSecure state. + */ +if ((ss == ARMSS_Root || ss == ARMSS_Realm) && +((timeridx == GTIMER_VIRT && (cnthctl & CNTHCTL_CNTVMASK)) || + (timeridx == GTIMER_PHYS && (cnthctl & CNTHCTL_CNTPMASK { +irqstate = 0; +} + +qemu_set_irq(cpu->gt_timer_outputs[timeridx], irqstate); +trace_arm_gt_update_irq(timeridx, irqstate); +} + static void gt_recalc_timer(ARMCPU *cpu, int timeridx) { ARMGenericTimer *gt = >env.cp15.c14_timer[timeridx]; @@ -2623,13 +2645,9 @@ static void gt_recalc_timer(ARMCPU *cpu, int timeridx) /* Note that this must be unsigned 64 bit arithmetic: */ int istatus = count - offset >= gt->cval; uint64_t nexttick; -int irqstate; gt->ctl = deposit32(gt->ctl, 2, 1, istatus); -irqstate = (istatus && !(gt->ctl & 2)); -qemu_set_irq(cpu->gt_timer_outputs[timeridx], irqstate); - if (istatus) { /* Next transition is when count rolls back over to zero */ nexttick = UINT64_MAX; @@ -2648,14 +2666,14 @@ static void gt_recalc_timer(ARMCPU *cpu, int timeridx) } else { timer_mod(cpu->gt_timer[timeridx], nexttick); } -trace_arm_gt_recalc(timeridx, irqstate, nexttick); +trace_arm_gt_recalc(timeridx, nexttick); } else { /* Timer disabled: ISTATUS and timer output always clear */ gt->ctl &= ~4; -qemu_set_irq(cpu->gt_timer_outputs[timeridx], 0); timer_del(cpu->gt_timer[timeridx]); trace_arm_gt_recalc_disabled(timeridx); } +gt_update_irq(cpu, timeridx); } static void gt_timer_reset(CPUARMState *env, const ARMCPRegInfo *ri, @@ -2759,10 +2777,8 @@ static void gt_ctl_write(CPUARMState *env, const ARMCPRegInfo *ri, * IMASK toggled: don't need to recalculate, * just set the interrupt line based on ISTATUS */ -int irqstate = (oldval & 4) && !(value & 2); - -trace_arm_gt_imask_toggle(timeridx, irqstate); -qemu_set_irq(cpu->gt_timer_outputs[timeridx], irqstate); +trace_arm_gt_imask_toggle(timeridx); +gt_update_irq(cpu, timeridx); } } @@ -2888,6 +2904,21 @@ static void gt_virt_ctl_write(CPUARMState *env, const ARMCPRegInfo *ri, gt_ctl_write(env, ri, GTIMER_VIRT, value); } +static void gt_cnthctl_write(CPUARMState *env, const ARMCPRegInfo *ri, + uint64_t value) +{ +ARMCPU *cpu = env_archcpu(env); +uint32_t oldval = env->cp15.cnthctl_el2; + +raw_write(env, ri, value); + +if ((oldval ^ value) & CNTHCTL_CNTVMASK) { +gt_update_irq(cpu, GTIMER_VIRT); +} else if ((oldval ^ value) & CNTHCTL_CNTPMASK) { +gt_update_irq(cpu, GTIMER_PHYS); +} +} + static void gt_cntvoff_write(CPUARMState *env, const ARMCPRegInfo *ri, uint64_t value) { @@ -6200,7 +6231,8 @@ static const ARMCPRegInfo el2_cp_reginfo[] = { * reset values as IMPDEF. We choose to reset to 3 to comply with * both ARMv7
[PATCH v2 4/6] target/arm: Pass security space rather than flag for AT instructions
At the moment we only handle Secure and Nonsecure security spaces for the AT instructions. Add support for Realm and Root. For AArch64, arm_security_space() gives the desired space. ARM DDI0487J says (R_NYXTL): If EL3 is implemented, then when an address translation instruction that applies to an Exception level lower than EL3 is executed, the Effective value of SCR_EL3.{NSE, NS} determines the target Security state that the instruction applies to. For AArch32, some instructions can access NonSecure space from Secure, so we still need to pass the state explicitly to do_ats_write(). Signed-off-by: Jean-Philippe Brucker Reviewed-by: Peter Maydell --- target/arm/internals.h | 18 +- target/arm/helper.c| 27 --- target/arm/ptw.c | 12 ++-- 3 files changed, 27 insertions(+), 30 deletions(-) diff --git a/target/arm/internals.h b/target/arm/internals.h index fc90c364f7..cf13bb94f5 100644 --- a/target/arm/internals.h +++ b/target/arm/internals.h @@ -1217,24 +1217,24 @@ bool get_phys_addr(CPUARMState *env, target_ulong address, __attribute__((nonnull)); /** - * get_phys_addr_with_secure_nogpc: get the physical address for a virtual - * address + * get_phys_addr_with_space_nogpc: get the physical address for a virtual + * address * @env: CPUARMState * @address: virtual address to get physical address for * @access_type: 0 for read, 1 for write, 2 for execute * @mmu_idx: MMU index indicating required translation regime - * @is_secure: security state for the access + * @space: security space for the access * @result: set on translation success. * @fi: set to fault info if the translation fails * - * Similar to get_phys_addr, but use the given security regime and don't perform + * Similar to get_phys_addr, but use the given security space and don't perform * a Granule Protection Check on the resulting address. */ -bool get_phys_addr_with_secure_nogpc(CPUARMState *env, target_ulong address, - MMUAccessType access_type, - ARMMMUIdx mmu_idx, bool is_secure, - GetPhysAddrResult *result, - ARMMMUFaultInfo *fi) +bool get_phys_addr_with_space_nogpc(CPUARMState *env, target_ulong address, +MMUAccessType access_type, +ARMMMUIdx mmu_idx, ARMSecuritySpace space, +GetPhysAddrResult *result, +ARMMMUFaultInfo *fi) __attribute__((nonnull)); bool pmsav8_mpu_lookup(CPUARMState *env, uint32_t address, diff --git a/target/arm/helper.c b/target/arm/helper.c index 427de6bd2a..fbb03c364b 100644 --- a/target/arm/helper.c +++ b/target/arm/helper.c @@ -3357,7 +3357,7 @@ static int par_el1_shareability(GetPhysAddrResult *res) static uint64_t do_ats_write(CPUARMState *env, uint64_t value, MMUAccessType access_type, ARMMMUIdx mmu_idx, - bool is_secure) + ARMSecuritySpace ss) { bool ret; uint64_t par64; @@ -3369,8 +3369,8 @@ static uint64_t do_ats_write(CPUARMState *env, uint64_t value, * I_MXTJT: Granule protection checks are not performed on the final address * of a successful translation. */ -ret = get_phys_addr_with_secure_nogpc(env, value, access_type, mmu_idx, - is_secure, , ); +ret = get_phys_addr_with_space_nogpc(env, value, access_type, mmu_idx, ss, + , ); /* * ATS operations only do S1 or S1+S2 translations, so we never @@ -3535,7 +3535,7 @@ static void ats_write(CPUARMState *env, const ARMCPRegInfo *ri, uint64_t value) uint64_t par64; ARMMMUIdx mmu_idx; int el = arm_current_el(env); -bool secure = arm_is_secure_below_el3(env); +ARMSecuritySpace ss = arm_security_space(env); switch (ri->opc2 & 6) { case 0: @@ -3543,10 +3543,9 @@ static void ats_write(CPUARMState *env, const ARMCPRegInfo *ri, uint64_t value) switch (el) { case 3: mmu_idx = ARMMMUIdx_E3; -secure = true; break; case 2: -g_assert(!secure); /* ARMv8.4-SecEL2 is 64-bit only */ +g_assert(ss != ARMSS_Secure); /* ARMv8.4-SecEL2 is 64-bit only */ /* fall through */ case 1: if (ri->crm == 9 && (env->uncached_cpsr & CPSR_PAN)) { @@ -3564,10 +3563,9 @@ static void ats_write(CPUARMState *env, const ARMCPRegInfo *ri, uint64_t value) switch (el) { case 3: mmu_idx = ARMMMUIdx_E10_0; -secure = true; break; case 2: -g_assert(!secure); /* ARMv
[PATCH v2 2/6] target/arm/helper: Fix tlbmask and tlbbits for TLBI VAE2*
When HCR_EL2.E2H is enabled, TLB entries are formed using the EL2&0 translation regime, instead of the EL2 translation regime. The TLB VAE2* instructions invalidate the regime that corresponds to the current value of HCR_EL2.E2H. At the moment we only invalidate the EL2 translation regime. This causes problems with RMM, which issues TLBI VAE2IS instructions with HCR_EL2.E2H enabled. Update vae2_tlbmask() to take HCR_EL2.E2H into account. Add vae2_tlbbits() as well, since the top-byte-ignore configuration is different between the EL2&0 and EL2 regime. Signed-off-by: Jean-Philippe Brucker --- target/arm/helper.c | 50 - 1 file changed, 40 insertions(+), 10 deletions(-) diff --git a/target/arm/helper.c b/target/arm/helper.c index 2959d27543..a4c2c1bde5 100644 --- a/target/arm/helper.c +++ b/target/arm/helper.c @@ -4663,6 +4663,21 @@ static int vae1_tlbmask(CPUARMState *env) return mask; } +static int vae2_tlbmask(CPUARMState *env) +{ +uint64_t hcr = arm_hcr_el2_eff(env); +uint16_t mask; + +if (hcr & HCR_E2H) { +mask = ARMMMUIdxBit_E20_2 | + ARMMMUIdxBit_E20_2_PAN | + ARMMMUIdxBit_E20_0; +} else { +mask = ARMMMUIdxBit_E2; +} +return mask; +} + /* Return 56 if TBI is enabled, 64 otherwise. */ static int tlbbits_for_regime(CPUARMState *env, ARMMMUIdx mmu_idx, uint64_t addr) @@ -4689,6 +4704,25 @@ static int vae1_tlbbits(CPUARMState *env, uint64_t addr) return tlbbits_for_regime(env, mmu_idx, addr); } +static int vae2_tlbbits(CPUARMState *env, uint64_t addr) +{ +uint64_t hcr = arm_hcr_el2_eff(env); +ARMMMUIdx mmu_idx; + +/* + * Only the regime of the mmu_idx below is significant. + * Regime EL2&0 has two ranges with separate TBI configuration, while EL2 + * only has one. + */ +if (hcr & HCR_E2H) { +mmu_idx = ARMMMUIdx_E20_2; +} else { +mmu_idx = ARMMMUIdx_E2; +} + +return tlbbits_for_regime(env, mmu_idx, addr); +} + static void tlbi_aa64_vmalle1is_write(CPUARMState *env, const ARMCPRegInfo *ri, uint64_t value) { @@ -4781,10 +4815,11 @@ static void tlbi_aa64_vae2_write(CPUARMState *env, const ARMCPRegInfo *ri, * flush-last-level-only. */ CPUState *cs = env_cpu(env); -int mask = e2_tlbmask(env); +int mask = vae2_tlbmask(env); uint64_t pageaddr = sextract64(value << 12, 0, 56); +int bits = vae2_tlbbits(env, pageaddr); -tlb_flush_page_by_mmuidx(cs, pageaddr, mask); +tlb_flush_page_bits_by_mmuidx(cs, pageaddr, mask, bits); } static void tlbi_aa64_vae3_write(CPUARMState *env, const ARMCPRegInfo *ri, @@ -4838,11 +4873,11 @@ static void tlbi_aa64_vae2is_write(CPUARMState *env, const ARMCPRegInfo *ri, uint64_t value) { CPUState *cs = env_cpu(env); +int mask = vae2_tlbmask(env); uint64_t pageaddr = sextract64(value << 12, 0, 56); -int bits = tlbbits_for_regime(env, ARMMMUIdx_E2, pageaddr); +int bits = vae2_tlbbits(env, pageaddr); -tlb_flush_page_bits_by_mmuidx_all_cpus_synced(cs, pageaddr, - ARMMMUIdxBit_E2, bits); +tlb_flush_page_bits_by_mmuidx_all_cpus_synced(cs, pageaddr, mask, bits); } static void tlbi_aa64_vae3is_write(CPUARMState *env, const ARMCPRegInfo *ri, @@ -5014,11 +5049,6 @@ static void tlbi_aa64_rvae1is_write(CPUARMState *env, do_rvae_write(env, value, vae1_tlbmask(env), true); } -static int vae2_tlbmask(CPUARMState *env) -{ -return ARMMMUIdxBit_E2; -} - static void tlbi_aa64_rvae2_write(CPUARMState *env, const ARMCPRegInfo *ri, uint64_t value) -- 2.41.0
Re: [PATCH 3/5] target/arm: Skip granule protection checks for AT instructions
On Thu, Jul 20, 2023 at 05:39:56PM +0100, Peter Maydell wrote: > On Wed, 19 Jul 2023 at 16:56, Jean-Philippe Brucker > wrote: > > > > GPC checks are not performed on the output address for AT instructions, > > as stated by ARM DDI 0487J in D8.12.2: > > > > When populating PAR_EL1 with the result of an address translation > > instruction, granule protection checks are not performed on the final > > output address of a successful translation. > > > > Rename get_phys_addr_with_secure(), since it's only used to handle AT > > instructions. > > > > Signed-off-by: Jean-Philippe Brucker > > --- > > This incidentally fixes a problem with AT S1E1 instructions which can > > output an IPA and should definitely not cause a GPC. > > --- > > target/arm/internals.h | 25 ++--- > > target/arm/helper.c| 8 ++-- > > target/arm/ptw.c | 11 ++- > > 3 files changed, 26 insertions(+), 18 deletions(-) > > > > diff --git a/target/arm/internals.h b/target/arm/internals.h > > index 0f01bc32a8..fc90c364f7 100644 > > --- a/target/arm/internals.h > > +++ b/target/arm/internals.h > > @@ -1190,12 +1190,11 @@ typedef struct GetPhysAddrResult { > > } GetPhysAddrResult; > > > > /** > > - * get_phys_addr_with_secure: get the physical address for a virtual > > address > > + * get_phys_addr: get the physical address for a virtual address > > * @env: CPUARMState > > * @address: virtual address to get physical address for > > * @access_type: 0 for read, 1 for write, 2 for execute > > * @mmu_idx: MMU index indicating required translation regime > > - * @is_secure: security state for the access > > * @result: set on translation success. > > * @fi: set to fault info if the translation fails > > * > > @@ -1212,26 +1211,30 @@ typedef struct GetPhysAddrResult { > > * * for PSMAv5 based systems we don't bother to return a full FSR format > > *value. > > */ > > -bool get_phys_addr_with_secure(CPUARMState *env, target_ulong address, > > - MMUAccessType access_type, > > - ARMMMUIdx mmu_idx, bool is_secure, > > - GetPhysAddrResult *result, ARMMMUFaultInfo > > *fi) > > +bool get_phys_addr(CPUARMState *env, target_ulong address, > > + MMUAccessType access_type, ARMMMUIdx mmu_idx, > > + GetPhysAddrResult *result, ARMMMUFaultInfo *fi) > > __attribute__((nonnull)); > > > What is going on in this bit of the patch ? We already > have a prototype for get_phys_addr() with a doc comment. > Is this git's diff-output producing something silly > for a change that is not logically touching get_phys_addr()'s > prototype and comment at all? I swapped the two prototypes in order to make the new comment for get_phys_addr_with_secure_nogpc() more clear. Tried to convey that get_phys_addr() is the normal function and get_phys_addr_with_secure_nogpc() is special. So git is working as expected and this is a style change, I can switch them back if you prefer. Thanks, Jean > > > /** > > - * get_phys_addr: get the physical address for a virtual address > > + * get_phys_addr_with_secure_nogpc: get the physical address for a virtual > > + * address > > * @env: CPUARMState > > * @address: virtual address to get physical address for > > * @access_type: 0 for read, 1 for write, 2 for execute > > * @mmu_idx: MMU index indicating required translation regime > > + * @is_secure: security state for the access > > * @result: set on translation success. > > * @fi: set to fault info if the translation fails > > * > > - * Similarly, but use the security regime of @mmu_idx. > > + * Similar to get_phys_addr, but use the given security regime and don't > > perform > > + * a Granule Protection Check on the resulting address. > > */ > > -bool get_phys_addr(CPUARMState *env, target_ulong address, > > - MMUAccessType access_type, ARMMMUIdx mmu_idx, > > - GetPhysAddrResult *result, ARMMMUFaultInfo *fi) > > +bool get_phys_addr_with_secure_nogpc(CPUARMState *env, target_ulong > > address, > > + MMUAccessType access_type, > > + ARMMMUIdx mmu_idx, bool is_secure, > > + GetPhysAddrResult *result, > > + ARMMMUFaultInfo *fi) > > __attribute__((nonnull)); > > thanks > -- PMM
Re: [PATCH 2/5] target/arm/helper: Fix vae2_tlbmask()
On Thu, Jul 20, 2023 at 05:35:49PM +0100, Peter Maydell wrote: > On Wed, 19 Jul 2023 at 16:56, Jean-Philippe Brucker > wrote: > > > > When HCR_EL2.E2H is enabled, TLB entries are formed using the EL2&0 > > translation regime, instead of the EL2 translation regime. The TLB VAE2* > > instructions invalidate the regime that corresponds to the current value > > of HCR_EL2.E2H. > > > > At the moment we only invalidate the EL2 translation regime. This causes > > problems with RMM, which issues TLBI VAE2IS instructions with > > HCR_EL2.E2H enabled. Update vae2_tlbmask() to take HCR_EL2.E2H into > > account. > > > > Signed-off-by: Jean-Philippe Brucker > > --- > > target/arm/helper.c | 26 ++ > > 1 file changed, 18 insertions(+), 8 deletions(-) > > > > diff --git a/target/arm/helper.c b/target/arm/helper.c > > index e1b3db6f5f..07a9ac70f5 100644 > > --- a/target/arm/helper.c > > +++ b/target/arm/helper.c > > @@ -4663,6 +4663,21 @@ static int vae1_tlbmask(CPUARMState *env) > > return mask; > > } > > > > +static int vae2_tlbmask(CPUARMState *env) > > +{ > > +uint64_t hcr = arm_hcr_el2_eff(env); > > +uint16_t mask; > > + > > +if (hcr & HCR_E2H) { > > +mask = ARMMMUIdxBit_E20_2 | > > + ARMMMUIdxBit_E20_2_PAN | > > + ARMMMUIdxBit_E20_0; > > +} else { > > +mask = ARMMMUIdxBit_E2; > > +} > > +return mask; > > +} > > + > > /* Return 56 if TBI is enabled, 64 otherwise. */ > > static int tlbbits_for_regime(CPUARMState *env, ARMMMUIdx mmu_idx, > >uint64_t addr) > > > @@ -4838,11 +4853,11 @@ static void tlbi_aa64_vae2is_write(CPUARMState > > *env, const ARMCPRegInfo *ri, > > uint64_t value) > > { > > CPUState *cs = env_cpu(env); > > +int mask = vae2_tlbmask(env); > > uint64_t pageaddr = sextract64(value << 12, 0, 56); > > int bits = tlbbits_for_regime(env, ARMMMUIdx_E2, pageaddr); > > Shouldn't the argument to tlbbits_for_regime() also change > if we're dealing with the EL2&0 regime rather than EL2 ? Yes, it affects the result since EL2&0 has two ranges Thanks, Jean > > > > > -tlb_flush_page_bits_by_mmuidx_all_cpus_synced(cs, pageaddr, > > - ARMMMUIdxBit_E2, bits); > > +tlb_flush_page_bits_by_mmuidx_all_cpus_synced(cs, pageaddr, mask, > > bits); > > } > > thanks > -- PMM
Re: [PATCH 0/5] target/arm: Fixes for RME
On Thu, Jul 20, 2023 at 01:05:58PM +0100, Peter Maydell wrote: > On Wed, 19 Jul 2023 at 16:56, Jean-Philippe Brucker > wrote: > > > > With these patches I'm able to boot a Realm guest under > > "-cpu max,x-rme=on". They are based on Peter's series which fixes > > handling of NSTable: > > https://lore.kernel.org/qemu-devel/20230714154648.327466-1-peter.mayd...@linaro.org/ > > Thanks for testing this -- this is a lot closer to > working out of the box than I thought we might be :-) > I'm tempted to try to put these fixes (and my ptw patchset) > into 8.1, but OTOH I suspect work on Realm guests will probably > still want to use a bleeding-edge QEMU for other bugs we're > going to discover over the next few months, so IDK. We'll > see how code review goes on those, I guess. > > > Running a Realm guest requires components at EL3 and R-EL2. Some rough > > support for TF-A and RMM is available here: > > https://jpbrucker.net/git/tf-a/log/?h=qemu-rme > > https://jpbrucker.net/git/rmm/log/?h=qemu-rme > > I'll clean this up before sending it out. > > > > I also need to manually disable FEAT_SME in QEMU in order to boot this, > > Do you mean you needed to do something more invasive than > '-cpu max,x-rme=on,sme=off' ? Ah no, I hadn't noticed there was a sme=off switch, that's much better Thanks, Jean > > > otherwise the Linux host fails to boot because hyp-stub accesses to SME > > regs are trapped to EL3, which doesn't support RME+SME at the moment. > > The right fix is probably in TF-A but I haven't investigated yet. > > thanks > -- PMM
Re: [PATCH for-8.1] virtio-iommu: Standardize granule extraction and formatting
On Tue, Jul 18, 2023 at 08:21:36PM +0200, Eric Auger wrote: > At several locations we compute the granule from the config > page_size_mask using ctz() and then format it in traces using > BIT(). As the page_size_mask is 64b we should use ctz64 and > BIT_ULL() for formatting. We failed to be consistent. > > Note the page_size_mask is garanteed to be non null. The spec > mandates the device to set at least one bit, so ctz64 cannot > return 64. This is garanteed by the fact the device > initializes the page_size_mask to qemu_target_page_mask() > and then the page_size_mask is further constrained by > virtio_iommu_set_page_size_mask() callback which can't > result in a new mask being null. So if Coverity complains > round those ctz64/BIT_ULL with CID 1517772 this is a false > positive > > Signed-off-by: Eric Auger > Fixes: 94df5b2180 ("virtio-iommu: Fix 64kB host page size VFIO device > assignment") Reviewed-by: Jean-Philippe Brucker > --- > hw/virtio/virtio-iommu.c | 8 +--- > 1 file changed, 5 insertions(+), 3 deletions(-) > > diff --git a/hw/virtio/virtio-iommu.c b/hw/virtio/virtio-iommu.c > index 201127c488..c6ee4d7a3c 100644 > --- a/hw/virtio/virtio-iommu.c > +++ b/hw/virtio/virtio-iommu.c > @@ -852,17 +852,19 @@ static IOMMUTLBEntry > virtio_iommu_translate(IOMMUMemoryRegion *mr, hwaddr addr, > VirtIOIOMMUEndpoint *ep; > uint32_t sid, flags; > bool bypass_allowed; > +int granule; > bool found; > int i; > > interval.low = addr; > interval.high = addr + 1; > +granule = ctz64(s->config.page_size_mask); > > IOMMUTLBEntry entry = { > .target_as = _space_memory, > .iova = addr, > .translated_addr = addr, > -.addr_mask = (1 << ctz32(s->config.page_size_mask)) - 1, > +.addr_mask = BIT_ULL(granule) - 1, > .perm = IOMMU_NONE, > }; > > @@ -1115,7 +1117,7 @@ static int > virtio_iommu_set_page_size_mask(IOMMUMemoryRegion *mr, > if (s->granule_frozen) { > int cur_granule = ctz64(cur_mask); > > -if (!(BIT(cur_granule) & new_mask)) { > +if (!(BIT_ULL(cur_granule) & new_mask)) { > error_setg(errp, "virtio-iommu %s does not support frozen > granule 0x%llx", > mr->parent_obj.name, BIT_ULL(cur_granule)); > return -1; > @@ -1161,7 +1163,7 @@ static void virtio_iommu_freeze_granule(Notifier > *notifier, void *data) > } > s->granule_frozen = true; > granule = ctz64(s->config.page_size_mask); > -trace_virtio_iommu_freeze_granule(BIT(granule)); > +trace_virtio_iommu_freeze_granule(BIT_ULL(granule)); > } > > static void virtio_iommu_device_realize(DeviceState *dev, Error **errp) > -- > 2.38.1 >
[PATCH 0/5] target/arm: Fixes for RME
With these patches I'm able to boot a Realm guest under "-cpu max,x-rme=on". They are based on Peter's series which fixes handling of NSTable: https://lore.kernel.org/qemu-devel/20230714154648.327466-1-peter.mayd...@linaro.org/ Running a Realm guest requires components at EL3 and R-EL2. Some rough support for TF-A and RMM is available here: https://jpbrucker.net/git/tf-a/log/?h=qemu-rme https://jpbrucker.net/git/rmm/log/?h=qemu-rme I'll clean this up before sending it out. I also need to manually disable FEAT_SME in QEMU in order to boot this, otherwise the Linux host fails to boot because hyp-stub accesses to SME regs are trapped to EL3, which doesn't support RME+SME at the moment. The right fix is probably in TF-A but I haven't investigated yet. Jean-Philippe Brucker (5): target/arm/ptw: Load stage-2 tables from realm physical space target/arm/helper: Fix vae2_tlbmask() target/arm: Skip granule protection checks for AT instructions target/arm: Pass security space rather than flag for AT instructions target/arm/helper: Implement CNTHCTL_EL2.CNT[VP]MASK target/arm/internals.h | 25 -- target/arm/helper.c| 78 -- target/arm/ptw.c | 19 ++ 3 files changed, 79 insertions(+), 43 deletions(-) -- 2.41.0
[PATCH 4/5] target/arm: Pass security space rather than flag for AT instructions
At the moment we only handle Secure and Nonsecure security spaces for the AT instructions. Add support for Realm and Root. For AArch64, arm_security_space() gives the desired space. ARM DDI0487J says (R_NYXTL): If EL3 is implemented, then when an address translation instruction that applies to an Exception level lower than EL3 is executed, the Effective value of SCR_EL3.{NSE, NS} determines the target Security state that the instruction applies to. For AArch32, some instructions can access NonSecure space from Secure, so we still need to pass the state explicitly to do_ats_write(). Signed-off-by: Jean-Philippe Brucker --- I haven't tested AT instructions in Realm/Root space yet, but it looks like the patch is needed. RMM doesn't issue AT instructions like KVM does in non-secure state (which triggered the bug in the previous patch). --- target/arm/internals.h | 18 +- target/arm/helper.c| 27 --- target/arm/ptw.c | 12 ++-- 3 files changed, 27 insertions(+), 30 deletions(-) diff --git a/target/arm/internals.h b/target/arm/internals.h index fc90c364f7..cf13bb94f5 100644 --- a/target/arm/internals.h +++ b/target/arm/internals.h @@ -1217,24 +1217,24 @@ bool get_phys_addr(CPUARMState *env, target_ulong address, __attribute__((nonnull)); /** - * get_phys_addr_with_secure_nogpc: get the physical address for a virtual - * address + * get_phys_addr_with_space_nogpc: get the physical address for a virtual + * address * @env: CPUARMState * @address: virtual address to get physical address for * @access_type: 0 for read, 1 for write, 2 for execute * @mmu_idx: MMU index indicating required translation regime - * @is_secure: security state for the access + * @space: security space for the access * @result: set on translation success. * @fi: set to fault info if the translation fails * - * Similar to get_phys_addr, but use the given security regime and don't perform + * Similar to get_phys_addr, but use the given security space and don't perform * a Granule Protection Check on the resulting address. */ -bool get_phys_addr_with_secure_nogpc(CPUARMState *env, target_ulong address, - MMUAccessType access_type, - ARMMMUIdx mmu_idx, bool is_secure, - GetPhysAddrResult *result, - ARMMMUFaultInfo *fi) +bool get_phys_addr_with_space_nogpc(CPUARMState *env, target_ulong address, +MMUAccessType access_type, +ARMMMUIdx mmu_idx, ARMSecuritySpace space, +GetPhysAddrResult *result, +ARMMMUFaultInfo *fi) __attribute__((nonnull)); bool pmsav8_mpu_lookup(CPUARMState *env, uint32_t address, diff --git a/target/arm/helper.c b/target/arm/helper.c index 3ee2bb5fe1..2017b11795 100644 --- a/target/arm/helper.c +++ b/target/arm/helper.c @@ -3357,7 +3357,7 @@ static int par_el1_shareability(GetPhysAddrResult *res) static uint64_t do_ats_write(CPUARMState *env, uint64_t value, MMUAccessType access_type, ARMMMUIdx mmu_idx, - bool is_secure) + ARMSecuritySpace ss) { bool ret; uint64_t par64; @@ -3369,8 +3369,8 @@ static uint64_t do_ats_write(CPUARMState *env, uint64_t value, * I_MXTJT: Granule protection checks are not performed on the final address * of a successful translation. */ -ret = get_phys_addr_with_secure_nogpc(env, value, access_type, mmu_idx, - is_secure, , ); +ret = get_phys_addr_with_space_nogpc(env, value, access_type, mmu_idx, ss, + , ); /* * ATS operations only do S1 or S1+S2 translations, so we never @@ -3535,7 +3535,7 @@ static void ats_write(CPUARMState *env, const ARMCPRegInfo *ri, uint64_t value) uint64_t par64; ARMMMUIdx mmu_idx; int el = arm_current_el(env); -bool secure = arm_is_secure_below_el3(env); +ARMSecuritySpace ss = arm_security_space(env); switch (ri->opc2 & 6) { case 0: @@ -3543,10 +3543,9 @@ static void ats_write(CPUARMState *env, const ARMCPRegInfo *ri, uint64_t value) switch (el) { case 3: mmu_idx = ARMMMUIdx_E3; -secure = true; break; case 2: -g_assert(!secure); /* ARMv8.4-SecEL2 is 64-bit only */ +g_assert(ss != ARMSS_Secure); /* ARMv8.4-SecEL2 is 64-bit only */ /* fall through */ case 1: if (ri->crm == 9 && (env->uncached_cpsr & CPSR_PAN)) { @@ -3564,10 +3563,9 @@ static void ats_write(CPUARMState *env, const ARMCPRegInfo *ri, uint64_t v
[PATCH 5/5] target/arm/helper: Implement CNTHCTL_EL2.CNT[VP]MASK
When FEAT_RME is implemented, these bits override the value of CNT[VP]_CTL_EL0.IMASK in Realm and Root state. Signed-off-by: Jean-Philippe Brucker --- target/arm/helper.c | 21 +++-- 1 file changed, 19 insertions(+), 2 deletions(-) diff --git a/target/arm/helper.c b/target/arm/helper.c index 2017b11795..5b173a827f 100644 --- a/target/arm/helper.c +++ b/target/arm/helper.c @@ -2608,6 +2608,23 @@ static uint64_t gt_get_countervalue(CPUARMState *env) return qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) / gt_cntfrq_period_ns(cpu); } +static bool gt_is_masked(CPUARMState *env, int timeridx) +{ +ARMSecuritySpace ss = arm_security_space(env); + +/* + * If bits CNTHCTL_EL2.CNT[VP]MASK are set, they override + * CNT[VP]_CTL_EL0.IMASK. They are RES0 in Secure and NonSecure state. + */ +if ((ss == ARMSS_Root || ss == ARMSS_Realm) && +((timeridx == GTIMER_VIRT && extract64(env->cp15.cnthctl_el2, 18, 1)) || + (timeridx == GTIMER_PHYS && extract64(env->cp15.cnthctl_el2, 19, 1 { +return true; +} + +return env->cp15.c14_timer[timeridx].ctl & 2; +} + static void gt_recalc_timer(ARMCPU *cpu, int timeridx) { ARMGenericTimer *gt = >env.cp15.c14_timer[timeridx]; @@ -2627,7 +2644,7 @@ static void gt_recalc_timer(ARMCPU *cpu, int timeridx) gt->ctl = deposit32(gt->ctl, 2, 1, istatus); -irqstate = (istatus && !(gt->ctl & 2)); +irqstate = (istatus && !gt_is_masked(>env, timeridx)); qemu_set_irq(cpu->gt_timer_outputs[timeridx], irqstate); if (istatus) { @@ -2759,7 +2776,7 @@ static void gt_ctl_write(CPUARMState *env, const ARMCPRegInfo *ri, * IMASK toggled: don't need to recalculate, * just set the interrupt line based on ISTATUS */ -int irqstate = (oldval & 4) && !(value & 2); +int irqstate = (oldval & 4) && !gt_is_masked(env, timeridx); trace_arm_gt_imask_toggle(timeridx, irqstate); qemu_set_irq(cpu->gt_timer_outputs[timeridx], irqstate); -- 2.41.0
[PATCH 2/5] target/arm/helper: Fix vae2_tlbmask()
When HCR_EL2.E2H is enabled, TLB entries are formed using the EL2&0 translation regime, instead of the EL2 translation regime. The TLB VAE2* instructions invalidate the regime that corresponds to the current value of HCR_EL2.E2H. At the moment we only invalidate the EL2 translation regime. This causes problems with RMM, which issues TLBI VAE2IS instructions with HCR_EL2.E2H enabled. Update vae2_tlbmask() to take HCR_EL2.E2H into account. Signed-off-by: Jean-Philippe Brucker --- target/arm/helper.c | 26 ++ 1 file changed, 18 insertions(+), 8 deletions(-) diff --git a/target/arm/helper.c b/target/arm/helper.c index e1b3db6f5f..07a9ac70f5 100644 --- a/target/arm/helper.c +++ b/target/arm/helper.c @@ -4663,6 +4663,21 @@ static int vae1_tlbmask(CPUARMState *env) return mask; } +static int vae2_tlbmask(CPUARMState *env) +{ +uint64_t hcr = arm_hcr_el2_eff(env); +uint16_t mask; + +if (hcr & HCR_E2H) { +mask = ARMMMUIdxBit_E20_2 | + ARMMMUIdxBit_E20_2_PAN | + ARMMMUIdxBit_E20_0; +} else { +mask = ARMMMUIdxBit_E2; +} +return mask; +} + /* Return 56 if TBI is enabled, 64 otherwise. */ static int tlbbits_for_regime(CPUARMState *env, ARMMMUIdx mmu_idx, uint64_t addr) @@ -4781,7 +4796,7 @@ static void tlbi_aa64_vae2_write(CPUARMState *env, const ARMCPRegInfo *ri, * flush-last-level-only. */ CPUState *cs = env_cpu(env); -int mask = e2_tlbmask(env); +int mask = vae2_tlbmask(env); uint64_t pageaddr = sextract64(value << 12, 0, 56); tlb_flush_page_by_mmuidx(cs, pageaddr, mask); @@ -4838,11 +4853,11 @@ static void tlbi_aa64_vae2is_write(CPUARMState *env, const ARMCPRegInfo *ri, uint64_t value) { CPUState *cs = env_cpu(env); +int mask = vae2_tlbmask(env); uint64_t pageaddr = sextract64(value << 12, 0, 56); int bits = tlbbits_for_regime(env, ARMMMUIdx_E2, pageaddr); -tlb_flush_page_bits_by_mmuidx_all_cpus_synced(cs, pageaddr, - ARMMMUIdxBit_E2, bits); +tlb_flush_page_bits_by_mmuidx_all_cpus_synced(cs, pageaddr, mask, bits); } static void tlbi_aa64_vae3is_write(CPUARMState *env, const ARMCPRegInfo *ri, @@ -5014,11 +5029,6 @@ static void tlbi_aa64_rvae1is_write(CPUARMState *env, do_rvae_write(env, value, vae1_tlbmask(env), true); } -static int vae2_tlbmask(CPUARMState *env) -{ -return ARMMMUIdxBit_E2; -} - static void tlbi_aa64_rvae2_write(CPUARMState *env, const ARMCPRegInfo *ri, uint64_t value) -- 2.41.0
[PATCH 3/5] target/arm: Skip granule protection checks for AT instructions
GPC checks are not performed on the output address for AT instructions, as stated by ARM DDI 0487J in D8.12.2: When populating PAR_EL1 with the result of an address translation instruction, granule protection checks are not performed on the final output address of a successful translation. Rename get_phys_addr_with_secure(), since it's only used to handle AT instructions. Signed-off-by: Jean-Philippe Brucker --- This incidentally fixes a problem with AT S1E1 instructions which can output an IPA and should definitely not cause a GPC. --- target/arm/internals.h | 25 ++--- target/arm/helper.c| 8 ++-- target/arm/ptw.c | 11 ++- 3 files changed, 26 insertions(+), 18 deletions(-) diff --git a/target/arm/internals.h b/target/arm/internals.h index 0f01bc32a8..fc90c364f7 100644 --- a/target/arm/internals.h +++ b/target/arm/internals.h @@ -1190,12 +1190,11 @@ typedef struct GetPhysAddrResult { } GetPhysAddrResult; /** - * get_phys_addr_with_secure: get the physical address for a virtual address + * get_phys_addr: get the physical address for a virtual address * @env: CPUARMState * @address: virtual address to get physical address for * @access_type: 0 for read, 1 for write, 2 for execute * @mmu_idx: MMU index indicating required translation regime - * @is_secure: security state for the access * @result: set on translation success. * @fi: set to fault info if the translation fails * @@ -1212,26 +1211,30 @@ typedef struct GetPhysAddrResult { * * for PSMAv5 based systems we don't bother to return a full FSR format *value. */ -bool get_phys_addr_with_secure(CPUARMState *env, target_ulong address, - MMUAccessType access_type, - ARMMMUIdx mmu_idx, bool is_secure, - GetPhysAddrResult *result, ARMMMUFaultInfo *fi) +bool get_phys_addr(CPUARMState *env, target_ulong address, + MMUAccessType access_type, ARMMMUIdx mmu_idx, + GetPhysAddrResult *result, ARMMMUFaultInfo *fi) __attribute__((nonnull)); /** - * get_phys_addr: get the physical address for a virtual address + * get_phys_addr_with_secure_nogpc: get the physical address for a virtual + * address * @env: CPUARMState * @address: virtual address to get physical address for * @access_type: 0 for read, 1 for write, 2 for execute * @mmu_idx: MMU index indicating required translation regime + * @is_secure: security state for the access * @result: set on translation success. * @fi: set to fault info if the translation fails * - * Similarly, but use the security regime of @mmu_idx. + * Similar to get_phys_addr, but use the given security regime and don't perform + * a Granule Protection Check on the resulting address. */ -bool get_phys_addr(CPUARMState *env, target_ulong address, - MMUAccessType access_type, ARMMMUIdx mmu_idx, - GetPhysAddrResult *result, ARMMMUFaultInfo *fi) +bool get_phys_addr_with_secure_nogpc(CPUARMState *env, target_ulong address, + MMUAccessType access_type, + ARMMMUIdx mmu_idx, bool is_secure, + GetPhysAddrResult *result, + ARMMMUFaultInfo *fi) __attribute__((nonnull)); bool pmsav8_mpu_lookup(CPUARMState *env, uint32_t address, diff --git a/target/arm/helper.c b/target/arm/helper.c index 07a9ac70f5..3ee2bb5fe1 100644 --- a/target/arm/helper.c +++ b/target/arm/helper.c @@ -3365,8 +3365,12 @@ static uint64_t do_ats_write(CPUARMState *env, uint64_t value, ARMMMUFaultInfo fi = {}; GetPhysAddrResult res = {}; -ret = get_phys_addr_with_secure(env, value, access_type, mmu_idx, -is_secure, , ); +/* + * I_MXTJT: Granule protection checks are not performed on the final address + * of a successful translation. + */ +ret = get_phys_addr_with_secure_nogpc(env, value, access_type, mmu_idx, + is_secure, , ); /* * ATS operations only do S1 or S1+S2 translations, so we never diff --git a/target/arm/ptw.c b/target/arm/ptw.c index 6318e13b98..1aef2b8cef 100644 --- a/target/arm/ptw.c +++ b/target/arm/ptw.c @@ -3412,16 +3412,17 @@ static bool get_phys_addr_gpc(CPUARMState *env, S1Translate *ptw, return false; } -bool get_phys_addr_with_secure(CPUARMState *env, target_ulong address, - MMUAccessType access_type, ARMMMUIdx mmu_idx, - bool is_secure, GetPhysAddrResult *result, - ARMMMUFaultInfo *fi) +bool get_phys_addr_with_secure_nogpc(CPUARMState *env, target_ulong address, + MMUAccessType access_type, + ARMMMUIdx mmu_idx, bool
[PATCH 1/5] target/arm/ptw: Load stage-2 tables from realm physical space
In realm state, stage-2 translation tables are fetched from the realm physical address space (R_PGRQD). Signed-off-by: Jean-Philippe Brucker --- target/arm/ptw.c | 6 +- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/target/arm/ptw.c b/target/arm/ptw.c index d1de934702..6318e13b98 100644 --- a/target/arm/ptw.c +++ b/target/arm/ptw.c @@ -164,7 +164,11 @@ static ARMMMUIdx ptw_idx_for_stage_2(CPUARMState *env, ARMMMUIdx stage2idx) * an NS stage 1+2 lookup while the NS bit is 0.) */ if (!arm_is_secure_below_el3(env) || !arm_el_is_aa64(env, 3)) { -return ARMMMUIdx_Phys_NS; +if (arm_security_space_below_el3(env) == ARMSS_Realm) { +return ARMMMUIdx_Phys_Realm; +} else { +return ARMMMUIdx_Phys_NS; +} } if (stage2idx == ARMMMUIdx_Stage2_S) { s2walk_secure = !(env->cp15.vstcr_el2 & VSTCR_SW); -- 2.41.0
Re: [PATCH for-8.1 2/3] target/arm: Fix S1_ptw_translate() debug path
On Mon, Jul 10, 2023 at 04:21:29PM +0100, Peter Maydell wrote: > In commit XXX we rearranged the logic in S1_ptw_translate() so that > the debug-access "call get_phys_addr_*" codepath is used both when S1 > is doing ptw reads from stage 2 and when it is doing ptw reads from > physical memory. However, we didn't update the calculation of > s2ptw->in_space and s2ptw->in_secure to account for the "ptw reads > from physical memory" case. This meant that debug accesses when in > Secure state broke. > > Create a new function S2_security_space() which returns the > correct security space to use for the ptw load, and use it to > determine the correct .in_secure and .in_space fields for the > stage 2 lookup for the ptw load. > > Reported-by: Jean-Philippe Brucker > Fixes: fe4a5472ccd6 ("target/arm: Use get_phys_addr_with_struct in > S1_ptw_translate") > Signed-off-by: Peter Maydell Thanks, this fixes tf-a boot with semihosting Tested-by: Jean-Philippe Brucker > --- > target/arm/ptw.c | 37 - > 1 file changed, 32 insertions(+), 5 deletions(-) > > diff --git a/target/arm/ptw.c b/target/arm/ptw.c > index 21749375f97..c0b9cee5843 100644 > --- a/target/arm/ptw.c > +++ b/target/arm/ptw.c > @@ -485,11 +485,39 @@ static bool S2_attrs_are_device(uint64_t hcr, uint8_t > attrs) > } > } > > +static ARMSecuritySpace S2_security_space(ARMSecuritySpace s1_space, > + ARMMMUIdx s2_mmu_idx) > +{ > +/* > + * Return the security space to use for stage 2 when doing > + * the S1 page table descriptor load. > + */ > +if (regime_is_stage2(s2_mmu_idx)) { > +/* > + * The security space for ptw reads is almost always the same > + * as that of the security space of the stage 1 translation. > + * The only exception is when stage 1 is Secure; in that case > + * the ptw read might be to the Secure or the NonSecure space > + * (but never Realm or Root), and the s2_mmu_idx tells us which. > + * Root translations are always single-stage. > + */ > +if (s1_space == ARMSS_Secure) { > +return arm_secure_to_space(s2_mmu_idx == ARMMMUIdx_Stage2_S); > +} else { > +assert(s2_mmu_idx != ARMMMUIdx_Stage2_S); > +assert(s1_space != ARMSS_Root); > +return s1_space; > +} > +} else { > +/* ptw loads are from phys: the mmu idx itself says which space */ > +return arm_phys_to_space(s2_mmu_idx); > +} > +} > + > /* Translate a S1 pagetable walk through S2 if needed. */ > static bool S1_ptw_translate(CPUARMState *env, S1Translate *ptw, > hwaddr addr, ARMMMUFaultInfo *fi) > { > -ARMSecuritySpace space = ptw->in_space; > bool is_secure = ptw->in_secure; > ARMMMUIdx mmu_idx = ptw->in_mmu_idx; > ARMMMUIdx s2_mmu_idx = ptw->in_ptw_idx; > @@ -502,13 +530,12 @@ static bool S1_ptw_translate(CPUARMState *env, > S1Translate *ptw, > * From gdbstub, do not use softmmu so that we don't modify the > * state of the cpu at all, including softmmu tlb contents. > */ > +ARMSecuritySpace s2_space = S2_security_space(ptw->in_space, > s2_mmu_idx); > S1Translate s2ptw = { > .in_mmu_idx = s2_mmu_idx, > .in_ptw_idx = ptw_idx_for_stage_2(env, s2_mmu_idx), > -.in_secure = s2_mmu_idx == ARMMMUIdx_Stage2_S, > -.in_space = (s2_mmu_idx == ARMMMUIdx_Stage2_S ? ARMSS_Secure > - : space == ARMSS_Realm ? ARMSS_Realm > - : ARMSS_NonSecure), > +.in_secure = arm_space_is_secure(s2_space), > +.in_space = s2_space, > .in_debug = true, > }; > GetPhysAddrResult s2 = { }; > -- > 2.34.1 >
Re: [PATCH v2 0/2] VIRTIO-IOMMU/VFIO page size related fixes
On Wed, Jul 05, 2023 at 06:51:16PM +0200, Eric Auger wrote: > When assigning a host device and protecting it with the virtio-iommu we may > end up with qemu crashing with > > qemu-kvm: virtio-iommu page mask 0xf000 is incompatible > with mask 0x2001 > qemu: hardware error: vfio: DMA mapping failed, unable to continue > > This happens if the host features a 64kB page size and constraints > the physical IOMMU to use a 64kB page size. By default 4kB granule is used > by the qemu virtio-iommu device and this latter becomes aware of the 64kB > requirement too late, after the machine init, when the vfio device domain is > attached. virtio_iommu_set_page_size_mask() fails and this causes > vfio_listener_region_add() to end up with hw_error(). Currently the > granule is global to all domains. > > To work around this issue, despite the IOMMU MR may be bypassed, we > transiently enable it on machine init done to get vfio_listener_region_add > and virtio_iommu_set_page_size_mask called ealier, before the domain > attach. That way the page size requirement can be taken into account > before the guest gets started. > > Also get benefit of this series to do some cleanups in some traces > which may confuse the end user. For both patches: Reviewed-by: Jean-Philippe Brucker Tested-by: Jean-Philippe Brucker
Re: [PATCH] target/arm: Fix ptw parameters in S1_ptw_translate() for debug contexts
On Thu, Jul 06, 2023 at 04:42:02PM +0100, Peter Maydell wrote: > > > Do you have a repro case for this bug? Did it work > > > before commit fe4a5472ccd6 ? > > > > Yes I bisected to fe4a5472ccd6 by trying to run TF-A, following > > instructions here: > > https://github.com/ARM-software/arm-trusted-firmware/blob/master/docs/plat/qemu.rst > > > > Building TF-A (HEAD 8e31faa05): > > make -j CROSS_COMPILE=aarch64-linux-gnu- > > BL33=edk2/Build/ArmVirtQemuKernel-AARCH64/DEBUG_GCC5/FV/QEMU_EFI.fd > > PLAT=qemu DEBUG=1 LOG_LEVEL=40 all fip > > > > Installing it to QEMU runtime dir: > > ln -sf tf-a/build/qemu/debug/bl1.bin build/qemu-cca/run/ > > ln -sf tf-a/build/qemu/debug/bl2.bin build/qemu-cca/run/ > > ln -sf tf-a/build/qemu/debug/bl31.bin build/qemu-cca/run/ > > ln -sf edk2/Build/ArmVirtQemuKernel-AARCH64/DEBUG_GCC5/FV/QEMU_EFI.fd > > build/qemu-cca/run/bl33.bin > > Could you put the necessary binary blobs up somewhere, to save > me trying to rebuild TF-A ? Uploaded to: https://jpbrucker.net/tmp/2023-07-06-repro-qemu-tfa.tar.gz Thanks, Jean > > > > > > --- > > > > target/arm/ptw.c | 6 ++ > > > > 1 file changed, 2 insertions(+), 4 deletions(-) > > > > > > > > diff --git a/target/arm/ptw.c b/target/arm/ptw.c > > > > index 9aaff1546a..e3a738c28e 100644 > > > > --- a/target/arm/ptw.c > > > > +++ b/target/arm/ptw.c > > > > @@ -465,10 +465,8 @@ static bool S1_ptw_translate(CPUARMState *env, > > > > S1Translate *ptw, > > > > S1Translate s2ptw = { > > > > .in_mmu_idx = s2_mmu_idx, > > > > .in_ptw_idx = ptw_idx_for_stage_2(env, s2_mmu_idx), > > > > -.in_secure = s2_mmu_idx == ARMMMUIdx_Stage2_S, > > > > -.in_space = (s2_mmu_idx == ARMMMUIdx_Stage2_S ? > > > > ARMSS_Secure > > > > - : space == ARMSS_Realm ? ARMSS_Realm > > > > - : ARMSS_NonSecure), > > > > +.in_secure = is_secure, > > > > +.in_space = space, > > > > > > If the problem is fe4a5472ccd6 then this seems an odd change to > > > be making, because in_secure and in_space were set that way > > > before that commit too... > > > > I think that commit merged both sides of the > > "regime_is_stage2(s2_mmu_idx)" test, but only kept testing for secure > > through ARMMMUIdx_Stage2_S, and removed the test through ARMMMUIdx_Phys_S > > Yes, I agree. I'm not sure your proposed fix is the right one, > though. I need to re-work through what I did in fcc0b0418fff > to remind myself of what the various fields in a S1Translate > struct are supposed to be, but I think .in_secure (and now > .in_space) are supposed to always match .in_mmu_idx, and > that's not necessarily the same as what the local is_secure > holds. (is_secure is the original ptw's in_secure, which > matches that ptw's .in_mmu_idx, not its .in_ptw_idx.) > So probably the right thing for the .in_secure check is > to change to "(s2_mmu_idx == ARMMMUIdx_Stage2_S || > s2_mmu_idx == ARMMMUIdx_Phys_S)". Less sure about .in_space, > because that conditional is a bit more complicated. > > thanks > -- PMM
Re: [PATCH] target/arm: Fix ptw parameters in S1_ptw_translate() for debug contexts
On Thu, Jul 06, 2023 at 03:28:32PM +0100, Peter Maydell wrote: > On Thu, 6 Jul 2023 at 15:12, Jean-Philippe Brucker > wrote: > > > > Arm TF-A fails to boot via semihosting following a recent change to the > > MMU code. Semihosting attempts to read parameters passed by TF-A in > > secure RAM via cpu_memory_rw_debug(). While performing the S1 > > translation, we call S1_ptw_translate() on the page table descriptor > > address, with an MMU index of ARMMMUIdx_Phys_S. At the moment > > S1_ptw_translate() doesn't interpret this as a secure access, and as a > > result we attempt to read the page table descriptor from the non-secure > > address space, which fails. > > > > Fixes: fe4a5472ccd6 ("target/arm: Use get_phys_addr_with_struct in > > S1_ptw_translate") > > Signed-off-by: Jean-Philippe Brucker > > --- > > I'm not entirely sure why the semihosting parameters are accessed > > through stage-1 translation rather than directly as physical addresses, > > but I'm not familiar with semihosting. > > The semihosting ABI says the guest code should pass "a pointer > to the parameter block". It doesn't say explicitly, but the > straightforward interpretation is "a pointer that the guest > itself could dereference to read/write the values", which means > a virtual address, not a physical one. It would be pretty > painful for the guest to have to figure out "what is the > physaddr for this virtual address" to pass it to the semihosting > call. > > Do you have a repro case for this bug? Did it work > before commit fe4a5472ccd6 ? Yes I bisected to fe4a5472ccd6 by trying to run TF-A, following instructions here: https://github.com/ARM-software/arm-trusted-firmware/blob/master/docs/plat/qemu.rst Building TF-A (HEAD 8e31faa05): make -j CROSS_COMPILE=aarch64-linux-gnu- BL33=edk2/Build/ArmVirtQemuKernel-AARCH64/DEBUG_GCC5/FV/QEMU_EFI.fd PLAT=qemu DEBUG=1 LOG_LEVEL=40 all fip Installing it to QEMU runtime dir: ln -sf tf-a/build/qemu/debug/bl1.bin build/qemu-cca/run/ ln -sf tf-a/build/qemu/debug/bl2.bin build/qemu-cca/run/ ln -sf tf-a/build/qemu/debug/bl31.bin build/qemu-cca/run/ ln -sf edk2/Build/ArmVirtQemuKernel-AARCH64/DEBUG_GCC5/FV/QEMU_EFI.fd build/qemu-cca/run/bl33.bin Running QEMU with HEAD=fe4a5472cc: qemu-system-aarch64 -M virt,secure=on -cpu max -m 1G -nographic -bios bl1.bin -semihosting-config enable=on,target=native -d guest_errors NOTICE: Booting Trusted Firmware NOTICE: BL1: v2.9(debug):v2.9.0-280-g8e31faa05 NOTICE: BL1: Built : 16:23:20, Jul 6 2023 INFO:BL1: RAM 0xe0ee000 - 0xe0f6000 INFO:BL1: Loading BL2 WARNING: Firmware Image Package header check failed. Invalid read at addr 0xE0EF900, size 8, region '(null)', reason: rejected WARNING: Failed to obtain reference to image id=1 (-2) ERROR: Failed to load BL2 firmware. with HEAD=4a7d7702cd: ... INFO:BL1: Loading BL2 WARNING: Firmware Image Package header check failed. INFO:Loading image id=1 at address 0xe06b000 INFO:Image id=1 loaded: 0xe06b000 - 0xe073201 NOTICE: BL1: Booting BL2 INFO:Entry point address = 0xe06b000 INFO:SPSR = 0x3c5 ... (Note that there is an issue with TF-A missing ENABLE_FEAT_FGT for qemu at the moment, which prevents booting Linux with -cpu max. I'll send the fix to TF-A after this, but this reproducer should at least boot edk2.) > > --- > > target/arm/ptw.c | 6 ++ > > 1 file changed, 2 insertions(+), 4 deletions(-) > > > > diff --git a/target/arm/ptw.c b/target/arm/ptw.c > > index 9aaff1546a..e3a738c28e 100644 > > --- a/target/arm/ptw.c > > +++ b/target/arm/ptw.c > > @@ -465,10 +465,8 @@ static bool S1_ptw_translate(CPUARMState *env, > > S1Translate *ptw, > > S1Translate s2ptw = { > > .in_mmu_idx = s2_mmu_idx, > > .in_ptw_idx = ptw_idx_for_stage_2(env, s2_mmu_idx), > > -.in_secure = s2_mmu_idx == ARMMMUIdx_Stage2_S, > > -.in_space = (s2_mmu_idx == ARMMMUIdx_Stage2_S ? ARMSS_Secure > > - : space == ARMSS_Realm ? ARMSS_Realm > > - : ARMSS_NonSecure), > > +.in_secure = is_secure, > > +.in_space = space, > > If the problem is fe4a5472ccd6 then this seems an odd change to > be making, because in_secure and in_space were set that way > before that commit too... I think that commit merged both sides of the "regime_is_stage2(s2_mmu_idx)" test, but only kept testing for secure through ARMMMUIdx_Stage2_S, and removed the test through ARMMMUIdx_Phys_S Thanks, Jean
Re: [PATCH 2/2] virtio-iommu: Rework the trace in virtio_iommu_set_page_size_mask()
On Wed, Jul 05, 2023 at 03:16:31PM +0200, Eric Auger wrote: > >>> diff --git a/hw/virtio/virtio-iommu.c b/hw/virtio/virtio-iommu.c index > >>> 1eaf81bab5..0d9f7196fe 100644 > >>> --- a/hw/virtio/virtio-iommu.c > >>> +++ b/hw/virtio/virtio-iommu.c > >>> @@ -1101,29 +1101,24 @@ static int > >>> virtio_iommu_set_page_size_mask(IOMMUMemoryRegion *mr, > >>> new_mask); > >>> > >>> if ((cur_mask & new_mask) == 0) { > >>> -error_setg(errp, "virtio-iommu page mask 0x%"PRIx64 > >>> - " is incompatible with mask 0x%"PRIx64, cur_mask, > >>> new_mask); > >>> +error_setg(errp, "virtio-iommu %s reports a page size mask > >>> 0x%"PRIx64 > >>> + " incompatible with currently supported mask > >>> 0x%"PRIx64, > >>> + mr->parent_obj.name, new_mask, cur_mask); > >>> return -1; > >>> } > >>> > >>> /* > >>> * Once the granule is frozen we can't change the mask anymore. If by > >>> * chance the hotplugged device supports the same granule, we can > >>> still > >>> - * accept it. Having a different masks is possible but the guest > >>> will use > >>> - * sub-optimal block sizes, so warn about it. > >>> + * accept it. > >>> */ > >>> if (s->granule_frozen) { > >>> -int new_granule = ctz64(new_mask); > >>> int cur_granule = ctz64(cur_mask); > >>> > >>> -if (new_granule != cur_granule) { > >>> -error_setg(errp, "virtio-iommu page mask 0x%"PRIx64 > >>> - " is incompatible with mask 0x%"PRIx64, cur_mask, > >>> - new_mask); > >>> +if (!(BIT(cur_granule) & new_mask)) { > > Sorry, I read this piece code again and got a question, if new_mask has > > finer > > granularity than cur_granule, should we allow it to pass even though > > BIT(cur_granule) is not set? > I think this should work but this is not straightforward to test. > virtio-iommu would use the current granule for map/unmap. In map/unmap > notifiers, this is split into pow2 ranges and cascaded to VFIO through > vfio_dma_map/unmap. The iova and size are aligned with the smaller > supported granule. > > Jean, do you share this understanding or do I miss something. Yes, I also think that would work. The guest would only issue mappings with the larger granularity, which can be applied by VFIO with a finer granule. However I doubt we're going to encounter this case, because seeing a cur_granule larger than 4k here means that a VFIO device has already been assigned with a large granule like 64k, and we're trying to add a new device with 4k. This indicates two HW IOMMUs supporting different granules in the same system, which seems unlikely. Hopefully by the time we actually need this (if ever) we will support per-endpoint probe properties, which allow informing the guest about different hardware properties instead of relying on one global property in the virtio config. Thanks, Jean
[PATCH] target/arm: Fix ptw parameters in S1_ptw_translate() for debug contexts
Arm TF-A fails to boot via semihosting following a recent change to the MMU code. Semihosting attempts to read parameters passed by TF-A in secure RAM via cpu_memory_rw_debug(). While performing the S1 translation, we call S1_ptw_translate() on the page table descriptor address, with an MMU index of ARMMMUIdx_Phys_S. At the moment S1_ptw_translate() doesn't interpret this as a secure access, and as a result we attempt to read the page table descriptor from the non-secure address space, which fails. Fixes: fe4a5472ccd6 ("target/arm: Use get_phys_addr_with_struct in S1_ptw_translate") Signed-off-by: Jean-Philippe Brucker --- I'm not entirely sure why the semihosting parameters are accessed through stage-1 translation rather than directly as physical addresses, but I'm not familiar with semihosting. --- target/arm/ptw.c | 6 ++ 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/target/arm/ptw.c b/target/arm/ptw.c index 9aaff1546a..e3a738c28e 100644 --- a/target/arm/ptw.c +++ b/target/arm/ptw.c @@ -465,10 +465,8 @@ static bool S1_ptw_translate(CPUARMState *env, S1Translate *ptw, S1Translate s2ptw = { .in_mmu_idx = s2_mmu_idx, .in_ptw_idx = ptw_idx_for_stage_2(env, s2_mmu_idx), -.in_secure = s2_mmu_idx == ARMMMUIdx_Stage2_S, -.in_space = (s2_mmu_idx == ARMMMUIdx_Stage2_S ? ARMSS_Secure - : space == ARMSS_Realm ? ARMSS_Realm - : ARMSS_NonSecure), +.in_secure = is_secure, +.in_space = space, .in_debug = true, }; GetPhysAddrResult s2 = { }; -- 2.41.0
Re: [PATCH 1/2] virtio-iommu: Fix 64kB host page size VFIO device assignment
On Wed, Jul 05, 2023 at 10:13:11AM +, Duan, Zhenzhong wrote: > >-Original Message- > >From: Jean-Philippe Brucker > >Sent: Wednesday, July 5, 2023 4:29 PM > >Subject: Re: [PATCH 1/2] virtio-iommu: Fix 64kB host page size VFIO device > >assignment > > > >On Wed, Jul 05, 2023 at 04:52:09AM +, Duan, Zhenzhong wrote: > >> Hi Eric, > >> > >> >-Original Message- > >> >From: Eric Auger > >> >Sent: Tuesday, July 4, 2023 7:15 PM > >> >Subject: [PATCH 1/2] virtio-iommu: Fix 64kB host page size VFIO > >> >device assignment > >> > > >> >When running on a 64kB page size host and protecting a VFIO device > >> >with the virtio-iommu, qemu crashes with this kind of message: > >> > > >> >qemu-kvm: virtio-iommu page mask 0xf000 is incompatible > >> >with mask 0x2001 > >> > >> Does 0x2001 mean only 512MB and 64KB super page mapping is > >> supported for host iommu hw? 4KB mapping not supported? > > > >It's not a restriction by the HW IOMMU, but the host kernel. An Arm SMMU > >can implement 4KB, 16KB and/or 64KB granules, but the host kernel only > >advertises through VFIO the granule corresponding to host PAGE_SIZE. This > >restriction is done by arm_lpae_restrict_pgsizes() in order to choose a page > >size when a device is driven by the host. > > Just curious why not advertises the Arm SMMU implemented granules to VFIO > Eg:4KB, 16KB or 64KB granules? That's possible, but the difficulty is setting up the page table configuration afterwards. At the moment the host installs the HW page tables early, when QEMU sets up the VFIO container. That initializes the page size bitmap because configuring the HW page tables requires picking one of the supported granules (setting TG0 in the SMMU Context Descriptor). If the guest could pick a granule via an ATTACH request, then QEMU would need to tell the host kernel to install page tables with the desired granule at that point. That would require a new interface in VFIO to reconfigure a live container and replace the existing HW page tables configuration (before ATTACH, the container must already be configured with working page tables in order to implement boot-bypass, I think). > But arm_lpae_restrict_pgsizes() restricted ones, > Eg: for SZ_4K, (SZ_4K | SZ_2M | SZ_1G). > (SZ_4K | SZ_2M | SZ_1G) looks not real hardware granules of Arm SMMU. Yes, the granule here is 4K, and other bits only indicate huge page sizes, so the user can try to optimize large mappings to use fewer TLB entries where possible. Thanks, Jean
Re: [PATCH 1/2] virtio-iommu: Fix 64kB host page size VFIO device assignment
On Wed, Jul 05, 2023 at 04:52:09AM +, Duan, Zhenzhong wrote: > Hi Eric, > > >-Original Message- > >From: Eric Auger > >Sent: Tuesday, July 4, 2023 7:15 PM > >Subject: [PATCH 1/2] virtio-iommu: Fix 64kB host page size VFIO device > >assignment > > > >When running on a 64kB page size host and protecting a VFIO device > >with the virtio-iommu, qemu crashes with this kind of message: > > > >qemu-kvm: virtio-iommu page mask 0xf000 is incompatible > >with mask 0x2001 > > Does 0x2001 mean only 512MB and 64KB super page mapping is > supported for host iommu hw? 4KB mapping not supported? It's not a restriction by the HW IOMMU, but the host kernel. An Arm SMMU can implement 4KB, 16KB and/or 64KB granules, but the host kernel only advertises through VFIO the granule corresponding to host PAGE_SIZE. This restriction is done by arm_lpae_restrict_pgsizes() in order to choose a page size when a device is driven by the host. > > There is a check in guest kernel side hint only 4KB is supported, with > this patch we force viommu->pgsize_bitmap to 0x2001 > and fail below check? Does this device work in guest? > Please correct me if I understand wrong. Right, a guest using 4KB pages under a host that uses 64KB is not supported, because if the guest attempts to dma_map a 4K page, the IOMMU cannot create a mapping small enough, the mapping would have to spill over neighbouring guest memory. One possible solution would be supporting multiple page granules. If we added a page granule negotiation through VFIO and virtio-iommu then the guest could pick the page size it wants. But this requires changes to Linux UAPI so isn't a priority at the moment, because we're trying to enable nesting which would support 64K-host/4K-guest as well. See also the discussion on the patch that introduced the guest check https://lore.kernel.org/linux-iommu/20200318114047.1518048-1-jean-phili...@linaro.org/ Thanks, Jean
Re: [PATCH v4 00/10] Add stage-2 translation for SMMUv3
On Tue, May 16, 2023 at 08:33:07PM +, Mostafa Saleh wrote: > This patch series can be used to run Linux pKVM SMMUv3 patches (currently on > the list) > which controls stage-2 (from EL2) while providing a paravirtualized > interface the host(EL1) > https://lore.kernel.org/kvmarm/20230201125328.2186498-1-jean-phili...@linaro.org/ I've been using these patches for pKVM, and also tested the normal stage-2 flow with Linux and VFIO Tested-by: Jean-Philippe Brucker
Re: [PATCH] kvm: Merge kvm_check_extension() and kvm_vm_check_extension()
On Mon, Apr 24, 2023 at 03:01:54PM +0200, Cornelia Huck wrote: > > @@ -2480,6 +2471,7 @@ static int kvm_init(MachineState *ms) > > } > > > > s->vmfd = ret; > > +s->check_extension_vm = kvm_check_extension(s, > > KVM_CAP_CHECK_EXTENSION_VM); > > Hm, it's a bit strange to set s->check_extension_vm by calling a > function that already checks for the value of > s->check_extension_vm... would it be better to call kvm_ioctl() directly > on this line? Yes that's probably best. I'll use kvm_vm_ioctl() since the doc suggests to check KVM_CAP_CHECK_EXTENSION_VM on the vm fd. Thanks, Jean > > I think it would be good if some ppc folks could give this a look, but > in general, it looks fine to me. >
[PATCH] kvm: Merge kvm_check_extension() and kvm_vm_check_extension()
The KVM_CHECK_EXTENSION ioctl can be issued either on the global fd (/dev/kvm), or on the VM fd obtained with KVM_CREATE_VM. For most extensions, KVM returns the same value with either method, but for some of them it can refine the returned value depending on the VM type. The KVM documentation [1] advises to use the VM fd: Based on their initialization different VMs may have different capabilities. It is thus encouraged to use the vm ioctl to query for capabilities (available with KVM_CAP_CHECK_EXTENSION_VM on the vm fd) Ongoing work on Arm confidential VMs confirms this, as some capabilities become unavailable to confidential VMs, requiring changes in QEMU to use kvm_vm_check_extension() instead of kvm_check_extension() [2]. Rather than changing each check one by one, change kvm_check_extension() to always issue the ioctl on the VM fd when available, and remove kvm_vm_check_extension(). Fall back to the global fd when the VM check is unavailable: * Ancient kernels do not support KVM_CHECK_EXTENSION on the VM fd, since it was added by commit 92b591a4c46b ("KVM: Allow KVM_CHECK_EXTENSION on the vm fd") in Linux 3.17 [3]. Support for Linux 3.16 ended only in June 2020, but there may still be old images around. * A couple of calls must be issued before the VM fd is available, since they determine the VM type: KVM_CAP_MIPS_VZ and KVM_CAP_ARM_VM_IPA_SIZE Does any user actually depend on the check being done on the global fd instead of the VM fd? I surveyed all cases where KVM can return different values depending on the query method. Luckily QEMU already calls kvm_vm_check_extension() for most of those. Only three of them are ambiguous, because currently done on the global fd: * KVM_CAP_MAX_VCPUS and KVM_CAP_MAX_VCPU_ID on Arm, changes value if the user requests a vGIC different from the default. But QEMU queries this before vGIC configuration, so the reported value will be the same. * KVM_CAP_SW_TLB on PPC. When issued on the global fd, returns false if the kvm-hv module is loaded; when issued on the VM fd, returns false only if the VM type is HV instead of PR. If this returns false, then QEMU will fail to initialize a BOOKE206 MMU model. So this patch supposedly improves things, as it allows to run this type of vCPU even when both KVM modules are loaded. * KVM_CAP_PPC_SECURE_GUEST. Similarly, doing this check on a VM fd refines the returned value, and ensures that SVM is actually supported. Since QEMU follows the check with kvm_vm_enable_cap(), this patch should only provide better error reporting. [1] https://www.kernel.org/doc/html/latest/virt/kvm/api.html#kvm-check-extension [2] https://lore.kernel.org/kvm/875ybi0ytc@redhat.com/ [3] https://github.com/torvalds/linux/commit/92b591a4c46b Suggested-by: Cornelia Huck Signed-off-by: Jean-Philippe Brucker --- include/sysemu/kvm.h | 2 -- include/sysemu/kvm_int.h | 1 + accel/kvm/kvm-all.c | 26 +- target/i386/kvm/kvm.c| 6 +++--- target/ppc/kvm.c | 34 +- 5 files changed, 30 insertions(+), 39 deletions(-) diff --git a/include/sysemu/kvm.h b/include/sysemu/kvm.h index c8281c07a7..d62054004e 100644 --- a/include/sysemu/kvm.h +++ b/include/sysemu/kvm.h @@ -438,8 +438,6 @@ bool kvm_arch_stop_on_emulation_error(CPUState *cpu); int kvm_check_extension(KVMState *s, unsigned int extension); -int kvm_vm_check_extension(KVMState *s, unsigned int extension); - #define kvm_vm_enable_cap(s, capability, cap_flags, ...) \ ({ \ struct kvm_enable_cap cap = {\ diff --git a/include/sysemu/kvm_int.h b/include/sysemu/kvm_int.h index a641c974ea..f6aa22ea09 100644 --- a/include/sysemu/kvm_int.h +++ b/include/sysemu/kvm_int.h @@ -122,6 +122,7 @@ struct KVMState uint32_t xen_caps; uint16_t xen_gnttab_max_frames; uint16_t xen_evtchn_max_pirq; +bool check_extension_vm; }; void kvm_memory_listener_register(KVMState *s, KVMMemoryListener *kml, diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c index cf3a88d90e..eca815e763 100644 --- a/accel/kvm/kvm-all.c +++ b/accel/kvm/kvm-all.c @@ -1109,22 +1109,13 @@ int kvm_check_extension(KVMState *s, unsigned int extension) { int ret; -ret = kvm_ioctl(s, KVM_CHECK_EXTENSION, extension); -if (ret < 0) { -ret = 0; +if (!s->check_extension_vm) { +ret = kvm_ioctl(s, KVM_CHECK_EXTENSION, extension); +} else { +ret = kvm_vm_ioctl(s, KVM_CHECK_EXTENSION, extension); } - -return ret; -} - -int kvm_vm_check_extension(KVMState *s, unsigned int extension) -{ -int ret; - -ret = kvm_vm_ioctl(s, KVM_CHECK_EXTENSION, extension); if (ret < 0) { -/* VM wide version not implemented, use global one instead */ -ret = kvm_check_extension(s, extension); +ret = 0;
Re: virtio-iommu hotplug issue
On Thu, Apr 13, 2023 at 08:01:54PM +0900, Akihiko Odaki wrote: > Yes, that's right. The guest can dynamically create and delete VFs. The > device is emulated by QEMU: igb, an Intel NIC recently added to QEMU and > projected to be released as part of QEMU 8.0. Ah great, that's really useful, I'll add it to my tests > > Yes, I think this is an issue in the virtio-iommu driver, which should be > > sending a DETACH request when the VF is disabled, likely from > > viommu_release_device(). I'll work on a fix unless you would like to do it > > It will be nice if you prepare a fix. I will test your patch with my > workload if you share it with me. I sent a fix: https://lore.kernel.org/linux-iommu/20230414150744.562456-1-jean-phili...@linaro.org/ Thank you for reporting this, it must have been annoying to debug Thanks, Jean
Re: virtio-iommu hotplug issue
Hello, On Thu, Apr 13, 2023 at 01:49:43PM +0900, Akihiko Odaki wrote: > Hi, > > Recently I encountered a problem with the combination of Linux's > virtio-iommu driver and QEMU when a SR-IOV virtual function gets disabled. > I'd like to ask you what kind of solution is appropriate here and implement > the solution if possible. > > A PCIe device implementing the SR-IOV specification exports a virtual > function, and the guest can enable or disable it at runtime by writing to a > configuration register. This effectively looks like a PCI device is > hotplugged for the guest. Just so I understand this better: the guest gets a whole PCIe device PF that implements SR-IOV, and so the guest can dynamically create VFs? Out of curiosity, is that a hardware device assigned to the guest with VFIO, or a device emulated by QEMU? > In such a case, the kernel assumes the endpoint is > detached from the virtio-iommu domain, but QEMU actually does not detach it. > > This inconsistent view of the removed device sometimes prevents the VM from > correctly performing the following procedure, for example: > 1. Enable a VF. > 2. Disable the VF. > 3. Open a vfio container. > 4. Open the group which the PF belongs to. > 5. Add the group to the vfio container. > 6. Map some memory region. > 7. Close the group. > 8. Close the vfio container. > 9. Repeat 3-8 > > When the VF gets disabled, the kernel assumes the endpoint is detached from > the IOMMU domain, but QEMU actually doesn't detach it. Later, the domain > will be reused in step 3-8. > > In step 7, the PF will be detached, and the kernel thinks there is no > endpoint attached and the mapping the domain holds is cleared, but the VF > endpoint is still attached and the mapping is kept intact. > > In step 9, the same domain will be reused again, and the kernel requests to > create a new mapping, but it will conflict with the existing mapping and > result in -EINVAL. > > This problem can be fixed by either of: > - requesting the detachment of the endpoint from the guest when the PCI > device is unplugged (the VF is disabled) Yes, I think this is an issue in the virtio-iommu driver, which should be sending a DETACH request when the VF is disabled, likely from viommu_release_device(). I'll work on a fix unless you would like to do it > - detecting that the PCI device is gone and automatically detach it on > QEMU-side. > > It is not completely clear for me which solution is more appropriate as the > virtio-iommu specification is written in a way independent of the endpoint > mechanism and does not say what should be done when a PCI device is > unplugged. Yes, I'm not sure it's in scope for the specification, it's more about software guidance Thanks, Jean
Re: [RFC PATCH 12/16] hw/arm/smmuv3: Add VMID to tlb tagging
Hi Mostafa, On Sun, Feb 05, 2023 at 09:44:07AM +, Mostafa Saleh wrote: > Allow TLB to be tagged with VMID. > > If stage-1 is only supported, VMID is set to -1 and ignored from STE > and CMD_TLBI_NH* cmds. > > Signed-off-by: Mostafa Saleh > --- > hw/arm/smmu-common.c | 24 +++- > hw/arm/smmu-internal.h | 2 ++ > hw/arm/smmuv3.c | 12 +--- > include/hw/arm/smmu-common.h | 5 +++-- > 4 files changed, 29 insertions(+), 14 deletions(-) > > diff --git a/hw/arm/smmu-common.c b/hw/arm/smmu-common.c > index 541c427684..028a60949a 100644 > --- a/hw/arm/smmu-common.c > +++ b/hw/arm/smmu-common.c > @@ -56,10 +56,11 @@ static gboolean smmu_iotlb_key_equal(gconstpointer v1, > gconstpointer v2) > (k1->level == k2->level) && (k1->tg == k2->tg); I'm getting some aliasing in the TLB, because smmu_iotlb_key_equal() is missing the VMID comparison. With that fixed my handful of tests pass Thanks, Jean
[PATCH v2 0/2] hw/arm/smmu: Fixes for TTB1
Two small changes to support TTB1. Since [v1] I removed the unused SMMU_MAX_VA_BITS and added tags, thanks! [v1] https://lore.kernel.org/qemu-devel/20230210163731.970130-1-jean-phili...@linaro.org/ Jean-Philippe Brucker (2): hw/arm/smmu-common: Support 64-bit addresses hw/arm/smmu-common: Fix TTB1 handling include/hw/arm/smmu-common.h | 2 -- hw/arm/smmu-common.c | 4 ++-- 2 files changed, 2 insertions(+), 4 deletions(-) -- 2.39.0
[PATCH v2 1/2] hw/arm/smmu-common: Support 64-bit addresses
Addresses targeting the second translation table (TTB1) in the SMMU have all upper bits set. Ensure the IOMMU region covers all 64 bits. Reviewed-by: Richard Henderson Signed-off-by: Jean-Philippe Brucker --- include/hw/arm/smmu-common.h | 2 -- hw/arm/smmu-common.c | 2 +- 2 files changed, 1 insertion(+), 3 deletions(-) diff --git a/include/hw/arm/smmu-common.h b/include/hw/arm/smmu-common.h index c5683af07d..9fcff26357 100644 --- a/include/hw/arm/smmu-common.h +++ b/include/hw/arm/smmu-common.h @@ -27,8 +27,6 @@ #define SMMU_PCI_DEVFN_MAX256 #define SMMU_PCI_DEVFN(sid) (sid & 0xFF) -#define SMMU_MAX_VA_BITS 48 - /* * Page table walk error types */ diff --git a/hw/arm/smmu-common.c b/hw/arm/smmu-common.c index 733c964778..2b8c67b9a1 100644 --- a/hw/arm/smmu-common.c +++ b/hw/arm/smmu-common.c @@ -439,7 +439,7 @@ static AddressSpace *smmu_find_add_as(PCIBus *bus, void *opaque, int devfn) memory_region_init_iommu(>iommu, sizeof(sdev->iommu), s->mrtypename, - OBJECT(s), name, 1ULL << SMMU_MAX_VA_BITS); + OBJECT(s), name, UINT64_MAX); address_space_init(>as, MEMORY_REGION(>iommu), name); trace_smmu_add_mr(name); -- 2.39.0
[PATCH v2 2/2] hw/arm/smmu-common: Fix TTB1 handling
Addresses targeting the second translation table (TTB1) in the SMMU have all upper bits set (except for the top byte when TBI is enabled). Fix the TTB1 check. Reported-by: Ola Hugosson Reviewed-by: Eric Auger Reviewed-by: Richard Henderson Signed-off-by: Jean-Philippe Brucker --- hw/arm/smmu-common.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/hw/arm/smmu-common.c b/hw/arm/smmu-common.c index 2b8c67b9a1..0a5a60ca1e 100644 --- a/hw/arm/smmu-common.c +++ b/hw/arm/smmu-common.c @@ -249,7 +249,7 @@ SMMUTransTableInfo *select_tt(SMMUTransCfg *cfg, dma_addr_t iova) /* there is a ttbr0 region and we are in it (high bits all zero) */ return >tt[0]; } else if (cfg->tt[1].tsz && - !extract64(iova, 64 - cfg->tt[1].tsz, cfg->tt[1].tsz - tbi_byte)) { +sextract64(iova, 64 - cfg->tt[1].tsz, cfg->tt[1].tsz - tbi_byte) == -1) { /* there is a ttbr1 region and we are in it (high bits all one) */ return >tt[1]; } else if (!cfg->tt[0].tsz) { -- 2.39.0
Re: [PATCH 2/2] hw/arm/smmu-common: Fix TTB1 handling
On Mon, Feb 13, 2023 at 05:30:03PM +0100, Eric Auger wrote: > Hi Jean, > > On 2/10/23 17:37, Jean-Philippe Brucker wrote: > > Addresses targeting the second translation table (TTB1) in the SMMU have > > all upper bits set (except for the top byte when TBI is enabled). Fix > > the TTB1 check. > > > > Reported-by: Ola Hugosson > > Signed-off-by: Jean-Philippe Brucker > > --- > > hw/arm/smmu-common.c | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/hw/arm/smmu-common.c b/hw/arm/smmu-common.c > > index 2b8c67b9a1..0a5a60ca1e 100644 > > --- a/hw/arm/smmu-common.c > > +++ b/hw/arm/smmu-common.c > > @@ -249,7 +249,7 @@ SMMUTransTableInfo *select_tt(SMMUTransCfg *cfg, > > dma_addr_t iova) > > /* there is a ttbr0 region and we are in it (high bits all zero) */ > > return >tt[0]; > > } else if (cfg->tt[1].tsz && > > - !extract64(iova, 64 - cfg->tt[1].tsz, cfg->tt[1].tsz - > > tbi_byte)) { > > +sextract64(iova, 64 - cfg->tt[1].tsz, cfg->tt[1].tsz - tbi_byte) > > == -1) { > > /* there is a ttbr1 region and we are in it (high bits all one) */ > > return >tt[1]; > > } else if (!cfg->tt[0].tsz) { > > Reviewed-by: Eric Auger > > While reading the spec again, I noticed we do not support VAX. Is it > something that we would need to support? I guess it would be needed to support sharing page tables with the CPU, if the CPU supports and the OS uses FEAT_LVA. But in order to share the stage-1, Linux would need more complex features as well (ATS+PRI/Stall, PASID). For a private DMA address space, I think 48 bits of VA is already plenty. Thanks, Jean
[PATCH 2/2] hw/arm/smmu-common: Fix TTB1 handling
Addresses targeting the second translation table (TTB1) in the SMMU have all upper bits set (except for the top byte when TBI is enabled). Fix the TTB1 check. Reported-by: Ola Hugosson Signed-off-by: Jean-Philippe Brucker --- hw/arm/smmu-common.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/hw/arm/smmu-common.c b/hw/arm/smmu-common.c index 2b8c67b9a1..0a5a60ca1e 100644 --- a/hw/arm/smmu-common.c +++ b/hw/arm/smmu-common.c @@ -249,7 +249,7 @@ SMMUTransTableInfo *select_tt(SMMUTransCfg *cfg, dma_addr_t iova) /* there is a ttbr0 region and we are in it (high bits all zero) */ return >tt[0]; } else if (cfg->tt[1].tsz && - !extract64(iova, 64 - cfg->tt[1].tsz, cfg->tt[1].tsz - tbi_byte)) { +sextract64(iova, 64 - cfg->tt[1].tsz, cfg->tt[1].tsz - tbi_byte) == -1) { /* there is a ttbr1 region and we are in it (high bits all one) */ return >tt[1]; } else if (!cfg->tt[0].tsz) { -- 2.39.0
[PATCH 0/2] hw/arm/smmu: Fixes for TTB1
Two small changes to support TTB1. Note that I had to modify the Linux driver in order to test this (see below), but other OSes might use TTB1. Jean-Philippe Brucker (2): hw/arm/smmu-common: Support 64-bit addresses hw/arm/smmu-common: Fix TTB1 handling hw/arm/smmu-common.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) --- 8< --- Linux hacks: diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h index 8d772ea8a583..bf0ff699b64b 100644 --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h @@ -276,6 +276,11 @@ #define CTXDESC_CD_0_TCR_IRGN0 GENMASK_ULL(9, 8) #define CTXDESC_CD_0_TCR_ORGN0 GENMASK_ULL(11, 10) #define CTXDESC_CD_0_TCR_SH0 GENMASK_ULL(13, 12) +#define CTXDESC_CD_0_TCR_T1SZ GENMASK_ULL(21, 16) +#define CTXDESC_CD_0_TCR_TG1 GENMASK_ULL(23, 22) +#define CTXDESC_CD_0_TCR_IRGN1 GENMASK_ULL(25, 24) +#define CTXDESC_CD_0_TCR_ORGN1 GENMASK_ULL(27, 26) +#define CTXDESC_CD_0_TCR_SH1 GENMASK_ULL(29, 28) #define CTXDESC_CD_0_TCR_EPD0 (1ULL << 14) #define CTXDESC_CD_0_TCR_EPD1 (1ULL << 30) @@ -293,6 +298,7 @@ #define CTXDESC_CD_0_ASID GENMASK_ULL(63, 48) #define CTXDESC_CD_1_TTB0_MASK GENMASK_ULL(51, 4) +#define CTXDESC_CD_2_TTB1_MASK GENMASK_ULL(51, 4) /* * When the SMMU only supports linear context descriptor tables, pick a diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c index f2425b0f0cd6..3a4343e60a54 100644 --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c @@ -1075,8 +1075,8 @@ int arm_smmu_write_ctx_desc(struct arm_smmu_domain *smmu_domain, int ssid, * this substream's traffic */ } else { /* (1) and (2) */ - cdptr[1] = cpu_to_le64(cd->ttbr & CTXDESC_CD_1_TTB0_MASK); - cdptr[2] = 0; + cdptr[1] = 0; + cdptr[2] = cpu_to_le64(cd->ttbr & CTXDESC_CD_2_TTB1_MASK); cdptr[3] = cpu_to_le64(cd->mair); /* @@ -2108,13 +2108,13 @@ static int arm_smmu_domain_finalise_s1(struct arm_smmu_domain *smmu_domain, cfg->cd.asid= (u16)asid; cfg->cd.ttbr= pgtbl_cfg->arm_lpae_s1_cfg.ttbr; - cfg->cd.tcr = FIELD_PREP(CTXDESC_CD_0_TCR_T0SZ, tcr->tsz) | - FIELD_PREP(CTXDESC_CD_0_TCR_TG0, tcr->tg) | - FIELD_PREP(CTXDESC_CD_0_TCR_IRGN0, tcr->irgn) | - FIELD_PREP(CTXDESC_CD_0_TCR_ORGN0, tcr->orgn) | - FIELD_PREP(CTXDESC_CD_0_TCR_SH0, tcr->sh) | + cfg->cd.tcr = FIELD_PREP(CTXDESC_CD_0_TCR_T1SZ, tcr->tsz) | + FIELD_PREP(CTXDESC_CD_0_TCR_TG1, tcr->tg) | + FIELD_PREP(CTXDESC_CD_0_TCR_IRGN1, tcr->irgn) | + FIELD_PREP(CTXDESC_CD_0_TCR_ORGN1, tcr->orgn) | + FIELD_PREP(CTXDESC_CD_0_TCR_SH1, tcr->sh) | FIELD_PREP(CTXDESC_CD_0_TCR_IPS, tcr->ips) | - CTXDESC_CD_0_TCR_EPD1 | CTXDESC_CD_0_AA64; + CTXDESC_CD_0_TCR_EPD0 | CTXDESC_CD_0_AA64; cfg->cd.mair= pgtbl_cfg->arm_lpae_s1_cfg.mair; /* @@ -2212,6 +2212,7 @@ static int arm_smmu_domain_finalise(struct iommu_domain *domain, .pgsize_bitmap = smmu->pgsize_bitmap, .ias= ias, .oas= oas, + .quirks = IO_PGTABLE_QUIRK_ARM_TTBR1, .coherent_walk = smmu->features & ARM_SMMU_FEAT_COHERENCY, .tlb= _smmu_flush_ops, .iommu_dev = smmu->dev, diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c index 38434203bf04..3fe154c9782d 100644 --- a/drivers/iommu/dma-iommu.c +++ b/drivers/iommu/dma-iommu.c @@ -677,6 +677,10 @@ static int dma_info_to_prot(enum dma_data_direction dir, bool coherent, } } +/* HACK */ +#define VA_SIZE39 +#define VA_MASK(~0ULL << VA_SIZE) + static dma_addr_t iommu_dma_alloc_iova(struct iommu_domain *domain, size_t size, u64 dma_limit, struct device *dev) { @@ -706,7 +710,7 @@ static dma_addr_t iommu_dma_alloc_iova(struct iommu_domain *domain, iova = alloc_iova_fast(iovad, iova_len, dma_limit >> shift, true); - return (dma_addr_t)iova << shift; + return (dma_addr_t)iova << shift | VA_MASK; } static void iommu_dma_free_iova(struct iommu_dma_cookie *cookie, @@ -714,6 +718,7 @@ static void iommu_dma_free_iova(struct iommu_dma_cookie *cookie, { struct iova_dom
[PATCH 1/2] hw/arm/smmu-common: Support 64-bit addresses
Addresses targeting the second translation table (TTB1) in the SMMU have all upper bits set. Ensure the IOMMU region covers all 64 bits. Signed-off-by: Jean-Philippe Brucker --- hw/arm/smmu-common.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/hw/arm/smmu-common.c b/hw/arm/smmu-common.c index 733c964778..2b8c67b9a1 100644 --- a/hw/arm/smmu-common.c +++ b/hw/arm/smmu-common.c @@ -439,7 +439,7 @@ static AddressSpace *smmu_find_add_as(PCIBus *bus, void *opaque, int devfn) memory_region_init_iommu(>iommu, sizeof(sdev->iommu), s->mrtypename, - OBJECT(s), name, 1ULL << SMMU_MAX_VA_BITS); + OBJECT(s), name, UINT64_MAX); address_space_init(>as, MEMORY_REGION(>iommu), name); trace_smmu_add_mr(name); -- 2.39.0
Re: [RFC PATCH 08/16] target/arm/kvm-rme: Populate the realm with boot images
On Fri, Jan 27, 2023 at 01:54:23PM -1000, Richard Henderson wrote: > > static void rme_vm_state_change(void *opaque, bool running, RunState > > state) > > { > > int ret; > > @@ -72,6 +115,9 @@ static void rme_vm_state_change(void *opaque, bool > > running, RunState state) > > } > > } > > +g_slist_foreach(rme_images, rme_populate_realm, NULL); > > +g_slist_free_full(g_steal_pointer(_images), g_free); > > I suppose this technically works because you clear the list, and thus the > hook is called only on the first transition to RUNNING. On all subsequent > transitions the list is empty. > > I see that i386 sev does this immediately during machine init, alongside the > kernel setup. Since kvm_init has already been called, that seems workable, > rather than queuing anything for later. The problem I faced was that RME_POPULATE_REALM needs to be called after rom_reset(), which copies all the blobs into guest memory, and that happens at device reset time, after machine init and kvm_cpu_synchronize_post_init(). > But I think ideally this would be handled generically in (approximately) > kvm_cpu_synchronize_post_init, looping over all blobs. This would handle > any usage of '-device loader,...', instead of the 4 specific things you > handle in the next patch. I'd definitely prefer something generic that hooks into the loader, I'll look into that. I didn't do it right away because the arm64 Linux kernel loading is special, requires reserving extra RAM in addition to the blob (hence the two parameters to kvm_arm_rme_add_blob()). But we could just have a special case for the extra memory needed by Linux and make the rest generic. Thanks, Jean
Re: [RFC PATCH 06/16] target/arm/kvm-rme: Initialize vCPU
On Fri, Jan 27, 2023 at 12:37:12PM -1000, Richard Henderson wrote: > On 1/27/23 05:07, Jean-Philippe Brucker wrote: > > +static int kvm_arm_rme_get_core_regs(CPUState *cs) > > +{ > > +int i, ret; > > +struct kvm_one_reg reg; > > +ARMCPU *cpu = ARM_CPU(cs); > > +CPUARMState *env = >env; > > + > > +for (i = 0; i < 8; i++) { > > +reg.id = AARCH64_CORE_REG(regs.regs[i]); > > +reg.addr = (uintptr_t) >xregs[i]; > > +ret = kvm_vcpu_ioctl(cs, KVM_GET_ONE_REG, ); > > +if (ret) { > > +return ret; > > +} > > +} > > + > > +return 0; > > +} > > Wow, this is quite the restriction. > > I get that this is just enough to seed the guest for boot, and take SMC > traps, but I'm concerned that we can't do much with the machine underneath, > when it comes to other things like "info registers" or gdbstub will be > silently unusable. I would prefer if we can somehow make this loudly > unusable. For "info registers", which currently displays zero values for all regs, we can instead return an error message in aarch64_cpu_dump_state(). For gdbstub, I suspect we should disable it entirely since it seems fundamentally incompatible with confidential VMs, but I need to spend more time on this. Thanks, Jean
Re: [RFC PATCH 03/16] target/arm/kvm-rme: Initialize realm
Hi Richard, Thanks a lot for the review On Fri, Jan 27, 2023 at 10:37:12AM -1000, Richard Henderson wrote: > At present I would expect exactly one object class to be present in the > qemu-system-aarch64 binary that would pass the > machine_check_confidential_guest_support test done by core code. But we are > hoping to move toward a heterogeneous model where e.g. the TYPE_SEV_GUEST > object might be discoverable within the same executable. Yes, I'm not sure SEV can be supported on qemu-system-aarch64, but pKVM could probably coexist with RME as another type of confidential guest support (https://lwn.net/ml/linux-arm-kernel/20220519134204.5379-1-w...@kernel.org/) Thanks, Jean
Re: [RFC PATCH 04/16] hw/arm/virt: Add support for Arm RME
On Fri, Jan 27, 2023 at 11:07:35AM -1000, Richard Henderson wrote: > > +/* > > + * Since the devicetree is included in the initial measurement, it must > > + * not contain random data. > > + */ > > +if (virt_machine_is_confidential(vms)) { > > +vms->dtb_randomness = false; > > +} > > This property is default off, and the only way it can be on is user > argument. This should be an error, not a silent disable. This one seems to default to true in virt_instance_init(), and I did need to disable it in order to get deterministic measurements. Maybe I could throw an error only when the user attempts to explicitly enables it. > > +if (virt_machine_is_confidential(vms)) { > > +/* > > + * The host cannot write into a confidential guest's memory until > > the > > + * guest shares it. Since the host writes the pvtime region before > > the > > + * guest gets a chance to set it up, disable pvtime. > > + */ > > +steal_time = false; > > +} > > This property is default on since 5.2, so falls into a different category. > Since 5.2 it is auto-on for 64-bit guests. Since it's auto-off for 32-bit > guests, I don't see a problem with it being auto-off for RME guests. > > I do wonder if we should change it to an OnOffAuto property, just to catch > silly usage. I'll look into that Thanks, Jean
[RFC PATCH 13/16] target/arm/kvm-rme: Add breakpoints and watchpoints parameters
Pass the num_bps and num_wps parameters to Realm creation. These parameters contribute to the initial Realm measurement. Signed-off-by: Jean-Philippe Brucker --- qapi/qom.json| 8 +++- target/arm/kvm-rme.c | 34 +- 2 files changed, 40 insertions(+), 2 deletions(-) diff --git a/qapi/qom.json b/qapi/qom.json index 94ecb87f6f..86ed386f26 100644 --- a/qapi/qom.json +++ b/qapi/qom.json @@ -866,12 +866,18 @@ # # @sve-vector-length: SVE vector length (default: 0, SVE disabled) # +# @num-breakpoints: Number of breakpoints (default: 0) +# +# @num-watchpoints: Number of watchpoints (default: 0) +# # Since: FIXME ## { 'struct': 'RmeGuestProperties', 'data': { '*measurement-algo': 'str', '*personalization-value': 'str', -'*sve-vector-length': 'uint32' } } +'*sve-vector-length': 'uint32', +'*num-breakpoints': 'uint32', +'*num-watchpoints': 'uint32' } } ## # @ObjectType: diff --git a/target/arm/kvm-rme.c b/target/arm/kvm-rme.c index 0b2153a45c..3f39f1f7ad 100644 --- a/target/arm/kvm-rme.c +++ b/target/arm/kvm-rme.c @@ -22,7 +22,9 @@ OBJECT_DECLARE_SIMPLE_TYPE(RmeGuest, RME_GUEST) #define RME_PAGE_SIZE qemu_real_host_page_size() -#define RME_MAX_CFG 3 +#define RME_MAX_BPS 0x10 +#define RME_MAX_WPS 0x10 +#define RME_MAX_CFG 4 typedef struct RmeGuest RmeGuest; @@ -31,6 +33,8 @@ struct RmeGuest { char *measurement_algo; char *personalization_value; uint32_t sve_vl; +uint32_t num_wps; +uint32_t num_bps; }; struct RmeImage { @@ -145,6 +149,14 @@ static int rme_configure_one(RmeGuest *guest, uint32_t cfg, Error **errp) args.sve_vq = guest->sve_vl / 128; cfg_str = "SVE"; break; +case KVM_CAP_ARM_RME_CFG_DBG: +if (!guest->num_bps && !guest->num_wps) { +return 0; +} +args.num_brps = guest->num_bps; +args.num_wrps = guest->num_wps; +cfg_str = "debug parameters"; +break; default: g_assert_not_reached(); } @@ -362,6 +374,10 @@ static void rme_get_uint32(Object *obj, Visitor *v, const char *name, if (strcmp(name, "sve-vector-length") == 0) { value = guest->sve_vl; +} else if (strcmp(name, "num-breakpoints") == 0) { +value = guest->num_bps; +} else if (strcmp(name, "num-watchpoints") == 0) { +value = guest->num_wps; } else { g_assert_not_reached(); } @@ -388,6 +404,12 @@ static void rme_set_uint32(Object *obj, Visitor *v, const char *name, error_setg(errp, "invalid SVE vector length %"PRIu32, value); return; } +} else if (strcmp(name, "num-breakpoints") == 0) { +max_value = RME_MAX_BPS; +var = >num_bps; +} else if (strcmp(name, "num-watchpoints") == 0) { +max_value = RME_MAX_WPS; +var = >num_wps; } else { g_assert_not_reached(); } @@ -424,6 +446,16 @@ static void rme_guest_class_init(ObjectClass *oc, void *data) rme_set_uint32, NULL, NULL); object_class_property_set_description(oc, "sve-vector-length", "SVE vector length. 0 disables SVE (the default)"); + +object_class_property_add(oc, "num-breakpoints", "uint32", rme_get_uint32, + rme_set_uint32, NULL, NULL); +object_class_property_set_description(oc, "num-breakpoints", +"Number of breakpoints"); + +object_class_property_add(oc, "num-watchpoints", "uint32", rme_get_uint32, + rme_set_uint32, NULL, NULL); +object_class_property_set_description(oc, "num-watchpoints", +"Number of watchpoints"); } static const TypeInfo rme_guest_info = { -- 2.39.0
[RFC PATCH 14/16] target/arm/kvm-rme: Add PMU num counters parameters
Pass the num_cntrs parameter to Realm creation. These parameters contribute to the initial Realm measurement. Signed-off-by: Jean-Philippe Brucker --- qapi/qom.json| 5 - target/arm/kvm-rme.c | 21 - 2 files changed, 24 insertions(+), 2 deletions(-) diff --git a/qapi/qom.json b/qapi/qom.json index 86ed386f26..13c85abde9 100644 --- a/qapi/qom.json +++ b/qapi/qom.json @@ -870,6 +870,8 @@ # # @num-watchpoints: Number of watchpoints (default: 0) # +# @num-pmu-counters: Number of PMU counters (default: 0, PMU disabled) +# # Since: FIXME ## { 'struct': 'RmeGuestProperties', @@ -877,7 +879,8 @@ '*personalization-value': 'str', '*sve-vector-length': 'uint32', '*num-breakpoints': 'uint32', -'*num-watchpoints': 'uint32' } } +'*num-watchpoints': 'uint32', +'*num-pmu-counters': 'uint32' } } ## # @ObjectType: diff --git a/target/arm/kvm-rme.c b/target/arm/kvm-rme.c index 3f39f1f7ad..1baed79d46 100644 --- a/target/arm/kvm-rme.c +++ b/target/arm/kvm-rme.c @@ -24,7 +24,8 @@ OBJECT_DECLARE_SIMPLE_TYPE(RmeGuest, RME_GUEST) #define RME_MAX_BPS 0x10 #define RME_MAX_WPS 0x10 -#define RME_MAX_CFG 4 +#define RME_MAX_PMU_CTRS0x20 +#define RME_MAX_CFG 5 typedef struct RmeGuest RmeGuest; @@ -35,6 +36,7 @@ struct RmeGuest { uint32_t sve_vl; uint32_t num_wps; uint32_t num_bps; +uint32_t num_pmu_cntrs; }; struct RmeImage { @@ -157,6 +159,13 @@ static int rme_configure_one(RmeGuest *guest, uint32_t cfg, Error **errp) args.num_wrps = guest->num_wps; cfg_str = "debug parameters"; break; +case KVM_CAP_ARM_RME_CFG_PMU: +if (!guest->num_pmu_cntrs) { +return 0; +} +args.num_pmu_cntrs = guest->num_pmu_cntrs; +cfg_str = "PMU"; +break; default: g_assert_not_reached(); } @@ -378,6 +387,8 @@ static void rme_get_uint32(Object *obj, Visitor *v, const char *name, value = guest->num_bps; } else if (strcmp(name, "num-watchpoints") == 0) { value = guest->num_wps; +} else if (strcmp(name, "num-pmu-counters") == 0) { +value = guest->num_pmu_cntrs; } else { g_assert_not_reached(); } @@ -410,6 +421,9 @@ static void rme_set_uint32(Object *obj, Visitor *v, const char *name, } else if (strcmp(name, "num-watchpoints") == 0) { max_value = RME_MAX_WPS; var = >num_wps; +} else if (strcmp(name, "num-pmu-counters") == 0) { +max_value = RME_MAX_PMU_CTRS; +var = >num_pmu_cntrs; } else { g_assert_not_reached(); } @@ -456,6 +470,11 @@ static void rme_guest_class_init(ObjectClass *oc, void *data) rme_set_uint32, NULL, NULL); object_class_property_set_description(oc, "num-watchpoints", "Number of watchpoints"); + +object_class_property_add(oc, "num-pmu-counters", "uint32", rme_get_uint32, + rme_set_uint32, NULL, NULL); +object_class_property_set_description(oc, "num-pmu-counters", +"Number of PMU counters"); } static const TypeInfo rme_guest_info = { -- 2.39.0
[RFC PATCH 08/16] target/arm/kvm-rme: Populate the realm with boot images
Initialize the GPA space and populate it with boot images (kernel, initrd, firmware, etc). Populating has to be done at VM start time, because the images are loaded during reset by rom_reset() Signed-off-by: Jean-Philippe Brucker --- target/arm/kvm_arm.h | 6 target/arm/kvm-rme.c | 79 2 files changed, 85 insertions(+) diff --git a/target/arm/kvm_arm.h b/target/arm/kvm_arm.h index e4dc7fbb8d..cec6500603 100644 --- a/target/arm/kvm_arm.h +++ b/target/arm/kvm_arm.h @@ -371,6 +371,7 @@ int kvm_arm_set_irq(int cpu, int irqtype, int irq, int level); int kvm_arm_rme_init(ConfidentialGuestSupport *cgs, Error **errp); int kvm_arm_rme_vm_type(MachineState *ms); +void kvm_arm_rme_add_blob(hwaddr start, hwaddr src_size, hwaddr dst_size); bool kvm_arm_rme_enabled(void); int kvm_arm_rme_vcpu_init(CPUState *cs); @@ -458,6 +459,11 @@ static inline int kvm_arm_rme_vm_type(MachineState *ms) { return 0; } + +static inline void kvm_arm_rme_add_blob(hwaddr start, hwaddr src_size, +hwaddr dst_size) +{ +} #endif static inline const char *gic_class_name(void) diff --git a/target/arm/kvm-rme.c b/target/arm/kvm-rme.c index 3833b187f9..c8c019f78a 100644 --- a/target/arm/kvm-rme.c +++ b/target/arm/kvm-rme.c @@ -9,6 +9,7 @@ #include "exec/confidential-guest-support.h" #include "hw/boards.h" #include "hw/core/cpu.h" +#include "hw/loader.h" #include "kvm_arm.h" #include "migration/blocker.h" #include "qapi/error.h" @@ -19,12 +20,22 @@ #define TYPE_RME_GUEST "rme-guest" OBJECT_DECLARE_SIMPLE_TYPE(RmeGuest, RME_GUEST) +#define RME_PAGE_SIZE qemu_real_host_page_size() + typedef struct RmeGuest RmeGuest; struct RmeGuest { ConfidentialGuestSupport parent_obj; }; +struct RmeImage { +hwaddr base; +hwaddr src_size; +hwaddr dst_size; +}; + +static GSList *rme_images; + static RmeGuest *cgs_to_rme(ConfidentialGuestSupport *cgs) { if (!cgs) { @@ -51,6 +62,38 @@ static int rme_create_rd(RmeGuest *guest, Error **errp) return ret; } +static void rme_populate_realm(gpointer data, gpointer user_data) +{ +int ret; +struct RmeImage *image = data; +struct kvm_cap_arm_rme_init_ipa_args init_args = { +.init_ipa_base = image->base, +.init_ipa_size = image->dst_size, +}; +struct kvm_cap_arm_rme_populate_realm_args populate_args = { +.populate_ipa_base = image->base, +.populate_ipa_size = image->src_size, +}; + +ret = kvm_vm_enable_cap(kvm_state, KVM_CAP_ARM_RME, 0, +KVM_CAP_ARM_RME_INIT_IPA_REALM, +(intptr_t)_args); +if (ret) { +error_setg_errno(_fatal, -ret, + "RME: failed to initialize GPA range (0x%"HWADDR_PRIx", 0x%"HWADDR_PRIx")", + image->base, image->dst_size); +} + +ret = kvm_vm_enable_cap(kvm_state, KVM_CAP_ARM_RME, 0, +KVM_CAP_ARM_RME_POPULATE_REALM, +(intptr_t)_args); +if (ret) { +error_setg_errno(_fatal, -ret, + "RME: failed to populate realm (0x%"HWADDR_PRIx", 0x%"HWADDR_PRIx")", + image->base, image->src_size); +} +} + static void rme_vm_state_change(void *opaque, bool running, RunState state) { int ret; @@ -72,6 +115,9 @@ static void rme_vm_state_change(void *opaque, bool running, RunState state) } } +g_slist_foreach(rme_images, rme_populate_realm, NULL); +g_slist_free_full(g_steal_pointer(_images), g_free); + ret = kvm_vm_enable_cap(kvm_state, KVM_CAP_ARM_RME, 0, KVM_CAP_ARM_RME_ACTIVATE_REALM); if (ret) { @@ -118,6 +164,39 @@ int kvm_arm_rme_init(ConfidentialGuestSupport *cgs, Error **errp) return 0; } +/* + * kvm_arm_rme_add_blob - Initialize the Realm IPA range and set up the image. + * + * @src_size is the number of bytes of the source image, to be populated into + * Realm memory. + * @dst_size is the effective image size, which may be larger than @src_size. + * For a kernel @dst_size may include zero-initialized regions such as the BSS + * and initial page directory. + */ +void kvm_arm_rme_add_blob(hwaddr base, hwaddr src_size, hwaddr dst_size) +{ +struct RmeImage *image; + +if (!kvm_arm_rme_enabled()) { +return; +} + +base = QEMU_ALIGN_DOWN(base, RME_PAGE_SIZE); +src_size = QEMU_ALIGN_UP(src_size, RME_PAGE_SIZE); +dst_size = QEMU_ALIGN_UP(dst_size, RME_PAGE_SIZE); + +image = g_malloc0(sizeof(*image)); +image->base = base; +image->src_size = src_size; +image->dst_size = dst_size; + +/* + * The ROM loader will only load the images during reset, so postpone the +
[RFC PATCH 01/16] NOMERGE: Add KVM Arm RME definitions to Linux headers
Copy the KVM definitions for Arm RME from the development branch. Don't merge, they will be added from the periodic Linux header sync. Signed-off-by: Jean-Philippe Brucker --- linux-headers/asm-arm64/kvm.h | 63 +++ linux-headers/linux/kvm.h | 21 +--- 2 files changed, 80 insertions(+), 4 deletions(-) diff --git a/linux-headers/asm-arm64/kvm.h b/linux-headers/asm-arm64/kvm.h index 4bf2d7246e..8e04d6f7ff 100644 --- a/linux-headers/asm-arm64/kvm.h +++ b/linux-headers/asm-arm64/kvm.h @@ -108,6 +108,7 @@ struct kvm_regs { #define KVM_ARM_VCPU_SVE 4 /* enable SVE for this CPU */ #define KVM_ARM_VCPU_PTRAUTH_ADDRESS 5 /* VCPU uses address authentication */ #define KVM_ARM_VCPU_PTRAUTH_GENERIC 6 /* VCPU uses generic authentication */ +#define KVM_ARM_VCPU_REC 7 /* VCPU REC state as part of Realm */ struct kvm_vcpu_init { __u32 target; @@ -391,6 +392,68 @@ enum { #define KVM_DEV_ARM_VGIC_SAVE_PENDING_TABLES 3 #define KVM_DEV_ARM_ITS_CTRL_RESET 4 +/* KVM_CAP_ARM_RME on VM fd */ +#define KVM_CAP_ARM_RME_CONFIG_REALM 0 +#define KVM_CAP_ARM_RME_CREATE_RD 1 +#define KVM_CAP_ARM_RME_INIT_IPA_REALM 2 +#define KVM_CAP_ARM_RME_POPULATE_REALM 3 +#define KVM_CAP_ARM_RME_ACTIVATE_REALM 4 + +#define KVM_CAP_ARM_RME_MEASUREMENT_ALGO_SHA2560 +#define KVM_CAP_ARM_RME_MEASUREMENT_ALGO_SHA5121 + +#define KVM_CAP_ARM_RME_RPV_SIZE 64 + +/* List of configuration items accepted for KVM_CAP_ARM_RME_CONFIG_REALM */ +#define KVM_CAP_ARM_RME_CFG_RPV0 +#define KVM_CAP_ARM_RME_CFG_HASH_ALGO 1 +#define KVM_CAP_ARM_RME_CFG_SVE2 +#define KVM_CAP_ARM_RME_CFG_DBG3 +#define KVM_CAP_ARM_RME_CFG_PMU4 + +struct kvm_cap_arm_rme_config_item { + __u32 cfg; + union { + /* cfg == KVM_CAP_ARM_RME_CFG_RPV */ + struct { + __u8rpv[KVM_CAP_ARM_RME_RPV_SIZE]; + }; + + /* cfg == KVM_CAP_ARM_RME_CFG_HASH_ALGO */ + struct { + __u32 hash_algo; + }; + + /* cfg == KVM_CAP_ARM_RME_CFG_SVE */ + struct { + __u32 sve_vq; + }; + + /* cfg == KVM_CAP_ARM_RME_CFG_DBG */ + struct { + __u32 num_brps; + __u32 num_wrps; + }; + + /* cfg == KVM_CAP_ARM_RME_CFG_PMU */ + struct { + __u32 num_pmu_cntrs; + }; + /* Fix the size of the union */ + __u8reserved[256]; + }; +}; + +struct kvm_cap_arm_rme_populate_realm_args { + __u64 populate_ipa_base; + __u64 populate_ipa_size; +}; + +struct kvm_cap_arm_rme_init_ipa_args { + __u64 init_ipa_base; + __u64 init_ipa_size; +}; + /* Device Control API on vcpu fd */ #define KVM_ARM_VCPU_PMU_V3_CTRL 0 #define KVM_ARM_VCPU_PMU_V3_IRQ 0 diff --git a/linux-headers/linux/kvm.h b/linux-headers/linux/kvm.h index ebdafa576d..9d5affc98a 100644 --- a/linux-headers/linux/kvm.h +++ b/linux-headers/linux/kvm.h @@ -901,14 +901,25 @@ struct kvm_ppc_resize_hpt { #define KVM_S390_SIE_PAGE_OFFSET 1 /* - * On arm64, machine type can be used to request the physical - * address size for the VM. Bits[7-0] are reserved for the guest - * PA size shift (i.e, log2(PA_Size)). For backward compatibility, - * value 0 implies the default IPA size, 40bits. + * On arm64, machine type can be used to request both the machine type and + * the physical address size for the VM. + * + * Bits[11-8] are reserved for the ARM specific machine type. + * + * Bits[7-0] are reserved for the guest PA size shift (i.e, log2(PA_Size)). + * For backward compatibility, value 0 implies the default IPA size, 40bits. */ +#define KVM_VM_TYPE_ARM_SHIFT 8 +#define KVM_VM_TYPE_ARM_MASK (0xfULL << KVM_VM_TYPE_ARM_SHIFT) +#define KVM_VM_TYPE_ARM(_type) \ + (((_type) << KVM_VM_TYPE_ARM_SHIFT) & KVM_VM_TYPE_ARM_MASK) +#define KVM_VM_TYPE_ARM_NORMAL KVM_VM_TYPE_ARM(0) +#define KVM_VM_TYPE_ARM_REALM KVM_VM_TYPE_ARM(1) + #define KVM_VM_TYPE_ARM_IPA_SIZE_MASK 0xffULL #define KVM_VM_TYPE_ARM_IPA_SIZE(x)\ ((x) & KVM_VM_TYPE_ARM_IPA_SIZE_MASK) + /* * ioctls for /dev/kvm fds: */ @@ -1176,6 +1187,8 @@ struct kvm_ppc_resize_hpt { #define KVM_CAP_S390_ZPCI_OP 221 #define KVM_CAP_S390_CPU_TOPOLOGY 222 +#define KVM_CAP_ARM_RME 300 // FIXME: Large number to prevent conflicts + #ifdef KVM_CAP_IRQ_ROUTING struct kvm_irq_routing_irqchip { -- 2.39.0
[RFC PATCH 11/16] target/arm/kvm-rme: Add Realm Personalization Value parameter
The Realm Personalization Value (RPV) is provided by the user to distinguish Realms that have the same initial measurement. The user provides a 512-bit hexadecimal number. Signed-off-by: Jean-Philippe Brucker --- qapi/qom.json| 5 ++- target/arm/kvm-rme.c | 72 +++- 2 files changed, 75 insertions(+), 2 deletions(-) diff --git a/qapi/qom.json b/qapi/qom.json index 87fe7c31fe..a012281628 100644 --- a/qapi/qom.json +++ b/qapi/qom.json @@ -862,10 +862,13 @@ # # @measurement-algo: Realm measurement algorithm (default: RMM default) # +# @personalization-value: Realm personalization value (default: 0) +# # Since: FIXME ## { 'struct': 'RmeGuestProperties', - 'data': { '*measurement-algo': 'str' } } + 'data': { '*measurement-algo': 'str', +'*personalization-value': 'str' } } ## # @ObjectType: diff --git a/target/arm/kvm-rme.c b/target/arm/kvm-rme.c index 3929b941ae..e974c27e5c 100644 --- a/target/arm/kvm-rme.c +++ b/target/arm/kvm-rme.c @@ -22,13 +22,14 @@ OBJECT_DECLARE_SIMPLE_TYPE(RmeGuest, RME_GUEST) #define RME_PAGE_SIZE qemu_real_host_page_size() -#define RME_MAX_CFG 1 +#define RME_MAX_CFG 2 typedef struct RmeGuest RmeGuest; struct RmeGuest { ConfidentialGuestSupport parent_obj; char *measurement_algo; +char *personalization_value; }; struct RmeImage { @@ -65,6 +66,45 @@ static int rme_create_rd(RmeGuest *guest, Error **errp) return ret; } +static int rme_parse_rpv(uint8_t *out, const char *in, Error **errp) +{ +int ret; +size_t in_len = strlen(in); + +/* Two chars per byte */ +if (in_len > KVM_CAP_ARM_RME_RPV_SIZE * 2) { +error_setg(errp, "Realm Personalization Value is too large"); +return -E2BIG; +} + +/* + * Parse as big-endian hexadecimal number (most significant byte on the + * left), store little-endian, zero-padded on the right. + */ +while (in_len) { +/* + * Do the lower nibble first to catch invalid inputs such as '2z', and + * to handle the last char. + */ +in_len--; +ret = sscanf(in + in_len, "%1hhx", out); +if (ret != 1) { +error_setg(errp, "Invalid Realm Personalization Value"); +return -EINVAL; +} +if (!in_len) { +break; +} +in_len--; +ret = sscanf(in + in_len, "%2hhx", out++); +if (ret != 1) { +error_setg(errp, "Invalid Realm Personalization Value"); +return -EINVAL; +} +} +return 0; +} + static int rme_configure_one(RmeGuest *guest, uint32_t cfg, Error **errp) { int ret; @@ -87,6 +127,16 @@ static int rme_configure_one(RmeGuest *guest, uint32_t cfg, Error **errp) } cfg_str = "hash algorithm"; break; +case KVM_CAP_ARM_RME_CFG_RPV: +if (!guest->personalization_value) { +return 0; +} +ret = rme_parse_rpv(args.rpv, guest->personalization_value, errp); +if (ret) { +return ret; +} +cfg_str = "personalization value"; +break; default: g_assert_not_reached(); } @@ -281,6 +331,21 @@ static void rme_set_measurement_algo(Object *obj, const char *value, guest->measurement_algo = g_strdup(value); } +static char *rme_get_rpv(Object *obj, Error **errp) +{ +RmeGuest *guest = RME_GUEST(obj); + +return g_strdup(guest->personalization_value); +} + +static void rme_set_rpv(Object *obj, const char *value, Error **errp) +{ +RmeGuest *guest = RME_GUEST(obj); + +g_free(guest->personalization_value); +guest->personalization_value = g_strdup(value); +} + static void rme_guest_class_init(ObjectClass *oc, void *data) { object_class_property_add_str(oc, "measurement-algo", @@ -288,6 +353,11 @@ static void rme_guest_class_init(ObjectClass *oc, void *data) rme_set_measurement_algo); object_class_property_set_description(oc, "measurement-algo", "Realm measurement algorithm ('sha256', 'sha512')"); + +object_class_property_add_str(oc, "personalization-value", rme_get_rpv, + rme_set_rpv); +object_class_property_set_description(oc, "personalization-value", +"Realm personalization value (512-bit hexadecimal number)"); } static const TypeInfo rme_guest_info = { -- 2.39.0
[RFC PATCH 10/16] target/arm/kvm-rme: Add measurement algorithm property
This option selects which measurement algorithm to use for attestation. Supported values are sha256 and sha512. Signed-off-by: Jean-Philippe Brucker --- qapi/qom.json| 14 - target/arm/kvm-rme.c | 71 2 files changed, 84 insertions(+), 1 deletion(-) diff --git a/qapi/qom.json b/qapi/qom.json index 7ca27bb86c..87fe7c31fe 100644 --- a/qapi/qom.json +++ b/qapi/qom.json @@ -855,6 +855,17 @@ 'data': { '*cpu-affinity': ['uint16'], '*node-affinity': ['uint16'] } } +## +# @RmeGuestProperties: +# +# Properties for rme-guest objects. +# +# @measurement-algo: Realm measurement algorithm (default: RMM default) +# +# Since: FIXME +## +{ 'struct': 'RmeGuestProperties', + 'data': { '*measurement-algo': 'str' } } ## # @ObjectType: @@ -985,7 +996,8 @@ 'tls-creds-x509': 'TlsCredsX509Properties', 'tls-cipher-suites': 'TlsCredsProperties', 'x-remote-object':'RemoteObjectProperties', - 'x-vfio-user-server': 'VfioUserServerProperties' + 'x-vfio-user-server': 'VfioUserServerProperties', + 'rme-guest': 'RmeGuestProperties' } } ## diff --git a/target/arm/kvm-rme.c b/target/arm/kvm-rme.c index c8c019f78a..3929b941ae 100644 --- a/target/arm/kvm-rme.c +++ b/target/arm/kvm-rme.c @@ -22,10 +22,13 @@ OBJECT_DECLARE_SIMPLE_TYPE(RmeGuest, RME_GUEST) #define RME_PAGE_SIZE qemu_real_host_page_size() +#define RME_MAX_CFG 1 + typedef struct RmeGuest RmeGuest; struct RmeGuest { ConfidentialGuestSupport parent_obj; +char *measurement_algo; }; struct RmeImage { @@ -62,6 +65,40 @@ static int rme_create_rd(RmeGuest *guest, Error **errp) return ret; } +static int rme_configure_one(RmeGuest *guest, uint32_t cfg, Error **errp) +{ +int ret; +const char *cfg_str; +struct kvm_cap_arm_rme_config_item args = { +.cfg = cfg, +}; + +switch (cfg) { +case KVM_CAP_ARM_RME_CFG_HASH_ALGO: +if (!guest->measurement_algo) { +return 0; +} +if (!strcmp(guest->measurement_algo, "sha256")) { +args.hash_algo = KVM_CAP_ARM_RME_MEASUREMENT_ALGO_SHA256; +} else if (!strcmp(guest->measurement_algo, "sha512")) { +args.hash_algo = KVM_CAP_ARM_RME_MEASUREMENT_ALGO_SHA512; +} else { +g_assert_not_reached(); +} +cfg_str = "hash algorithm"; +break; +default: +g_assert_not_reached(); +} + +ret = kvm_vm_enable_cap(kvm_state, KVM_CAP_ARM_RME, 0, +KVM_CAP_ARM_RME_CONFIG_REALM, (intptr_t)); +if (ret) { +error_setg_errno(errp, -ret, "RME: failed to configure %s", cfg_str); +} +return ret; +} + static void rme_populate_realm(gpointer data, gpointer user_data) { int ret; @@ -128,6 +165,7 @@ static void rme_vm_state_change(void *opaque, bool running, RunState state) int kvm_arm_rme_init(ConfidentialGuestSupport *cgs, Error **errp) { int ret; +int cfg; static Error *rme_mig_blocker; RmeGuest *guest = cgs_to_rme(cgs); @@ -146,6 +184,13 @@ int kvm_arm_rme_init(ConfidentialGuestSupport *cgs, Error **errp) return -ENODEV; } +for (cfg = 0; cfg < RME_MAX_CFG; cfg++) { +ret = rme_configure_one(guest, cfg, errp); +if (ret) { +return ret; +} +} + ret = rme_create_rd(guest, errp); if (ret) { return ret; @@ -215,8 +260,34 @@ int kvm_arm_rme_vm_type(MachineState *ms) return 0; } +static char *rme_get_measurement_algo(Object *obj, Error **errp) +{ +RmeGuest *guest = RME_GUEST(obj); + +return g_strdup(guest->measurement_algo); +} + +static void rme_set_measurement_algo(Object *obj, const char *value, + Error **errp) +{ +RmeGuest *guest = RME_GUEST(obj); + +if (strncmp(value, "sha256", 6) && +strncmp(value, "sha512", 6)) { +error_setg(errp, "invalid Realm measurement algorithm '%s'", value); +return; +} +g_free(guest->measurement_algo); +guest->measurement_algo = g_strdup(value); +} + static void rme_guest_class_init(ObjectClass *oc, void *data) { +object_class_property_add_str(oc, "measurement-algo", + rme_get_measurement_algo, + rme_set_measurement_algo); +object_class_property_set_description(oc, "measurement-algo", +"Realm measurement algorithm ('sha256', 'sha512')"); } static const TypeInfo rme_guest_info = { -- 2.39.0
[RFC PATCH 06/16] target/arm/kvm-rme: Initialize vCPU
The target code calls kvm_arm_vcpu_init() to mark the vCPU as part of a realm. RME support does not use the register lists, because the host can only set the boot PC and registers x0-x7. The rest is private to the Realm and saved/restored by the RMM. Signed-off-by: Jean-Philippe Brucker --- target/arm/cpu.h | 3 ++ target/arm/kvm_arm.h | 1 + target/arm/helper.c | 8 ++ target/arm/kvm-rme.c | 10 +++ target/arm/kvm.c | 12 target/arm/kvm64.c | 65 ++-- 6 files changed, 97 insertions(+), 2 deletions(-) diff --git a/target/arm/cpu.h b/target/arm/cpu.h index 9aeed3c848..7d8397985f 100644 --- a/target/arm/cpu.h +++ b/target/arm/cpu.h @@ -937,6 +937,9 @@ struct ArchCPU { /* KVM steal time */ OnOffAuto kvm_steal_time; +/* Realm Management Extension */ +bool kvm_rme; + /* Uniprocessor system with MP extensions */ bool mp_is_up; diff --git a/target/arm/kvm_arm.h b/target/arm/kvm_arm.h index 00d3df8cac..e4dc7fbb8d 100644 --- a/target/arm/kvm_arm.h +++ b/target/arm/kvm_arm.h @@ -373,6 +373,7 @@ int kvm_arm_rme_init(ConfidentialGuestSupport *cgs, Error **errp); int kvm_arm_rme_vm_type(MachineState *ms); bool kvm_arm_rme_enabled(void); +int kvm_arm_rme_vcpu_init(CPUState *cs); #else diff --git a/target/arm/helper.c b/target/arm/helper.c index d8c8223ec3..52360ae2ff 100644 --- a/target/arm/helper.c +++ b/target/arm/helper.c @@ -126,6 +126,10 @@ bool write_cpustate_to_list(ARMCPU *cpu, bool kvm_sync) int i; bool ok = true; +if (cpu->kvm_rme) { +return ok; +} + for (i = 0; i < cpu->cpreg_array_len; i++) { uint32_t regidx = kvm_to_cpreg_id(cpu->cpreg_indexes[i]); const ARMCPRegInfo *ri; @@ -171,6 +175,10 @@ bool write_list_to_cpustate(ARMCPU *cpu) int i; bool ok = true; +if (cpu->kvm_rme) { +return ok; +} + for (i = 0; i < cpu->cpreg_array_len; i++) { uint32_t regidx = kvm_to_cpreg_id(cpu->cpreg_indexes[i]); uint64_t v = cpu->cpreg_values[i]; diff --git a/target/arm/kvm-rme.c b/target/arm/kvm-rme.c index d7cdca1cbf..3833b187f9 100644 --- a/target/arm/kvm-rme.c +++ b/target/arm/kvm-rme.c @@ -118,6 +118,16 @@ int kvm_arm_rme_init(ConfidentialGuestSupport *cgs, Error **errp) return 0; } +int kvm_arm_rme_vcpu_init(CPUState *cs) +{ +ARMCPU *cpu = ARM_CPU(cs); + +if (kvm_arm_rme_enabled()) { +cpu->kvm_rme = true; +} +return 0; +} + int kvm_arm_rme_vm_type(MachineState *ms) { if (cgs_to_rme(ms->cgs)) { diff --git a/target/arm/kvm.c b/target/arm/kvm.c index f022c644d2..fcddead4fe 100644 --- a/target/arm/kvm.c +++ b/target/arm/kvm.c @@ -449,6 +449,10 @@ int kvm_arm_init_cpreg_list(ARMCPU *cpu) int i, ret, arraylen; CPUState *cs = CPU(cpu); +if (cpu->kvm_rme) { +return 0; +} + rl.n = 0; ret = kvm_vcpu_ioctl(cs, KVM_GET_REG_LIST, ); if (ret != -E2BIG) { @@ -521,6 +525,10 @@ bool write_kvmstate_to_list(ARMCPU *cpu) int i; bool ok = true; +if (cpu->kvm_rme) { +return ok; +} + for (i = 0; i < cpu->cpreg_array_len; i++) { struct kvm_one_reg r; uint64_t regidx = cpu->cpreg_indexes[i]; @@ -557,6 +565,10 @@ bool write_list_to_kvmstate(ARMCPU *cpu, int level) int i; bool ok = true; +if (cpu->kvm_rme) { +return ok; +} + for (i = 0; i < cpu->cpreg_array_len; i++) { struct kvm_one_reg r; uint64_t regidx = cpu->cpreg_indexes[i]; diff --git a/target/arm/kvm64.c b/target/arm/kvm64.c index 55191496f3..b6320672b2 100644 --- a/target/arm/kvm64.c +++ b/target/arm/kvm64.c @@ -887,6 +887,11 @@ int kvm_arch_init_vcpu(CPUState *cs) return ret; } +ret = kvm_arm_rme_vcpu_init(cs); +if (ret) { +return ret; +} + if (cpu_isar_feature(aa64_sve, cpu)) { ret = kvm_arm_sve_set_vls(cs); if (ret) { @@ -1080,6 +1085,35 @@ static int kvm_arch_put_sve(CPUState *cs) return 0; } +static int kvm_arm_rme_put_core_regs(CPUState *cs, int level) +{ +int i, ret; +struct kvm_one_reg reg; +ARMCPU *cpu = ARM_CPU(cs); +CPUARMState *env = >env; + +/* + * The RME ABI only allows us to set 8 GPRs and the PC + */ +for (i = 0; i < 8; i++) { +reg.id = AARCH64_CORE_REG(regs.regs[i]); +reg.addr = (uintptr_t) >xregs[i]; +ret = kvm_vcpu_ioctl(cs, KVM_SET_ONE_REG, ); +if (ret) { +return ret; +} +} + +reg.id = AARCH64_CORE_REG(regs.pc); +reg.addr = (uintptr_t) >pc; +ret = kvm_vcpu_ioctl(cs, KVM_SET_ONE_REG, ); +if (ret) { +return ret; +} + +return 0; +} + static int kvm_arm_put_core_regs(CPUState *cs) { struct kvm_one_reg reg; @@ -1208,7 +1242,11 @@ int kvm_arch_put_registers(CPUState *cs, int level) int ret; ARMCPU
[RFC PATCH 04/16] hw/arm/virt: Add support for Arm RME
When confidential-guest-support is enabled for the virt machine, call the RME init function, and add the RME flag to the VM type. * The Realm differentiates non-secure from realm memory using the upper GPA bit. Reserve that bit when creating the memory map, to make sure that device MMIO located in high memory can still fit. * pvtime is disabled for the moment. Since the hypervisor has to write into the shared pvtime page before scheduling a vcpu, it seems incompatible with confidential guests. Signed-off-by: Jean-Philippe Brucker --- hw/arm/virt.c | 48 1 file changed, 44 insertions(+), 4 deletions(-) diff --git a/hw/arm/virt.c b/hw/arm/virt.c index b871350856..df613e634a 100644 --- a/hw/arm/virt.c +++ b/hw/arm/virt.c @@ -210,6 +210,11 @@ static const char *valid_cpus[] = { ARM_CPU_TYPE_NAME("max"), }; +static bool virt_machine_is_confidential(VirtMachineState *vms) +{ +return MACHINE(vms)->cgs; +} + static bool cpu_type_valid(const char *cpu) { int i; @@ -247,6 +252,14 @@ static void create_fdt(VirtMachineState *vms) exit(1); } +/* + * Since the devicetree is included in the initial measurement, it must + * not contain random data. + */ +if (virt_machine_is_confidential(vms)) { +vms->dtb_randomness = false; +} + ms->fdt = fdt; /* Header */ @@ -1924,6 +1937,15 @@ static void virt_cpu_post_init(VirtMachineState *vms, MemoryRegion *sysmem) steal_time = object_property_get_bool(OBJECT(first_cpu), "kvm-steal-time", NULL); +if (virt_machine_is_confidential(vms)) { +/* + * The host cannot write into a confidential guest's memory until the + * guest shares it. Since the host writes the pvtime region before the + * guest gets a chance to set it up, disable pvtime. + */ +steal_time = false; +} + if (kvm_enabled()) { hwaddr pvtime_reg_base = vms->memmap[VIRT_PVTIME].base; hwaddr pvtime_reg_size = vms->memmap[VIRT_PVTIME].size; @@ -2053,10 +2075,11 @@ static void machvirt_init(MachineState *machine) * if the guest has EL2 then we will use SMC as the conduit, * and otherwise we will use HVC (for backwards compatibility and * because if we're using KVM then we must use HVC). + * Realm guests must also use SMC. */ if (vms->secure && firmware_loaded) { vms->psci_conduit = QEMU_PSCI_CONDUIT_DISABLED; -} else if (vms->virt) { +} else if (vms->virt || virt_machine_is_confidential(vms)) { vms->psci_conduit = QEMU_PSCI_CONDUIT_SMC; } else { vms->psci_conduit = QEMU_PSCI_CONDUIT_HVC; @@ -2102,6 +2125,8 @@ static void machvirt_init(MachineState *machine) exit(1); } +kvm_arm_rme_init(machine->cgs, _fatal); + create_fdt(vms); assert(possible_cpus->len == max_cpus); @@ -2854,15 +2879,26 @@ static HotplugHandler *virt_machine_get_hotplug_handler(MachineState *machine, static int virt_kvm_type(MachineState *ms, const char *type_str) { VirtMachineState *vms = VIRT_MACHINE(ms); +int rme_vm_type = kvm_arm_rme_vm_type(ms); int max_vm_pa_size, requested_pa_size; +int rme_reserve_bit = 0; bool fixed_ipa; -max_vm_pa_size = kvm_arm_get_max_vm_ipa_size(ms, _ipa); +if (rme_vm_type) { +/* + * With RME, the upper GPA bit differentiates Realm from NS memory. + * Reserve the upper bit to guarantee that highmem devices will fit. + */ +rme_reserve_bit = 1; +} + +max_vm_pa_size = kvm_arm_get_max_vm_ipa_size(ms, _ipa) - + rme_reserve_bit; /* we freeze the memory map to compute the highest gpa */ virt_set_memmap(vms, max_vm_pa_size); -requested_pa_size = 64 - clz64(vms->highest_gpa); +requested_pa_size = 64 - clz64(vms->highest_gpa) + rme_reserve_bit; /* * KVM requires the IPA size to be at least 32 bits. @@ -2883,7 +2919,11 @@ static int virt_kvm_type(MachineState *ms, const char *type_str) * the implicit legacy 40b IPA setting, in which case the kvm_type * must be 0. */ -return fixed_ipa ? 0 : requested_pa_size; +if (fixed_ipa) { +return 0; +} + +return requested_pa_size | rme_vm_type; } static void virt_machine_class_init(ObjectClass *oc, void *data) -- 2.39.0
[RFC PATCH 15/16] target/arm/kvm: Disable Realm reboot
A realm cannot be reset, it must be recreated from scratch. The RMM specification defines states of a Realm as NEW -> ACTIVE -> SYSTEM_OFF, after which the Realm can only be destroyed. A PCSI_SYSTEM_RESET call, which normally reboots the system, puts the Realm in SYSTEM_OFF state. QEMU does not support recreating a VM. Normally, a reboot request by the guest causes all devices to reset, which cannot work for a Realm. Indeed, loading images into Realm memory and changing the PC is only allowed for a Realm in NEW state. Resetting the images for a Realm in SYSTEM_OFF state will cause QEMU to crash with a bus error. Handle reboot requests by the guest more gracefully, by indicating to runstate.c that the vCPUs of a Realm are not resettable, and that QEMU should exit. Signed-off-by: Jean-Philippe Brucker --- target/arm/kvm.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/target/arm/kvm.c b/target/arm/kvm.c index d8655d9041..02b5e8009f 100644 --- a/target/arm/kvm.c +++ b/target/arm/kvm.c @@ -1071,7 +1071,8 @@ int kvm_arch_msi_data_to_gsi(uint32_t data) bool kvm_arch_cpu_check_are_resettable(void) { -return true; +/* A Realm cannot be reset */ +return !kvm_arm_rme_enabled(); } void kvm_arch_accel_class_init(ObjectClass *oc) -- 2.39.0
[RFC PATCH 05/16] target/arm/kvm: Split kvm_arch_get/put_registers
The confidential guest support in KVM limits the number of registers that we can read and write. Split the get/put_registers function to prepare for it. Signed-off-by: Jean-Philippe Brucker --- target/arm/kvm64.c | 30 -- 1 file changed, 28 insertions(+), 2 deletions(-) diff --git a/target/arm/kvm64.c b/target/arm/kvm64.c index 1197253d12..55191496f3 100644 --- a/target/arm/kvm64.c +++ b/target/arm/kvm64.c @@ -1080,7 +1080,7 @@ static int kvm_arch_put_sve(CPUState *cs) return 0; } -int kvm_arch_put_registers(CPUState *cs, int level) +static int kvm_arm_put_core_regs(CPUState *cs) { struct kvm_one_reg reg; uint64_t val; @@ -1200,6 +1200,19 @@ int kvm_arch_put_registers(CPUState *cs, int level) return ret; } +return 0; +} + +int kvm_arch_put_registers(CPUState *cs, int level) +{ +int ret; +ARMCPU *cpu = ARM_CPU(cs); + +ret = kvm_arm_put_core_regs(cs); +if (ret) { +return ret; +} + write_cpustate_to_list(cpu, true); if (!write_list_to_kvmstate(cpu, level)) { @@ -1293,7 +1306,7 @@ static int kvm_arch_get_sve(CPUState *cs) return 0; } -int kvm_arch_get_registers(CPUState *cs) +static int kvm_arm_get_core_regs(CPUState *cs) { struct kvm_one_reg reg; uint64_t val; @@ -1413,6 +1426,19 @@ int kvm_arch_get_registers(CPUState *cs) } vfp_set_fpcr(env, fpr); +return 0; +} + +int kvm_arch_get_registers(CPUState *cs) +{ +int ret; +ARMCPU *cpu = ARM_CPU(cs); + +ret = kvm_arm_get_core_regs(cs); +if (ret) { +return ret; +} + ret = kvm_get_vcpu_events(cpu); if (ret) { return ret; -- 2.39.0