from:"Jean\-Philippe Brucker"

Re: [PATCH v2 02/22] target/arm: Add confidential guest support

2024-04-23 Thread Jean-Philippe Brucker

On Fri, Apr 19, 2024 at 05:25:12PM +0100, Daniel P. Berrangé wrote:
> On Fri, Apr 19, 2024 at 04:56:50PM +0100, Jean-Philippe Brucker wrote:
> > Add a new RmeGuest object, inheriting from ConfidentialGuestSupport, to
> > support the Arm Realm Management Extension (RME). It is instantiated by
> > passing on the command-line:
> > 
> >   -M virt,confidential-guest-support=
> >   -object guest-rme,id=[,options...]

Hm, the commit message is wrong, it should say "rme-guest".

> How about either "arm-rme" or merely 'rme' as the object name 

I don't feel strongly about the name, but picked this one to be consistent
with existing confidential-guest-support objects: sev-guest, pef-guest,
s390-pv-guest, and upcoming tdx-guest [1]

Thanks,
Jean

[1] 
https://lore.kernel.org/qemu-devel/20240229063726.610065-13-xiaoyao...@intel.com/

[PATCH v2 06/22] hw/arm/virt: Disable DTB randomness for confidential VMs

2024-04-19 Thread Jean-Philippe Brucker

The dtb-randomness feature, which adds random seeds to the DTB, isn't
really compatible with confidential VMs since it randomizes the Realm
Initial Measurement. Enabling it is not an error, but it prevents
attestation. It also isn't useful to a Realm, which doesn't trust host
input.

Currently the feature is automatically enabled, unless the user disables
it on the command-line. Change it to OnOffAuto, and automatically
disable it for confidential VMs, unless the user explicitly enables it.

Signed-off-by: Jean-Philippe Brucker 
---
v1->v2: separate patch, use OnOffAuto
---
 docs/system/arm/virt.rst |  9 +
 include/hw/arm/virt.h|  2 +-
 hw/arm/virt.c| 41 +---
 3 files changed, 32 insertions(+), 20 deletions(-)

diff --git a/docs/system/arm/virt.rst b/docs/system/arm/virt.rst
index 26fcba00b7..e4bbfec662 100644
--- a/docs/system/arm/virt.rst
+++ b/docs/system/arm/virt.rst
@@ -172,10 +172,11 @@ dtb-randomness
   rng-seed and kaslr-seed nodes (in both "/chosen" and
   "/secure-chosen") to use for features like the random number
   generator and address space randomisation. The default is
-  ``on``. You will want to disable it if your trusted boot chain
-  will verify the DTB it is passed, since this option causes the
-  DTB to be non-deterministic. It would be the responsibility of
-  the firmware to come up with a seed and pass it on if it wants to.
+  ``off`` for confidential VMs, and ``on`` otherwise. You will want
+  to disable it if your trusted boot chain will verify the DTB it is
+  passed, since this option causes the DTB to be non-deterministic.
+  It would be the responsibility of the firmware to come up with a
+  seed and pass it on if it wants to.
 
 dtb-kaslr-seed
   A deprecated synonym for dtb-randomness.
diff --git a/include/hw/arm/virt.h b/include/hw/arm/virt.h
index bb486d36b1..90a148dac2 100644
--- a/include/hw/arm/virt.h
+++ b/include/hw/arm/virt.h
@@ -150,7 +150,7 @@ struct VirtMachineState {
 bool virt;
 bool ras;
 bool mte;
-bool dtb_randomness;
+OnOffAuto dtb_randomness;
 OnOffAuto acpi;
 VirtGICType gic_version;
 VirtIOMMUType iommu;
diff --git a/hw/arm/virt.c b/hw/arm/virt.c
index 07ad31876e..f300f100b5 100644
--- a/hw/arm/virt.c
+++ b/hw/arm/virt.c
@@ -259,6 +259,7 @@ static bool ns_el2_virt_timer_present(void)
 
 static void create_fdt(VirtMachineState *vms)
 {
+bool dtb_randomness = true;
 MachineState *ms = MACHINE(vms);
 int nb_numa_nodes = ms->numa_state->num_nodes;
 void *fdt = create_device_tree(>fdt_size);
@@ -268,6 +269,16 @@ static void create_fdt(VirtMachineState *vms)
 exit(1);
 }
 
+/*
+ * Including random data in the DTB causes random intial measurement on 
CCA,
+ * so disable it for confidential VMs.
+ */
+if (vms->dtb_randomness == ON_OFF_AUTO_OFF ||
+(vms->dtb_randomness == ON_OFF_AUTO_AUTO &&
+ virt_machine_is_confidential(vms))) {
+dtb_randomness = false;
+}
+
 ms->fdt = fdt;
 
 /* Header */
@@ -278,13 +289,13 @@ static void create_fdt(VirtMachineState *vms)
 
 /* /chosen must exist for load_dtb to fill in necessary properties later */
 qemu_fdt_add_subnode(fdt, "/chosen");
-if (vms->dtb_randomness) {
+if (dtb_randomness) {
 create_randomness(ms, "/chosen");
 }
 
 if (vms->secure) {
 qemu_fdt_add_subnode(fdt, "/secure-chosen");
-if (vms->dtb_randomness) {
+if (dtb_randomness) {
 create_randomness(ms, "/secure-chosen");
 }
 }
@@ -2474,18 +2485,21 @@ static void virt_set_its(Object *obj, bool value, Error 
**errp)
 vms->its = value;
 }
 
-static bool virt_get_dtb_randomness(Object *obj, Error **errp)
+static void virt_get_dtb_randomness(Object *obj, Visitor *v, const char *name,
+void *opaque, Error **errp)
 {
 VirtMachineState *vms = VIRT_MACHINE(obj);
+OnOffAuto dtb_randomness = vms->dtb_randomness;
 
-return vms->dtb_randomness;
+visit_type_OnOffAuto(v, name, _randomness, errp);
 }
 
-static void virt_set_dtb_randomness(Object *obj, bool value, Error **errp)
+static void virt_set_dtb_randomness(Object *obj, Visitor *v, const char *name,
+void *opaque, Error **errp)
 {
 VirtMachineState *vms = VIRT_MACHINE(obj);
 
-vms->dtb_randomness = value;
+visit_type_OnOffAuto(v, name, >dtb_randomness, errp);
 }
 
 static char *virt_get_oem_id(Object *obj, Error **errp)
@@ -3123,16 +3137,16 @@ static void virt_machine_class_init(ObjectClass *oc, 
void *data)
   "Set on/off to enable/disable "
   "ITS instantiation");
 
-object_class_property_add_bool(oc, "dtb-randomness",
-

[PATCH v2 18/22] target/arm/kvm: Disable Realm reboot

2024-04-19 Thread Jean-Philippe Brucker

A realm cannot be reset, it must be recreated from scratch. The RMM
specification defines states of a Realm as NEW -> ACTIVE -> SYSTEM_OFF,
after which the Realm can only be destroyed. A PCSI_SYSTEM_RESET call,
which normally reboots the system, puts the Realm in SYSTEM_OFF state.

QEMU does not support recreating a VM. Normally, a reboot request by the
guest causes all devices to reset, which cannot work for a Realm.
Indeed, loading images into Realm memory and changing the PC is only
allowed for a Realm in NEW state. Resetting the images for a Realm in
SYSTEM_OFF state will cause QEMU to crash with a bus error.

Handle reboot requests by the guest more gracefully, by indicating to
runstate.c that the vCPUs of a Realm are not resettable, and that QEMU
should exit.

Reviewed-by: Richard Henderson 
Signed-off-by: Jean-Philippe Brucker 
---
 target/arm/kvm.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/target/arm/kvm.c b/target/arm/kvm.c
index 9855cadb1b..60c2ef9388 100644
--- a/target/arm/kvm.c
+++ b/target/arm/kvm.c
@@ -1694,7 +1694,8 @@ int kvm_arch_msi_data_to_gsi(uint32_t data)
 
 bool kvm_arch_cpu_check_are_resettable(void)
 {
-return true;
+/* A Realm cannot be reset */
+return !kvm_arm_rme_enabled();
 }
 
 static void kvm_arch_get_eager_split_size(Object *obj, Visitor *v,
-- 
2.44.0

[PATCH v2 19/22] target/arm/cpu: Inform about reading confidential CPU registers

2024-04-19 Thread Jean-Philippe Brucker

The host cannot access registers of a Realm. Instead of showing all
registers as zero in "info registers", display a message about this
restriction.

Signed-off-by: Jean-Philippe Brucker 
---
v1->v2: new
---
 target/arm/cpu.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/target/arm/cpu.c b/target/arm/cpu.c
index ab8d007a86..18d1b88e2f 100644
--- a/target/arm/cpu.c
+++ b/target/arm/cpu.c
@@ -1070,6 +1070,11 @@ static void aarch64_cpu_dump_state(CPUState *cs, FILE 
*f, int flags)
 const char *ns_status;
 bool sve;
 
+if (cpu->kvm_rme) {
+qemu_fprintf(f, "the CPU registers are confidential to the realm\n");
+return;
+}
+
 qemu_fprintf(f, " PC=%016" PRIx64 " ", env->pc);
 for (i = 0; i < 32; i++) {
 if (i == 31) {
-- 
2.44.0

[PATCH v2 13/22] hw/arm/boot: Register Linux BSS section for confidential guests

2024-04-19 Thread Jean-Philippe Brucker

Although the BSS section is not currently part of the kernel blob, it
needs to be registered as guest RAM for confidential guest support,
because the kernel needs to access it before it is able to setup its RAM
regions.

It would be tempting to simply add the BSS as part of the ROM blob (ie
pass kernel_size as max_len argument to rom_add_blob()) and let the ROM
loader notifier deal with the full image size generically, but that
would add zero-initialization of the BSS region by the loader, which
adds a significant overhead. For a 40MB kernel with a 17MB BSS, I
measured an average boot time regression of 2.8ms on a fast desktop,
5.7% of the QEMU setup time). On a slower host, the regression could be
much larger.

Instead, add a special case to initialize the kernel's BSS IPA range.

Signed-off-by: Jean-Philippe Brucker 
---
v1->v2: new
---
 target/arm/kvm_arm.h |  5 +
 hw/arm/boot.c| 11 +++
 target/arm/kvm-rme.c | 10 ++
 3 files changed, 26 insertions(+)

diff --git a/target/arm/kvm_arm.h b/target/arm/kvm_arm.h
index 4386b0..4b787dd628 100644
--- a/target/arm/kvm_arm.h
+++ b/target/arm/kvm_arm.h
@@ -218,6 +218,7 @@ int kvm_arm_set_irq(int cpu, int irqtype, int irq, int 
level);
 
 int kvm_arm_rme_init(MachineState *ms);
 int kvm_arm_rme_vm_type(MachineState *ms);
+void kvm_arm_rme_init_guest_ram(hwaddr base, size_t size);
 
 bool kvm_arm_rme_enabled(void);
 int kvm_arm_rme_vcpu_init(CPUState *cs);
@@ -243,6 +244,10 @@ static inline bool kvm_arm_sve_supported(void)
 return false;
 }
 
+static inline void kvm_arm_rme_init_guest_ram(hwaddr base, size_t size)
+{
+}
+
 /*
  * These functions should never actually be called without KVM support.
  */
diff --git a/hw/arm/boot.c b/hw/arm/boot.c
index 84ea6a807a..9f522e332b 100644
--- a/hw/arm/boot.c
+++ b/hw/arm/boot.c
@@ -26,6 +26,7 @@
 #include "qemu/config-file.h"
 #include "qemu/option.h"
 #include "qemu/units.h"
+#include "kvm_arm.h"
 
 /* Kernel boot protocol is specified in the kernel docs
  * Documentation/arm/Booting and Documentation/arm64/booting.txt
@@ -850,6 +851,7 @@ static uint64_t load_aarch64_image(const char *filename, 
hwaddr mem_base,
 {
 hwaddr kernel_load_offset = KERNEL64_LOAD_ADDR;
 uint64_t kernel_size = 0;
+uint64_t page_size;
 uint8_t *buffer;
 int size;
 
@@ -916,6 +918,15 @@ static uint64_t load_aarch64_image(const char *filename, 
hwaddr mem_base,
 *entry = mem_base + kernel_load_offset;
 rom_add_blob_fixed_as(filename, buffer, size, *entry, as);
 
+/*
+ * Register the kernel BSS as realm resource, so the kernel can use it 
right
+ * away. Align up to skip the last page, which still contains kernel
+ * data.
+ */
+page_size = qemu_real_host_page_size();
+kvm_arm_rme_init_guest_ram(QEMU_ALIGN_UP(*entry + size, page_size),
+   QEMU_ALIGN_DOWN(kernel_size - size, page_size));
+
 g_free(buffer);
 
 return kernel_size;
diff --git a/target/arm/kvm-rme.c b/target/arm/kvm-rme.c
index bee6694d6d..b2ad10ef6d 100644
--- a/target/arm/kvm-rme.c
+++ b/target/arm/kvm-rme.c
@@ -203,6 +203,16 @@ int kvm_arm_rme_init(MachineState *ms)
 return 0;
 }
 
+/*
+ * kvm_arm_rme_init_guest_ram - Initialize a Realm IPA range
+ */
+void kvm_arm_rme_init_guest_ram(hwaddr base, size_t size)
+{
+if (rme_guest) {
+rme_add_ram_region(base, size, /* populate */ false);
+}
+}
+
 int kvm_arm_rme_vcpu_init(CPUState *cs)
 {
 ARMCPU *cpu = ARM_CPU(cs);
-- 
2.44.0

[PATCH v2 15/22] target/arm/kvm-rme: Add measurement algorithm property

2024-04-19 Thread Jean-Philippe Brucker

This option selects which measurement algorithm to use for attestation.
Supported values are SHA256 and SHA512. Default to SHA512 arbitrarily.

SHA512 is generally faster on 64-bit architectures. On a few arm64 CPUs
I tested SHA256 is much faster, but that's most likely because they only
support acceleration via FEAT_SHA256 (Armv8.0) and not FEAT_SHA512
(Armv8.2). Future CPUs supporting RME are likely to also support
FEAT_SHA512.

Cc: Eric Blake 
Cc: Markus Armbruster 
Cc: Daniel P. Berrangé 
Cc: Eduardo Habkost 
Signed-off-by: Jean-Philippe Brucker 
---
v1->v2: use enum, pick default
---
 qapi/qom.json| 18 +-
 target/arm/kvm-rme.c | 39 ++-
 2 files changed, 55 insertions(+), 2 deletions(-)

diff --git a/qapi/qom.json b/qapi/qom.json
index 91654aa267..84dce666b2 100644
--- a/qapi/qom.json
+++ b/qapi/qom.json
@@ -931,18 +931,34 @@
   'data': { '*cpu-affinity': ['uint16'],
 '*node-affinity': ['uint16'] } }
 
+##
+# @RmeGuestMeasurementAlgo:
+#
+# @sha256: Use the SHA256 algorithm
+# @sha512: Use the SHA512 algorithm
+#
+# Algorithm to use for realm measurements
+#
+# Since: FIXME
+##
+{ 'enum': 'RmeGuestMeasurementAlgo',
+  'data': ['sha256', 'sha512'] }
+
 ##
 # @RmeGuestProperties:
 #
 # Properties for rme-guest objects.
 #
+# @measurement-algo: Realm measurement algorithm (default: sha512)
+#
 # @personalization-value: Realm personalization value, as a 64-byte hex string
 # (default: 0)
 #
 # Since: FIXME
 ##
 { 'struct': 'RmeGuestProperties',
-  'data': { '*personalization-value': 'str' } }
+  'data': { '*personalization-value': 'str',
+'*measurement-algo': 'RmeGuestMeasurementAlgo' } }
 
 ##
 # @ObjectType:
diff --git a/target/arm/kvm-rme.c b/target/arm/kvm-rme.c
index cb5c3f7a22..8f39e54aaa 100644
--- a/target/arm/kvm-rme.c
+++ b/target/arm/kvm-rme.c
@@ -23,13 +23,14 @@ OBJECT_DECLARE_SIMPLE_TYPE(RmeGuest, RME_GUEST)
 
 #define RME_PAGE_SIZE qemu_real_host_page_size()
 
-#define RME_MAX_CFG 1
+#define RME_MAX_CFG 2
 
 struct RmeGuest {
 ConfidentialGuestSupport parent_obj;
 Notifier rom_load_notifier;
 GSList *ram_regions;
 uint8_t *personalization_value;
+RmeGuestMeasurementAlgo measurement_algo;
 };
 
 typedef struct {
@@ -73,6 +74,19 @@ static int rme_configure_one(RmeGuest *guest, uint32_t cfg, 
Error **errp)
 memcpy(args.rpv, guest->personalization_value, 
KVM_CAP_ARM_RME_RPV_SIZE);
 cfg_str = "personalization value";
 break;
+case KVM_CAP_ARM_RME_CFG_HASH_ALGO:
+switch (guest->measurement_algo) {
+case RME_GUEST_MEASUREMENT_ALGO_SHA256:
+args.hash_algo = KVM_CAP_ARM_RME_MEASUREMENT_ALGO_SHA256;
+break;
+case RME_GUEST_MEASUREMENT_ALGO_SHA512:
+args.hash_algo = KVM_CAP_ARM_RME_MEASUREMENT_ALGO_SHA512;
+break;
+default:
+g_assert_not_reached();
+}
+cfg_str = "hash algorithm";
+break;
 default:
 g_assert_not_reached();
 }
@@ -338,12 +352,34 @@ static void rme_set_rpv(Object *obj, const char *value, 
Error **errp)
 }
 }
 
+static int rme_get_measurement_algo(Object *obj, Error **errp)
+{
+RmeGuest *guest = RME_GUEST(obj);
+
+return guest->measurement_algo;
+}
+
+static void rme_set_measurement_algo(Object *obj, int algo, Error **errp)
+{
+RmeGuest *guest = RME_GUEST(obj);
+
+guest->measurement_algo = algo;
+}
+
 static void rme_guest_class_init(ObjectClass *oc, void *data)
 {
 object_class_property_add_str(oc, "personalization-value", rme_get_rpv,
   rme_set_rpv);
 object_class_property_set_description(oc, "personalization-value",
 "Realm personalization value (512-bit hexadecimal number)");
+
+object_class_property_add_enum(oc, "measurement-algo",
+   "RmeGuestMeasurementAlgo",
+   _lookup,
+   rme_get_measurement_algo,
+   rme_set_measurement_algo);
+object_class_property_set_description(oc, "measurement-algo",
+"Realm measurement algorithm ('sha256', 'sha512')");
 }
 
 static void rme_guest_instance_init(Object *obj)
@@ -353,6 +389,7 @@ static void rme_guest_instance_init(Object *obj)
 exit(1);
 }
 rme_guest = RME_GUEST(obj);
+rme_guest->measurement_algo = RME_GUEST_MEASUREMENT_ALGO_SHA512;
 }
 
 static const TypeInfo rme_guest_info = {
-- 
2.44.0

[PATCH v2 04/22] target/arm/kvm-rme: Initialize realm

2024-04-19 Thread Jean-Philippe Brucker

The machine code calls kvm_arm_rme_vm_type() to get the VM flag and KVM
calls kvm_arm_rme_init() to issue KVM hypercalls:

* create the realm descriptor,
* load images into Realm RAM (in another patch),
* finalize the REC (vCPU) after the registers are reset,
* activate the realm at the end, at which point the realm is sealed.

Signed-off-by: Jean-Philippe Brucker 
---
v1->v2:
* Use g_assert_not_reached() in stubs
* Init from kvm_arch_init() rather than hw/arm/virt
* Cache rme_guest
---
 target/arm/kvm_arm.h |  16 +++
 target/arm/kvm-rme.c | 101 +++
 target/arm/kvm.c |   7 ++-
 3 files changed, 123 insertions(+), 1 deletion(-)

diff --git a/target/arm/kvm_arm.h b/target/arm/kvm_arm.h
index cfaa0d9bc7..8e2d90c265 100644
--- a/target/arm/kvm_arm.h
+++ b/target/arm/kvm_arm.h
@@ -203,6 +203,8 @@ int kvm_arm_vgic_probe(void);
 void kvm_arm_pmu_init(ARMCPU *cpu);
 void kvm_arm_pmu_set_irq(ARMCPU *cpu, int irq);
 
+int kvm_arm_vcpu_finalize(ARMCPU *cpu, int feature);
+
 /**
  * kvm_arm_pvtime_init:
  * @cpu: ARMCPU
@@ -214,6 +216,11 @@ void kvm_arm_pvtime_init(ARMCPU *cpu, uint64_t ipa);
 
 int kvm_arm_set_irq(int cpu, int irqtype, int irq, int level);
 
+int kvm_arm_rme_init(MachineState *ms);
+int kvm_arm_rme_vm_type(MachineState *ms);
+
+bool kvm_arm_rme_enabled(void);
+
 #else
 
 /*
@@ -283,6 +290,15 @@ static inline uint32_t kvm_arm_sve_get_vls(ARMCPU *cpu)
 g_assert_not_reached();
 }
 
+static inline int kvm_arm_rme_init(MachineState *ms)
+{
+g_assert_not_reached();
+}
+
+static inline int kvm_arm_rme_vm_type(MachineState *ms)
+{
+g_assert_not_reached();
+}
 #endif
 
 #endif
diff --git a/target/arm/kvm-rme.c b/target/arm/kvm-rme.c
index 960dd75608..23ac2d32d4 100644
--- a/target/arm/kvm-rme.c
+++ b/target/arm/kvm-rme.c
@@ -23,14 +23,115 @@ struct RmeGuest {
 ConfidentialGuestSupport parent_obj;
 };
 
+static RmeGuest *rme_guest;
+
+bool kvm_arm_rme_enabled(void)
+{
+return !!rme_guest;
+}
+
+static int rme_create_rd(Error **errp)
+{
+int ret = kvm_vm_enable_cap(kvm_state, KVM_CAP_ARM_RME, 0,
+KVM_CAP_ARM_RME_CREATE_RD);
+
+if (ret) {
+error_setg_errno(errp, -ret, "RME: failed to create Realm Descriptor");
+}
+return ret;
+}
+
+static void rme_vm_state_change(void *opaque, bool running, RunState state)
+{
+int ret;
+CPUState *cs;
+
+if (!running) {
+return;
+}
+
+ret = rme_create_rd(_abort);
+if (ret) {
+return;
+}
+
+/*
+ * Now that do_cpu_reset() initialized the boot PC and
+ * kvm_cpu_synchronize_post_reset() registered it, we can finalize the REC.
+ */
+CPU_FOREACH(cs) {
+ret = kvm_arm_vcpu_finalize(ARM_CPU(cs), KVM_ARM_VCPU_REC);
+if (ret) {
+error_report("RME: failed to finalize vCPU: %s", strerror(-ret));
+exit(1);
+}
+}
+
+ret = kvm_vm_enable_cap(kvm_state, KVM_CAP_ARM_RME, 0,
+KVM_CAP_ARM_RME_ACTIVATE_REALM);
+if (ret) {
+error_report("RME: failed to activate realm: %s", strerror(-ret));
+exit(1);
+}
+}
+
+int kvm_arm_rme_init(MachineState *ms)
+{
+static Error *rme_mig_blocker;
+ConfidentialGuestSupport *cgs = ms->cgs;
+
+if (!rme_guest) {
+return 0;
+}
+
+if (!cgs) {
+error_report("missing -machine confidential-guest-support parameter");
+return -EINVAL;
+}
+
+if (!kvm_check_extension(kvm_state, KVM_CAP_ARM_RME)) {
+return -ENODEV;
+}
+
+error_setg(_mig_blocker, "RME: migration is not implemented");
+migrate_add_blocker(_mig_blocker, _fatal);
+
+/*
+ * The realm activation is done last, when the VM starts, after all images
+ * have been loaded and all vcpus finalized.
+ */
+qemu_add_vm_change_state_handler(rme_vm_state_change, NULL);
+
+cgs->ready = true;
+return 0;
+}
+
+int kvm_arm_rme_vm_type(MachineState *ms)
+{
+if (rme_guest) {
+return KVM_VM_TYPE_ARM_REALM;
+}
+return 0;
+}
+
 static void rme_guest_class_init(ObjectClass *oc, void *data)
 {
 }
 
+static void rme_guest_instance_init(Object *obj)
+{
+if (rme_guest) {
+error_report("a single instance of RmeGuest is supported");
+exit(1);
+}
+rme_guest = RME_GUEST(obj);
+}
+
 static const TypeInfo rme_guest_info = {
 .parent = TYPE_CONFIDENTIAL_GUEST_SUPPORT,
 .name = TYPE_RME_GUEST,
 .instance_size = sizeof(struct RmeGuest),
+.instance_init = rme_guest_instance_init,
 .class_init = rme_guest_class_init,
 .interfaces = (InterfaceInfo[]) {
 { TYPE_USER_CREATABLE },
diff --git a/target/arm/kvm.c b/target/arm/kvm.c
index a5673241e5..b00077c1a5 100644
--- a/target/arm/kvm.c
+++ b/target/arm/kvm.c
@@ -93,7 +93,7 @@ static int kvm_arm_vcpu_init(ARMCPU *cpu)
  *
  * Returns: 0 if success else < 0 error co

[PATCH v2 09/22] target/arm/kvm-rme: Initialize vCPU

2024-04-19 Thread Jean-Philippe Brucker

The target code calls kvm_arm_vcpu_init() to mark the vCPU as part of a
Realm. For a Realm vCPU, only x0-x7 can be set at runtime. Before boot,
the PC can also be set, and is ignored at runtime. KVM also accepts a
few system register changes during initial configuration, as returned by
KVM_GET_REG_LIST.

Signed-off-by: Jean-Philippe Brucker 
---
v1->v2: only do the GP regs, since they are sync'd explicitly. Other
  registers use the existing reglist facility.
---
 target/arm/cpu.h |  3 +++
 target/arm/kvm_arm.h |  1 +
 target/arm/kvm-rme.c | 10 
 target/arm/kvm.c | 61 
 4 files changed, 75 insertions(+)

diff --git a/target/arm/cpu.h b/target/arm/cpu.h
index bc0c84873f..d3ff1b4a31 100644
--- a/target/arm/cpu.h
+++ b/target/arm/cpu.h
@@ -945,6 +945,9 @@ struct ArchCPU {
 OnOffAuto kvm_steal_time;
 #endif /* CONFIG_KVM */
 
+/* Realm Management Extension */
+bool kvm_rme;
+
 /* Uniprocessor system with MP extensions */
 bool mp_is_up;
 
diff --git a/target/arm/kvm_arm.h b/target/arm/kvm_arm.h
index 8e2d90c265..4386b0 100644
--- a/target/arm/kvm_arm.h
+++ b/target/arm/kvm_arm.h
@@ -220,6 +220,7 @@ int kvm_arm_rme_init(MachineState *ms);
 int kvm_arm_rme_vm_type(MachineState *ms);
 
 bool kvm_arm_rme_enabled(void);
+int kvm_arm_rme_vcpu_init(CPUState *cs);
 
 #else
 
diff --git a/target/arm/kvm-rme.c b/target/arm/kvm-rme.c
index 23ac2d32d4..aa9c3b5551 100644
--- a/target/arm/kvm-rme.c
+++ b/target/arm/kvm-rme.c
@@ -106,6 +106,16 @@ int kvm_arm_rme_init(MachineState *ms)
 return 0;
 }
 
+int kvm_arm_rme_vcpu_init(CPUState *cs)
+{
+ARMCPU *cpu = ARM_CPU(cs);
+
+if (rme_guest) {
+cpu->kvm_rme = true;
+}
+return 0;
+}
+
 int kvm_arm_rme_vm_type(MachineState *ms)
 {
 if (rme_guest) {
diff --git a/target/arm/kvm.c b/target/arm/kvm.c
index 3504276822..3a2233ec73 100644
--- a/target/arm/kvm.c
+++ b/target/arm/kvm.c
@@ -1920,6 +1920,11 @@ int kvm_arch_init_vcpu(CPUState *cs)
 return ret;
 }
 
+ret = kvm_arm_rme_vcpu_init(cs);
+if (ret) {
+return ret;
+}
+
 if (cpu_isar_feature(aa64_sve, cpu)) {
 ret = kvm_arm_sve_set_vls(cpu);
 if (ret) {
@@ -2056,6 +2061,35 @@ static int kvm_arch_put_sve(CPUState *cs)
 return 0;
 }
 
+static int kvm_arm_rme_put_core_regs(CPUState *cs)
+{
+int i, ret;
+struct kvm_one_reg reg;
+ARMCPU *cpu = ARM_CPU(cs);
+CPUARMState *env = >env;
+
+/*
+ * The RME ABI only allows us to set 8 GPRs and the PC
+ */
+for (i = 0; i < 8; i++) {
+reg.id = AARCH64_CORE_REG(regs.regs[i]);
+reg.addr = (uintptr_t) >xregs[i];
+ret = kvm_vcpu_ioctl(cs, KVM_SET_ONE_REG, );
+if (ret) {
+return ret;
+}
+}
+
+reg.id = AARCH64_CORE_REG(regs.pc);
+reg.addr = (uintptr_t) >pc;
+ret = kvm_vcpu_ioctl(cs, KVM_SET_ONE_REG, );
+if (ret) {
+return ret;
+}
+
+return 0;
+}
+
 static int kvm_arm_put_core_regs(CPUState *cs, int level)
 {
 uint64_t val;
@@ -2066,6 +2100,10 @@ static int kvm_arm_put_core_regs(CPUState *cs, int level)
 ARMCPU *cpu = ARM_CPU(cs);
 CPUARMState *env = >env;
 
+if (cpu->kvm_rme) {
+return kvm_arm_rme_put_core_regs(cs);
+}
+
 /* If we are in AArch32 mode then we need to copy the AArch32 regs to the
  * AArch64 registers before pushing them out to 64-bit KVM.
  */
@@ -2253,6 +2291,25 @@ static int kvm_arch_get_sve(CPUState *cs)
 return 0;
 }
 
+static int kvm_arm_rme_get_core_regs(CPUState *cs)
+{
+int i, ret;
+struct kvm_one_reg reg;
+ARMCPU *cpu = ARM_CPU(cs);
+CPUARMState *env = >env;
+
+for (i = 0; i < 8; i++) {
+reg.id = AARCH64_CORE_REG(regs.regs[i]);
+reg.addr = (uintptr_t) >xregs[i];
+ret = kvm_vcpu_ioctl(cs, KVM_GET_ONE_REG, );
+if (ret) {
+return ret;
+}
+}
+
+return 0;
+}
+
 static int kvm_arm_get_core_regs(CPUState *cs)
 {
 uint64_t val;
@@ -2263,6 +2320,10 @@ static int kvm_arm_get_core_regs(CPUState *cs)
 ARMCPU *cpu = ARM_CPU(cs);
 CPUARMState *env = >env;
 
+if (cpu->kvm_rme) {
+return kvm_arm_rme_get_core_regs(cs);
+}
+
 for (i = 0; i < 31; i++) {
 ret = kvm_get_one_reg(cs, AARCH64_CORE_REG(regs.regs[i]),
   >xregs[i]);
-- 
2.44.0

[PATCH v2 16/22] target/arm/cpu: Set number of breakpoints and watchpoints in KVM

2024-04-19 Thread Jean-Philippe Brucker

Add "num-breakpoints" and "num-watchpoints" CPU parameters to configure
the debug features that KVM presents to the guest. The KVM vCPU
configuration is modified by calling SET_ONE_REG on the ID register.

This is needed for Realm VMs, whose parameters include breakpoints and
watchpoints, and influence the Realm Initial Measurement.

Signed-off-by: Jean-Philippe Brucker 
---
v1->v2: new
---
 target/arm/cpu.h  |  4 ++
 target/arm/kvm_arm.h  |  2 +
 target/arm/arm-qmp-cmds.c |  1 +
 target/arm/cpu64.c| 77 +++
 target/arm/kvm.c  | 56 +++-
 5 files changed, 139 insertions(+), 1 deletion(-)

diff --git a/target/arm/cpu.h b/target/arm/cpu.h
index d3ff1b4a31..24080da2b7 100644
--- a/target/arm/cpu.h
+++ b/target/arm/cpu.h
@@ -1089,6 +1089,10 @@ struct ArchCPU {
 
 /* Generic timer counter frequency, in Hz */
 uint64_t gt_cntfrq_hz;
+
+/* Allows to override the default configuration */
+uint8_t num_bps;
+uint8_t num_wps;
 };
 
 typedef struct ARMCPUInfo {
diff --git a/target/arm/kvm_arm.h b/target/arm/kvm_arm.h
index 4b787dd628..b040686eab 100644
--- a/target/arm/kvm_arm.h
+++ b/target/arm/kvm_arm.h
@@ -16,6 +16,8 @@
 #define KVM_ARM_VGIC_V2   (1 << 0)
 #define KVM_ARM_VGIC_V3   (1 << 1)
 
+#define KVM_REG_ARM_ID_AA64DFR0_EL1 ARM64_SYS_REG(3, 0, 0, 5, 0)
+
 /**
  * kvm_arm_register_device:
  * @mr: memory region for this device
diff --git a/target/arm/arm-qmp-cmds.c b/target/arm/arm-qmp-cmds.c
index 3cc8cc738b..0f574bb1dd 100644
--- a/target/arm/arm-qmp-cmds.c
+++ b/target/arm/arm-qmp-cmds.c
@@ -95,6 +95,7 @@ static const char *cpu_model_advertised_features[] = {
 "sve1408", "sve1536", "sve1664", "sve1792", "sve1920", "sve2048",
 "kvm-no-adjvtime", "kvm-steal-time",
 "pauth", "pauth-impdef", "pauth-qarma3",
+"num-breakpoints", "num-watchpoints",
 NULL
 };
 
diff --git a/target/arm/cpu64.c b/target/arm/cpu64.c
index 985b1efe16..9ca74eb019 100644
--- a/target/arm/cpu64.c
+++ b/target/arm/cpu64.c
@@ -571,6 +571,82 @@ void aarch64_add_pauth_properties(Object *obj)
 }
 }
 
+#if defined(CONFIG_KVM)
+static void arm_cpu_get_num_wps(Object *obj, Visitor *v, const char *name,
+void *opaque, Error **errp)
+{
+uint8_t val;
+ARMCPU *cpu = ARM_CPU(obj);
+
+val = cpu->num_wps;
+if (val == 0) {
+val = FIELD_EX64(cpu->isar.id_aa64dfr0, ID_AA64DFR0, WRPS) + 1;
+}
+
+visit_type_uint8(v, name, , errp);
+}
+
+static void arm_cpu_set_num_wps(Object *obj, Visitor *v, const char *name,
+void *opaque, Error **errp)
+{
+uint8_t val;
+ARMCPU *cpu = ARM_CPU(obj);
+uint8_t max_wps = FIELD_EX64(cpu->isar.id_aa64dfr0, ID_AA64DFR0, WRPS) + 1;
+
+if (!visit_type_uint8(v, name, , errp)) {
+return;
+}
+
+if (val < 2 || val > max_wps) {
+error_setg(errp, "invalid number of watchpoints");
+return;
+}
+
+cpu->num_wps = val;
+}
+
+static void arm_cpu_get_num_bps(Object *obj, Visitor *v, const char *name,
+void *opaque, Error **errp)
+{
+uint8_t val;
+ARMCPU *cpu = ARM_CPU(obj);
+
+val = cpu->num_bps;
+if (val == 0) {
+val = FIELD_EX64(cpu->isar.id_aa64dfr0, ID_AA64DFR0, BRPS) + 1;
+}
+
+visit_type_uint8(v, name, , errp);
+}
+
+static void arm_cpu_set_num_bps(Object *obj, Visitor *v, const char *name,
+void *opaque, Error **errp)
+{
+uint8_t val;
+ARMCPU *cpu = ARM_CPU(obj);
+uint8_t max_bps = FIELD_EX64(cpu->isar.id_aa64dfr0, ID_AA64DFR0, BRPS) + 1;
+
+if (!visit_type_uint8(v, name, , errp)) {
+return;
+}
+
+if (val < 2 || val > max_bps) {
+error_setg(errp, "invalid number of breakpoints");
+return;
+}
+
+cpu->num_bps = val;
+}
+
+static void aarch64_add_kvm_writable_properties(Object *obj)
+{
+object_property_add(obj, "num-breakpoints", "uint8", arm_cpu_get_num_bps,
+arm_cpu_set_num_bps, NULL, NULL);
+object_property_add(obj, "num-watchpoints", "uint8", arm_cpu_get_num_wps,
+arm_cpu_set_num_wps, NULL, NULL);
+}
+#endif /* CONFIG_KVM */
+
 void arm_cpu_lpa2_finalize(ARMCPU *cpu, Error **errp)
 {
 uint64_t t;
@@ -713,6 +789,7 @@ static void aarch64_host_initfn(Object *obj)
 if (arm_feature(>env, ARM_FEATURE_AARCH64)) {
 aarch64_add_sve_properties(obj);
 aarch64_add_pauth_properties(obj);
+aarch64_add_kvm_writable_properties(obj);
 }
 #elif defined(CONFIG_HVF)
 ARMCPU *cpu = ARM_CPU(obj);
diff --git a/target/arm/kvm.c b/target/arm/kvm.c
index 6d368bf442..623980a25b

[PATCH v2 11/22] hw/core/loader: Add ROM loader notifier

2024-04-19 Thread Jean-Philippe Brucker

Add a function to register a notifier, that is invoked after a ROM gets
loaded into guest memory.

It will be used by Arm confidential guest support, in order to register
all blobs loaded into memory with KVM, so that their content is part of
the initial VM measurement and contribute to the guest attestation.

Signed-off-by: Jean-Philippe Brucker 
---
v1->v2: new
---
 include/hw/loader.h | 15 +++
 hw/core/loader.c| 15 +++
 2 files changed, 30 insertions(+)

diff --git a/include/hw/loader.h b/include/hw/loader.h
index 8685e27334..79fab25dd9 100644
--- a/include/hw/loader.h
+++ b/include/hw/loader.h
@@ -356,6 +356,21 @@ void hmp_info_roms(Monitor *mon, const QDict *qdict);
 ssize_t rom_add_vga(const char *file);
 ssize_t rom_add_option(const char *file, int32_t bootindex);
 
+typedef struct RomLoaderNotify {
+/* Parameters passed to rom_add_blob() */
+hwaddr addr;
+size_t len;
+size_t max_len;
+} RomLoaderNotify;
+
+/**
+ * rom_add_load_notifier - Add a notifier for loaded images
+ *
+ * Add a notifier that will be invoked with a RomLoaderNotify structure for 
each
+ * blob loaded into guest memory, after the blob is loaded.
+ */
+void rom_add_load_notifier(Notifier *notifier);
+
 /* This is the usual maximum in uboot, so if a uImage overflows this, it would
  * overflow on real hardware too. */
 #define UBOOT_MAX_GUNZIP_BYTES (64 << 20)
diff --git a/hw/core/loader.c b/hw/core/loader.c
index b8e52f3fb0..4bd236cf89 100644
--- a/hw/core/loader.c
+++ b/hw/core/loader.c
@@ -67,6 +67,8 @@
 #include 
 
 static int roms_loaded;
+static NotifierList rom_loader_notifier =
+NOTIFIER_LIST_INITIALIZER(rom_loader_notifier);
 
 /* return the size or -1 if error */
 int64_t get_image_size(const char *filename)
@@ -1209,6 +1211,11 @@ MemoryRegion *rom_add_blob(const char *name, const void 
*blob, size_t len,
 return mr;
 }
 
+void rom_add_load_notifier(Notifier *notifier)
+{
+notifier_list_add(_loader_notifier, notifier);
+}
+
 /* This function is specific for elf program because we don't need to allocate
  * all the rom. We just allocate the first part and the rest is just zeros. 
This
  * is why romsize and datasize are different. Also, this function takes its own
@@ -1250,6 +1257,7 @@ ssize_t rom_add_option(const char *file, int32_t 
bootindex)
 static void rom_reset(void *unused)
 {
 Rom *rom;
+RomLoaderNotify notify;
 
 QTAILQ_FOREACH(rom, , next) {
 if (rom->fw_file) {
@@ -1298,6 +1306,13 @@ static void rom_reset(void *unused)
 cpu_flush_icache_range(rom->addr, rom->datasize);
 
 trace_loader_write_rom(rom->name, rom->addr, rom->datasize, 
rom->isrom);
+
+notify = (RomLoaderNotify) {
+.addr = rom->addr,
+.len = rom->datasize,
+.max_len = rom->romsize,
+};
+notifier_list_notify(_loader_notifier, );
 }
 }
 
-- 
2.44.0

[PATCH v2 21/22] hw/arm/virt: Move virt_flash_create() to machvirt_init()

2024-04-19 Thread Jean-Philippe Brucker

For confidential VMs we'll want to skip flash device creation.
Unfortunately, in virt_instance_init() the machine->cgs member has not
yet been initialized, so we cannot check whether confidential guest is
enabled. Move virt_flash_create() to machvirt_init(), where we can
access the machine->cgs member.

Signed-off-by: Jean-Philippe Brucker 
---
v1->v2: new
---
 hw/arm/virt.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/hw/arm/virt.c b/hw/arm/virt.c
index eca9a96b5a..bed19d0b79 100644
--- a/hw/arm/virt.c
+++ b/hw/arm/virt.c
@@ -2071,6 +2071,8 @@ static void machvirt_init(MachineState *machine)
 unsigned int smp_cpus = machine->smp.cpus;
 unsigned int max_cpus = machine->smp.max_cpus;
 
+virt_flash_create(vms);
+
 possible_cpus = mc->possible_cpu_arch_ids(machine);
 
 /*
@@ -3229,8 +3231,6 @@ static void virt_instance_init(Object *obj)
 
 vms->irqmap = a15irqmap;
 
-virt_flash_create(vms);
-
 vms->oem_id = g_strndup(ACPI_BUILD_APPNAME6, 6);
 vms->oem_table_id = g_strndup(ACPI_BUILD_APPNAME8, 8);
 }
-- 
2.44.0

[PATCH v2 12/22] target/arm/kvm-rme: Populate Realm memory

2024-04-19 Thread Jean-Philippe Brucker

Collect the images copied into guest RAM into a sorted list, and issue
POPULATE_REALM KVM ioctls once we've created the Realm Descriptor. The
images are part of the Realm Initial Measurement.

Signed-off-by: Jean-Philippe Brucker 
---
v1->v2: Use a ROM loader notifier
---
 target/arm/kvm-rme.c | 97 
 1 file changed, 97 insertions(+)

diff --git a/target/arm/kvm-rme.c b/target/arm/kvm-rme.c
index aa9c3b5551..bee6694d6d 100644
--- a/target/arm/kvm-rme.c
+++ b/target/arm/kvm-rme.c
@@ -9,9 +9,11 @@
 #include "exec/confidential-guest-support.h"
 #include "hw/boards.h"
 #include "hw/core/cpu.h"
+#include "hw/loader.h"
 #include "kvm_arm.h"
 #include "migration/blocker.h"
 #include "qapi/error.h"
+#include "qemu/error-report.h"
 #include "qom/object_interfaces.h"
 #include "sysemu/kvm.h"
 #include "sysemu/runstate.h"
@@ -19,10 +21,21 @@
 #define TYPE_RME_GUEST "rme-guest"
 OBJECT_DECLARE_SIMPLE_TYPE(RmeGuest, RME_GUEST)
 
+#define RME_PAGE_SIZE qemu_real_host_page_size()
+
 struct RmeGuest {
 ConfidentialGuestSupport parent_obj;
+Notifier rom_load_notifier;
+GSList *ram_regions;
 };
 
+typedef struct {
+hwaddr base;
+hwaddr len;
+/* Populate guest RAM with data, or only initialize the IPA range */
+bool populate;
+} RmeRamRegion;
+
 static RmeGuest *rme_guest;
 
 bool kvm_arm_rme_enabled(void)
@@ -41,6 +54,41 @@ static int rme_create_rd(Error **errp)
 return ret;
 }
 
+static void rme_populate_realm(gpointer data, gpointer unused)
+{
+int ret;
+const RmeRamRegion *region = data;
+
+if (region->populate) {
+struct kvm_cap_arm_rme_populate_realm_args populate_args = {
+.populate_ipa_base = region->base,
+.populate_ipa_size = region->len,
+.flags = KVM_ARM_RME_POPULATE_FLAGS_MEASURE,
+};
+ret = kvm_vm_enable_cap(kvm_state, KVM_CAP_ARM_RME, 0,
+KVM_CAP_ARM_RME_POPULATE_REALM,
+(intptr_t)_args);
+if (ret) {
+error_report("RME: failed to populate realm (0x%"HWADDR_PRIx", 
0x%"HWADDR_PRIx"): %s",
+ region->base, region->len, strerror(-ret));
+exit(1);
+}
+} else {
+struct kvm_cap_arm_rme_init_ipa_args init_args = {
+.init_ipa_base = region->base,
+.init_ipa_size = region->len,
+};
+ret = kvm_vm_enable_cap(kvm_state, KVM_CAP_ARM_RME, 0,
+KVM_CAP_ARM_RME_INIT_IPA_REALM,
+(intptr_t)_args);
+if (ret) {
+error_report("RME: failed to initialize GPA range 
(0x%"HWADDR_PRIx", 0x%"HWADDR_PRIx"): %s",
+ region->base, region->len, strerror(-ret));
+exit(1);
+}
+}
+}
+
 static void rme_vm_state_change(void *opaque, bool running, RunState state)
 {
 int ret;
@@ -55,6 +103,9 @@ static void rme_vm_state_change(void *opaque, bool running, 
RunState state)
 return;
 }
 
+g_slist_foreach(rme_guest->ram_regions, rme_populate_realm, NULL);
+g_slist_free_full(g_steal_pointer(_guest->ram_regions), g_free);
+
 /*
  * Now that do_cpu_reset() initialized the boot PC and
  * kvm_cpu_synchronize_post_reset() registered it, we can finalize the REC.
@@ -75,6 +126,49 @@ static void rme_vm_state_change(void *opaque, bool running, 
RunState state)
 }
 }
 
+static gint rme_compare_ram_regions(gconstpointer a, gconstpointer b)
+{
+const RmeRamRegion *ra = a;
+const RmeRamRegion *rb = b;
+
+g_assert(ra->base != rb->base);
+return ra->base < rb->base ? -1 : 1;
+}
+
+static void rme_add_ram_region(hwaddr base, hwaddr len, bool populate)
+{
+RmeRamRegion *region;
+
+region = g_new0(RmeRamRegion, 1);
+region->base = QEMU_ALIGN_DOWN(base, RME_PAGE_SIZE);
+region->len = QEMU_ALIGN_UP(len, RME_PAGE_SIZE);
+region->populate = populate;
+
+/*
+ * The Realm Initial Measurement (RIM) depends on the order in which we
+ * initialize and populate the RAM regions. To help a verifier
+ * independently calculate the RIM, sort regions by GPA.
+ */
+rme_guest->ram_regions = g_slist_insert_sorted(rme_guest->ram_regions,
+   region,
+   rme_compare_ram_regions);
+}
+
+static void rme_rom_load_notify(Notifier *notifier, void *data)
+{
+RomLoaderNotify *rom = data;
+
+if (rom->addr == -1) {
+/*
+ * These blobs (ACPI tables) are not loaded into guest RAM at reset.
+ * Instead the firmware will load them via fw_cfg and measure them
+

[PATCH v2 17/22] target/arm/cpu: Set number of PMU counters in KVM

2024-04-19 Thread Jean-Philippe Brucker

Add a "num-pmu-counters" CPU parameter to configure the number of
counters that KVM presents to the guest. This is needed for Realm VMs,
whose parameters include the number of PMU counters and influence the
Realm Initial Measurement.

Signed-off-by: Jean-Philippe Brucker 
---
v1->v2: new
---
 target/arm/cpu.h  |  3 +++
 target/arm/kvm_arm.h  |  1 +
 target/arm/arm-qmp-cmds.c |  2 +-
 target/arm/cpu64.c| 41 +++
 target/arm/kvm.c  | 34 +++-
 5 files changed, 79 insertions(+), 2 deletions(-)

diff --git a/target/arm/cpu.h b/target/arm/cpu.h
index 24080da2b7..84f3a67dab 100644
--- a/target/arm/cpu.h
+++ b/target/arm/cpu.h
@@ -1093,6 +1093,7 @@ struct ArchCPU {
 /* Allows to override the default configuration */
 uint8_t num_bps;
 uint8_t num_wps;
+int8_t num_pmu_ctrs;
 };
 
 typedef struct ARMCPUInfo {
@@ -2312,6 +2313,8 @@ FIELD(MFAR, FPA, 12, 40)
 FIELD(MFAR, NSE, 62, 1)
 FIELD(MFAR, NS, 63, 1)
 
+FIELD(PMCR, N, 11, 5)
+
 QEMU_BUILD_BUG_ON(ARRAY_SIZE(((ARMCPU *)0)->ccsidr) <= 
R_V7M_CSSELR_INDEX_MASK);
 
 /* If adding a feature bit which corresponds to a Linux ELF
diff --git a/target/arm/kvm_arm.h b/target/arm/kvm_arm.h
index b040686eab..62e39e7184 100644
--- a/target/arm/kvm_arm.h
+++ b/target/arm/kvm_arm.h
@@ -17,6 +17,7 @@
 #define KVM_ARM_VGIC_V3   (1 << 1)
 
 #define KVM_REG_ARM_ID_AA64DFR0_EL1 ARM64_SYS_REG(3, 0, 0, 5, 0)
+#define KVM_REG_ARM_PMCR_EL0ARM64_SYS_REG(3, 3, 9, 12, 0)
 
 /**
  * kvm_arm_register_device:
diff --git a/target/arm/arm-qmp-cmds.c b/target/arm/arm-qmp-cmds.c
index 0f574bb1dd..985d4270b8 100644
--- a/target/arm/arm-qmp-cmds.c
+++ b/target/arm/arm-qmp-cmds.c
@@ -95,7 +95,7 @@ static const char *cpu_model_advertised_features[] = {
 "sve1408", "sve1536", "sve1664", "sve1792", "sve1920", "sve2048",
 "kvm-no-adjvtime", "kvm-steal-time",
 "pauth", "pauth-impdef", "pauth-qarma3",
-"num-breakpoints", "num-watchpoints",
+"num-breakpoints", "num-watchpoints", "num-pmu-counters",
 NULL
 };
 
diff --git a/target/arm/cpu64.c b/target/arm/cpu64.c
index 9ca74eb019..6c2b922d93 100644
--- a/target/arm/cpu64.c
+++ b/target/arm/cpu64.c
@@ -638,12 +638,53 @@ static void arm_cpu_set_num_bps(Object *obj, Visitor *v, 
const char *name,
 cpu->num_bps = val;
 }
 
+static void arm_cpu_get_num_pmu_ctrs(Object *obj, Visitor *v, const char *name,
+ void *opaque, Error **errp)
+{
+uint8_t val;
+ARMCPU *cpu = ARM_CPU(obj);
+
+if (cpu->num_pmu_ctrs == -1) {
+val = FIELD_EX64(cpu->isar.reset_pmcr_el0, PMCR, N);
+} else {
+val = cpu->num_pmu_ctrs;
+}
+
+visit_type_uint8(v, name, , errp);
+}
+
+static void arm_cpu_set_num_pmu_ctrs(Object *obj, Visitor *v, const char *name,
+ void *opaque, Error **errp)
+{
+uint8_t val;
+ARMCPU *cpu = ARM_CPU(obj);
+uint8_t max_ctrs = FIELD_EX64(cpu->isar.reset_pmcr_el0, PMCR, N);
+
+if (!visit_type_uint8(v, name, , errp)) {
+return;
+}
+
+if (val > max_ctrs) {
+error_setg(errp, "invalid number of PMU counters");
+return;
+}
+
+cpu->num_pmu_ctrs = val;
+}
+
 static void aarch64_add_kvm_writable_properties(Object *obj)
 {
+ARMCPU *cpu = ARM_CPU(obj);
+
 object_property_add(obj, "num-breakpoints", "uint8", arm_cpu_get_num_bps,
 arm_cpu_set_num_bps, NULL, NULL);
 object_property_add(obj, "num-watchpoints", "uint8", arm_cpu_get_num_wps,
 arm_cpu_set_num_wps, NULL, NULL);
+
+cpu->num_pmu_ctrs = -1;
+object_property_add(obj, "num-pmu-counters", "uint8",
+arm_cpu_get_num_pmu_ctrs, arm_cpu_set_num_pmu_ctrs,
+NULL, NULL);
 }
 #endif /* CONFIG_KVM */
 
diff --git a/target/arm/kvm.c b/target/arm/kvm.c
index 623980a25b..9855cadb1b 100644
--- a/target/arm/kvm.c
+++ b/target/arm/kvm.c
@@ -418,7 +418,7 @@ static bool 
kvm_arm_get_host_cpu_features(ARMHostCPUFeatures *ahcf)
 if (pmu_supported) {
 /* PMCR_EL0 is only accessible if the vCPU has feature PMU_V3 */
 err |= read_sys_reg64(fdarray[2], >isar.reset_pmcr_el0,
-  ARM64_SYS_REG(3, 3, 9, 12, 0));
+  KVM_REG_ARM_PMCR_EL0);
 }
 
 if (sve_supported) {
@@ -919,9 +919,41 @@ static void kvm_arm_configure_aa64dfr0(ARMCPU *cpu)
 }
 }
 
+static void kvm_arm_configure_pmcr(ARMCPU *cpu)
+{
+int ret;
+uint64_t val, newval;
+CPUState *cs = CPU(cpu);
+
+if (cpu->num_pmu_ctrs == -1) {
+return;
+}
+
+newval

[PATCH v2 20/22] target/arm/kvm-rme: Enable guest memfd

2024-04-19 Thread Jean-Philippe Brucker

Request that RAM block uses the KVM guest memfd call to allocate guest
memory. With RME, guest memory is not accessible by the host, and using
guest memfd ensures that the host kernel is aware of this and doesn't
attempt to access guest pages.

Done in a separate patch because ms->require_guest_memfd is not yet
merged.

Signed-off-by: Jean-Philippe Brucker 
---
v1->v2: new
---
 target/arm/kvm-rme.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/target/arm/kvm-rme.c b/target/arm/kvm-rme.c
index 8f39e54aaa..71cc1d4147 100644
--- a/target/arm/kvm-rme.c
+++ b/target/arm/kvm-rme.c
@@ -263,6 +263,7 @@ int kvm_arm_rme_init(MachineState *ms)
 rme_guest->rom_load_notifier.notify = rme_rom_load_notify;
 rom_add_load_notifier(_guest->rom_load_notifier);
 
+ms->require_guest_memfd = true;
 cgs->ready = true;
 return 0;
 }
-- 
2.44.0

[PATCH v2 22/22] hw/arm/virt: Use RAM instead of flash for confidential guest firmware

2024-04-19 Thread Jean-Philippe Brucker

The flash device that holds firmware code relies on read-only stage-2
mappings. Read accesses behave as RAM and write accesses as MMIO. Since
the RMM does not support read-only mappings we cannot use the flash
device as-is.

That isn't a problem because the firmware does not want to disclose any
information to the host, hence will not store its variables in clear
persistent memory. We can therefore replace the flash device with RAM,
and load the firmware there.

Signed-off-by: Jean-Philippe Brucker 
---
v1->v2: new
---
 include/hw/arm/boot.h |  9 +
 hw/arm/boot.c | 34 ++---
 hw/arm/virt.c | 44 +++
 3 files changed, 84 insertions(+), 3 deletions(-)

diff --git a/include/hw/arm/boot.h b/include/hw/arm/boot.h
index 80c492d742..d91cfc6942 100644
--- a/include/hw/arm/boot.h
+++ b/include/hw/arm/boot.h
@@ -112,6 +112,10 @@ struct arm_boot_info {
  */
 bool firmware_loaded;
 
+/* Used when loading firmware into RAM */
+hwaddr firmware_base;
+hwaddr firmware_max_size;
+
 /* Address at which board specific loader/setup code exists. If enabled,
  * this code-blob will run before anything else. It must return to the
  * caller via the link register. There is no stack set up. Enabled by
@@ -132,6 +136,11 @@ struct arm_boot_info {
 bool secure_board_setup;
 
 arm_endianness endianness;
+
+/*
+ * Confidential guest boot loads everything into RAM so it can be measured.
+ */
+bool confidential;
 };
 
 /**
diff --git a/hw/arm/boot.c b/hw/arm/boot.c
index 9f522e332b..26c6334d52 100644
--- a/hw/arm/boot.c
+++ b/hw/arm/boot.c
@@ -1154,7 +1154,31 @@ static void arm_setup_direct_kernel_boot(ARMCPU *cpu,
 }
 }
 
-static void arm_setup_firmware_boot(ARMCPU *cpu, struct arm_boot_info *info)
+static void arm_setup_confidential_firmware_boot(ARMCPU *cpu,
+ struct arm_boot_info *info,
+ const char *firmware_filename)
+{
+ssize_t fw_size;
+const char *fname;
+AddressSpace *as = arm_boot_address_space(cpu, info);
+
+fname = qemu_find_file(QEMU_FILE_TYPE_BIOS, firmware_filename);
+if (!fname) {
+error_report("Could not find firmware image '%s'", firmware_filename);
+exit(1);
+}
+
+fw_size = load_image_targphys_as(firmware_filename,
+ info->firmware_base,
+ info->firmware_max_size, as);
+if (fw_size <= 0) {
+error_report("could not load firmware '%s'", firmware_filename);
+exit(1);
+}
+}
+
+static void arm_setup_firmware_boot(ARMCPU *cpu, struct arm_boot_info *info,
+const char *firmware_filename)
 {
 /* Set up for booting firmware (which might load a kernel via fw_cfg) */
 
@@ -1205,6 +1229,10 @@ static void arm_setup_firmware_boot(ARMCPU *cpu, struct 
arm_boot_info *info)
 }
 }
 
+if (info->confidential) {
+arm_setup_confidential_firmware_boot(cpu, info, firmware_filename);
+}
+
 /*
  * We will start from address 0 (typically a boot ROM image) in the
  * same way as hardware. Leave env->boot_info NULL, so that
@@ -1243,9 +1271,9 @@ void arm_load_kernel(ARMCPU *cpu, MachineState *ms, 
struct arm_boot_info *info)
 info->dtb_filename = ms->dtb;
 info->dtb_limit = 0;
 
-/* Load the kernel.  */
+/* Load the kernel and/or firmware. */
 if (!info->kernel_filename || info->firmware_loaded) {
-arm_setup_firmware_boot(cpu, info);
+arm_setup_firmware_boot(cpu, info, ms->firmware);
 } else {
 arm_setup_direct_kernel_boot(cpu, info);
 }
diff --git a/hw/arm/virt.c b/hw/arm/virt.c
index bed19d0b79..4a6281fc89 100644
--- a/hw/arm/virt.c
+++ b/hw/arm/virt.c
@@ -1178,6 +1178,10 @@ static PFlashCFI01 *virt_flash_create1(VirtMachineState 
*vms,
 
 static void virt_flash_create(VirtMachineState *vms)
 {
+if (virt_machine_is_confidential(vms)) {
+return;
+}
+
 vms->flash[0] = virt_flash_create1(vms, "virt.flash0", "pflash0");
 vms->flash[1] = virt_flash_create1(vms, "virt.flash1", "pflash1");
 }
@@ -1213,6 +1217,10 @@ static void virt_flash_map(VirtMachineState *vms,
 hwaddr flashsize = vms->memmap[VIRT_FLASH].size / 2;
 hwaddr flashbase = vms->memmap[VIRT_FLASH].base;
 
+if (virt_machine_is_confidential(vms)) {
+return;
+}
+
 virt_flash_map1(vms->flash[0], flashbase, flashsize,
 secure_sysmem);
 virt_flash_map1(vms->flash[1], flashbase + flashsize, flashsize,
@@ -1228,6 +1236,10 @@ static void virt_flash_fdt(VirtMachineState *vms,
 MachineState *ms = MACHINE(vms);
 char *nodename;
 
+if (virt_machine_is_confidential(vms)) {
+ret

[PATCH v2 00/22] arm: Run CCA VMs with KVM

2024-04-19 Thread Jean-Philippe Brucker

These patches enable launching a confidential guest with QEMU KVM on
Arm. The KVM changes for CCA have now been posted as v2 [1]. Launching a
confidential VM requires two additional command-line parameters:

-M confidential-guest-support=rme0
-object rme-guest,id=rme0

Since the RFC [2] I tried to address all review comments, and added a
few features:

* Enabled support for guest memfd by Xiaoyao Li and Chao Peng [3].
  Guest memfd is mandatory for CCA.

* Support firmware boot (edk2).

* Use CPU command-line arguments for Realm parameters. SVE vector length
  uses the existing sve -cpu parameters, while breakpoints, watchpoints
  and PMU counters use new CPU parameters.

The full series based on the memfd patches is at:
https://git.codelinaro.org/linaro/dcap/qemu.git branch cca/v2

Please find instructions for building and running the whole CCA stack at:
https://linaro.atlassian.net/wiki/spaces/QEMU/pages/29051027459/Building+an+RME+stack+for+QEMU

[1] https://lore.kernel.org/kvm/20240412084056.1733704-1-steven.pr...@arm.com/
[2] 
https://lore.kernel.org/all/20230127150727.612594-1-jean-phili...@linaro.org/
[3] 
https://lore.kernel.org/qemu-devel/20240322181116.1228416-1-pbonz...@redhat.com/

Jean-Philippe Brucker (22):
  kvm: Merge kvm_check_extension() and kvm_vm_check_extension()
  target/arm: Add confidential guest support
  target/arm/kvm: Return immediately on error in kvm_arch_init()
  target/arm/kvm-rme: Initialize realm
  hw/arm/virt: Add support for Arm RME
  hw/arm/virt: Disable DTB randomness for confidential VMs
  hw/arm/virt: Reserve one bit of guest-physical address for RME
  target/arm/kvm: Split kvm_arch_get/put_registers
  target/arm/kvm-rme: Initialize vCPU
  target/arm/kvm: Create scratch VM as Realm if necessary
  hw/core/loader: Add ROM loader notifier
  target/arm/kvm-rme: Populate Realm memory
  hw/arm/boot: Register Linux BSS section for confidential guests
  target/arm/kvm-rme: Add Realm Personalization Value parameter
  target/arm/kvm-rme: Add measurement algorithm property
  target/arm/cpu: Set number of breakpoints and watchpoints in KVM
  target/arm/cpu: Set number of PMU counters in KVM
  target/arm/kvm: Disable Realm reboot
  target/arm/cpu: Inform about reading confidential CPU registers
  target/arm/kvm-rme: Enable guest memfd
  hw/arm/virt: Move virt_flash_create() to machvirt_init()
  hw/arm/virt: Use RAM instead of flash for confidential guest firmware

 docs/system/arm/virt.rst   |   9 +-
 docs/system/confidential-guest-support.rst |   1 +
 qapi/qom.json  |  34 +-
 include/hw/arm/boot.h  |   9 +
 include/hw/arm/virt.h  |   2 +-
 include/hw/loader.h|  15 +
 include/sysemu/kvm.h   |   2 -
 include/sysemu/kvm_int.h   |   1 +
 target/arm/cpu.h   |  10 +
 target/arm/kvm_arm.h   |  25 ++
 accel/kvm/kvm-all.c|  34 +-
 hw/arm/boot.c  |  45 ++-
 hw/arm/virt.c  | 118 --
 hw/core/loader.c   |  15 +
 target/arm/arm-qmp-cmds.c  |   1 +
 target/arm/cpu.c   |   5 +
 target/arm/cpu64.c | 118 ++
 target/arm/kvm-rme.c   | 413 +
 target/arm/kvm.c   | 200 +-
 target/i386/kvm/kvm.c  |   6 +-
 target/ppc/kvm.c   |  36 +-
 target/arm/meson.build |   7 +-
 22 files changed, 1023 insertions(+), 83 deletions(-)
 create mode 100644 target/arm/kvm-rme.c

-- 
2.44.0

[PATCH v2 03/22] target/arm/kvm: Return immediately on error in kvm_arch_init()

2024-04-19 Thread Jean-Philippe Brucker

Returning an error to kvm_init() is fatal anyway, no need to continue
the initialization.

Signed-off-by: Jean-Philippe Brucker 
---
v1->v2: new
---
 target/arm/kvm.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/target/arm/kvm.c b/target/arm/kvm.c
index 3371ffa401..a5673241e5 100644
--- a/target/arm/kvm.c
+++ b/target/arm/kvm.c
@@ -566,7 +566,7 @@ int kvm_arch_init(MachineState *ms, KVMState *s)
 !kvm_check_extension(s, KVM_CAP_ARM_IRQ_LINE_LAYOUT_2)) {
 error_report("Using more than 256 vcpus requires a host kernel "
  "with KVM_CAP_ARM_IRQ_LINE_LAYOUT_2");
-ret = -EINVAL;
+return -EINVAL;
 }
 
 if (kvm_check_extension(s, KVM_CAP_ARM_NISV_TO_USER)) {
@@ -595,6 +595,7 @@ int kvm_arch_init(MachineState *ms, KVMState *s)
 if (ret < 0) {
 error_report("Enabling of Eager Page Split failed: %s",
  strerror(-ret));
+return ret;
 }
 }
 }
-- 
2.44.0

[PATCH v2 01/22] kvm: Merge kvm_check_extension() and kvm_vm_check_extension()

2024-04-19 Thread Jean-Philippe Brucker

The KVM_CHECK_EXTENSION ioctl can be issued either on the global fd
(/dev/kvm), or on the VM fd obtained with KVM_CREATE_VM. For most
extensions, KVM returns the same value with either method, but for some
of them it can refine the returned value depending on the VM type. The
KVM documentation [1] advises to use the VM fd:

  Based on their initialization different VMs may have different
  capabilities. It is thus encouraged to use the vm ioctl to query for
  capabilities (available with KVM_CAP_CHECK_EXTENSION_VM on the vm fd)

Ongoing work on Arm confidential VMs confirms this, as some capabilities
become unavailable to confidential VMs, requiring changes in QEMU to use
kvm_vm_check_extension() instead of kvm_check_extension() [2]. Rather
than changing each check one by one, change kvm_check_extension() to
always issue the ioctl on the VM fd when available, and remove
kvm_vm_check_extension().

Fall back to the global fd when the VM check is unavailable:

* Ancient kernels do not support KVM_CHECK_EXTENSION on the VM fd, since
  it was added by commit 92b591a4c46b ("KVM: Allow KVM_CHECK_EXTENSION
  on the vm fd") in Linux 3.17 [3]. Support for Linux 3.16 ended in June
  2020, but there may still be old images around.

* A couple of calls must be issued before the VM fd is available, since
  they determine the VM type: KVM_CAP_MIPS_VZ and KVM_CAP_ARM_VM_IPA_SIZE

Does any user actually depend on the check being done on the global fd
instead of the VM fd?  I surveyed all cases where KVM presently returns
different values depending on the query method. Luckily QEMU already
calls kvm_vm_check_extension() for most of those. Only three of them are
ambiguous, because currently done on the global fd:

* KVM_CAP_MAX_VCPUS and KVM_CAP_MAX_VCPU_ID on Arm, changes value if the
  user requests a vGIC different from the default. But QEMU queries this
  before vGIC configuration, so the reported value will be the same.

* KVM_CAP_SW_TLB on PPC. When issued on the global fd, returns false if
  the kvm-hv module is loaded; when issued on the VM fd, returns false
  only if the VM type is HV instead of PR. If this returns false, then
  QEMU will fail to initialize a BOOKE206 MMU model.

  So this patch supposedly improves things, as it allows to run this
  type of vCPU even when both KVM modules are loaded.

* KVM_CAP_PPC_SECURE_GUEST. Similarly, doing this check on a VM fd
  refines the returned value, and ensures that SVM is actually
  supported. Since QEMU follows the check with kvm_vm_enable_cap(), this
  patch should only provide better error reporting.

[1] https://www.kernel.org/doc/html/latest/virt/kvm/api.html#kvm-check-extension
[2] https://lore.kernel.org/kvm/875ybi0ytc@redhat.com/
[3] https://github.com/torvalds/linux/commit/92b591a4c46b

Cc: Marcelo Tosatti 
Cc: Nicholas Piggin 
Cc: Daniel Henrique Barboza 
Cc: qemu-...@nongnu.org
Suggested-by: Cornelia Huck 
Signed-off-by: Jean-Philippe Brucker 
---
v1: 
https://lore.kernel.org/qemu-devel/20230421163822.839167-1-jean-phili...@linaro.org/
v1->v2: Initialize check_extension_vm using kvm_vm_ioctl() as suggested
---
 include/sysemu/kvm.h |  2 --
 include/sysemu/kvm_int.h |  1 +
 accel/kvm/kvm-all.c  | 34 +++---
 target/arm/kvm.c |  2 +-
 target/i386/kvm/kvm.c|  6 +++---
 target/ppc/kvm.c | 36 ++--
 6 files changed, 38 insertions(+), 43 deletions(-)

diff --git a/include/sysemu/kvm.h b/include/sysemu/kvm.h
index c6f34d4794..df97077434 100644
--- a/include/sysemu/kvm.h
+++ b/include/sysemu/kvm.h
@@ -404,8 +404,6 @@ bool kvm_arch_stop_on_emulation_error(CPUState *cpu);
 
 int kvm_check_extension(KVMState *s, unsigned int extension);
 
-int kvm_vm_check_extension(KVMState *s, unsigned int extension);
-
 #define kvm_vm_enable_cap(s, capability, cap_flags, ...) \
 ({   \
 struct kvm_enable_cap cap = {\
diff --git a/include/sysemu/kvm_int.h b/include/sysemu/kvm_int.h
index cad763e240..fa4c9aeb96 100644
--- a/include/sysemu/kvm_int.h
+++ b/include/sysemu/kvm_int.h
@@ -123,6 +123,7 @@ struct KVMState
 uint16_t xen_gnttab_max_frames;
 uint16_t xen_evtchn_max_pirq;
 char *device;
+bool check_extension_vm;
 };
 
 void kvm_memory_listener_register(KVMState *s, KVMMemoryListener *kml,
diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index e08dd04164..3d9fbc8a98 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -1128,7 +1128,11 @@ int kvm_check_extension(KVMState *s, unsigned int 
extension)
 {
 int ret;
 
-ret = kvm_ioctl(s, KVM_CHECK_EXTENSION, extension);
+if (!s->check_extension_vm) {
+ret = kvm_ioctl(s, KVM_CHECK_EXTENSION, extension);
+} else {
+ret = kvm_vm_ioctl(s, KVM_CHECK_EXTENSION, extension);
+}
 if (ret < 0) {
 ret = 0;
 }
@@ -1136,19 +1140,6 @@ int

[PATCH v2 14/22] target/arm/kvm-rme: Add Realm Personalization Value parameter

2024-04-19 Thread Jean-Philippe Brucker

The Realm Personalization Value (RPV) is provided by the user to
distinguish Realms that have the same initial measurement.

The user provides up to 64 hexadecimal bytes. They are stored into the
RPV in the same order, zero-padded on the right.

Cc: Eric Blake 
Cc: Markus Armbruster 
Cc: Daniel P. Berrangé 
Cc: Eduardo Habkost 
Signed-off-by: Jean-Philippe Brucker 
---
v1->v2: Move parsing early, store as-is rather than reverted
---
 qapi/qom.json|  15 +-
 target/arm/kvm-rme.c | 111 +++
 2 files changed, 125 insertions(+), 1 deletion(-)

diff --git a/qapi/qom.json b/qapi/qom.json
index 623ec8071f..91654aa267 100644
--- a/qapi/qom.json
+++ b/qapi/qom.json
@@ -931,6 +931,18 @@
   'data': { '*cpu-affinity': ['uint16'],
 '*node-affinity': ['uint16'] } }
 
+##
+# @RmeGuestProperties:
+#
+# Properties for rme-guest objects.
+#
+# @personalization-value: Realm personalization value, as a 64-byte hex string
+# (default: 0)
+#
+# Since: FIXME
+##
+{ 'struct': 'RmeGuestProperties',
+  'data': { '*personalization-value': 'str' } }
 
 ##
 # @ObjectType:
@@ -1066,7 +1078,8 @@
   'tls-creds-x509': 'TlsCredsX509Properties',
   'tls-cipher-suites':  'TlsCredsProperties',
   'x-remote-object':'RemoteObjectProperties',
-  'x-vfio-user-server': 'VfioUserServerProperties'
+  'x-vfio-user-server': 'VfioUserServerProperties',
+  'rme-guest':  'RmeGuestProperties'
   } }
 
 ##
diff --git a/target/arm/kvm-rme.c b/target/arm/kvm-rme.c
index b2ad10ef6d..cb5c3f7a22 100644
--- a/target/arm/kvm-rme.c
+++ b/target/arm/kvm-rme.c
@@ -23,10 +23,13 @@ OBJECT_DECLARE_SIMPLE_TYPE(RmeGuest, RME_GUEST)
 
 #define RME_PAGE_SIZE qemu_real_host_page_size()
 
+#define RME_MAX_CFG 1
+
 struct RmeGuest {
 ConfidentialGuestSupport parent_obj;
 Notifier rom_load_notifier;
 GSList *ram_regions;
+uint8_t *personalization_value;
 };
 
 typedef struct {
@@ -54,6 +57,48 @@ static int rme_create_rd(Error **errp)
 return ret;
 }
 
+static int rme_configure_one(RmeGuest *guest, uint32_t cfg, Error **errp)
+{
+int ret;
+const char *cfg_str;
+struct kvm_cap_arm_rme_config_item args = {
+.cfg = cfg,
+};
+
+switch (cfg) {
+case KVM_CAP_ARM_RME_CFG_RPV:
+if (!guest->personalization_value) {
+return 0;
+}
+memcpy(args.rpv, guest->personalization_value, 
KVM_CAP_ARM_RME_RPV_SIZE);
+cfg_str = "personalization value";
+break;
+default:
+g_assert_not_reached();
+}
+
+ret = kvm_vm_enable_cap(kvm_state, KVM_CAP_ARM_RME, 0,
+KVM_CAP_ARM_RME_CONFIG_REALM, (intptr_t));
+if (ret) {
+error_setg_errno(errp, -ret, "RME: failed to configure %s", cfg_str);
+}
+return ret;
+}
+
+static int rme_configure(void)
+{
+int ret;
+int cfg;
+
+for (cfg = 0; cfg < RME_MAX_CFG; cfg++) {
+ret = rme_configure_one(rme_guest, cfg, _abort);
+if (ret) {
+return ret;
+}
+}
+return 0;
+}
+
 static void rme_populate_realm(gpointer data, gpointer unused)
 {
 int ret;
@@ -98,6 +143,11 @@ static void rme_vm_state_change(void *opaque, bool running, 
RunState state)
 return;
 }
 
+ret = rme_configure();
+if (ret) {
+return;
+}
+
 ret = rme_create_rd(_abort);
 if (ret) {
 return;
@@ -231,8 +281,69 @@ int kvm_arm_rme_vm_type(MachineState *ms)
 return 0;
 }
 
+static char *rme_get_rpv(Object *obj, Error **errp)
+{
+RmeGuest *guest = RME_GUEST(obj);
+GString *s;
+int i;
+
+if (!guest->personalization_value) {
+return NULL;
+}
+
+s = g_string_sized_new(KVM_CAP_ARM_RME_RPV_SIZE * 2 + 1);
+
+for (i = 0; i < KVM_CAP_ARM_RME_RPV_SIZE; i++) {
+g_string_append_printf(s, "%02x", guest->personalization_value[i]);
+}
+
+return g_string_free(s, /* free_segment */ false);
+}
+
+static void rme_set_rpv(Object *obj, const char *value, Error **errp)
+{
+RmeGuest *guest = RME_GUEST(obj);
+size_t len = strlen(value);
+uint8_t *out;
+int i = 1;
+int ret;
+
+g_free(guest->personalization_value);
+guest->personalization_value = out = g_malloc0(KVM_CAP_ARM_RME_RPV_SIZE);
+
+/* Two chars per byte */
+if (len > KVM_CAP_ARM_RME_RPV_SIZE * 2) {
+error_setg(errp, "Realm Personalization Value is too large");
+return;
+}
+
+/* First byte may have a single char */
+if (len % 2) {
+ret = sscanf(value, "%1hhx", out++);
+} else {
+ret = sscanf(value, "%2hhx", out++);
+i++;
+}
+if (ret != 1) {
+error_setg(errp, "Invalid Realm Personalization Value");
+return;
+}
+
+for (; i < len; i += 2) {
+ret = sscanf(value + i,

[PATCH v2 05/22] hw/arm/virt: Add support for Arm RME

2024-04-19 Thread Jean-Philippe Brucker

When confidential-guest-support is enabled for the virt machine, call
the RME init function, and add the RME flag to the VM type.

Signed-off-by: Jean-Philippe Brucker 
---
v1->v2:
* Don't explicitly disable steal_time, it's now done through KVM capabilities
* Split patch
---
 hw/arm/virt.c | 15 +--
 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/hw/arm/virt.c b/hw/arm/virt.c
index a9a913aead..07ad31876e 100644
--- a/hw/arm/virt.c
+++ b/hw/arm/virt.c
@@ -224,6 +224,11 @@ static const int a15irqmap[] = {
 [VIRT_PLATFORM_BUS] = 112, /* ...to 112 + PLATFORM_BUS_NUM_IRQS -1 */
 };
 
+static bool virt_machine_is_confidential(VirtMachineState *vms)
+{
+return MACHINE(vms)->cgs;
+}
+
 static void create_randomness(MachineState *ms, const char *node)
 {
 struct {
@@ -2111,10 +2116,11 @@ static void machvirt_init(MachineState *machine)
  * if the guest has EL2 then we will use SMC as the conduit,
  * and otherwise we will use HVC (for backwards compatibility and
  * because if we're using KVM then we must use HVC).
+ * Realm guests must also use SMC.
  */
 if (vms->secure && firmware_loaded) {
 vms->psci_conduit = QEMU_PSCI_CONDUIT_DISABLED;
-} else if (vms->virt) {
+} else if (vms->virt || virt_machine_is_confidential(vms)) {
 vms->psci_conduit = QEMU_PSCI_CONDUIT_SMC;
 } else {
 vms->psci_conduit = QEMU_PSCI_CONDUIT_HVC;
@@ -2917,6 +2923,7 @@ static HotplugHandler 
*virt_machine_get_hotplug_handler(MachineState *machine,
 static int virt_kvm_type(MachineState *ms, const char *type_str)
 {
 VirtMachineState *vms = VIRT_MACHINE(ms);
+int rme_vm_type = kvm_arm_rme_vm_type(ms);
 int max_vm_pa_size, requested_pa_size;
 bool fixed_ipa;
 
@@ -2946,7 +2953,11 @@ static int virt_kvm_type(MachineState *ms, const char 
*type_str)
  * the implicit legacy 40b IPA setting, in which case the kvm_type
  * must be 0.
  */
-return fixed_ipa ? 0 : requested_pa_size;
+if (fixed_ipa) {
+return 0;
+}
+
+return requested_pa_size | rme_vm_type;
 }
 
 static void virt_machine_class_init(ObjectClass *oc, void *data)
-- 
2.44.0

[PATCH v2 07/22] hw/arm/virt: Reserve one bit of guest-physical address for RME

2024-04-19 Thread Jean-Philippe Brucker

When RME is enabled, the upper GPA bit is used to distinguish protected
from unprotected addresses. Reserve it when setting up the guest memory
map.

Signed-off-by: Jean-Philippe Brucker 
---
v1->v2: separate patch
---
 hw/arm/virt.c | 14 --
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/hw/arm/virt.c b/hw/arm/virt.c
index f300f100b5..eca9a96b5a 100644
--- a/hw/arm/virt.c
+++ b/hw/arm/virt.c
@@ -2939,14 +2939,24 @@ static int virt_kvm_type(MachineState *ms, const char 
*type_str)
 VirtMachineState *vms = VIRT_MACHINE(ms);
 int rme_vm_type = kvm_arm_rme_vm_type(ms);
 int max_vm_pa_size, requested_pa_size;
+int rme_reserve_bit = 0;
 bool fixed_ipa;
 
-max_vm_pa_size = kvm_arm_get_max_vm_ipa_size(ms, _ipa);
+if (rme_vm_type) {
+/*
+ * With RME, the upper GPA bit differentiates Realm from NS memory.
+ * Reserve the upper bit to ensure that highmem devices will fit.
+ */
+rme_reserve_bit = 1;
+}
+
+max_vm_pa_size = kvm_arm_get_max_vm_ipa_size(ms, _ipa) -
+ rme_reserve_bit;
 
 /* we freeze the memory map to compute the highest gpa */
 virt_set_memmap(vms, max_vm_pa_size);
 
-requested_pa_size = 64 - clz64(vms->highest_gpa);
+requested_pa_size = 64 - clz64(vms->highest_gpa) + rme_reserve_bit;
 
 /*
  * KVM requires the IPA size to be at least 32 bits.
-- 
2.44.0

[PATCH v2 10/22] target/arm/kvm: Create scratch VM as Realm if necessary

2024-04-19 Thread Jean-Philippe Brucker

Some ID registers have a different value for a Realm VM, for example
ID_AA64DFR0_EL1 contains the number of breakpoints/watchpoints
implemented by RMM instead of the hardware.

Even though RMM is in charge of setting up most Realm registers, KVM
still provides GET_ONE_REG interface on a Realm VM to probe the VM's
capabilities.

KVM only reports the maximum IPA it supports, but RMM may support
smaller sizes. If the VM creation fails with the value returned by KVM,
then retry with the smaller working address. This needs a better
solution.

Signed-off-by: Jean-Philippe Brucker 
---
 target/arm/kvm.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/target/arm/kvm.c b/target/arm/kvm.c
index 3a2233ec73..6d368bf442 100644
--- a/target/arm/kvm.c
+++ b/target/arm/kvm.c
@@ -104,6 +104,7 @@ bool kvm_arm_create_scratch_host_vcpu(const uint32_t 
*cpus_to_try,
 {
 int ret = 0, kvmfd = -1, vmfd = -1, cpufd = -1;
 int max_vm_pa_size;
+int vm_type;
 
 kvmfd = qemu_open_old("/dev/kvm", O_RDWR);
 if (kvmfd < 0) {
@@ -113,8 +114,9 @@ bool kvm_arm_create_scratch_host_vcpu(const uint32_t 
*cpus_to_try,
 if (max_vm_pa_size < 0) {
 max_vm_pa_size = 0;
 }
+vm_type = kvm_arm_rme_vm_type(MACHINE(qdev_get_machine()));
 do {
-vmfd = ioctl(kvmfd, KVM_CREATE_VM, max_vm_pa_size);
+vmfd = ioctl(kvmfd, KVM_CREATE_VM, max_vm_pa_size | vm_type);
 } while (vmfd == -1 && errno == EINTR);
 if (vmfd < 0) {
 goto err;
-- 
2.44.0

[PATCH v2 08/22] target/arm/kvm: Split kvm_arch_get/put_registers

2024-04-19 Thread Jean-Philippe Brucker

The confidential guest support in KVM limits the number of registers
that we can read and write. Split the get/put_registers function to
prepare for it.

Signed-off-by: Jean-Philippe Brucker 
---
 target/arm/kvm.c | 30 --
 1 file changed, 28 insertions(+), 2 deletions(-)

diff --git a/target/arm/kvm.c b/target/arm/kvm.c
index b00077c1a5..3504276822 100644
--- a/target/arm/kvm.c
+++ b/target/arm/kvm.c
@@ -2056,7 +2056,7 @@ static int kvm_arch_put_sve(CPUState *cs)
 return 0;
 }
 
-int kvm_arch_put_registers(CPUState *cs, int level)
+static int kvm_arm_put_core_regs(CPUState *cs, int level)
 {
 uint64_t val;
 uint32_t fpr;
@@ -2159,6 +2159,19 @@ int kvm_arch_put_registers(CPUState *cs, int level)
 return ret;
 }
 
+return 0;
+}
+
+int kvm_arch_put_registers(CPUState *cs, int level)
+{
+int ret;
+ARMCPU *cpu = ARM_CPU(cs);
+
+ret = kvm_arm_put_core_regs(cs, level);
+if (ret) {
+return ret;
+}
+
 write_cpustate_to_list(cpu, true);
 
 if (!write_list_to_kvmstate(cpu, level)) {
@@ -2240,7 +2253,7 @@ static int kvm_arch_get_sve(CPUState *cs)
 return 0;
 }
 
-int kvm_arch_get_registers(CPUState *cs)
+static int kvm_arm_get_core_regs(CPUState *cs)
 {
 uint64_t val;
 unsigned int el;
@@ -2343,6 +2356,19 @@ int kvm_arch_get_registers(CPUState *cs)
 }
 vfp_set_fpcr(env, fpr);
 
+return 0;
+}
+
+int kvm_arch_get_registers(CPUState *cs)
+{
+int ret;
+ARMCPU *cpu = ARM_CPU(cs);
+
+ret = kvm_arm_get_core_regs(cs);
+if (ret) {
+return ret;
+}
+
 ret = kvm_get_vcpu_events(cpu);
 if (ret) {
 return ret;
-- 
2.44.0

[PATCH v2 02/22] target/arm: Add confidential guest support

2024-04-19 Thread Jean-Philippe Brucker

Add a new RmeGuest object, inheriting from ConfidentialGuestSupport, to
support the Arm Realm Management Extension (RME). It is instantiated by
passing on the command-line:

  -M virt,confidential-guest-support=
  -object guest-rme,id=[,options...]

This is only the skeleton. Support will be added in following patches.

Cc: Eric Blake 
Cc: Markus Armbruster 
Cc: Daniel P. Berrangé 
Cc: Eduardo Habkost 
Reviewed-by: Philippe Mathieu-Daudé 
Reviewed-by: Richard Henderson 
Signed-off-by: Jean-Philippe Brucker 
---
 docs/system/confidential-guest-support.rst |  1 +
 qapi/qom.json  |  3 +-
 target/arm/kvm-rme.c   | 46 ++
 target/arm/meson.build |  7 +++-
 4 files changed, 55 insertions(+), 2 deletions(-)
 create mode 100644 target/arm/kvm-rme.c

diff --git a/docs/system/confidential-guest-support.rst 
b/docs/system/confidential-guest-support.rst
index 0c490dbda2..acf46d8856 100644
--- a/docs/system/confidential-guest-support.rst
+++ b/docs/system/confidential-guest-support.rst
@@ -40,5 +40,6 @@ Currently supported confidential guest mechanisms are:
 * AMD Secure Encrypted Virtualization (SEV) (see 
:doc:`i386/amd-memory-encryption`)
 * POWER Protected Execution Facility (PEF) (see 
:ref:`power-papr-protected-execution-facility-pef`)
 * s390x Protected Virtualization (PV) (see :doc:`s390x/protvirt`)
+* Arm Realm Management Extension (RME)
 
 Other mechanisms may be supported in future.
diff --git a/qapi/qom.json b/qapi/qom.json
index 85e6b4f84a..623ec8071f 100644
--- a/qapi/qom.json
+++ b/qapi/qom.json
@@ -996,7 +996,8 @@
 'tls-creds-x509',
 'tls-cipher-suites',
 { 'name': 'x-remote-object', 'features': [ 'unstable' ] },
-{ 'name': 'x-vfio-user-server', 'features': [ 'unstable' ] }
+{ 'name': 'x-vfio-user-server', 'features': [ 'unstable' ] },
+'rme-guest'
   ] }
 
 ##
diff --git a/target/arm/kvm-rme.c b/target/arm/kvm-rme.c
new file mode 100644
index 00..960dd75608
--- /dev/null
+++ b/target/arm/kvm-rme.c
@@ -0,0 +1,46 @@
+/*
+ * QEMU Arm RME support
+ *
+ * Copyright Linaro 2024
+ */
+
+#include "qemu/osdep.h"
+
+#include "exec/confidential-guest-support.h"
+#include "hw/boards.h"
+#include "hw/core/cpu.h"
+#include "kvm_arm.h"
+#include "migration/blocker.h"
+#include "qapi/error.h"
+#include "qom/object_interfaces.h"
+#include "sysemu/kvm.h"
+#include "sysemu/runstate.h"
+
+#define TYPE_RME_GUEST "rme-guest"
+OBJECT_DECLARE_SIMPLE_TYPE(RmeGuest, RME_GUEST)
+
+struct RmeGuest {
+ConfidentialGuestSupport parent_obj;
+};
+
+static void rme_guest_class_init(ObjectClass *oc, void *data)
+{
+}
+
+static const TypeInfo rme_guest_info = {
+.parent = TYPE_CONFIDENTIAL_GUEST_SUPPORT,
+.name = TYPE_RME_GUEST,
+.instance_size = sizeof(struct RmeGuest),
+.class_init = rme_guest_class_init,
+.interfaces = (InterfaceInfo[]) {
+{ TYPE_USER_CREATABLE },
+{ }
+}
+};
+
+static void rme_register_types(void)
+{
+type_register_static(_guest_info);
+}
+
+type_init(rme_register_types);
diff --git a/target/arm/meson.build b/target/arm/meson.build
index 2e10464dbb..c610c078f7 100644
--- a/target/arm/meson.build
+++ b/target/arm/meson.build
@@ -8,7 +8,12 @@ arm_ss.add(files(
 ))
 arm_ss.add(zlib)
 
-arm_ss.add(when: 'CONFIG_KVM', if_true: files('hyp_gdbstub.c', 'kvm.c'), 
if_false: files('kvm-stub.c'))
+arm_ss.add(when: 'CONFIG_KVM',
+  if_true: files(
+'hyp_gdbstub.c',
+'kvm.c',
+'kvm-rme.c'),
+  if_false: files('kvm-stub.c'))
 arm_ss.add(when: 'CONFIG_HVF', if_true: files('hyp_gdbstub.c'))
 
 arm_ss.add(when: 'TARGET_AARCH64', if_true: files(
-- 
2.44.0

Re: [PATCH v2] virtio-iommu: Use qemu_real_host_page_mask as default page_size_mask

2024-02-21 Thread Jean-Philippe Brucker

On Wed, Feb 21, 2024 at 11:41:57AM +0100, Eric Auger wrote:
> Hi,
> 
> On 2/13/24 13:00, Michael S. Tsirkin wrote:
> > On Tue, Feb 13, 2024 at 12:24:22PM +0100, Eric Auger wrote:
> >> Hi Michael,
> >> On 2/13/24 12:09, Michael S. Tsirkin wrote:
> >>> On Tue, Feb 13, 2024 at 11:32:13AM +0100, Eric Auger wrote:
>  Do you have an other concern?
> >>> I also worry a bit about migrating between hosts with different
> >>> page sizes. Not with kvm I am guessing but with tcg it does work I think?
> >> I have never tried but is it a valid use case? Adding Peter in CC.
> >>> Is this just for vfio and vdpa? Can we limit this to these setups
> >>> maybe?
> >> I am afraid we know the actual use case too later. If the VFIO device is
> >> hotplugged we have started working with 4kB granule.
> >>
> >> The other way is to introduce a min_granule option as done for aw-bits.
> >> But it is heavier.
> >>
> >> Thanks
> >>
> >> Eric
> > Let's say, if you are changing the default then we definitely want
> > a way to get the cmpatible behaviour for tcg.
> > So the compat machinery should be user-accessible too and documented.
> 
> I guess I need to add a new option to guarantee the machine compat.
> 
> I was thinking about an enum GranuleMode property taking the following
> values, 4KB, 64KB, host
> Jean, do you think there is a rationale offering something richer?

16KB seems to be gaining popularity, we should include that (I think it's
the only granule supported by Apple IOMMU?). Hopefully that will be
enough.

Thanks,
Jean

> 
> Obviously being able to set the exact page_size_mask + host mode would
> be better but this does not really fit into any std property type.
> 
> Thanks
> 
> Eric
> >
>

Re: [PATCH v3 0/3] VIRTIO-IOMMU: Introduce an aw-bits option

2024-02-08 Thread Jean-Philippe Brucker

On Thu, Feb 08, 2024 at 11:10:16AM +0100, Eric Auger wrote:
> In [1] and [2] we attempted to fix a case where a VFIO-PCI device
> protected with a virtio-iommu is assigned to an x86 guest. On x86
> the physical IOMMU may have an address width (gaw) of 39 or 48 bits
> whereas the virtio-iommu exposes a 64b input address space by default.
> Hence the guest may try to use the full 64b space and DMA MAP
> failures may be encountered. To work around this issue we endeavoured
> to pass usable host IOVA regions (excluding the out of range space) from
> VFIO to the virtio-iommu device so that the virtio-iommu driver can
> query those latter during the probe request and let the guest iommu
> kernel subsystem carve them out. 
> 
> However if there are several devices in the same iommu group,
> only the reserved regions of the first one are taken into
> account by the iommu subsystem of the guest. This generally
> works on baremetal because devices are not going to
> expose different reserved regions. However in our case, this
> may prevent from taking into account the host iommu geometry.
> 
> So the simplest solution to this problem looks to introduce an
> input address width option, aw-bits, which matches what is
> done on the intel-iommu. By default, from now on it is set
> to 39 bits with pc_q35 and 48 with arm virt. This replaces the
> previous default value of 64b. So we need to introduce a compat
> for machines older than 9.0 to behave similarly. We use
> hw_compat_8_2 to acheive that goal.

For the series:

Reviewed-by: Jean-Philippe Brucker 

> 
> Outstanding series [2] remains useful to let resv regions beeing
> communicated on time before the probe request.
> 
> [1] [PATCH v4 00/12] VIRTIO-IOMMU/VFIO: Don't assume 64b IOVA space
> https://lore.kernel.org/all/20231019134651.842175-1-eric.au...@redhat.com/
> - This is merged -
> 
> [2] [RFC 0/7] VIRTIO-IOMMU/VFIO: Fix host iommu geometry handling for 
> hotplugged devices
> https://lore.kernel.org/all/20240117080414.316890-1-eric.au...@redhat.com/
> - This is pending for review on the ML -
> 
> This series can be found at:
> https://github.com/eauger/qemu/tree/virtio-iommu-aw-bits-v3
> previous
> https://github.com/eauger/qemu/tree/virtio-iommu-aw-bits-v2
> 
> Applied on top of [3]
> [PATCH v2] virtio-iommu: Use qemu_real_host_page_mask as default 
> page_size_mask
> https://lore.kernel.org/all/20240117132039.332273-1-eric.au...@redhat.com/
> 
> History:
> v2 -> v3:
> - Collected Zhenzhong and Cédric's R-b + Yanghang's T-b
> - use _abort instead of NULL error handle
>   on object_property_get_uint() call (Cédric)
> - use VTD_HOST_AW_39BIT (Cédric)
> 
> v1 -> v2
> - Limit aw to 48b on ARM
> - Check aw is within [32,64]
> - Use hw_compat_8_2
> 
> 
> Eric Auger (3):
>   virtio-iommu: Add an option to define the input range width
>   virtio-iommu: Trace domain range limits as unsigned int
>   hw: Set virtio-iommu aw-bits default value on pc_q35 and arm virt
> 
>  include/hw/virtio/virtio-iommu.h | 1 +
>  hw/arm/virt.c| 6 ++
>  hw/core/machine.c| 5 -
>  hw/i386/pc.c | 6 ++
>  hw/virtio/virtio-iommu.c | 7 ++-
>  hw/virtio/trace-events   | 2 +-
>  6 files changed, 24 insertions(+), 3 deletions(-)
> 
> -- 
> 2.41.0
>

Re: [PATCH v2 1/3] virtio-iommu: Add an option to define the input range width

2024-02-08 Thread Jean-Philippe Brucker

On Thu, Feb 08, 2024 at 09:16:35AM +0100, Eric Auger wrote:
> >> diff --git a/hw/virtio/virtio-iommu.c b/hw/virtio/virtio-iommu.c
> >> index ec2ba11d1d..7870bdbeee 100644
> >> --- a/hw/virtio/virtio-iommu.c
> >> +++ b/hw/virtio/virtio-iommu.c
> >> @@ -1314,7 +1314,11 @@ static void virtio_iommu_device_realize(DeviceState 
> >> *dev, Error **errp)
> >>   */
> >>  s->config.bypass = s->boot_bypass;
> >>  s->config.page_size_mask = qemu_real_host_page_mask();
> >> -s->config.input_range.end = UINT64_MAX;
> >> +if (s->aw_bits < 32 || s->aw_bits > 64) {
> > I'm wondering if we should lower this to 16 bits, just to support all
> > possible host SMMU configurations (the smallest address space configurable
> > with T0SZ is 25-bit, or 16-bit with the STT extension).
> Is it a valid use case case to assign host devices protected by
> virtio-iommu with a physical SMMU featuring Small Translation Table?

Probably not, I'm guessing STT is for tiny embedded implementations where
QEMU or even virtualization is not a use case. But because smaller mobile
platforms now implement SMMUv3, using smaller IOVA spaces and thus fewer
page tables can be beneficial. One use case I have in mind is android with
pKVM, each app has its own VM, and devices can be partitioned into lots of
address spaces with PASID, so you can save a lot of memory and table-walk
time by shrinking those address space. But that particular case will use
crosvm so isn't relevant here, it's only an example.

Mainly I was concerned that if the Linux driver decides to allow
configuring smaller address spaces (maybe a linux cmdline option), then
using a architectural limit here would be a safe bet that things can still
work. But we can always change it in a later version, or implement finer
controls (ideally the guest driver would configure the VA size in ATTACH).

> It leaves 64kB IOVA space only. Besides in the spec, it is wriiten the
> min T0SZ can even be 12.
> 
> "The minimum valid value is 16 unless all of the following also hold, in
> which case the minimum permitted
> value is 12:
> – SMMUv3.1 or later is supported.
> – SMMU_IDR5.VAX indicates support for 52-bit Vas.
> – The corresponding CD.TGx selects a 64KB granule.
> "

Yes that's confusing because va_size = 64 - T0SZ, so T0SZ=12 actually
describes the largest address size, 52.

> 
> At the moment I would prefer to stick to the limit suggested by Alex
> which looks also sensible for other archs whereas 16 doesn't.

Agreed, it should be sufficient.

Thanks,
Jean

Re: [PATCH v2 1/3] virtio-iommu: Add an option to define the input range width

2024-02-05 Thread Jean-Philippe Brucker

Hi Eric,

On Thu, Feb 01, 2024 at 05:32:22PM +0100, Eric Auger wrote:
> aw-bits is a new option that allows to set the bit width of
> the input address range. This value will be used as a default for
> the device config input_range.end. By default it is set to 64 bits
> which is the current value.
> 
> Signed-off-by: Eric Auger 
> 
> ---
> 
> v1 -> v2:
> - Check the aw-bits value is within [32,64]
> ---
>  include/hw/virtio/virtio-iommu.h | 1 +
>  hw/virtio/virtio-iommu.c | 7 ++-
>  2 files changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/include/hw/virtio/virtio-iommu.h 
> b/include/hw/virtio/virtio-iommu.h
> index 781ebaea8f..5fbe4677c2 100644
> --- a/include/hw/virtio/virtio-iommu.h
> +++ b/include/hw/virtio/virtio-iommu.h
> @@ -66,6 +66,7 @@ struct VirtIOIOMMU {
>  bool boot_bypass;
>  Notifier machine_done;
>  bool granule_frozen;
> +uint8_t aw_bits;
>  };
>  
>  #endif
> diff --git a/hw/virtio/virtio-iommu.c b/hw/virtio/virtio-iommu.c
> index ec2ba11d1d..7870bdbeee 100644
> --- a/hw/virtio/virtio-iommu.c
> +++ b/hw/virtio/virtio-iommu.c
> @@ -1314,7 +1314,11 @@ static void virtio_iommu_device_realize(DeviceState 
> *dev, Error **errp)
>   */
>  s->config.bypass = s->boot_bypass;
>  s->config.page_size_mask = qemu_real_host_page_mask();
> -s->config.input_range.end = UINT64_MAX;
> +if (s->aw_bits < 32 || s->aw_bits > 64) {

I'm wondering if we should lower this to 16 bits, just to support all
possible host SMMU configurations (the smallest address space configurable
with T0SZ is 25-bit, or 16-bit with the STT extension).

Thanks,
Jean

> +error_setg(errp, "aw-bits must be within [32,64]");
> +}
> +s->config.input_range.end =
> +s->aw_bits == 64 ? UINT64_MAX : BIT_ULL(s->aw_bits) - 1;
>  s->config.domain_range.end = UINT32_MAX;
>  s->config.probe_size = VIOMMU_PROBE_SIZE;
>  
> @@ -1525,6 +1529,7 @@ static Property virtio_iommu_properties[] = {
>  DEFINE_PROP_LINK("primary-bus", VirtIOIOMMU, primary_bus,
>   TYPE_PCI_BUS, PCIBus *),
>  DEFINE_PROP_BOOL("boot-bypass", VirtIOIOMMU, boot_bypass, true),
> +DEFINE_PROP_UINT8("aw-bits", VirtIOIOMMU, aw_bits, 64),
>  DEFINE_PROP_END_OF_LIST(),
>  };
>  
> -- 
> 2.41.0
>

Re: [RFC 0/7] VIRTIO-IOMMU/VFIO: Fix host iommu geometry handling for hotplugged devices

2024-01-30 Thread Jean-Philippe Brucker

On Mon, Jan 29, 2024 at 05:38:55PM +0100, Eric Auger wrote:
> > There may be a separate argument for clearing bypass. With a coldplugged
> > VFIO device the flow is:
> >
> > 1. Map the whole guest address space in VFIO to implement boot-bypass.
> >This allocates all guest pages, which takes a while and is wasteful.
> >I've actually crashed a host that way, when spawning a guest with too
> >much RAM.
> interesting
> > 2. Start the VM
> > 3. When the virtio-iommu driver attaches a (non-identity) domain to the
> >assigned endpoint, then unmap the whole address space in VFIO, and most
> >pages are given back to the host.
> >
> > We can't disable boot-bypass because the BIOS needs it. But instead the
> > flow could be:
> >
> > 1. Start the VM, with only the virtual endpoints. Nothing to pin.
> > 2. The virtio-iommu driver disables bypass during boot
> We needed this boot-bypass mode for booting with virtio-blk-scsi
> protected with virtio-iommu for instance.
> That was needed because we don't have any virtio-iommu driver in edk2 as
> opposed to intel iommu driver, right?

Yes. What I had in mind is the x86 SeaBIOS which doesn't have any IOMMU
driver and accesses the default SATA device:

 $ qemu-system-x86_64 -M q35 -device virtio-iommu,boot-bypass=off
 qemu: virtio_iommu_translate sid=250 is not known!!
 qemu: no buffer available in event queue to report event
 qemu: AHCI: Failed to start FIS receive engine: bad FIS receive buffer address

But it's the same problem with edk2. Also a guest OS without a
virtio-iommu driver needs boot-bypass. Once firmware boot is complete, the
OS with a virtio-iommu driver normally can turn bypass off in the config
space, it's not useful anymore. If it needs to put some endpoints in
bypass, then it can attach them to a bypass domain.

> > 3. Hotplug the VFIO device. With bypass disabled there is no need to pin
> >the whole guest address space, unless the guest explicitly asks for an
> >identity domain.
> >
> > However, I don't know if this is a realistic scenario that will actually
> > be used.
> >
> > By the way, do you have an easy way to reproduce the issue described here?
> > I've had to enable iommu.forcedac=1 on the command-line, otherwise Linux
> > just allocates 32-bit IOVAs.
> I don't have a simple generic reproducer. It happens when assigning this
> device:
> Ethernet Controller E810-C for QSFP (Ethernet Network Adapter E810-C-Q2)
> 
> I have not encountered that issue with another device yet.
> I see on guest side in dmesg:
> [    6.849292] ice :00:05.0: Using 64-bit DMA addresses
> 
> That's emitted in dma-iommu.c iommu_dma_alloc_iova().
> Looks like the guest first tries to allocate an iova in the 32-bit AS
> and if this fails use the whole dma_limit.
> Seems the 32b IOVA alloc failed here ;-)

Interesting, are you running some demanding workload and a lot of CPUs?
That's a lot of IOVAs used up, I'm curious about what kind of DMA pattern
does that.

Thanks,
Jean

Re: [PATCH 0/3] VIRTIO-IOMMU: Introduce an aw-bits option

2024-01-29 Thread Jean-Philippe Brucker

On Mon, Jan 29, 2024 at 03:07:41PM +0100, Eric Auger wrote:
> Hi Jean-Philippe,
> 
> On 1/29/24 13:23, Jean-Philippe Brucker wrote:
> > Hi Eric,
> >
> > On Tue, Jan 23, 2024 at 07:15:54PM +0100, Eric Auger wrote:
> >> In [1] and [2] we attempted to fix a case where a VFIO-PCI device
> >> protected with a virtio-iommu is assigned to an x86 guest. On x86
> >> the physical IOMMU may have an address width (gaw) of 39 or 48 bits
> >> whereas the virtio-iommu exposes a 64b input address space by default.
> >> Hence the guest may try to use the full 64b space and DMA MAP
> >> failures may be encountered. To work around this issue we endeavoured
> >> to pass usable host IOVA regions (excluding the out of range space) from
> >> VFIO to the virtio-iommu device so that the virtio-iommu driver can
> >> query those latter during the probe request and let the guest iommu
> >> kernel subsystem carve them out.
> >>
> >> However if there are several devices in the same iommu group,
> >> only the reserved regions of the first one are taken into
> >> account by the iommu subsystem of the guest. This generally
> >> works on baremetal because devices are not going to
> >> expose different reserved regions. However in our case, this
> >> may prevent from taking into account the host iommu geometry.
> >>
> >> So the simplest solution to this problem looks to introduce an
> >> input address width option, aw-bits, which matches what is
> >> done on the intel-iommu. By default, from now on it is set
> >> to 39 bits with pc_q35 and 64b with arm virt.
> > Doesn't Arm have the same problem?  The TTB0 page tables limit what can be
> > mapped to 48-bit, or 52-bit when SMMU_IDR5.VAX==1 and granule is 64kB.
> > A Linux host driver could configure smaller VA sizes:
> > * SMMUv2 limits the VA to SMMU_IDR2.UBS (upstream bus size) which
> >   can go as low as 32-bit (I'm assuming we don't care about 32-bit hosts).
> Yes I think we can ignore that use case.
> > * SMMUv3 currently limits the VA to CONFIG_ARM64_VA_BITS, which
> >   could be as low as 36 bits (but realistically 39, since 36 depends on
> >   16kB pages and CONFIG_EXPERT).
> Further reading "3.4.1 Input address size and Virtual Address size" ooks
> indeed SMMU_IDR5.VAX gives info on the physical SMMU actual
> implementation max (which matches intel iommu gaw). I missed that. Now I
> am confused about should we limit VAS to 39 to accomodate of the worst
> case host SW configuration or shall we use 48 instead?

I don't know what's best either. 48 should be fine if hosts normally
enable VA_BITS_48 (I see debian has it [1], not sure how to find the
others).

[1] 
https://salsa.debian.org/kernel-team/linux/-/blob/master/debian/config/arm64/config?ref_type=heads#L18

> If we set such a low 39b value, won't it prevent some guests from
> properly working?

It's not that low, since it gives each endpoint a private 512GB address
space, but yes there might be special cases that reach the limit. Maybe
assign a multi-queue NIC to a 256-vCPU guest, and if you want per-vCPU DMA
pools, then with a 39-bit address space you only get 2GB per vCPU. With
48-bit you get 1TB which should be plenty.

52-bit private IOVA space doesn't seem useful, I doubt we'll ever need to
support that on the MAP/UNMAP interface.

So I guess 48-bit can be the default, and users with special setups can
override aw-bits.

Thanks,
Jean

Re: [PATCH 0/3] VIRTIO-IOMMU: Introduce an aw-bits option

2024-01-29 Thread Jean-Philippe Brucker

Hi Eric,

On Tue, Jan 23, 2024 at 07:15:54PM +0100, Eric Auger wrote:
> In [1] and [2] we attempted to fix a case where a VFIO-PCI device
> protected with a virtio-iommu is assigned to an x86 guest. On x86
> the physical IOMMU may have an address width (gaw) of 39 or 48 bits
> whereas the virtio-iommu exposes a 64b input address space by default.
> Hence the guest may try to use the full 64b space and DMA MAP
> failures may be encountered. To work around this issue we endeavoured
> to pass usable host IOVA regions (excluding the out of range space) from
> VFIO to the virtio-iommu device so that the virtio-iommu driver can
> query those latter during the probe request and let the guest iommu
> kernel subsystem carve them out.
> 
> However if there are several devices in the same iommu group,
> only the reserved regions of the first one are taken into
> account by the iommu subsystem of the guest. This generally
> works on baremetal because devices are not going to
> expose different reserved regions. However in our case, this
> may prevent from taking into account the host iommu geometry.
> 
> So the simplest solution to this problem looks to introduce an
> input address width option, aw-bits, which matches what is
> done on the intel-iommu. By default, from now on it is set
> to 39 bits with pc_q35 and 64b with arm virt.

Doesn't Arm have the same problem?  The TTB0 page tables limit what can be
mapped to 48-bit, or 52-bit when SMMU_IDR5.VAX==1 and granule is 64kB.
A Linux host driver could configure smaller VA sizes:
* SMMUv2 limits the VA to SMMU_IDR2.UBS (upstream bus size) which
  can go as low as 32-bit (I'm assuming we don't care about 32-bit hosts).
* SMMUv3 currently limits the VA to CONFIG_ARM64_VA_BITS, which
  could be as low as 36 bits (but realistically 39, since 36 depends on
  16kB pages and CONFIG_EXPERT).

But 64-bit definitely can't work for VFIO, and I suppose isn't useful for
virtual devices, so maybe 39 is also a reasonable default on Arm.

Thanks,
Jean

> This replaces the
> previous default value of 64b. So we need to introduce a compat
> for pc_q35 machines older than 9.0 to behave similarly.

Re: [RFC 0/7] VIRTIO-IOMMU/VFIO: Fix host iommu geometry handling for hotplugged devices

2024-01-25 Thread Jean-Philippe Brucker

Hi,

On Thu, Jan 18, 2024 at 10:43:55AM +0100, Eric Auger wrote:
> Hi Zhenzhong,
> On 1/18/24 08:10, Duan, Zhenzhong wrote:
> > Hi Eric,
> >
> >> -Original Message-
> >> From: Eric Auger 
> >> Cc: m...@redhat.com; c...@redhat.com
> >> Subject: [RFC 0/7] VIRTIO-IOMMU/VFIO: Fix host iommu geometry handling
> >> for hotplugged devices
> >>
> >> In [1] we attempted to fix a case where a VFIO-PCI device protected
> >> with a virtio-iommu was assigned to an x86 guest. On x86 the physical
> >> IOMMU may have an address width (gaw) of 39 or 48 bits whereas the
> >> virtio-iommu used to expose a 64b address space by default.
> >> Hence the guest was trying to use the full 64b space and we hit
> >> DMA MAP failures. To work around this issue we managed to pass
> >> usable IOVA regions (excluding the out of range space) from VFIO
> >> to the virtio-iommu device. This was made feasible by introducing
> >> a new IOMMU Memory Region callback dubbed iommu_set_iova_regions().
> >> This latter gets called when the IOMMU MR is enabled which
> >> causes the vfio_listener_region_add() to be called.
> >>
> >> However with VFIO-PCI hotplug, this technique fails due to the
> >> race between the call to the callback in the add memory listener
> >> and the virtio-iommu probe request. Indeed the probe request gets
> >> called before the attach to the domain. So in that case the usable
> >> regions are communicated after the probe request and fail to be
> >> conveyed to the guest. To be honest the problem was hinted by
> >> Jean-Philippe in [1] and I should have been more careful at
> >> listening to him and testing with hotplug :-(
> > It looks the global virtio_iommu_config.bypass is never cleared in guest.
> > When guest virtio_iommu driver enable IOMMU, should it clear this
> > bypass attribute?
> > If it could be cleared in viommu_probe(), then qemu will call
> > virtio_iommu_set_config() then virtio_iommu_switch_address_space_all()
> > to enable IOMMU MR. Then both coldplugged and hotplugged devices will work.
> 
> this field is iommu wide while the probe applies on a one device.In
> general I would prefer not to be dependent on the MR enablement. We know
> that the device is likely to be protected and we can collect its
> requirements beforehand.
> 
> >
> > Intel iommu has a similar bit in register GCMD_REG.TE, when guest
> > intel_iommu driver probe set it, on qemu side, 
> > vtd_address_space_refresh_all()
> > is called to enable IOMMU MRs.
> interesting.
> 
> Would be curious to get Jean Philippe's pov.

I'd rather not rely on this, it's hard to justify a driver change based
only on QEMU internals. And QEMU can't count on the driver always clearing
bypass. There could be situations where the guest can't afford to do it,
like if an endpoint is owned by the firmware and has to keep running.

There may be a separate argument for clearing bypass. With a coldplugged
VFIO device the flow is:

1. Map the whole guest address space in VFIO to implement boot-bypass.
   This allocates all guest pages, which takes a while and is wasteful.
   I've actually crashed a host that way, when spawning a guest with too
   much RAM.
2. Start the VM
3. When the virtio-iommu driver attaches a (non-identity) domain to the
   assigned endpoint, then unmap the whole address space in VFIO, and most
   pages are given back to the host.

We can't disable boot-bypass because the BIOS needs it. But instead the
flow could be:

1. Start the VM, with only the virtual endpoints. Nothing to pin.
2. The virtio-iommu driver disables bypass during boot
3. Hotplug the VFIO device. With bypass disabled there is no need to pin
   the whole guest address space, unless the guest explicitly asks for an
   identity domain.

However, I don't know if this is a realistic scenario that will actually
be used.

By the way, do you have an easy way to reproduce the issue described here?
I've had to enable iommu.forcedac=1 on the command-line, otherwise Linux
just allocates 32-bit IOVAs.

> >
> >> For coldplugged device the technique works because we make sure all
> >> the IOMMU MR are enabled once on the machine init done: 94df5b2180
> >> ("virtio-iommu: Fix 64kB host page size VFIO device assignment")
> >> for granule freeze. But I would be keen to get rid of this trick.
> >>
> >> Using an IOMMU MR Ops is unpractical because this relies on the IOMMU
> >> MR to have been enabled and the corresponding vfio_listener_region_add()
> >> to be executed. Instead this series proposes to replace the usage of this
> >> API by the recently introduced PCIIOMMUOps: ba7d12eb8c  ("hw/pci:
> >> modify
> >> pci_setup_iommu() to set PCIIOMMUOps"). That way, the callback can be
> >> called earlier, once the usable IOVA regions have been collected by
> >> VFIO, without the need for the IOMMU MR to be enabled.
> >>
> >> This looks cleaner. In the short term this may also be used for
> >> passing the page size mask, which would allow to get rid of the
> >> hacky transient IOMMU MR

Re: [PATCH] virtio-iommu: Use qemu_real_host_page_mask as default page_size_mask

2024-01-16 Thread Jean-Philippe Brucker

On Thu, Dec 21, 2023 at 08:45:05AM -0500, Eric Auger wrote:
> We used to set default page_size_mask to qemu_target_page_mask() but
> with VFIO assignment it makes more sense to use the actual host page mask
> instead.
> 
> So from now on qemu_real_host_page_mask() will be used as a default.
> To be able to migrate older code, we increase the vmstat version_id
> to 3 and if an older incoming v2 stream is detected we set the previous
> default value.
> 
> The new default is well adapted to configs where host and guest have
> the same page size. This allows to fix hotplugging VFIO devices on a
> 64kB guest and a 64kB host. This test case has been failing before
> and even crashing qemu with hw_error("vfio: DMA mapping failed,
> unable to continue") in VFIO common). Indeed the hot-attached VFIO
> device would call memory_region_iommu_set_page_size_mask with 64kB
> mask whereas after the granule was frozen to 4kB on machine init done.

I guess TARGET_PAGE_MASK is always 4kB on arm64 CPUs, since it's the
smallest supported and the guest configures its page size at runtime.
Even if QEMU's software IOMMU can deal with any page size, VFIO can't so
passing the host page size seems more accurate than forcing a value of
4kB.

> Now this works. However the new default will prevent 4kB guest on
> 64kB host because the granule will be set to 64kB which would be
> larger than the guest page size. In that situation, the virtio-iommu
> driver fails the viommu_domain_finalise() with
> "granule 0x1 larger than system page zie 0x1000".

"size"
(it could matter if someone searches for this message later)

> 
> The current limitation of global granule in the virtio-iommu
> should be removed and turned into per domain granule. But
> until we get this upgraded, this new default is probably
> better because I don't think anyone is currently interested in
> running a 4kB page size guest with virtio-iommu on a 64kB host.
> However supporting 64kB guest on 64kB host with virtio-iommu and
> VFIO looks a more important feature.
> 
> Signed-off-by: Eric Auger 

So to summarize the configurations that work for hotplug (tested with QEMU
system emulation with SMMU + QEMU VMM with virtio-iommu):

 Host | Guest | virtio-net | IGB passthrough
  4k  | 4k| Y  | Y
  64k | 64k   | Y  | N -> Y (fixed by this patch)
  64k | 4k| Y -> N | N
  4k  | 64k   | Y  | Y

The change is a reasonable trade-off in my opinion. It fixes the more common
64k on 64k case, and for 4k on 64k, the error is now contained to the
guest and made clear ("granule 0x1 larger than system page size
0x1000") instead of crashing the VMM. A guest OS now discovers that the
host needs DMA buffers aligned on 64k and could actually support this case
(but Linux won't because it can't control the origin of all DMA buffers).
Later, support for page tables will enable 4k on 64k for all devices.

Tested-by: Jean-Philippe Brucker 
Reviewed-by: Jean-Philippe Brucker 

> ---
>  hw/virtio/virtio-iommu.c | 7 +--
>  1 file changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/virtio/virtio-iommu.c b/hw/virtio/virtio-iommu.c
> index 9d463efc52..b77e3644ea 100644
> --- a/hw/virtio/virtio-iommu.c
> +++ b/hw/virtio/virtio-iommu.c
> @@ -1313,7 +1313,7 @@ static void virtio_iommu_device_realize(DeviceState 
> *dev, Error **errp)
>   * in vfio realize
>   */
>  s->config.bypass = s->boot_bypass;
> -s->config.page_size_mask = qemu_target_page_mask();
> +s->config.page_size_mask = qemu_real_host_page_mask();
>  s->config.input_range.end = UINT64_MAX;
>  s->config.domain_range.end = UINT32_MAX;
>  s->config.probe_size = VIOMMU_PROBE_SIZE;
> @@ -1491,13 +1491,16 @@ static int iommu_post_load(void *opaque, int 
> version_id)
>   * still correct.
>   */
>  virtio_iommu_switch_address_space_all(s);
> +if (version_id <= 2) {
> +s->config.page_size_mask = qemu_target_page_mask();
> +}
>  return 0;
>  }
>  
>  static const VMStateDescription vmstate_virtio_iommu_device = {
>  .name = "virtio-iommu-device",
>  .minimum_version_id = 2,
> -.version_id = 2,
> +.version_id = 3,
>  .post_load = iommu_post_load,
>  .fields = (VMStateField[]) {
>  VMSTATE_GTREE_DIRECT_KEY_V(domains, VirtIOIOMMU, 2,
> -- 
> 2.27.0
>

[PATCH] target/arm/helper: Propagate MDCR_EL2.HPMN into PMCR_EL0.N

2023-12-15 Thread Jean-Philippe Brucker

MDCR_EL2.HPMN allows an hypervisor to limit the number of PMU counters
available to EL1 and EL0 (to keep the others to itself). QEMU already
implements this split correctly, except for PMCR_EL0.N reads: the number
of counters read by EL1 or EL0 should be the one configured in
MDCR_EL2.HPMN.

Signed-off-by: Jean-Philippe Brucker 
---
 target/arm/helper.c | 22 --
 1 file changed, 20 insertions(+), 2 deletions(-)

diff --git a/target/arm/helper.c b/target/arm/helper.c
index ff1970981e..bec293bc93 100644
--- a/target/arm/helper.c
+++ b/target/arm/helper.c
@@ -1475,6 +1475,22 @@ static void pmcr_write(CPUARMState *env, const 
ARMCPRegInfo *ri,
 pmu_op_finish(env);
 }
 
+static uint64_t pmcr_read(CPUARMState *env, const ARMCPRegInfo *ri)
+{
+uint64_t pmcr = env->cp15.c9_pmcr;
+
+/*
+ * If EL2 is implemented and enabled for the current security state, reads
+ * of PMCR.N from EL1 or EL0 return the value of MDCR_EL2.HPMN or 
HDCR.HPMN.
+ */
+if (arm_current_el(env) <= 1 && arm_is_el2_enabled(env)) {
+pmcr &= ~PMCRN_MASK;
+pmcr |= (env->cp15.mdcr_el2 & MDCR_HPMN) << PMCRN_SHIFT;
+}
+
+return pmcr;
+}
+
 static void pmswinc_write(CPUARMState *env, const ARMCPRegInfo *ri,
   uint64_t value)
 {
@@ -7137,8 +7153,9 @@ static void define_pmu_regs(ARMCPU *cpu)
 .fgt = FGT_PMCR_EL0,
 .type = ARM_CP_IO | ARM_CP_ALIAS,
 .fieldoffset = offsetoflow32(CPUARMState, cp15.c9_pmcr),
-.accessfn = pmreg_access, .writefn = pmcr_write,
-.raw_writefn = raw_write,
+.accessfn = pmreg_access,
+.readfn = pmcr_read, .raw_readfn = raw_read,
+.writefn = pmcr_write, .raw_writefn = raw_write,
 };
 ARMCPRegInfo pmcr64 = {
 .name = "PMCR_EL0", .state = ARM_CP_STATE_AA64,
@@ -7148,6 +7165,7 @@ static void define_pmu_regs(ARMCPU *cpu)
 .type = ARM_CP_IO,
 .fieldoffset = offsetof(CPUARMState, cp15.c9_pmcr),
 .resetvalue = cpu->isar.reset_pmcr_el0,
+.readfn = pmcr_read, .raw_readfn = raw_read,
 .writefn = pmcr_write, .raw_writefn = raw_write,
 };
 
-- 
2.43.0

Re: [PATCH] hw/arm/virt: fix GIC maintenance IRQ registration

2023-11-10 Thread Jean-Philippe Brucker

On Fri, Nov 10, 2023 at 10:19:30AM +, Peter Maydell wrote:
> On Fri, 10 Nov 2023 at 09:07, Jean-Philippe Brucker
>  wrote:
> >
> > Since commit 9036e917f8 ("{include/}hw/arm: refactor virt PPI logic"),
> > GIC maintenance IRQ registration fails on arm64:
> >
> > [0.979743] kvm [1]: Cannot register interrupt 9
> >
> > That commit re-defined VIRTUAL_PMU_IRQ to be a INTID but missed a case
> > where the maintenance IRQ is actually referred by its PPI index. Just
> > like commit fa68ecb330db ("hw/arm/virt: fix PMU IRQ registration"), use
> > INITID_TO_PPI(). A search of "GIC_FDT_IRQ_TYPE_PPI" indicates that there
> > shouldn't be more similar issues.
> >
> > Fixes: 9036e917f8 ("{include/}hw/arm: refactor virt PPI logic")
> > Signed-off-by: Jean-Philippe Brucker 
> 
> Isn't this already fixed by commit fa68ecb330dbd ?

No, that commit fixed the PMU interrupt (I copied most of its commit
message and referenced it), but the GIC maintenance interrupt still needed
to be fixed.

Thanks,
Jean

[PATCH] hw/arm/virt: fix GIC maintenance IRQ registration

2023-11-10 Thread Jean-Philippe Brucker

Since commit 9036e917f8 ("{include/}hw/arm: refactor virt PPI logic"),
GIC maintenance IRQ registration fails on arm64:

[0.979743] kvm [1]: Cannot register interrupt 9

That commit re-defined VIRTUAL_PMU_IRQ to be a INTID but missed a case
where the maintenance IRQ is actually referred by its PPI index. Just
like commit fa68ecb330db ("hw/arm/virt: fix PMU IRQ registration"), use
INITID_TO_PPI(). A search of "GIC_FDT_IRQ_TYPE_PPI" indicates that there
shouldn't be more similar issues.

Fixes: 9036e917f8 ("{include/}hw/arm: refactor virt PPI logic")
Signed-off-by: Jean-Philippe Brucker 
---
 hw/arm/virt.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/hw/arm/virt.c b/hw/arm/virt.c
index 783d71a1b3..f5e685b060 100644
--- a/hw/arm/virt.c
+++ b/hw/arm/virt.c
@@ -591,7 +591,8 @@ static void fdt_add_gic_node(VirtMachineState *vms)
 
 if (vms->virt) {
 qemu_fdt_setprop_cells(ms->fdt, nodename, "interrupts",
-   GIC_FDT_IRQ_TYPE_PPI, ARCH_GIC_MAINT_IRQ,
+   GIC_FDT_IRQ_TYPE_PPI,
+   INTID_TO_PPI(ARCH_GIC_MAINT_IRQ),
GIC_FDT_IRQ_FLAGS_LEVEL_HI);
 }
 } else {
@@ -615,7 +616,8 @@ static void fdt_add_gic_node(VirtMachineState *vms)
  2, vms->memmap[VIRT_GIC_VCPU].base,
  2, vms->memmap[VIRT_GIC_VCPU].size);
 qemu_fdt_setprop_cells(ms->fdt, nodename, "interrupts",
-   GIC_FDT_IRQ_TYPE_PPI, ARCH_GIC_MAINT_IRQ,
+   GIC_FDT_IRQ_TYPE_PPI,
+   INTID_TO_PPI(ARCH_GIC_MAINT_IRQ),
GIC_FDT_IRQ_FLAGS_LEVEL_HI);
 }
 }
-- 
2.42.0

Re: [PATCH v2 09/12] util/reserved-region: Add new ReservedRegion helpers

2023-09-29 Thread Jean-Philippe Brucker

list;
> +}
> +} else if (range_contains_range(range_iter, r)) {
> +/* new region is included in the current region */
> +if (range_lob(range_iter) == range_lob(r)) {
> +/* adjacent on the left side, derives into 2 regions */
> +range_set_bounds(range_iter, range_upb(r) + 1,
> + range_upb(range_iter));
> +return g_list_insert_before(list, l, reg);
> +} else if (range_upb(range_iter) == range_upb(r)) {
> +/* adjacent on the right side, derives into 2 regions */
> +range_set_bounds(range_iter, range_lob(range_iter),
> + range_lob(r) - 1);
> +l = l->next;
> +} else {
> +uint64_t lob = range_lob(range_iter);
> +/*
> + * the new range is in the middle of an existing one,
> + * split this latter into 3 regs instead
> + */
> +range_set_bounds(range_iter, range_upb(r) + 1,
> + range_upb(range_iter));
> +new_reg = g_new0(ReservedRegion, 1);
> +new_reg->type = resv_iter->type;
> +range_set_bounds(_reg->range,
> + lob, range_lob(r) - 1);
> +list = g_list_insert_before(list, l, new_reg);
> +return g_list_insert_before(list, l, reg);
> +}
> +} else if (range_lob(r) < range_lob(range_iter)) {
> +    range_set_bounds(range_iter, range_upb(r) + 1,
> + range_upb(range_iter));
> +return g_list_insert_before(list, l, reg);
> +} else { /* intersection on the upper range */
> +range_set_bounds(range_iter, range_lob(range_iter),
> + range_lob(r) - 1);
> +l = l->next;
> +}
> +} /* overlap */
> +}
> +return g_list_append(list, reg);

Looks correct overall

Reviewed-by: Jean-Philippe Brucker 

> +}
> +
> diff --git a/util/meson.build b/util/meson.build
> index c4827fd70a..eb677b40c2 100644
> --- a/util/meson.build
> +++ b/util/meson.build
> @@ -51,6 +51,7 @@ util_ss.add(files('qdist.c'))
>  util_ss.add(files('qht.c'))
>  util_ss.add(files('qsp.c'))
>  util_ss.add(files('range.c'))
> +util_ss.add(files('reserved-region.c'))
>  util_ss.add(files('stats64.c'))
>  util_ss.add(files('systemd.c'))
>  util_ss.add(files('transactions.c'))
> -- 
> 2.41.0
>

Re: [PATCH v2 07/12] virtio-iommu: Implement set_iova_ranges() callback

2023-09-29 Thread Jean-Philippe Brucker

On Wed, Sep 13, 2023 at 10:01:42AM +0200, Eric Auger wrote:
> The implementation populates the array of per IOMMUDevice
> host reserved regions.
> 
> It is forbidden to have conflicting sets of host IOVA ranges
> to be applied onto the same IOMMU MR (implied by different
> host devices).
> 
> Signed-off-by: Eric Auger 
> 
> ---
> 
> v1 -> v2:
> - Forbid conflicting sets of host resv regions
> ---
>  include/hw/virtio/virtio-iommu.h |  2 ++
>  hw/virtio/virtio-iommu.c | 48 
>  2 files changed, 50 insertions(+)
> 
> diff --git a/include/hw/virtio/virtio-iommu.h 
> b/include/hw/virtio/virtio-iommu.h
> index 70b8ace34d..31b69c8261 100644
> --- a/include/hw/virtio/virtio-iommu.h
> +++ b/include/hw/virtio/virtio-iommu.h
> @@ -40,6 +40,8 @@ typedef struct IOMMUDevice {
>  MemoryRegion root;  /* The root container of the device */
>  MemoryRegion bypass_mr; /* The alias of shared memory MR */
>  GList *resv_regions;
> +Range *host_resv_regions;
> +uint32_t nr_host_resv_regions;
>  } IOMMUDevice;
>  
>  typedef struct IOMMUPciBus {
> diff --git a/hw/virtio/virtio-iommu.c b/hw/virtio/virtio-iommu.c
> index ea359b586a..ed2df5116f 100644
> --- a/hw/virtio/virtio-iommu.c
> +++ b/hw/virtio/virtio-iommu.c
> @@ -20,6 +20,7 @@
>  #include "qemu/osdep.h"
>  #include "qemu/log.h"
>  #include "qemu/iov.h"
> +#include "qemu/range.h"
>  #include "exec/target_page.h"
>  #include "hw/qdev-properties.h"
>  #include "hw/virtio/virtio.h"
> @@ -1158,6 +1159,52 @@ static int 
> virtio_iommu_set_page_size_mask(IOMMUMemoryRegion *mr,
>  return 0;
>  }
>  
> +static int virtio_iommu_set_iova_ranges(IOMMUMemoryRegion *mr,
> +uint32_t nr_ranges,
> +struct Range *iova_ranges,
> +Error **errp)
> +{
> +IOMMUDevice *sdev = container_of(mr, IOMMUDevice, iommu_mr);
> +uint32_t nr_host_resv_regions;
> +Range *host_resv_regions;
> +int ret = -EINVAL;
> +
> +if (!nr_ranges) {
> +return 0;
> +}
> +
> +if (sdev->host_resv_regions) {
> +range_inverse_array(nr_ranges, iova_ranges,
> +_host_resv_regions, _resv_regions,
> +0, UINT64_MAX);
> +if (nr_host_resv_regions != sdev->nr_host_resv_regions) {
> +goto error;
> +}
> +for (int i = 0; i < nr_host_resv_regions; i++) {
> +Range *new = _resv_regions[i];
> +Range *existing = >host_resv_regions[i];
> +
> +if (!range_contains_range(existing, new)) {
> +goto error;
> +}
> +}
> +ret = 0;
> +goto out;
> +}
> +
> +range_inverse_array(nr_ranges, iova_ranges,
> +>nr_host_resv_regions, 
> >host_resv_regions,
> +0, UINT64_MAX);

Can set_iova_ranges() only be called for the first time before the guest
has had a chance to issue a probe request?  Maybe we could add a
sanity-check that the guest hasn't issued a probe request yet, since we
can't notify about updated reserved regions.

I'm probably misremembering because I thought Linux set up IOMMU contexts
(including probe requests) before enabling DMA master in PCI which cause
QEMU VFIO to issue these calls. I'll double check.

Thanks,
Jean

> +
> +return 0;
> +error:
> +error_setg(errp, "IOMMU mr=%s Conflicting host reserved regions set!",
> +   mr->parent_obj.name);
> +out:
> +g_free(host_resv_regions);
> +return ret;
> +}
> +
>  static void virtio_iommu_system_reset(void *opaque)
>  {
>  VirtIOIOMMU *s = opaque;
> @@ -1453,6 +1500,7 @@ static void 
> virtio_iommu_memory_region_class_init(ObjectClass *klass,
>  imrc->replay = virtio_iommu_replay;
>  imrc->notify_flag_changed = virtio_iommu_notify_flag_changed;
>  imrc->iommu_set_page_size_mask = virtio_iommu_set_page_size_mask;
> +imrc->iommu_set_iova_ranges = virtio_iommu_set_iova_ranges;
>  }
>  
>  static const TypeInfo virtio_iommu_info = {
> -- 
> 2.41.0
>

Re: [PATCH v2 05/12] virtio-iommu: Introduce per IOMMUDevice reserved regions

2023-09-29 Thread Jean-Philippe Brucker

Hi Eric,

On Wed, Sep 13, 2023 at 10:01:40AM +0200, Eric Auger wrote:
> For the time being the per device reserved regions are
> just a duplicate of IOMMU wide reserved regions. Subsequent
> patches will combine those with host reserved regions, if any.
> 
> Signed-off-by: Eric Auger 
> ---
>  include/hw/virtio/virtio-iommu.h |  1 +
>  hw/virtio/virtio-iommu.c | 42 ++--
>  2 files changed, 35 insertions(+), 8 deletions(-)
> 
> diff --git a/include/hw/virtio/virtio-iommu.h 
> b/include/hw/virtio/virtio-iommu.h
> index eea4564782..70b8ace34d 100644
> --- a/include/hw/virtio/virtio-iommu.h
> +++ b/include/hw/virtio/virtio-iommu.h
> @@ -39,6 +39,7 @@ typedef struct IOMMUDevice {
>  AddressSpace  as;
>  MemoryRegion root;  /* The root container of the device */
>  MemoryRegion bypass_mr; /* The alias of shared memory MR */
> +GList *resv_regions;
>  } IOMMUDevice;
>  
>  typedef struct IOMMUPciBus {
> diff --git a/hw/virtio/virtio-iommu.c b/hw/virtio/virtio-iommu.c
> index 979cdb5648..ea359b586a 100644
> --- a/hw/virtio/virtio-iommu.c
> +++ b/hw/virtio/virtio-iommu.c
> @@ -624,22 +624,48 @@ static int virtio_iommu_unmap(VirtIOIOMMU *s,
>  return ret;
>  }
>  
> +static int consolidate_resv_regions(IOMMUDevice *sdev)
> +{
> +VirtIOIOMMU *s = sdev->viommu;
> +int i;
> +
> +for (i = 0; i < s->nr_prop_resv_regions; i++) {
> +ReservedRegion *reg = g_new0(ReservedRegion, 1);
> +
> +*reg = s->prop_resv_regions[i];
> +sdev->resv_regions = g_list_append(sdev->resv_regions, reg);
> +}
> +return 0;
> +}
> +
>  static ssize_t virtio_iommu_fill_resv_mem_prop(VirtIOIOMMU *s, uint32_t ep,
> uint8_t *buf, size_t free)
>  {
>  struct virtio_iommu_probe_resv_mem prop = {};
>  size_t size = sizeof(prop), length = size - sizeof(prop.head), total;
> -int i;
> +IOMMUDevice *sdev;
> +GList *l;
> +int ret;
>  
> -total = size * s->nr_prop_resv_regions;
> +sdev = container_of(virtio_iommu_mr(s, ep), IOMMUDevice, iommu_mr);
> +if (!sdev) {
> +return -EINVAL;
> +}
>  
> +ret = consolidate_resv_regions(sdev);
> +if (ret) {
> +return ret;
> +}
> +
> +total = size * g_list_length(sdev->resv_regions);
>  if (total > free) {
>  return -ENOSPC;
>  }
>  
> -for (i = 0; i < s->nr_prop_resv_regions; i++) {
> -unsigned subtype = s->prop_resv_regions[i].type;
> -Range *range = >prop_resv_regions[i].range;
> +for (l = sdev->resv_regions; l; l = l->next) {
> +ReservedRegion *reg = l->data;
> +unsigned subtype = reg->type;
> +Range *range = >range;
>  
>  assert(subtype == VIRTIO_IOMMU_RESV_MEM_T_RESERVED ||
> subtype == VIRTIO_IOMMU_RESV_MEM_T_MSI);
> @@ -857,7 +883,7 @@ static IOMMUTLBEntry 
> virtio_iommu_translate(IOMMUMemoryRegion *mr, hwaddr addr,
>  bool bypass_allowed;
>  int granule;
>  bool found;
> -int i;
> +GList *l;
>  
>  interval.low = addr;
>  interval.high = addr + 1;
> @@ -895,8 +921,8 @@ static IOMMUTLBEntry 
> virtio_iommu_translate(IOMMUMemoryRegion *mr, hwaddr addr,
>  goto unlock;
>  }
>  
> -for (i = 0; i < s->nr_prop_resv_regions; i++) {
> -ReservedRegion *reg = >prop_resv_regions[i];
> +for (l = sdev->resv_regions; l; l = l->next) {
> +ReservedRegion *reg = l->data;

This means translate() now only takes reserved regions into account after
the guest issues a probe request, which only happens if the guest actually
supports the probe feature. It may be better to build the list earlier
(like when creating the IOMMUDevice), and complete it in
set_iova_ranges(). I guess both could call consolidate() which would
rebuild the whole list, for example

Thanks,
Jean

>  
>  if (range_contains(>range, addr)) {
>  switch (reg->type) {
> -- 
> 2.41.0
>

Re: [PATCH v3 0/6] target/arm: Fixes for RME

2023-08-10 Thread Jean-Philippe Brucker

On Thu, Aug 10, 2023 at 02:16:56PM +0100, Peter Maydell wrote:
> This didn't build for the linux-user targets. I squashed
> this into patch 6:
> 
> diff --git a/target/arm/cpu.c b/target/arm/cpu.c
> index 7df1f7600b1..d906d2b1caa 100644
> --- a/target/arm/cpu.c
> +++ b/target/arm/cpu.c
> @@ -2169,9 +2169,11 @@ static void arm_cpu_realizefn(DeviceState *dev,
> Error **errp)
>  set_feature(env, ARM_FEATURE_VBAR);
>  }
> 
> -if (cpu_isar_feature(aa64_rme, cpu)) {
> +#ifndef CONFIG_USER_ONLY
> +if (tcg_enabled() && cpu_isar_feature(aa64_rme, cpu)) {
>  arm_register_el_change_hook(cpu, _rme_post_el_change, 0);
>  }
> +#endif
> 
>  register_cp_regs_for_features(cpu);
>  arm_cpu_register_gdb_regs_for_features(cpu);
> 
> With that, I've applied the series to target-arm-for-8.2.

Thank you, sorry about the build error, I'll add linux-user to my tests

Thanks,
Jean

[PATCH v3 5/6] target/arm/helper: Check SCR_EL3.{NSE, NS} encoding for AT instructions

2023-08-09 Thread Jean-Philippe Brucker

The AT instruction is UNDEFINED if the {NSE,NS} configuration is
invalid. Add a function to check this on all AT instructions that apply
to an EL lower than 3.

Suggested-by: Peter Maydell 
Signed-off-by: Jean-Philippe Brucker 
---
 target/arm/helper.c | 38 +++---
 1 file changed, 27 insertions(+), 11 deletions(-)

diff --git a/target/arm/helper.c b/target/arm/helper.c
index fbb03c364b..dbfe9f2f5e 100644
--- a/target/arm/helper.c
+++ b/target/arm/helper.c
@@ -3616,6 +3616,22 @@ static void ats1h_write(CPUARMState *env, const 
ARMCPRegInfo *ri,
 #endif /* CONFIG_TCG */
 }
 
+static CPAccessResult at_e012_access(CPUARMState *env, const ARMCPRegInfo *ri,
+ bool isread)
+{
+/*
+ * R_NYXTL: instruction is UNDEFINED if it applies to an Exception level
+ * lower than EL3 and the combination SCR_EL3.{NSE,NS} is reserved. This 
can
+ * only happen when executing at EL3 because that combination also causes 
an
+ * illegal exception return. We don't need to check FEAT_RME either, 
because
+ * scr_write() ensures that the NSE bit is not set otherwise.
+ */
+if ((env->cp15.scr_el3 & (SCR_NSE | SCR_NS)) == SCR_NSE) {
+return CP_ACCESS_TRAP;
+}
+return CP_ACCESS_OK;
+}
+
 static CPAccessResult at_s1e2_access(CPUARMState *env, const ARMCPRegInfo *ri,
  bool isread)
 {
@@ -3623,7 +3639,7 @@ static CPAccessResult at_s1e2_access(CPUARMState *env, 
const ARMCPRegInfo *ri,
 !(env->cp15.scr_el3 & (SCR_NS | SCR_EEL2))) {
 return CP_ACCESS_TRAP;
 }
-return CP_ACCESS_OK;
+return at_e012_access(env, ri, isread);
 }
 
 static void ats_write64(CPUARMState *env, const ARMCPRegInfo *ri,
@@ -5505,38 +5521,38 @@ static const ARMCPRegInfo v8_cp_reginfo[] = {
   .opc0 = 1, .opc1 = 0, .crn = 7, .crm = 8, .opc2 = 0,
   .access = PL1_W, .type = ARM_CP_NO_RAW | ARM_CP_RAISES_EXC,
   .fgt = FGT_ATS1E1R,
-  .writefn = ats_write64 },
+  .accessfn = at_e012_access, .writefn = ats_write64 },
 { .name = "AT_S1E1W", .state = ARM_CP_STATE_AA64,
   .opc0 = 1, .opc1 = 0, .crn = 7, .crm = 8, .opc2 = 1,
   .access = PL1_W, .type = ARM_CP_NO_RAW | ARM_CP_RAISES_EXC,
   .fgt = FGT_ATS1E1W,
-  .writefn = ats_write64 },
+  .accessfn = at_e012_access, .writefn = ats_write64 },
 { .name = "AT_S1E0R", .state = ARM_CP_STATE_AA64,
   .opc0 = 1, .opc1 = 0, .crn = 7, .crm = 8, .opc2 = 2,
   .access = PL1_W, .type = ARM_CP_NO_RAW | ARM_CP_RAISES_EXC,
   .fgt = FGT_ATS1E0R,
-  .writefn = ats_write64 },
+  .accessfn = at_e012_access, .writefn = ats_write64 },
 { .name = "AT_S1E0W", .state = ARM_CP_STATE_AA64,
   .opc0 = 1, .opc1 = 0, .crn = 7, .crm = 8, .opc2 = 3,
   .access = PL1_W, .type = ARM_CP_NO_RAW | ARM_CP_RAISES_EXC,
   .fgt = FGT_ATS1E0W,
-  .writefn = ats_write64 },
+  .accessfn = at_e012_access, .writefn = ats_write64 },
 { .name = "AT_S12E1R", .state = ARM_CP_STATE_AA64,
   .opc0 = 1, .opc1 = 4, .crn = 7, .crm = 8, .opc2 = 4,
   .access = PL2_W, .type = ARM_CP_NO_RAW | ARM_CP_RAISES_EXC,
-  .writefn = ats_write64 },
+  .accessfn = at_e012_access, .writefn = ats_write64 },
 { .name = "AT_S12E1W", .state = ARM_CP_STATE_AA64,
   .opc0 = 1, .opc1 = 4, .crn = 7, .crm = 8, .opc2 = 5,
   .access = PL2_W, .type = ARM_CP_NO_RAW | ARM_CP_RAISES_EXC,
-  .writefn = ats_write64 },
+  .accessfn = at_e012_access, .writefn = ats_write64 },
 { .name = "AT_S12E0R", .state = ARM_CP_STATE_AA64,
   .opc0 = 1, .opc1 = 4, .crn = 7, .crm = 8, .opc2 = 6,
   .access = PL2_W, .type = ARM_CP_NO_RAW | ARM_CP_RAISES_EXC,
-  .writefn = ats_write64 },
+  .accessfn = at_e012_access, .writefn = ats_write64 },
 { .name = "AT_S12E0W", .state = ARM_CP_STATE_AA64,
   .opc0 = 1, .opc1 = 4, .crn = 7, .crm = 8, .opc2 = 7,
   .access = PL2_W, .type = ARM_CP_NO_RAW | ARM_CP_RAISES_EXC,
-  .writefn = ats_write64 },
+  .accessfn = at_e012_access, .writefn = ats_write64 },
 /* AT S1E2* are elsewhere as they UNDEF from EL3 if EL2 is not present */
 { .name = "AT_S1E3R", .state = ARM_CP_STATE_AA64,
   .opc0 = 1, .opc1 = 6, .crn = 7, .crm = 8, .opc2 = 0,
@@ -8078,12 +8094,12 @@ static const ARMCPRegInfo ats1e1_reginfo[] = {
   .opc0 = 1, .opc1 = 0, .crn = 7, .crm = 9, .opc2 = 0,
   .access = PL1_W, .type = ARM_CP_NO_RAW | ARM_CP_RAISES_EXC,
   .fgt = FGT_ATS1E1RP,
-  .writefn = ats_write64 },
+  .accessfn = at_e012_access, .writefn = ats_write64 },
 { .name = "AT_S1E1WP", .state = ARM_CP_STATE_AA64,
   .opc0 = 1, .opc1 = 0, .crn = 7, .crm = 9, .opc2 = 1,
   .access = PL1_W, .type = ARM_CP_NO_RAW | ARM_CP_RAISES_EXC,
   .fgt = FGT_ATS1E1WP,
-  .writefn = ats_write64 },
+  .accessfn = at_e012_access, .writefn = ats_write64 },
 };
 
 static const ARMCPRegInfo ats1cp_reginfo[] = {
-- 
2.41.0

[PATCH v3 2/6] target/arm/helper: Fix tlbmask and tlbbits for TLBI VAE2*

2023-08-09 Thread Jean-Philippe Brucker

When HCR_EL2.E2H is enabled, TLB entries are formed using the EL2&0
translation regime, instead of the EL2 translation regime. The TLB VAE2*
instructions invalidate the regime that corresponds to the current value
of HCR_EL2.E2H.

At the moment we only invalidate the EL2 translation regime. This causes
problems with RMM, which issues TLBI VAE2IS instructions with
HCR_EL2.E2H enabled. Update vae2_tlbmask() to take HCR_EL2.E2H into
account.

Add vae2_tlbbits() as well, since the top-byte-ignore configuration is
different between the EL2&0 and EL2 regime.

Signed-off-by: Jean-Philippe Brucker 
Reviewed-by: Peter Maydell 
---
 target/arm/helper.c | 50 -
 1 file changed, 40 insertions(+), 10 deletions(-)

diff --git a/target/arm/helper.c b/target/arm/helper.c
index 2959d27543..a4c2c1bde5 100644
--- a/target/arm/helper.c
+++ b/target/arm/helper.c
@@ -4663,6 +4663,21 @@ static int vae1_tlbmask(CPUARMState *env)
 return mask;
 }
 
+static int vae2_tlbmask(CPUARMState *env)
+{
+uint64_t hcr = arm_hcr_el2_eff(env);
+uint16_t mask;
+
+if (hcr & HCR_E2H) {
+mask = ARMMMUIdxBit_E20_2 |
+   ARMMMUIdxBit_E20_2_PAN |
+   ARMMMUIdxBit_E20_0;
+} else {
+mask = ARMMMUIdxBit_E2;
+}
+return mask;
+}
+
 /* Return 56 if TBI is enabled, 64 otherwise. */
 static int tlbbits_for_regime(CPUARMState *env, ARMMMUIdx mmu_idx,
   uint64_t addr)
@@ -4689,6 +4704,25 @@ static int vae1_tlbbits(CPUARMState *env, uint64_t addr)
 return tlbbits_for_regime(env, mmu_idx, addr);
 }
 
+static int vae2_tlbbits(CPUARMState *env, uint64_t addr)
+{
+uint64_t hcr = arm_hcr_el2_eff(env);
+ARMMMUIdx mmu_idx;
+
+/*
+ * Only the regime of the mmu_idx below is significant.
+ * Regime EL2&0 has two ranges with separate TBI configuration, while EL2
+ * only has one.
+ */
+if (hcr & HCR_E2H) {
+mmu_idx = ARMMMUIdx_E20_2;
+} else {
+mmu_idx = ARMMMUIdx_E2;
+}
+
+return tlbbits_for_regime(env, mmu_idx, addr);
+}
+
 static void tlbi_aa64_vmalle1is_write(CPUARMState *env, const ARMCPRegInfo *ri,
   uint64_t value)
 {
@@ -4781,10 +4815,11 @@ static void tlbi_aa64_vae2_write(CPUARMState *env, 
const ARMCPRegInfo *ri,
  * flush-last-level-only.
  */
 CPUState *cs = env_cpu(env);
-int mask = e2_tlbmask(env);
+int mask = vae2_tlbmask(env);
 uint64_t pageaddr = sextract64(value << 12, 0, 56);
+int bits = vae2_tlbbits(env, pageaddr);
 
-tlb_flush_page_by_mmuidx(cs, pageaddr, mask);
+tlb_flush_page_bits_by_mmuidx(cs, pageaddr, mask, bits);
 }
 
 static void tlbi_aa64_vae3_write(CPUARMState *env, const ARMCPRegInfo *ri,
@@ -4838,11 +4873,11 @@ static void tlbi_aa64_vae2is_write(CPUARMState *env, 
const ARMCPRegInfo *ri,
uint64_t value)
 {
 CPUState *cs = env_cpu(env);
+int mask = vae2_tlbmask(env);
 uint64_t pageaddr = sextract64(value << 12, 0, 56);
-int bits = tlbbits_for_regime(env, ARMMMUIdx_E2, pageaddr);
+int bits = vae2_tlbbits(env, pageaddr);
 
-tlb_flush_page_bits_by_mmuidx_all_cpus_synced(cs, pageaddr,
-  ARMMMUIdxBit_E2, bits);
+tlb_flush_page_bits_by_mmuidx_all_cpus_synced(cs, pageaddr, mask, bits);
 }
 
 static void tlbi_aa64_vae3is_write(CPUARMState *env, const ARMCPRegInfo *ri,
@@ -5014,11 +5049,6 @@ static void tlbi_aa64_rvae1is_write(CPUARMState *env,
 do_rvae_write(env, value, vae1_tlbmask(env), true);
 }
 
-static int vae2_tlbmask(CPUARMState *env)
-{
-return ARMMMUIdxBit_E2;
-}
-
 static void tlbi_aa64_rvae2_write(CPUARMState *env,
   const ARMCPRegInfo *ri,
   uint64_t value)
-- 
2.41.0

[PATCH v3 0/6] target/arm: Fixes for RME

2023-08-09 Thread Jean-Philippe Brucker

A few patches to fix RME support and allow booting a realm guest, based
on "[PATCH v2 00/15] target/arm/ptw: Cleanups and a few bugfixes"
https://lore.kernel.org/all/20230807141514.19075-1-peter.mayd...@linaro.org/

Since v2:

* Updated the comment in patch 5. I also removed the check for FEAT_RME,
  because as pointed out in "target/arm: Catch illegal-exception-return
  from EL3 with bad NSE/NS", the SCR_NSE bit can only be set with
  FEAT_RME enabled. Because of this additional change, I didn't add the
  Reviewed-by.

* Added an EL-change hook to patch 6, to update the timer IRQ
  when changing the security state. I was wondering whether the
  el_change function should filter security state changes, since we only
  need to update IRQ state when switching between Root and
  Secure/NonSecure. But with a small syscall benchmark exercising
  EL0-EL1 switch with FEAT_RME enabled, I couldn't see any difference
  with and without the el_change hook, so I kept it simple.

* Also added the .raw_write callback for CNTHCTL_EL2.

v2: 
https://lore.kernel.org/all/20230802170157.401491-1-jean-phili...@linaro.org/

Jean-Philippe Brucker (6):
  target/arm/ptw: Load stage-2 tables from realm physical space
  target/arm/helper: Fix tlbmask and tlbbits for TLBI VAE2*
  target/arm: Skip granule protection checks for AT instructions
  target/arm: Pass security space rather than flag for AT instructions
  target/arm/helper: Check SCR_EL3.{NSE,NS} encoding for AT instructions
  target/arm/helper: Implement CNTHCTL_EL2.CNT[VP]MASK

 target/arm/cpu.h|   4 +
 target/arm/internals.h  |  25 +++---
 target/arm/cpu.c|   4 +
 target/arm/helper.c | 184 ++--
 target/arm/ptw.c|  39 ++---
 target/arm/trace-events |   7 +-
 6 files changed, 188 insertions(+), 75 deletions(-)

-- 
2.41.0

[PATCH v3 6/6] target/arm/helper: Implement CNTHCTL_EL2.CNT[VP]MASK

2023-08-09 Thread Jean-Philippe Brucker

When FEAT_RME is implemented, these bits override the value of
CNT[VP]_CTL_EL0.IMASK in Realm and Root state. Move the IRQ state update
into a new gt_update_irq() function and test those bits every time we
recompute the IRQ state.

Since we're removing the IRQ state from some trace events, add a new
trace event for gt_update_irq().

Signed-off-by: Jean-Philippe Brucker 
---
 target/arm/cpu.h|  4 +++
 target/arm/cpu.c|  4 +++
 target/arm/helper.c | 65 ++---
 target/arm/trace-events |  7 +++--
 4 files changed, 66 insertions(+), 14 deletions(-)

diff --git a/target/arm/cpu.h b/target/arm/cpu.h
index bcd65a63ca..855a76ae81 100644
--- a/target/arm/cpu.h
+++ b/target/arm/cpu.h
@@ -1115,6 +1115,7 @@ struct ArchCPU {
 };
 
 unsigned int gt_cntfrq_period_ns(ARMCPU *cpu);
+void gt_rme_post_el_change(ARMCPU *cpu, void *opaque);
 
 void arm_cpu_post_init(Object *obj);
 
@@ -1743,6 +1744,9 @@ static inline void xpsr_write(CPUARMState *env, uint32_t 
val, uint32_t mask)
 #define HSTR_TTEE (1 << 16)
 #define HSTR_TJDBX (1 << 17)
 
+#define CNTHCTL_CNTVMASK  (1 << 18)
+#define CNTHCTL_CNTPMASK  (1 << 19)
+
 /* Return the current FPSCR value.  */
 uint32_t vfp_get_fpscr(CPUARMState *env);
 void vfp_set_fpscr(CPUARMState *env, uint32_t val);
diff --git a/target/arm/cpu.c b/target/arm/cpu.c
index 93c28d50e5..7df1f7600b 100644
--- a/target/arm/cpu.c
+++ b/target/arm/cpu.c
@@ -2169,6 +2169,10 @@ static void arm_cpu_realizefn(DeviceState *dev, Error 
**errp)
 set_feature(env, ARM_FEATURE_VBAR);
 }
 
+if (cpu_isar_feature(aa64_rme, cpu)) {
+arm_register_el_change_hook(cpu, _rme_post_el_change, 0);
+}
+
 register_cp_regs_for_features(cpu);
 arm_cpu_register_gdb_regs_for_features(cpu);
 
diff --git a/target/arm/helper.c b/target/arm/helper.c
index dbfe9f2f5e..86ce6a52bb 100644
--- a/target/arm/helper.c
+++ b/target/arm/helper.c
@@ -2608,6 +2608,39 @@ static uint64_t gt_get_countervalue(CPUARMState *env)
 return qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) / gt_cntfrq_period_ns(cpu);
 }
 
+static void gt_update_irq(ARMCPU *cpu, int timeridx)
+{
+CPUARMState *env = >env;
+uint64_t cnthctl = env->cp15.cnthctl_el2;
+ARMSecuritySpace ss = arm_security_space(env);
+/* ISTATUS && !IMASK */
+int irqstate = (env->cp15.c14_timer[timeridx].ctl & 6) == 4;
+
+/*
+ * If bit CNTHCTL_EL2.CNT[VP]MASK is set, it overrides IMASK.
+ * It is RES0 in Secure and NonSecure state.
+ */
+if ((ss == ARMSS_Root || ss == ARMSS_Realm) &&
+((timeridx == GTIMER_VIRT && (cnthctl & CNTHCTL_CNTVMASK)) ||
+ (timeridx == GTIMER_PHYS && (cnthctl & CNTHCTL_CNTPMASK {
+irqstate = 0;
+}
+
+qemu_set_irq(cpu->gt_timer_outputs[timeridx], irqstate);
+trace_arm_gt_update_irq(timeridx, irqstate);
+}
+
+void gt_rme_post_el_change(ARMCPU *cpu, void *ignored)
+{
+/*
+ * Changing security state between Root and Secure/NonSecure, which may
+ * happen when switching EL, can change the effective value of CNTHCTL_EL2
+ * mask bits. Update the IRQ state accordingly.
+ */
+gt_update_irq(cpu, GTIMER_VIRT);
+gt_update_irq(cpu, GTIMER_PHYS);
+}
+
 static void gt_recalc_timer(ARMCPU *cpu, int timeridx)
 {
 ARMGenericTimer *gt = >env.cp15.c14_timer[timeridx];
@@ -2623,13 +2656,9 @@ static void gt_recalc_timer(ARMCPU *cpu, int timeridx)
 /* Note that this must be unsigned 64 bit arithmetic: */
 int istatus = count - offset >= gt->cval;
 uint64_t nexttick;
-int irqstate;
 
 gt->ctl = deposit32(gt->ctl, 2, 1, istatus);
 
-irqstate = (istatus && !(gt->ctl & 2));
-qemu_set_irq(cpu->gt_timer_outputs[timeridx], irqstate);
-
 if (istatus) {
 /* Next transition is when count rolls back over to zero */
 nexttick = UINT64_MAX;
@@ -2648,14 +2677,14 @@ static void gt_recalc_timer(ARMCPU *cpu, int timeridx)
 } else {
 timer_mod(cpu->gt_timer[timeridx], nexttick);
 }
-trace_arm_gt_recalc(timeridx, irqstate, nexttick);
+trace_arm_gt_recalc(timeridx, nexttick);
 } else {
 /* Timer disabled: ISTATUS and timer output always clear */
 gt->ctl &= ~4;
-qemu_set_irq(cpu->gt_timer_outputs[timeridx], 0);
 timer_del(cpu->gt_timer[timeridx]);
 trace_arm_gt_recalc_disabled(timeridx);
 }
+gt_update_irq(cpu, timeridx);
 }
 
 static void gt_timer_reset(CPUARMState *env, const ARMCPRegInfo *ri,
@@ -2759,10 +2788,8 @@ static void gt_ctl_write(CPUARMState *env, const 
ARMCPRegInfo *ri,
  * IMASK toggled: don't need to recalculate,
  * just set the interrupt line based on ISTATUS
  */
-int irqstate = (oldval & 4) && !(value & 2);
-
-trace_arm_g

[PATCH v3 3/6] target/arm: Skip granule protection checks for AT instructions

2023-08-09 Thread Jean-Philippe Brucker

GPC checks are not performed on the output address for AT instructions,
as stated by ARM DDI 0487J in D8.12.2:

  When populating PAR_EL1 with the result of an address translation
  instruction, granule protection checks are not performed on the final
  output address of a successful translation.

Rename get_phys_addr_with_secure(), since it's only used to handle AT
instructions.

Signed-off-by: Jean-Philippe Brucker 
Reviewed-by: Peter Maydell 
---
 target/arm/internals.h | 25 ++---
 target/arm/helper.c|  8 ++--
 target/arm/ptw.c   | 11 ++-
 3 files changed, 26 insertions(+), 18 deletions(-)

diff --git a/target/arm/internals.h b/target/arm/internals.h
index 0f01bc32a8..fc90c364f7 100644
--- a/target/arm/internals.h
+++ b/target/arm/internals.h
@@ -1190,12 +1190,11 @@ typedef struct GetPhysAddrResult {
 } GetPhysAddrResult;
 
 /**
- * get_phys_addr_with_secure: get the physical address for a virtual address
+ * get_phys_addr: get the physical address for a virtual address
  * @env: CPUARMState
  * @address: virtual address to get physical address for
  * @access_type: 0 for read, 1 for write, 2 for execute
  * @mmu_idx: MMU index indicating required translation regime
- * @is_secure: security state for the access
  * @result: set on translation success.
  * @fi: set to fault info if the translation fails
  *
@@ -1212,26 +1211,30 @@ typedef struct GetPhysAddrResult {
  *  * for PSMAv5 based systems we don't bother to return a full FSR format
  *value.
  */
-bool get_phys_addr_with_secure(CPUARMState *env, target_ulong address,
-   MMUAccessType access_type,
-   ARMMMUIdx mmu_idx, bool is_secure,
-   GetPhysAddrResult *result, ARMMMUFaultInfo *fi)
+bool get_phys_addr(CPUARMState *env, target_ulong address,
+   MMUAccessType access_type, ARMMMUIdx mmu_idx,
+   GetPhysAddrResult *result, ARMMMUFaultInfo *fi)
 __attribute__((nonnull));
 
 /**
- * get_phys_addr: get the physical address for a virtual address
+ * get_phys_addr_with_secure_nogpc: get the physical address for a virtual
+ *  address
  * @env: CPUARMState
  * @address: virtual address to get physical address for
  * @access_type: 0 for read, 1 for write, 2 for execute
  * @mmu_idx: MMU index indicating required translation regime
+ * @is_secure: security state for the access
  * @result: set on translation success.
  * @fi: set to fault info if the translation fails
  *
- * Similarly, but use the security regime of @mmu_idx.
+ * Similar to get_phys_addr, but use the given security regime and don't 
perform
+ * a Granule Protection Check on the resulting address.
  */
-bool get_phys_addr(CPUARMState *env, target_ulong address,
-   MMUAccessType access_type, ARMMMUIdx mmu_idx,
-   GetPhysAddrResult *result, ARMMMUFaultInfo *fi)
+bool get_phys_addr_with_secure_nogpc(CPUARMState *env, target_ulong address,
+ MMUAccessType access_type,
+ ARMMMUIdx mmu_idx, bool is_secure,
+ GetPhysAddrResult *result,
+ ARMMMUFaultInfo *fi)
 __attribute__((nonnull));
 
 bool pmsav8_mpu_lookup(CPUARMState *env, uint32_t address,
diff --git a/target/arm/helper.c b/target/arm/helper.c
index a4c2c1bde5..427de6bd2a 100644
--- a/target/arm/helper.c
+++ b/target/arm/helper.c
@@ -3365,8 +3365,12 @@ static uint64_t do_ats_write(CPUARMState *env, uint64_t 
value,
 ARMMMUFaultInfo fi = {};
 GetPhysAddrResult res = {};
 
-ret = get_phys_addr_with_secure(env, value, access_type, mmu_idx,
-is_secure, , );
+/*
+ * I_MXTJT: Granule protection checks are not performed on the final 
address
+ * of a successful translation.
+ */
+ret = get_phys_addr_with_secure_nogpc(env, value, access_type, mmu_idx,
+  is_secure, , );
 
 /*
  * ATS operations only do S1 or S1+S2 translations, so we never
diff --git a/target/arm/ptw.c b/target/arm/ptw.c
index 063adbd84a..33179f3471 100644
--- a/target/arm/ptw.c
+++ b/target/arm/ptw.c
@@ -3418,16 +3418,17 @@ static bool get_phys_addr_gpc(CPUARMState *env, 
S1Translate *ptw,
 return false;
 }
 
-bool get_phys_addr_with_secure(CPUARMState *env, target_ulong address,
-   MMUAccessType access_type, ARMMMUIdx mmu_idx,
-   bool is_secure, GetPhysAddrResult *result,
-   ARMMMUFaultInfo *fi)
+bool get_phys_addr_with_secure_nogpc(CPUARMState *env, target_ulong address,
+ MMUAccessType access_type,
+ ARMMMUIdx mmu_idx, bool is_secure,
+ GetPhysAddrResult *result

[PATCH v3 4/6] target/arm: Pass security space rather than flag for AT instructions

2023-08-09 Thread Jean-Philippe Brucker

At the moment we only handle Secure and Nonsecure security spaces for
the AT instructions. Add support for Realm and Root.

For AArch64, arm_security_space() gives the desired space. ARM DDI0487J
says (R_NYXTL):

  If EL3 is implemented, then when an address translation instruction
  that applies to an Exception level lower than EL3 is executed, the
  Effective value of SCR_EL3.{NSE, NS} determines the target Security
  state that the instruction applies to.

For AArch32, some instructions can access NonSecure space from Secure,
so we still need to pass the state explicitly to do_ats_write().

Signed-off-by: Jean-Philippe Brucker 
Reviewed-by: Peter Maydell 
---
 target/arm/internals.h | 18 +-
 target/arm/helper.c| 27 ---
 target/arm/ptw.c   | 12 ++--
 3 files changed, 27 insertions(+), 30 deletions(-)

diff --git a/target/arm/internals.h b/target/arm/internals.h
index fc90c364f7..cf13bb94f5 100644
--- a/target/arm/internals.h
+++ b/target/arm/internals.h
@@ -1217,24 +1217,24 @@ bool get_phys_addr(CPUARMState *env, target_ulong 
address,
 __attribute__((nonnull));
 
 /**
- * get_phys_addr_with_secure_nogpc: get the physical address for a virtual
- *  address
+ * get_phys_addr_with_space_nogpc: get the physical address for a virtual
+ * address
  * @env: CPUARMState
  * @address: virtual address to get physical address for
  * @access_type: 0 for read, 1 for write, 2 for execute
  * @mmu_idx: MMU index indicating required translation regime
- * @is_secure: security state for the access
+ * @space: security space for the access
  * @result: set on translation success.
  * @fi: set to fault info if the translation fails
  *
- * Similar to get_phys_addr, but use the given security regime and don't 
perform
+ * Similar to get_phys_addr, but use the given security space and don't perform
  * a Granule Protection Check on the resulting address.
  */
-bool get_phys_addr_with_secure_nogpc(CPUARMState *env, target_ulong address,
- MMUAccessType access_type,
- ARMMMUIdx mmu_idx, bool is_secure,
- GetPhysAddrResult *result,
- ARMMMUFaultInfo *fi)
+bool get_phys_addr_with_space_nogpc(CPUARMState *env, target_ulong address,
+MMUAccessType access_type,
+ARMMMUIdx mmu_idx, ARMSecuritySpace space,
+GetPhysAddrResult *result,
+ARMMMUFaultInfo *fi)
 __attribute__((nonnull));
 
 bool pmsav8_mpu_lookup(CPUARMState *env, uint32_t address,
diff --git a/target/arm/helper.c b/target/arm/helper.c
index 427de6bd2a..fbb03c364b 100644
--- a/target/arm/helper.c
+++ b/target/arm/helper.c
@@ -3357,7 +3357,7 @@ static int par_el1_shareability(GetPhysAddrResult *res)
 
 static uint64_t do_ats_write(CPUARMState *env, uint64_t value,
  MMUAccessType access_type, ARMMMUIdx mmu_idx,
- bool is_secure)
+ ARMSecuritySpace ss)
 {
 bool ret;
 uint64_t par64;
@@ -3369,8 +3369,8 @@ static uint64_t do_ats_write(CPUARMState *env, uint64_t 
value,
  * I_MXTJT: Granule protection checks are not performed on the final 
address
  * of a successful translation.
  */
-ret = get_phys_addr_with_secure_nogpc(env, value, access_type, mmu_idx,
-  is_secure, , );
+ret = get_phys_addr_with_space_nogpc(env, value, access_type, mmu_idx, ss,
+ , );
 
 /*
  * ATS operations only do S1 or S1+S2 translations, so we never
@@ -3535,7 +3535,7 @@ static void ats_write(CPUARMState *env, const 
ARMCPRegInfo *ri, uint64_t value)
 uint64_t par64;
 ARMMMUIdx mmu_idx;
 int el = arm_current_el(env);
-bool secure = arm_is_secure_below_el3(env);
+ARMSecuritySpace ss = arm_security_space(env);
 
 switch (ri->opc2 & 6) {
 case 0:
@@ -3543,10 +3543,9 @@ static void ats_write(CPUARMState *env, const 
ARMCPRegInfo *ri, uint64_t value)
 switch (el) {
 case 3:
 mmu_idx = ARMMMUIdx_E3;
-secure = true;
 break;
 case 2:
-g_assert(!secure);  /* ARMv8.4-SecEL2 is 64-bit only */
+g_assert(ss != ARMSS_Secure);  /* ARMv8.4-SecEL2 is 64-bit only */
 /* fall through */
 case 1:
 if (ri->crm == 9 && (env->uncached_cpsr & CPSR_PAN)) {
@@ -3564,10 +3563,9 @@ static void ats_write(CPUARMState *env, const 
ARMCPRegInfo *ri, uint64_t value)
 switch (el) {
 case 3:
 mmu_idx = ARMMMUIdx_E10_0;
-secure = true;
 break;
 case 2:
-g_assert(!secure);  /* ARMv

[PATCH v3 1/6] target/arm/ptw: Load stage-2 tables from realm physical space

2023-08-09 Thread Jean-Philippe Brucker

In realm state, stage-2 translation tables are fetched from the realm
physical address space (R_PGRQD).

Signed-off-by: Jean-Philippe Brucker 
Reviewed-by: Peter Maydell 
---
 target/arm/ptw.c | 26 ++
 1 file changed, 18 insertions(+), 8 deletions(-)

diff --git a/target/arm/ptw.c b/target/arm/ptw.c
index d1de934702..063adbd84a 100644
--- a/target/arm/ptw.c
+++ b/target/arm/ptw.c
@@ -157,22 +157,32 @@ static ARMMMUIdx ptw_idx_for_stage_2(CPUARMState *env, 
ARMMMUIdx stage2idx)
 
 /*
  * We're OK to check the current state of the CPU here because
- * (1) we always invalidate all TLBs when the SCR_EL3.NS bit changes
+ * (1) we always invalidate all TLBs when the SCR_EL3.NS or SCR_EL3.NSE bit
+ * changes.
  * (2) there's no way to do a lookup that cares about Stage 2 for a
  * different security state to the current one for AArch64, and AArch32
  * never has a secure EL2. (AArch32 ATS12NSO[UP][RW] allow EL3 to do
  * an NS stage 1+2 lookup while the NS bit is 0.)
  */
-if (!arm_is_secure_below_el3(env) || !arm_el_is_aa64(env, 3)) {
+if (!arm_el_is_aa64(env, 3)) {
 return ARMMMUIdx_Phys_NS;
 }
-if (stage2idx == ARMMMUIdx_Stage2_S) {
-s2walk_secure = !(env->cp15.vstcr_el2 & VSTCR_SW);
-} else {
-s2walk_secure = !(env->cp15.vtcr_el2 & VTCR_NSW);
-}
-return s2walk_secure ? ARMMMUIdx_Phys_S : ARMMMUIdx_Phys_NS;
 
+switch (arm_security_space_below_el3(env)) {
+case ARMSS_NonSecure:
+return ARMMMUIdx_Phys_NS;
+case ARMSS_Realm:
+return ARMMMUIdx_Phys_Realm;
+case ARMSS_Secure:
+if (stage2idx == ARMMMUIdx_Stage2_S) {
+s2walk_secure = !(env->cp15.vstcr_el2 & VSTCR_SW);
+} else {
+s2walk_secure = !(env->cp15.vtcr_el2 & VTCR_NSW);
+}
+return s2walk_secure ? ARMMMUIdx_Phys_S : ARMMMUIdx_Phys_NS;
+default:
+g_assert_not_reached();
+}
 }
 
 static bool regime_translation_big_endian(CPUARMState *env, ARMMMUIdx mmu_idx)
-- 
2.41.0

Re: [PATCH v2 5/6] target/arm/helper: Check SCR_EL3.{NSE,NS} encoding for AT instructions

2023-08-07 Thread Jean-Philippe Brucker

On Mon, Aug 07, 2023 at 10:54:05AM +0100, Peter Maydell wrote:
> On Fri, 4 Aug 2023 at 19:08, Peter Maydell  wrote:
> >
> > On Wed, 2 Aug 2023 at 18:02, Jean-Philippe Brucker
> >  wrote:
> > >
> > > The AT instruction is UNDEFINED if the {NSE,NS} configuration is
> > > invalid. Add a function to check this on all AT instructions that apply
> > > to an EL lower than 3.
> > >
> > > Suggested-by: Peter Maydell 
> > > Signed-off-by: Jean-Philippe Brucker 
> > > ---
> > >  target/arm/helper.c | 36 +---
> > >  1 file changed, 25 insertions(+), 11 deletions(-)
> > >
> > > diff --git a/target/arm/helper.c b/target/arm/helper.c
> > > index fbb03c364b..77dd80ad28 100644
> > > --- a/target/arm/helper.c
> > > +++ b/target/arm/helper.c
> > > @@ -3616,6 +3616,20 @@ static void ats1h_write(CPUARMState *env, const 
> > > ARMCPRegInfo *ri,
> > >  #endif /* CONFIG_TCG */
> > >  }
> > >
> > > +static CPAccessResult at_e012_access(CPUARMState *env, const 
> > > ARMCPRegInfo *ri,
> > > + bool isread)
> > > +{
> > > +/*
> > > + * R_NYXTL: instruction is UNDEFINED if it applies to an Exception 
> > > level
> > > + * lower than EL3 and the combination SCR_EL3.{NSE,NS} is reserved.
> > > + */
> > > +if (cpu_isar_feature(aa64_rme, env_archcpu(env)) &&
> > > +(env->cp15.scr_el3 & (SCR_NSE | SCR_NS)) == SCR_NSE) {
> > > +return CP_ACCESS_TRAP;
> > > +}
> >
> > The AArch64.AT() pseudocode and the text in the individual
> > AT insn descriptions ("When FEAT_RME is implemented, if the Effective
> > value of SCR_EL3.{NSE, NS} is a reserved value, this instruction is
> > UNDEFINED at EL3") say that this check needs an "arm_current_el(env) == 3"
> > condition too.
> 
> It's been pointed out to me that since trying to return from
> EL3 with SCR_EL3.{NSE,NS} == {1,0} is an illegal exception return,
> it's not actually possible to try to execute these insns in this
> state from any other EL than EL3. So we don't actually need
> to check for EL3 here.
> 
> QEMU's implementation of exception return is missing that
> check for illegal-exception-return on bad {NSE,NS}, though.

I can add a patch to check that exception return condition, and add a
comment here explaining that this can only happen when executing at EL3

Thanks,
Jean

[PATCH v2 0/6] target/arm: Fixes for RME

2023-08-02 Thread Jean-Philippe Brucker

A few patches to fix RME support and allow booting a realm guest, based
on 
https://lore.kernel.org/qemu-devel/20230714154648.327466-1-peter.mayd...@linaro.org/

Since v1 I fixed patches 1, 2 and 6 following Peter's comments, and
added patch 5. Patch 6 now factors the timer IRQ update into a new
function, which is a bit invasive but seems cleaner.

v1: 
https://lore.kernel.org/qemu-devel/20230719153018.1456180-2-jean-phili...@linaro.org/

Jean-Philippe Brucker (6):
  target/arm/ptw: Load stage-2 tables from realm physical space
  target/arm/helper: Fix tlbmask and tlbbits for TLBI VAE2*
  target/arm: Skip granule protection checks for AT instructions
  target/arm: Pass security space rather than flag for AT instructions
  target/arm/helper: Check SCR_EL3.{NSE,NS} encoding for AT instructions
  target/arm/helper: Implement CNTHCTL_EL2.CNT[VP]MASK

 target/arm/cpu.h|   3 +
 target/arm/internals.h  |  25 +++---
 target/arm/helper.c | 171 +---
 target/arm/ptw.c|  39 +
 target/arm/trace-events |   7 +-
 5 files changed, 170 insertions(+), 75 deletions(-)

-- 
2.41.0

[PATCH v2 5/6] target/arm/helper: Check SCR_EL3.{NSE, NS} encoding for AT instructions

2023-08-02 Thread Jean-Philippe Brucker

The AT instruction is UNDEFINED if the {NSE,NS} configuration is
invalid. Add a function to check this on all AT instructions that apply
to an EL lower than 3.

Suggested-by: Peter Maydell 
Signed-off-by: Jean-Philippe Brucker 
---
 target/arm/helper.c | 36 +---
 1 file changed, 25 insertions(+), 11 deletions(-)

diff --git a/target/arm/helper.c b/target/arm/helper.c
index fbb03c364b..77dd80ad28 100644
--- a/target/arm/helper.c
+++ b/target/arm/helper.c
@@ -3616,6 +3616,20 @@ static void ats1h_write(CPUARMState *env, const 
ARMCPRegInfo *ri,
 #endif /* CONFIG_TCG */
 }
 
+static CPAccessResult at_e012_access(CPUARMState *env, const ARMCPRegInfo *ri,
+ bool isread)
+{
+/*
+ * R_NYXTL: instruction is UNDEFINED if it applies to an Exception level
+ * lower than EL3 and the combination SCR_EL3.{NSE,NS} is reserved.
+ */
+if (cpu_isar_feature(aa64_rme, env_archcpu(env)) &&
+(env->cp15.scr_el3 & (SCR_NSE | SCR_NS)) == SCR_NSE) {
+return CP_ACCESS_TRAP;
+}
+return CP_ACCESS_OK;
+}
+
 static CPAccessResult at_s1e2_access(CPUARMState *env, const ARMCPRegInfo *ri,
  bool isread)
 {
@@ -3623,7 +3637,7 @@ static CPAccessResult at_s1e2_access(CPUARMState *env, 
const ARMCPRegInfo *ri,
 !(env->cp15.scr_el3 & (SCR_NS | SCR_EEL2))) {
 return CP_ACCESS_TRAP;
 }
-return CP_ACCESS_OK;
+return at_e012_access(env, ri, isread);
 }
 
 static void ats_write64(CPUARMState *env, const ARMCPRegInfo *ri,
@@ -5505,38 +5519,38 @@ static const ARMCPRegInfo v8_cp_reginfo[] = {
   .opc0 = 1, .opc1 = 0, .crn = 7, .crm = 8, .opc2 = 0,
   .access = PL1_W, .type = ARM_CP_NO_RAW | ARM_CP_RAISES_EXC,
   .fgt = FGT_ATS1E1R,
-  .writefn = ats_write64 },
+  .accessfn = at_e012_access, .writefn = ats_write64 },
 { .name = "AT_S1E1W", .state = ARM_CP_STATE_AA64,
   .opc0 = 1, .opc1 = 0, .crn = 7, .crm = 8, .opc2 = 1,
   .access = PL1_W, .type = ARM_CP_NO_RAW | ARM_CP_RAISES_EXC,
   .fgt = FGT_ATS1E1W,
-  .writefn = ats_write64 },
+  .accessfn = at_e012_access, .writefn = ats_write64 },
 { .name = "AT_S1E0R", .state = ARM_CP_STATE_AA64,
   .opc0 = 1, .opc1 = 0, .crn = 7, .crm = 8, .opc2 = 2,
   .access = PL1_W, .type = ARM_CP_NO_RAW | ARM_CP_RAISES_EXC,
   .fgt = FGT_ATS1E0R,
-  .writefn = ats_write64 },
+  .accessfn = at_e012_access, .writefn = ats_write64 },
 { .name = "AT_S1E0W", .state = ARM_CP_STATE_AA64,
   .opc0 = 1, .opc1 = 0, .crn = 7, .crm = 8, .opc2 = 3,
   .access = PL1_W, .type = ARM_CP_NO_RAW | ARM_CP_RAISES_EXC,
   .fgt = FGT_ATS1E0W,
-  .writefn = ats_write64 },
+  .accessfn = at_e012_access, .writefn = ats_write64 },
 { .name = "AT_S12E1R", .state = ARM_CP_STATE_AA64,
   .opc0 = 1, .opc1 = 4, .crn = 7, .crm = 8, .opc2 = 4,
   .access = PL2_W, .type = ARM_CP_NO_RAW | ARM_CP_RAISES_EXC,
-  .writefn = ats_write64 },
+  .accessfn = at_e012_access, .writefn = ats_write64 },
 { .name = "AT_S12E1W", .state = ARM_CP_STATE_AA64,
   .opc0 = 1, .opc1 = 4, .crn = 7, .crm = 8, .opc2 = 5,
   .access = PL2_W, .type = ARM_CP_NO_RAW | ARM_CP_RAISES_EXC,
-  .writefn = ats_write64 },
+  .accessfn = at_e012_access, .writefn = ats_write64 },
 { .name = "AT_S12E0R", .state = ARM_CP_STATE_AA64,
   .opc0 = 1, .opc1 = 4, .crn = 7, .crm = 8, .opc2 = 6,
   .access = PL2_W, .type = ARM_CP_NO_RAW | ARM_CP_RAISES_EXC,
-  .writefn = ats_write64 },
+  .accessfn = at_e012_access, .writefn = ats_write64 },
 { .name = "AT_S12E0W", .state = ARM_CP_STATE_AA64,
   .opc0 = 1, .opc1 = 4, .crn = 7, .crm = 8, .opc2 = 7,
   .access = PL2_W, .type = ARM_CP_NO_RAW | ARM_CP_RAISES_EXC,
-  .writefn = ats_write64 },
+  .accessfn = at_e012_access, .writefn = ats_write64 },
 /* AT S1E2* are elsewhere as they UNDEF from EL3 if EL2 is not present */
 { .name = "AT_S1E3R", .state = ARM_CP_STATE_AA64,
   .opc0 = 1, .opc1 = 6, .crn = 7, .crm = 8, .opc2 = 0,
@@ -8078,12 +8092,12 @@ static const ARMCPRegInfo ats1e1_reginfo[] = {
   .opc0 = 1, .opc1 = 0, .crn = 7, .crm = 9, .opc2 = 0,
   .access = PL1_W, .type = ARM_CP_NO_RAW | ARM_CP_RAISES_EXC,
   .fgt = FGT_ATS1E1RP,
-  .writefn = ats_write64 },
+  .accessfn = at_e012_access, .writefn = ats_write64 },
 { .name = "AT_S1E1WP", .state = ARM_CP_STATE_AA64,
   .opc0 = 1, .opc1 = 0, .crn = 7, .crm = 9, .opc2 = 1,
   .access = PL1_W, .type = ARM_CP_NO_RAW | ARM_CP_RAISES_EXC,
   .fgt = FGT_ATS1E1WP,
-  .writefn = ats_write64 },
+  .accessfn = at_e012_access, .writefn = ats_write64 },
 };
 
 static const ARMCPRegInfo ats1cp_reginfo[] = {
-- 
2.41.0

[PATCH v2 1/6] target/arm/ptw: Load stage-2 tables from realm physical space

2023-08-02 Thread Jean-Philippe Brucker

In realm state, stage-2 translation tables are fetched from the realm
physical address space (R_PGRQD).

Signed-off-by: Jean-Philippe Brucker 
---
 target/arm/ptw.c | 26 ++
 1 file changed, 18 insertions(+), 8 deletions(-)

diff --git a/target/arm/ptw.c b/target/arm/ptw.c
index d1de934702..063adbd84a 100644
--- a/target/arm/ptw.c
+++ b/target/arm/ptw.c
@@ -157,22 +157,32 @@ static ARMMMUIdx ptw_idx_for_stage_2(CPUARMState *env, 
ARMMMUIdx stage2idx)
 
 /*
  * We're OK to check the current state of the CPU here because
- * (1) we always invalidate all TLBs when the SCR_EL3.NS bit changes
+ * (1) we always invalidate all TLBs when the SCR_EL3.NS or SCR_EL3.NSE bit
+ * changes.
  * (2) there's no way to do a lookup that cares about Stage 2 for a
  * different security state to the current one for AArch64, and AArch32
  * never has a secure EL2. (AArch32 ATS12NSO[UP][RW] allow EL3 to do
  * an NS stage 1+2 lookup while the NS bit is 0.)
  */
-if (!arm_is_secure_below_el3(env) || !arm_el_is_aa64(env, 3)) {
+if (!arm_el_is_aa64(env, 3)) {
 return ARMMMUIdx_Phys_NS;
 }
-if (stage2idx == ARMMMUIdx_Stage2_S) {
-s2walk_secure = !(env->cp15.vstcr_el2 & VSTCR_SW);
-} else {
-s2walk_secure = !(env->cp15.vtcr_el2 & VTCR_NSW);
-}
-return s2walk_secure ? ARMMMUIdx_Phys_S : ARMMMUIdx_Phys_NS;
 
+switch (arm_security_space_below_el3(env)) {
+case ARMSS_NonSecure:
+return ARMMMUIdx_Phys_NS;
+case ARMSS_Realm:
+return ARMMMUIdx_Phys_Realm;
+case ARMSS_Secure:
+if (stage2idx == ARMMMUIdx_Stage2_S) {
+s2walk_secure = !(env->cp15.vstcr_el2 & VSTCR_SW);
+} else {
+s2walk_secure = !(env->cp15.vtcr_el2 & VTCR_NSW);
+}
+return s2walk_secure ? ARMMMUIdx_Phys_S : ARMMMUIdx_Phys_NS;
+default:
+g_assert_not_reached();
+}
 }
 
 static bool regime_translation_big_endian(CPUARMState *env, ARMMMUIdx mmu_idx)
-- 
2.41.0

[PATCH v2 3/6] target/arm: Skip granule protection checks for AT instructions

2023-08-02 Thread Jean-Philippe Brucker

GPC checks are not performed on the output address for AT instructions,
as stated by ARM DDI 0487J in D8.12.2:

  When populating PAR_EL1 with the result of an address translation
  instruction, granule protection checks are not performed on the final
  output address of a successful translation.

Rename get_phys_addr_with_secure(), since it's only used to handle AT
instructions.

Signed-off-by: Jean-Philippe Brucker 
Reviewed-by: Peter Maydell 
---
 target/arm/internals.h | 25 ++---
 target/arm/helper.c|  8 ++--
 target/arm/ptw.c   | 11 ++-
 3 files changed, 26 insertions(+), 18 deletions(-)

diff --git a/target/arm/internals.h b/target/arm/internals.h
index 0f01bc32a8..fc90c364f7 100644
--- a/target/arm/internals.h
+++ b/target/arm/internals.h
@@ -1190,12 +1190,11 @@ typedef struct GetPhysAddrResult {
 } GetPhysAddrResult;
 
 /**
- * get_phys_addr_with_secure: get the physical address for a virtual address
+ * get_phys_addr: get the physical address for a virtual address
  * @env: CPUARMState
  * @address: virtual address to get physical address for
  * @access_type: 0 for read, 1 for write, 2 for execute
  * @mmu_idx: MMU index indicating required translation regime
- * @is_secure: security state for the access
  * @result: set on translation success.
  * @fi: set to fault info if the translation fails
  *
@@ -1212,26 +1211,30 @@ typedef struct GetPhysAddrResult {
  *  * for PSMAv5 based systems we don't bother to return a full FSR format
  *value.
  */
-bool get_phys_addr_with_secure(CPUARMState *env, target_ulong address,
-   MMUAccessType access_type,
-   ARMMMUIdx mmu_idx, bool is_secure,
-   GetPhysAddrResult *result, ARMMMUFaultInfo *fi)
+bool get_phys_addr(CPUARMState *env, target_ulong address,
+   MMUAccessType access_type, ARMMMUIdx mmu_idx,
+   GetPhysAddrResult *result, ARMMMUFaultInfo *fi)
 __attribute__((nonnull));
 
 /**
- * get_phys_addr: get the physical address for a virtual address
+ * get_phys_addr_with_secure_nogpc: get the physical address for a virtual
+ *  address
  * @env: CPUARMState
  * @address: virtual address to get physical address for
  * @access_type: 0 for read, 1 for write, 2 for execute
  * @mmu_idx: MMU index indicating required translation regime
+ * @is_secure: security state for the access
  * @result: set on translation success.
  * @fi: set to fault info if the translation fails
  *
- * Similarly, but use the security regime of @mmu_idx.
+ * Similar to get_phys_addr, but use the given security regime and don't 
perform
+ * a Granule Protection Check on the resulting address.
  */
-bool get_phys_addr(CPUARMState *env, target_ulong address,
-   MMUAccessType access_type, ARMMMUIdx mmu_idx,
-   GetPhysAddrResult *result, ARMMMUFaultInfo *fi)
+bool get_phys_addr_with_secure_nogpc(CPUARMState *env, target_ulong address,
+ MMUAccessType access_type,
+ ARMMMUIdx mmu_idx, bool is_secure,
+ GetPhysAddrResult *result,
+ ARMMMUFaultInfo *fi)
 __attribute__((nonnull));
 
 bool pmsav8_mpu_lookup(CPUARMState *env, uint32_t address,
diff --git a/target/arm/helper.c b/target/arm/helper.c
index a4c2c1bde5..427de6bd2a 100644
--- a/target/arm/helper.c
+++ b/target/arm/helper.c
@@ -3365,8 +3365,12 @@ static uint64_t do_ats_write(CPUARMState *env, uint64_t 
value,
 ARMMMUFaultInfo fi = {};
 GetPhysAddrResult res = {};
 
-ret = get_phys_addr_with_secure(env, value, access_type, mmu_idx,
-is_secure, , );
+/*
+ * I_MXTJT: Granule protection checks are not performed on the final 
address
+ * of a successful translation.
+ */
+ret = get_phys_addr_with_secure_nogpc(env, value, access_type, mmu_idx,
+  is_secure, , );
 
 /*
  * ATS operations only do S1 or S1+S2 translations, so we never
diff --git a/target/arm/ptw.c b/target/arm/ptw.c
index 063adbd84a..33179f3471 100644
--- a/target/arm/ptw.c
+++ b/target/arm/ptw.c
@@ -3418,16 +3418,17 @@ static bool get_phys_addr_gpc(CPUARMState *env, 
S1Translate *ptw,
 return false;
 }
 
-bool get_phys_addr_with_secure(CPUARMState *env, target_ulong address,
-   MMUAccessType access_type, ARMMMUIdx mmu_idx,
-   bool is_secure, GetPhysAddrResult *result,
-   ARMMMUFaultInfo *fi)
+bool get_phys_addr_with_secure_nogpc(CPUARMState *env, target_ulong address,
+ MMUAccessType access_type,
+ ARMMMUIdx mmu_idx, bool is_secure,
+ GetPhysAddrResult *result

[PATCH v2 6/6] target/arm/helper: Implement CNTHCTL_EL2.CNT[VP]MASK

2023-08-02 Thread Jean-Philippe Brucker

When FEAT_RME is implemented, these bits override the value of
CNT[VP]_CTL_EL0.IMASK in Realm and Root state. Move the IRQ state update
into a new gt_update_irq() function and test those bits every time we
recompute the IRQ state.

Since we're removing the IRQ state from some trace events, add a new
trace event for gt_update_irq().

Signed-off-by: Jean-Philippe Brucker 
---
 target/arm/cpu.h|  3 +++
 target/arm/helper.c | 54 -
 target/arm/trace-events |  7 +++---
 3 files changed, 50 insertions(+), 14 deletions(-)

diff --git a/target/arm/cpu.h b/target/arm/cpu.h
index bcd65a63ca..bedc7ec6dc 100644
--- a/target/arm/cpu.h
+++ b/target/arm/cpu.h
@@ -1743,6 +1743,9 @@ static inline void xpsr_write(CPUARMState *env, uint32_t 
val, uint32_t mask)
 #define HSTR_TTEE (1 << 16)
 #define HSTR_TJDBX (1 << 17)
 
+#define CNTHCTL_CNTVMASK  (1 << 18)
+#define CNTHCTL_CNTPMASK  (1 << 19)
+
 /* Return the current FPSCR value.  */
 uint32_t vfp_get_fpscr(CPUARMState *env);
 void vfp_set_fpscr(CPUARMState *env, uint32_t val);
diff --git a/target/arm/helper.c b/target/arm/helper.c
index 77dd80ad28..68e915ddda 100644
--- a/target/arm/helper.c
+++ b/target/arm/helper.c
@@ -2608,6 +2608,28 @@ static uint64_t gt_get_countervalue(CPUARMState *env)
 return qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) / gt_cntfrq_period_ns(cpu);
 }
 
+static void gt_update_irq(ARMCPU *cpu, int timeridx)
+{
+CPUARMState *env = >env;
+uint64_t cnthctl = env->cp15.cnthctl_el2;
+ARMSecuritySpace ss = arm_security_space(env);
+/* ISTATUS && !IMASK */
+int irqstate = (env->cp15.c14_timer[timeridx].ctl & 6) == 4;
+
+/*
+ * If bit CNTHCTL_EL2.CNT[VP]MASK is set, it overrides IMASK.
+ * It is RES0 in Secure and NonSecure state.
+ */
+if ((ss == ARMSS_Root || ss == ARMSS_Realm) &&
+((timeridx == GTIMER_VIRT && (cnthctl & CNTHCTL_CNTVMASK)) ||
+ (timeridx == GTIMER_PHYS && (cnthctl & CNTHCTL_CNTPMASK {
+irqstate = 0;
+}
+
+qemu_set_irq(cpu->gt_timer_outputs[timeridx], irqstate);
+trace_arm_gt_update_irq(timeridx, irqstate);
+}
+
 static void gt_recalc_timer(ARMCPU *cpu, int timeridx)
 {
 ARMGenericTimer *gt = >env.cp15.c14_timer[timeridx];
@@ -2623,13 +2645,9 @@ static void gt_recalc_timer(ARMCPU *cpu, int timeridx)
 /* Note that this must be unsigned 64 bit arithmetic: */
 int istatus = count - offset >= gt->cval;
 uint64_t nexttick;
-int irqstate;
 
 gt->ctl = deposit32(gt->ctl, 2, 1, istatus);
 
-irqstate = (istatus && !(gt->ctl & 2));
-qemu_set_irq(cpu->gt_timer_outputs[timeridx], irqstate);
-
 if (istatus) {
 /* Next transition is when count rolls back over to zero */
 nexttick = UINT64_MAX;
@@ -2648,14 +2666,14 @@ static void gt_recalc_timer(ARMCPU *cpu, int timeridx)
 } else {
 timer_mod(cpu->gt_timer[timeridx], nexttick);
 }
-trace_arm_gt_recalc(timeridx, irqstate, nexttick);
+trace_arm_gt_recalc(timeridx, nexttick);
 } else {
 /* Timer disabled: ISTATUS and timer output always clear */
 gt->ctl &= ~4;
-qemu_set_irq(cpu->gt_timer_outputs[timeridx], 0);
 timer_del(cpu->gt_timer[timeridx]);
 trace_arm_gt_recalc_disabled(timeridx);
 }
+gt_update_irq(cpu, timeridx);
 }
 
 static void gt_timer_reset(CPUARMState *env, const ARMCPRegInfo *ri,
@@ -2759,10 +2777,8 @@ static void gt_ctl_write(CPUARMState *env, const 
ARMCPRegInfo *ri,
  * IMASK toggled: don't need to recalculate,
  * just set the interrupt line based on ISTATUS
  */
-int irqstate = (oldval & 4) && !(value & 2);
-
-trace_arm_gt_imask_toggle(timeridx, irqstate);
-qemu_set_irq(cpu->gt_timer_outputs[timeridx], irqstate);
+trace_arm_gt_imask_toggle(timeridx);
+gt_update_irq(cpu, timeridx);
 }
 }
 
@@ -2888,6 +2904,21 @@ static void gt_virt_ctl_write(CPUARMState *env, const 
ARMCPRegInfo *ri,
 gt_ctl_write(env, ri, GTIMER_VIRT, value);
 }
 
+static void gt_cnthctl_write(CPUARMState *env, const ARMCPRegInfo *ri,
+ uint64_t value)
+{
+ARMCPU *cpu = env_archcpu(env);
+uint32_t oldval = env->cp15.cnthctl_el2;
+
+raw_write(env, ri, value);
+
+if ((oldval ^ value) & CNTHCTL_CNTVMASK) {
+gt_update_irq(cpu, GTIMER_VIRT);
+} else if ((oldval ^ value) & CNTHCTL_CNTPMASK) {
+gt_update_irq(cpu, GTIMER_PHYS);
+}
+}
+
 static void gt_cntvoff_write(CPUARMState *env, const ARMCPRegInfo *ri,
   uint64_t value)
 {
@@ -6200,7 +6231,8 @@ static const ARMCPRegInfo el2_cp_reginfo[] = {
* reset values as IMPDEF. We choose to reset to 3 to comply with
* both ARMv7

[PATCH v2 4/6] target/arm: Pass security space rather than flag for AT instructions

2023-08-02 Thread Jean-Philippe Brucker

At the moment we only handle Secure and Nonsecure security spaces for
the AT instructions. Add support for Realm and Root.

For AArch64, arm_security_space() gives the desired space. ARM DDI0487J
says (R_NYXTL):

  If EL3 is implemented, then when an address translation instruction
  that applies to an Exception level lower than EL3 is executed, the
  Effective value of SCR_EL3.{NSE, NS} determines the target Security
  state that the instruction applies to.

For AArch32, some instructions can access NonSecure space from Secure,
so we still need to pass the state explicitly to do_ats_write().

Signed-off-by: Jean-Philippe Brucker 
Reviewed-by: Peter Maydell 
---
 target/arm/internals.h | 18 +-
 target/arm/helper.c| 27 ---
 target/arm/ptw.c   | 12 ++--
 3 files changed, 27 insertions(+), 30 deletions(-)

diff --git a/target/arm/internals.h b/target/arm/internals.h
index fc90c364f7..cf13bb94f5 100644
--- a/target/arm/internals.h
+++ b/target/arm/internals.h
@@ -1217,24 +1217,24 @@ bool get_phys_addr(CPUARMState *env, target_ulong 
address,
 __attribute__((nonnull));
 
 /**
- * get_phys_addr_with_secure_nogpc: get the physical address for a virtual
- *  address
+ * get_phys_addr_with_space_nogpc: get the physical address for a virtual
+ * address
  * @env: CPUARMState
  * @address: virtual address to get physical address for
  * @access_type: 0 for read, 1 for write, 2 for execute
  * @mmu_idx: MMU index indicating required translation regime
- * @is_secure: security state for the access
+ * @space: security space for the access
  * @result: set on translation success.
  * @fi: set to fault info if the translation fails
  *
- * Similar to get_phys_addr, but use the given security regime and don't 
perform
+ * Similar to get_phys_addr, but use the given security space and don't perform
  * a Granule Protection Check on the resulting address.
  */
-bool get_phys_addr_with_secure_nogpc(CPUARMState *env, target_ulong address,
- MMUAccessType access_type,
- ARMMMUIdx mmu_idx, bool is_secure,
- GetPhysAddrResult *result,
- ARMMMUFaultInfo *fi)
+bool get_phys_addr_with_space_nogpc(CPUARMState *env, target_ulong address,
+MMUAccessType access_type,
+ARMMMUIdx mmu_idx, ARMSecuritySpace space,
+GetPhysAddrResult *result,
+ARMMMUFaultInfo *fi)
 __attribute__((nonnull));
 
 bool pmsav8_mpu_lookup(CPUARMState *env, uint32_t address,
diff --git a/target/arm/helper.c b/target/arm/helper.c
index 427de6bd2a..fbb03c364b 100644
--- a/target/arm/helper.c
+++ b/target/arm/helper.c
@@ -3357,7 +3357,7 @@ static int par_el1_shareability(GetPhysAddrResult *res)
 
 static uint64_t do_ats_write(CPUARMState *env, uint64_t value,
  MMUAccessType access_type, ARMMMUIdx mmu_idx,
- bool is_secure)
+ ARMSecuritySpace ss)
 {
 bool ret;
 uint64_t par64;
@@ -3369,8 +3369,8 @@ static uint64_t do_ats_write(CPUARMState *env, uint64_t 
value,
  * I_MXTJT: Granule protection checks are not performed on the final 
address
  * of a successful translation.
  */
-ret = get_phys_addr_with_secure_nogpc(env, value, access_type, mmu_idx,
-  is_secure, , );
+ret = get_phys_addr_with_space_nogpc(env, value, access_type, mmu_idx, ss,
+ , );
 
 /*
  * ATS operations only do S1 or S1+S2 translations, so we never
@@ -3535,7 +3535,7 @@ static void ats_write(CPUARMState *env, const 
ARMCPRegInfo *ri, uint64_t value)
 uint64_t par64;
 ARMMMUIdx mmu_idx;
 int el = arm_current_el(env);
-bool secure = arm_is_secure_below_el3(env);
+ARMSecuritySpace ss = arm_security_space(env);
 
 switch (ri->opc2 & 6) {
 case 0:
@@ -3543,10 +3543,9 @@ static void ats_write(CPUARMState *env, const 
ARMCPRegInfo *ri, uint64_t value)
 switch (el) {
 case 3:
 mmu_idx = ARMMMUIdx_E3;
-secure = true;
 break;
 case 2:
-g_assert(!secure);  /* ARMv8.4-SecEL2 is 64-bit only */
+g_assert(ss != ARMSS_Secure);  /* ARMv8.4-SecEL2 is 64-bit only */
 /* fall through */
 case 1:
 if (ri->crm == 9 && (env->uncached_cpsr & CPSR_PAN)) {
@@ -3564,10 +3563,9 @@ static void ats_write(CPUARMState *env, const 
ARMCPRegInfo *ri, uint64_t value)
 switch (el) {
 case 3:
 mmu_idx = ARMMMUIdx_E10_0;
-secure = true;
 break;
 case 2:
-g_assert(!secure);  /* ARMv

[PATCH v2 2/6] target/arm/helper: Fix tlbmask and tlbbits for TLBI VAE2*

2023-08-02 Thread Jean-Philippe Brucker

When HCR_EL2.E2H is enabled, TLB entries are formed using the EL2&0
translation regime, instead of the EL2 translation regime. The TLB VAE2*
instructions invalidate the regime that corresponds to the current value
of HCR_EL2.E2H.

At the moment we only invalidate the EL2 translation regime. This causes
problems with RMM, which issues TLBI VAE2IS instructions with
HCR_EL2.E2H enabled. Update vae2_tlbmask() to take HCR_EL2.E2H into
account.

Add vae2_tlbbits() as well, since the top-byte-ignore configuration is
different between the EL2&0 and EL2 regime.

Signed-off-by: Jean-Philippe Brucker 
---
 target/arm/helper.c | 50 -
 1 file changed, 40 insertions(+), 10 deletions(-)

diff --git a/target/arm/helper.c b/target/arm/helper.c
index 2959d27543..a4c2c1bde5 100644
--- a/target/arm/helper.c
+++ b/target/arm/helper.c
@@ -4663,6 +4663,21 @@ static int vae1_tlbmask(CPUARMState *env)
 return mask;
 }
 
+static int vae2_tlbmask(CPUARMState *env)
+{
+uint64_t hcr = arm_hcr_el2_eff(env);
+uint16_t mask;
+
+if (hcr & HCR_E2H) {
+mask = ARMMMUIdxBit_E20_2 |
+   ARMMMUIdxBit_E20_2_PAN |
+   ARMMMUIdxBit_E20_0;
+} else {
+mask = ARMMMUIdxBit_E2;
+}
+return mask;
+}
+
 /* Return 56 if TBI is enabled, 64 otherwise. */
 static int tlbbits_for_regime(CPUARMState *env, ARMMMUIdx mmu_idx,
   uint64_t addr)
@@ -4689,6 +4704,25 @@ static int vae1_tlbbits(CPUARMState *env, uint64_t addr)
 return tlbbits_for_regime(env, mmu_idx, addr);
 }
 
+static int vae2_tlbbits(CPUARMState *env, uint64_t addr)
+{
+uint64_t hcr = arm_hcr_el2_eff(env);
+ARMMMUIdx mmu_idx;
+
+/*
+ * Only the regime of the mmu_idx below is significant.
+ * Regime EL2&0 has two ranges with separate TBI configuration, while EL2
+ * only has one.
+ */
+if (hcr & HCR_E2H) {
+mmu_idx = ARMMMUIdx_E20_2;
+} else {
+mmu_idx = ARMMMUIdx_E2;
+}
+
+return tlbbits_for_regime(env, mmu_idx, addr);
+}
+
 static void tlbi_aa64_vmalle1is_write(CPUARMState *env, const ARMCPRegInfo *ri,
   uint64_t value)
 {
@@ -4781,10 +4815,11 @@ static void tlbi_aa64_vae2_write(CPUARMState *env, 
const ARMCPRegInfo *ri,
  * flush-last-level-only.
  */
 CPUState *cs = env_cpu(env);
-int mask = e2_tlbmask(env);
+int mask = vae2_tlbmask(env);
 uint64_t pageaddr = sextract64(value << 12, 0, 56);
+int bits = vae2_tlbbits(env, pageaddr);
 
-tlb_flush_page_by_mmuidx(cs, pageaddr, mask);
+tlb_flush_page_bits_by_mmuidx(cs, pageaddr, mask, bits);
 }
 
 static void tlbi_aa64_vae3_write(CPUARMState *env, const ARMCPRegInfo *ri,
@@ -4838,11 +4873,11 @@ static void tlbi_aa64_vae2is_write(CPUARMState *env, 
const ARMCPRegInfo *ri,
uint64_t value)
 {
 CPUState *cs = env_cpu(env);
+int mask = vae2_tlbmask(env);
 uint64_t pageaddr = sextract64(value << 12, 0, 56);
-int bits = tlbbits_for_regime(env, ARMMMUIdx_E2, pageaddr);
+int bits = vae2_tlbbits(env, pageaddr);
 
-tlb_flush_page_bits_by_mmuidx_all_cpus_synced(cs, pageaddr,
-  ARMMMUIdxBit_E2, bits);
+tlb_flush_page_bits_by_mmuidx_all_cpus_synced(cs, pageaddr, mask, bits);
 }
 
 static void tlbi_aa64_vae3is_write(CPUARMState *env, const ARMCPRegInfo *ri,
@@ -5014,11 +5049,6 @@ static void tlbi_aa64_rvae1is_write(CPUARMState *env,
 do_rvae_write(env, value, vae1_tlbmask(env), true);
 }
 
-static int vae2_tlbmask(CPUARMState *env)
-{
-return ARMMMUIdxBit_E2;
-}
-
 static void tlbi_aa64_rvae2_write(CPUARMState *env,
   const ARMCPRegInfo *ri,
   uint64_t value)
-- 
2.41.0

Re: [PATCH 3/5] target/arm: Skip granule protection checks for AT instructions

2023-07-21 Thread Jean-Philippe Brucker

On Thu, Jul 20, 2023 at 05:39:56PM +0100, Peter Maydell wrote:
> On Wed, 19 Jul 2023 at 16:56, Jean-Philippe Brucker
>  wrote:
> >
> > GPC checks are not performed on the output address for AT instructions,
> > as stated by ARM DDI 0487J in D8.12.2:
> >
> >   When populating PAR_EL1 with the result of an address translation
> >   instruction, granule protection checks are not performed on the final
> >   output address of a successful translation.
> >
> > Rename get_phys_addr_with_secure(), since it's only used to handle AT
> > instructions.
> >
> > Signed-off-by: Jean-Philippe Brucker 
> > ---
> > This incidentally fixes a problem with AT S1E1 instructions which can
> > output an IPA and should definitely not cause a GPC.
> > ---
> >  target/arm/internals.h | 25 ++---
> >  target/arm/helper.c|  8 ++--
> >  target/arm/ptw.c   | 11 ++-
> >  3 files changed, 26 insertions(+), 18 deletions(-)
> >
> > diff --git a/target/arm/internals.h b/target/arm/internals.h
> > index 0f01bc32a8..fc90c364f7 100644
> > --- a/target/arm/internals.h
> > +++ b/target/arm/internals.h
> > @@ -1190,12 +1190,11 @@ typedef struct GetPhysAddrResult {
> >  } GetPhysAddrResult;
> >
> >  /**
> > - * get_phys_addr_with_secure: get the physical address for a virtual 
> > address
> > + * get_phys_addr: get the physical address for a virtual address
> >   * @env: CPUARMState
> >   * @address: virtual address to get physical address for
> >   * @access_type: 0 for read, 1 for write, 2 for execute
> >   * @mmu_idx: MMU index indicating required translation regime
> > - * @is_secure: security state for the access
> >   * @result: set on translation success.
> >   * @fi: set to fault info if the translation fails
> >   *
> > @@ -1212,26 +1211,30 @@ typedef struct GetPhysAddrResult {
> >   *  * for PSMAv5 based systems we don't bother to return a full FSR format
> >   *value.
> >   */
> > -bool get_phys_addr_with_secure(CPUARMState *env, target_ulong address,
> > -   MMUAccessType access_type,
> > -   ARMMMUIdx mmu_idx, bool is_secure,
> > -   GetPhysAddrResult *result, ARMMMUFaultInfo 
> > *fi)
> > +bool get_phys_addr(CPUARMState *env, target_ulong address,
> > +   MMUAccessType access_type, ARMMMUIdx mmu_idx,
> > +   GetPhysAddrResult *result, ARMMMUFaultInfo *fi)
> >  __attribute__((nonnull));
> 
> 
> What is going on in this bit of the patch ? We already
> have a prototype for get_phys_addr() with a doc comment.
> Is this git's diff-output producing something silly
> for a change that is not logically touching get_phys_addr()'s
> prototype and comment at all?

I swapped the two prototypes in order to make the new comment for
get_phys_addr_with_secure_nogpc() more clear. Tried to convey that
get_phys_addr() is the normal function and
get_phys_addr_with_secure_nogpc() is special. So git is working as
expected and this is a style change, I can switch them back if you prefer.

Thanks,
Jean

> 
> >  /**
> > - * get_phys_addr: get the physical address for a virtual address
> > + * get_phys_addr_with_secure_nogpc: get the physical address for a virtual
> > + *  address
> >   * @env: CPUARMState
> >   * @address: virtual address to get physical address for
> >   * @access_type: 0 for read, 1 for write, 2 for execute
> >   * @mmu_idx: MMU index indicating required translation regime
> > + * @is_secure: security state for the access
> >   * @result: set on translation success.
> >   * @fi: set to fault info if the translation fails
> >   *
> > - * Similarly, but use the security regime of @mmu_idx.
> > + * Similar to get_phys_addr, but use the given security regime and don't 
> > perform
> > + * a Granule Protection Check on the resulting address.
> >   */
> > -bool get_phys_addr(CPUARMState *env, target_ulong address,
> > -   MMUAccessType access_type, ARMMMUIdx mmu_idx,
> > -   GetPhysAddrResult *result, ARMMMUFaultInfo *fi)
> > +bool get_phys_addr_with_secure_nogpc(CPUARMState *env, target_ulong 
> > address,
> > + MMUAccessType access_type,
> > + ARMMMUIdx mmu_idx, bool is_secure,
> > + GetPhysAddrResult *result,
> > + ARMMMUFaultInfo *fi)
> >  __attribute__((nonnull));
> 
> thanks
> -- PMM

Re: [PATCH 2/5] target/arm/helper: Fix vae2_tlbmask()

2023-07-21 Thread Jean-Philippe Brucker

On Thu, Jul 20, 2023 at 05:35:49PM +0100, Peter Maydell wrote:
> On Wed, 19 Jul 2023 at 16:56, Jean-Philippe Brucker
>  wrote:
> >
> > When HCR_EL2.E2H is enabled, TLB entries are formed using the EL2&0
> > translation regime, instead of the EL2 translation regime. The TLB VAE2*
> > instructions invalidate the regime that corresponds to the current value
> > of HCR_EL2.E2H.
> >
> > At the moment we only invalidate the EL2 translation regime. This causes
> > problems with RMM, which issues TLBI VAE2IS instructions with
> > HCR_EL2.E2H enabled. Update vae2_tlbmask() to take HCR_EL2.E2H into
> > account.
> >
> > Signed-off-by: Jean-Philippe Brucker 
> > ---
> >  target/arm/helper.c | 26 ++
> >  1 file changed, 18 insertions(+), 8 deletions(-)
> >
> > diff --git a/target/arm/helper.c b/target/arm/helper.c
> > index e1b3db6f5f..07a9ac70f5 100644
> > --- a/target/arm/helper.c
> > +++ b/target/arm/helper.c
> > @@ -4663,6 +4663,21 @@ static int vae1_tlbmask(CPUARMState *env)
> >  return mask;
> >  }
> >
> > +static int vae2_tlbmask(CPUARMState *env)
> > +{
> > +uint64_t hcr = arm_hcr_el2_eff(env);
> > +uint16_t mask;
> > +
> > +if (hcr & HCR_E2H) {
> > +mask = ARMMMUIdxBit_E20_2 |
> > +   ARMMMUIdxBit_E20_2_PAN |
> > +   ARMMMUIdxBit_E20_0;
> > +} else {
> > +mask = ARMMMUIdxBit_E2;
> > +}
> > +return mask;
> > +}
> > +
> >  /* Return 56 if TBI is enabled, 64 otherwise. */
> >  static int tlbbits_for_regime(CPUARMState *env, ARMMMUIdx mmu_idx,
> >uint64_t addr)
> 
> > @@ -4838,11 +4853,11 @@ static void tlbi_aa64_vae2is_write(CPUARMState 
> > *env, const ARMCPRegInfo *ri,
> > uint64_t value)
> >  {
> >  CPUState *cs = env_cpu(env);
> > +int mask = vae2_tlbmask(env);
> >  uint64_t pageaddr = sextract64(value << 12, 0, 56);
> >  int bits = tlbbits_for_regime(env, ARMMMUIdx_E2, pageaddr);
> 
> Shouldn't the argument to tlbbits_for_regime() also change
> if we're dealing with the EL2&0 regime rather than EL2 ?

Yes, it affects the result since EL2&0 has two ranges

Thanks,
Jean

> 
> >
> > -tlb_flush_page_bits_by_mmuidx_all_cpus_synced(cs, pageaddr,
> > -  ARMMMUIdxBit_E2, bits);
> > +tlb_flush_page_bits_by_mmuidx_all_cpus_synced(cs, pageaddr, mask, 
> > bits);
> >  }
> 
> thanks
> -- PMM

Re: [PATCH 0/5] target/arm: Fixes for RME

2023-07-20 Thread Jean-Philippe Brucker

On Thu, Jul 20, 2023 at 01:05:58PM +0100, Peter Maydell wrote:
> On Wed, 19 Jul 2023 at 16:56, Jean-Philippe Brucker
>  wrote:
> >
> > With these patches I'm able to boot a Realm guest under
> > "-cpu max,x-rme=on". They are based on Peter's series which fixes
> > handling of NSTable:
> > https://lore.kernel.org/qemu-devel/20230714154648.327466-1-peter.mayd...@linaro.org/
> 
> Thanks for testing this -- this is a lot closer to
> working out of the box than I thought we might be :-)
> I'm tempted to try to put these fixes (and my ptw patchset)
> into 8.1, but OTOH I suspect work on Realm guests will probably
> still want to use a bleeding-edge QEMU for other bugs we're
> going to discover over the next few months, so IDK. We'll
> see how code review goes on those, I guess.
> 
> > Running a Realm guest requires components at EL3 and R-EL2. Some rough
> > support for TF-A and RMM is available here:
> > https://jpbrucker.net/git/tf-a/log/?h=qemu-rme
> > https://jpbrucker.net/git/rmm/log/?h=qemu-rme
> > I'll clean this up before sending it out.
> >
> > I also need to manually disable FEAT_SME in QEMU in order to boot this,
> 
> Do you mean you needed to do something more invasive than
> '-cpu max,x-rme=on,sme=off' ?

Ah no, I hadn't noticed there was a sme=off switch, that's much better

Thanks,
Jean

> 
> > otherwise the Linux host fails to boot because hyp-stub accesses to SME
> > regs are trapped to EL3, which doesn't support RME+SME at the moment.
> > The right fix is probably in TF-A but I haven't investigated yet.
> 
> thanks
> -- PMM

Re: [PATCH for-8.1] virtio-iommu: Standardize granule extraction and formatting

2023-07-20 Thread Jean-Philippe Brucker

On Tue, Jul 18, 2023 at 08:21:36PM +0200, Eric Auger wrote:
> At several locations we compute the granule from the config
> page_size_mask using ctz() and then format it in traces using
> BIT(). As the page_size_mask is 64b we should use ctz64 and
> BIT_ULL() for formatting. We failed to be consistent.
> 
> Note the page_size_mask is garanteed to be non null. The spec
> mandates the device to set at least one bit, so ctz64 cannot
> return 64. This is garanteed by the fact the device
> initializes the page_size_mask to qemu_target_page_mask()
> and then the page_size_mask is further constrained by
> virtio_iommu_set_page_size_mask() callback which can't
> result in a new mask being null. So if Coverity complains
> round those ctz64/BIT_ULL with CID 1517772 this is a false
> positive
> 
> Signed-off-by: Eric Auger 
> Fixes: 94df5b2180 ("virtio-iommu: Fix 64kB host page size VFIO device 
> assignment")

Reviewed-by: Jean-Philippe Brucker 

> ---
>  hw/virtio/virtio-iommu.c | 8 +---
>  1 file changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/hw/virtio/virtio-iommu.c b/hw/virtio/virtio-iommu.c
> index 201127c488..c6ee4d7a3c 100644
> --- a/hw/virtio/virtio-iommu.c
> +++ b/hw/virtio/virtio-iommu.c
> @@ -852,17 +852,19 @@ static IOMMUTLBEntry 
> virtio_iommu_translate(IOMMUMemoryRegion *mr, hwaddr addr,
>  VirtIOIOMMUEndpoint *ep;
>  uint32_t sid, flags;
>  bool bypass_allowed;
> +int granule;
>  bool found;
>  int i;
>  
>  interval.low = addr;
>  interval.high = addr + 1;
> +granule = ctz64(s->config.page_size_mask);
>  
>  IOMMUTLBEntry entry = {
>  .target_as = _space_memory,
>  .iova = addr,
>  .translated_addr = addr,
> -.addr_mask = (1 << ctz32(s->config.page_size_mask)) - 1,
> +.addr_mask = BIT_ULL(granule) - 1,
>  .perm = IOMMU_NONE,
>  };
>  
> @@ -1115,7 +1117,7 @@ static int 
> virtio_iommu_set_page_size_mask(IOMMUMemoryRegion *mr,
>  if (s->granule_frozen) {
>  int cur_granule = ctz64(cur_mask);
>  
> -if (!(BIT(cur_granule) & new_mask)) {
> +if (!(BIT_ULL(cur_granule) & new_mask)) {
>  error_setg(errp, "virtio-iommu %s does not support frozen 
> granule 0x%llx",
> mr->parent_obj.name, BIT_ULL(cur_granule));
>  return -1;
> @@ -1161,7 +1163,7 @@ static void virtio_iommu_freeze_granule(Notifier 
> *notifier, void *data)
>  }
>  s->granule_frozen = true;
>  granule = ctz64(s->config.page_size_mask);
> -trace_virtio_iommu_freeze_granule(BIT(granule));
> +trace_virtio_iommu_freeze_granule(BIT_ULL(granule));
>  }
>  
>  static void virtio_iommu_device_realize(DeviceState *dev, Error **errp)
> -- 
> 2.38.1
>

[PATCH 0/5] target/arm: Fixes for RME

2023-07-19 Thread Jean-Philippe Brucker

With these patches I'm able to boot a Realm guest under
"-cpu max,x-rme=on". They are based on Peter's series which fixes
handling of NSTable:
https://lore.kernel.org/qemu-devel/20230714154648.327466-1-peter.mayd...@linaro.org/


Running a Realm guest requires components at EL3 and R-EL2. Some rough
support for TF-A and RMM is available here:
https://jpbrucker.net/git/tf-a/log/?h=qemu-rme
https://jpbrucker.net/git/rmm/log/?h=qemu-rme
I'll clean this up before sending it out.

I also need to manually disable FEAT_SME in QEMU in order to boot this,
otherwise the Linux host fails to boot because hyp-stub accesses to SME
regs are trapped to EL3, which doesn't support RME+SME at the moment.
The right fix is probably in TF-A but I haven't investigated yet.

Jean-Philippe Brucker (5):
  target/arm/ptw: Load stage-2 tables from realm physical space
  target/arm/helper: Fix vae2_tlbmask()
  target/arm: Skip granule protection checks for AT instructions
  target/arm: Pass security space rather than flag for AT instructions
  target/arm/helper: Implement CNTHCTL_EL2.CNT[VP]MASK

 target/arm/internals.h | 25 --
 target/arm/helper.c| 78 --
 target/arm/ptw.c   | 19 ++
 3 files changed, 79 insertions(+), 43 deletions(-)

-- 
2.41.0

[PATCH 4/5] target/arm: Pass security space rather than flag for AT instructions

2023-07-19 Thread Jean-Philippe Brucker

At the moment we only handle Secure and Nonsecure security spaces for
the AT instructions. Add support for Realm and Root.

For AArch64, arm_security_space() gives the desired space. ARM DDI0487J
says (R_NYXTL):

  If EL3 is implemented, then when an address translation instruction
  that applies to an Exception level lower than EL3 is executed, the
  Effective value of SCR_EL3.{NSE, NS} determines the target Security
  state that the instruction applies to.

For AArch32, some instructions can access NonSecure space from Secure,
so we still need to pass the state explicitly to do_ats_write().

Signed-off-by: Jean-Philippe Brucker 
---
I haven't tested AT instructions in Realm/Root space yet, but it looks
like the patch is needed. RMM doesn't issue AT instructions like KVM
does in non-secure state (which triggered the bug in the previous
patch).
---
 target/arm/internals.h | 18 +-
 target/arm/helper.c| 27 ---
 target/arm/ptw.c   | 12 ++--
 3 files changed, 27 insertions(+), 30 deletions(-)

diff --git a/target/arm/internals.h b/target/arm/internals.h
index fc90c364f7..cf13bb94f5 100644
--- a/target/arm/internals.h
+++ b/target/arm/internals.h
@@ -1217,24 +1217,24 @@ bool get_phys_addr(CPUARMState *env, target_ulong 
address,
 __attribute__((nonnull));
 
 /**
- * get_phys_addr_with_secure_nogpc: get the physical address for a virtual
- *  address
+ * get_phys_addr_with_space_nogpc: get the physical address for a virtual
+ * address
  * @env: CPUARMState
  * @address: virtual address to get physical address for
  * @access_type: 0 for read, 1 for write, 2 for execute
  * @mmu_idx: MMU index indicating required translation regime
- * @is_secure: security state for the access
+ * @space: security space for the access
  * @result: set on translation success.
  * @fi: set to fault info if the translation fails
  *
- * Similar to get_phys_addr, but use the given security regime and don't 
perform
+ * Similar to get_phys_addr, but use the given security space and don't perform
  * a Granule Protection Check on the resulting address.
  */
-bool get_phys_addr_with_secure_nogpc(CPUARMState *env, target_ulong address,
- MMUAccessType access_type,
- ARMMMUIdx mmu_idx, bool is_secure,
- GetPhysAddrResult *result,
- ARMMMUFaultInfo *fi)
+bool get_phys_addr_with_space_nogpc(CPUARMState *env, target_ulong address,
+MMUAccessType access_type,
+ARMMMUIdx mmu_idx, ARMSecuritySpace space,
+GetPhysAddrResult *result,
+ARMMMUFaultInfo *fi)
 __attribute__((nonnull));
 
 bool pmsav8_mpu_lookup(CPUARMState *env, uint32_t address,
diff --git a/target/arm/helper.c b/target/arm/helper.c
index 3ee2bb5fe1..2017b11795 100644
--- a/target/arm/helper.c
+++ b/target/arm/helper.c
@@ -3357,7 +3357,7 @@ static int par_el1_shareability(GetPhysAddrResult *res)
 
 static uint64_t do_ats_write(CPUARMState *env, uint64_t value,
  MMUAccessType access_type, ARMMMUIdx mmu_idx,
- bool is_secure)
+ ARMSecuritySpace ss)
 {
 bool ret;
 uint64_t par64;
@@ -3369,8 +3369,8 @@ static uint64_t do_ats_write(CPUARMState *env, uint64_t 
value,
  * I_MXTJT: Granule protection checks are not performed on the final 
address
  * of a successful translation.
  */
-ret = get_phys_addr_with_secure_nogpc(env, value, access_type, mmu_idx,
-  is_secure, , );
+ret = get_phys_addr_with_space_nogpc(env, value, access_type, mmu_idx, ss,
+ , );
 
 /*
  * ATS operations only do S1 or S1+S2 translations, so we never
@@ -3535,7 +3535,7 @@ static void ats_write(CPUARMState *env, const 
ARMCPRegInfo *ri, uint64_t value)
 uint64_t par64;
 ARMMMUIdx mmu_idx;
 int el = arm_current_el(env);
-bool secure = arm_is_secure_below_el3(env);
+ARMSecuritySpace ss = arm_security_space(env);
 
 switch (ri->opc2 & 6) {
 case 0:
@@ -3543,10 +3543,9 @@ static void ats_write(CPUARMState *env, const 
ARMCPRegInfo *ri, uint64_t value)
 switch (el) {
 case 3:
 mmu_idx = ARMMMUIdx_E3;
-secure = true;
 break;
 case 2:
-g_assert(!secure);  /* ARMv8.4-SecEL2 is 64-bit only */
+g_assert(ss != ARMSS_Secure);  /* ARMv8.4-SecEL2 is 64-bit only */
 /* fall through */
 case 1:
 if (ri->crm == 9 && (env->uncached_cpsr & CPSR_PAN)) {
@@ -3564,10 +3563,9 @@ static void ats_write(CPUARMState *env, const 
ARMCPRegInfo *ri, uint64_t v

[PATCH 5/5] target/arm/helper: Implement CNTHCTL_EL2.CNT[VP]MASK

2023-07-19 Thread Jean-Philippe Brucker

When FEAT_RME is implemented, these bits override the value of
CNT[VP]_CTL_EL0.IMASK in Realm and Root state.

Signed-off-by: Jean-Philippe Brucker 
---
 target/arm/helper.c | 21 +++--
 1 file changed, 19 insertions(+), 2 deletions(-)

diff --git a/target/arm/helper.c b/target/arm/helper.c
index 2017b11795..5b173a827f 100644
--- a/target/arm/helper.c
+++ b/target/arm/helper.c
@@ -2608,6 +2608,23 @@ static uint64_t gt_get_countervalue(CPUARMState *env)
 return qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) / gt_cntfrq_period_ns(cpu);
 }
 
+static bool gt_is_masked(CPUARMState *env, int timeridx)
+{
+ARMSecuritySpace ss = arm_security_space(env);
+
+/*
+ * If bits CNTHCTL_EL2.CNT[VP]MASK are set, they override
+ * CNT[VP]_CTL_EL0.IMASK. They are RES0 in Secure and NonSecure state.
+ */
+if ((ss == ARMSS_Root || ss == ARMSS_Realm) &&
+((timeridx == GTIMER_VIRT && extract64(env->cp15.cnthctl_el2, 18, 1)) 
||
+ (timeridx == GTIMER_PHYS && extract64(env->cp15.cnthctl_el2, 19, 
1 {
+return true;
+}
+
+return env->cp15.c14_timer[timeridx].ctl & 2;
+}
+
 static void gt_recalc_timer(ARMCPU *cpu, int timeridx)
 {
 ARMGenericTimer *gt = >env.cp15.c14_timer[timeridx];
@@ -2627,7 +2644,7 @@ static void gt_recalc_timer(ARMCPU *cpu, int timeridx)
 
 gt->ctl = deposit32(gt->ctl, 2, 1, istatus);
 
-irqstate = (istatus && !(gt->ctl & 2));
+irqstate = (istatus && !gt_is_masked(>env, timeridx));
 qemu_set_irq(cpu->gt_timer_outputs[timeridx], irqstate);
 
 if (istatus) {
@@ -2759,7 +2776,7 @@ static void gt_ctl_write(CPUARMState *env, const 
ARMCPRegInfo *ri,
  * IMASK toggled: don't need to recalculate,
  * just set the interrupt line based on ISTATUS
  */
-int irqstate = (oldval & 4) && !(value & 2);
+int irqstate = (oldval & 4) && !gt_is_masked(env, timeridx);
 
 trace_arm_gt_imask_toggle(timeridx, irqstate);
 qemu_set_irq(cpu->gt_timer_outputs[timeridx], irqstate);
-- 
2.41.0

[PATCH 2/5] target/arm/helper: Fix vae2_tlbmask()

2023-07-19 Thread Jean-Philippe Brucker

When HCR_EL2.E2H is enabled, TLB entries are formed using the EL2&0
translation regime, instead of the EL2 translation regime. The TLB VAE2*
instructions invalidate the regime that corresponds to the current value
of HCR_EL2.E2H.

At the moment we only invalidate the EL2 translation regime. This causes
problems with RMM, which issues TLBI VAE2IS instructions with
HCR_EL2.E2H enabled. Update vae2_tlbmask() to take HCR_EL2.E2H into
account.

Signed-off-by: Jean-Philippe Brucker 
---
 target/arm/helper.c | 26 ++
 1 file changed, 18 insertions(+), 8 deletions(-)

diff --git a/target/arm/helper.c b/target/arm/helper.c
index e1b3db6f5f..07a9ac70f5 100644
--- a/target/arm/helper.c
+++ b/target/arm/helper.c
@@ -4663,6 +4663,21 @@ static int vae1_tlbmask(CPUARMState *env)
 return mask;
 }
 
+static int vae2_tlbmask(CPUARMState *env)
+{
+uint64_t hcr = arm_hcr_el2_eff(env);
+uint16_t mask;
+
+if (hcr & HCR_E2H) {
+mask = ARMMMUIdxBit_E20_2 |
+   ARMMMUIdxBit_E20_2_PAN |
+   ARMMMUIdxBit_E20_0;
+} else {
+mask = ARMMMUIdxBit_E2;
+}
+return mask;
+}
+
 /* Return 56 if TBI is enabled, 64 otherwise. */
 static int tlbbits_for_regime(CPUARMState *env, ARMMMUIdx mmu_idx,
   uint64_t addr)
@@ -4781,7 +4796,7 @@ static void tlbi_aa64_vae2_write(CPUARMState *env, const 
ARMCPRegInfo *ri,
  * flush-last-level-only.
  */
 CPUState *cs = env_cpu(env);
-int mask = e2_tlbmask(env);
+int mask = vae2_tlbmask(env);
 uint64_t pageaddr = sextract64(value << 12, 0, 56);
 
 tlb_flush_page_by_mmuidx(cs, pageaddr, mask);
@@ -4838,11 +4853,11 @@ static void tlbi_aa64_vae2is_write(CPUARMState *env, 
const ARMCPRegInfo *ri,
uint64_t value)
 {
 CPUState *cs = env_cpu(env);
+int mask = vae2_tlbmask(env);
 uint64_t pageaddr = sextract64(value << 12, 0, 56);
 int bits = tlbbits_for_regime(env, ARMMMUIdx_E2, pageaddr);
 
-tlb_flush_page_bits_by_mmuidx_all_cpus_synced(cs, pageaddr,
-  ARMMMUIdxBit_E2, bits);
+tlb_flush_page_bits_by_mmuidx_all_cpus_synced(cs, pageaddr, mask, bits);
 }
 
 static void tlbi_aa64_vae3is_write(CPUARMState *env, const ARMCPRegInfo *ri,
@@ -5014,11 +5029,6 @@ static void tlbi_aa64_rvae1is_write(CPUARMState *env,
 do_rvae_write(env, value, vae1_tlbmask(env), true);
 }
 
-static int vae2_tlbmask(CPUARMState *env)
-{
-return ARMMMUIdxBit_E2;
-}
-
 static void tlbi_aa64_rvae2_write(CPUARMState *env,
   const ARMCPRegInfo *ri,
   uint64_t value)
-- 
2.41.0

[PATCH 3/5] target/arm: Skip granule protection checks for AT instructions

2023-07-19 Thread Jean-Philippe Brucker

GPC checks are not performed on the output address for AT instructions,
as stated by ARM DDI 0487J in D8.12.2:

  When populating PAR_EL1 with the result of an address translation
  instruction, granule protection checks are not performed on the final
  output address of a successful translation.

Rename get_phys_addr_with_secure(), since it's only used to handle AT
instructions.

Signed-off-by: Jean-Philippe Brucker 
---
This incidentally fixes a problem with AT S1E1 instructions which can
output an IPA and should definitely not cause a GPC.
---
 target/arm/internals.h | 25 ++---
 target/arm/helper.c|  8 ++--
 target/arm/ptw.c   | 11 ++-
 3 files changed, 26 insertions(+), 18 deletions(-)

diff --git a/target/arm/internals.h b/target/arm/internals.h
index 0f01bc32a8..fc90c364f7 100644
--- a/target/arm/internals.h
+++ b/target/arm/internals.h
@@ -1190,12 +1190,11 @@ typedef struct GetPhysAddrResult {
 } GetPhysAddrResult;
 
 /**
- * get_phys_addr_with_secure: get the physical address for a virtual address
+ * get_phys_addr: get the physical address for a virtual address
  * @env: CPUARMState
  * @address: virtual address to get physical address for
  * @access_type: 0 for read, 1 for write, 2 for execute
  * @mmu_idx: MMU index indicating required translation regime
- * @is_secure: security state for the access
  * @result: set on translation success.
  * @fi: set to fault info if the translation fails
  *
@@ -1212,26 +1211,30 @@ typedef struct GetPhysAddrResult {
  *  * for PSMAv5 based systems we don't bother to return a full FSR format
  *value.
  */
-bool get_phys_addr_with_secure(CPUARMState *env, target_ulong address,
-   MMUAccessType access_type,
-   ARMMMUIdx mmu_idx, bool is_secure,
-   GetPhysAddrResult *result, ARMMMUFaultInfo *fi)
+bool get_phys_addr(CPUARMState *env, target_ulong address,
+   MMUAccessType access_type, ARMMMUIdx mmu_idx,
+   GetPhysAddrResult *result, ARMMMUFaultInfo *fi)
 __attribute__((nonnull));
 
 /**
- * get_phys_addr: get the physical address for a virtual address
+ * get_phys_addr_with_secure_nogpc: get the physical address for a virtual
+ *  address
  * @env: CPUARMState
  * @address: virtual address to get physical address for
  * @access_type: 0 for read, 1 for write, 2 for execute
  * @mmu_idx: MMU index indicating required translation regime
+ * @is_secure: security state for the access
  * @result: set on translation success.
  * @fi: set to fault info if the translation fails
  *
- * Similarly, but use the security regime of @mmu_idx.
+ * Similar to get_phys_addr, but use the given security regime and don't 
perform
+ * a Granule Protection Check on the resulting address.
  */
-bool get_phys_addr(CPUARMState *env, target_ulong address,
-   MMUAccessType access_type, ARMMMUIdx mmu_idx,
-   GetPhysAddrResult *result, ARMMMUFaultInfo *fi)
+bool get_phys_addr_with_secure_nogpc(CPUARMState *env, target_ulong address,
+ MMUAccessType access_type,
+ ARMMMUIdx mmu_idx, bool is_secure,
+ GetPhysAddrResult *result,
+ ARMMMUFaultInfo *fi)
 __attribute__((nonnull));
 
 bool pmsav8_mpu_lookup(CPUARMState *env, uint32_t address,
diff --git a/target/arm/helper.c b/target/arm/helper.c
index 07a9ac70f5..3ee2bb5fe1 100644
--- a/target/arm/helper.c
+++ b/target/arm/helper.c
@@ -3365,8 +3365,12 @@ static uint64_t do_ats_write(CPUARMState *env, uint64_t 
value,
 ARMMMUFaultInfo fi = {};
 GetPhysAddrResult res = {};
 
-ret = get_phys_addr_with_secure(env, value, access_type, mmu_idx,
-is_secure, , );
+/*
+ * I_MXTJT: Granule protection checks are not performed on the final 
address
+ * of a successful translation.
+ */
+ret = get_phys_addr_with_secure_nogpc(env, value, access_type, mmu_idx,
+  is_secure, , );
 
 /*
  * ATS operations only do S1 or S1+S2 translations, so we never
diff --git a/target/arm/ptw.c b/target/arm/ptw.c
index 6318e13b98..1aef2b8cef 100644
--- a/target/arm/ptw.c
+++ b/target/arm/ptw.c
@@ -3412,16 +3412,17 @@ static bool get_phys_addr_gpc(CPUARMState *env, 
S1Translate *ptw,
 return false;
 }
 
-bool get_phys_addr_with_secure(CPUARMState *env, target_ulong address,
-   MMUAccessType access_type, ARMMMUIdx mmu_idx,
-   bool is_secure, GetPhysAddrResult *result,
-   ARMMMUFaultInfo *fi)
+bool get_phys_addr_with_secure_nogpc(CPUARMState *env, target_ulong address,
+ MMUAccessType access_type,
+ ARMMMUIdx mmu_idx, bool

[PATCH 1/5] target/arm/ptw: Load stage-2 tables from realm physical space

2023-07-19 Thread Jean-Philippe Brucker

In realm state, stage-2 translation tables are fetched from the realm
physical address space (R_PGRQD).

Signed-off-by: Jean-Philippe Brucker 
---
 target/arm/ptw.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/target/arm/ptw.c b/target/arm/ptw.c
index d1de934702..6318e13b98 100644
--- a/target/arm/ptw.c
+++ b/target/arm/ptw.c
@@ -164,7 +164,11 @@ static ARMMMUIdx ptw_idx_for_stage_2(CPUARMState *env, 
ARMMMUIdx stage2idx)
  * an NS stage 1+2 lookup while the NS bit is 0.)
  */
 if (!arm_is_secure_below_el3(env) || !arm_el_is_aa64(env, 3)) {
-return ARMMMUIdx_Phys_NS;
+if (arm_security_space_below_el3(env) == ARMSS_Realm) {
+return ARMMMUIdx_Phys_Realm;
+} else {
+return ARMMMUIdx_Phys_NS;
+}
 }
 if (stage2idx == ARMMMUIdx_Stage2_S) {
 s2walk_secure = !(env->cp15.vstcr_el2 & VSTCR_SW);
-- 
2.41.0

Re: [PATCH for-8.1 2/3] target/arm: Fix S1_ptw_translate() debug path

2023-07-11 Thread Jean-Philippe Brucker

On Mon, Jul 10, 2023 at 04:21:29PM +0100, Peter Maydell wrote:
> In commit XXX we rearranged the logic in S1_ptw_translate() so that
> the debug-access "call get_phys_addr_*" codepath is used both when S1
> is doing ptw reads from stage 2 and when it is doing ptw reads from
> physical memory.  However, we didn't update the calculation of
> s2ptw->in_space and s2ptw->in_secure to account for the "ptw reads
> from physical memory" case.  This meant that debug accesses when in
> Secure state broke.
> 
> Create a new function S2_security_space() which returns the
> correct security space to use for the ptw load, and use it to
> determine the correct .in_secure and .in_space fields for the
> stage 2 lookup for the ptw load.
> 
> Reported-by: Jean-Philippe Brucker 
> Fixes: fe4a5472ccd6 ("target/arm: Use get_phys_addr_with_struct in 
> S1_ptw_translate")
> Signed-off-by: Peter Maydell 

Thanks, this fixes tf-a boot with semihosting

Tested-by: Jean-Philippe Brucker 

> ---
>  target/arm/ptw.c | 37 -
>  1 file changed, 32 insertions(+), 5 deletions(-)
> 
> diff --git a/target/arm/ptw.c b/target/arm/ptw.c
> index 21749375f97..c0b9cee5843 100644
> --- a/target/arm/ptw.c
> +++ b/target/arm/ptw.c
> @@ -485,11 +485,39 @@ static bool S2_attrs_are_device(uint64_t hcr, uint8_t 
> attrs)
>  }
>  }
>  
> +static ARMSecuritySpace S2_security_space(ARMSecuritySpace s1_space,
> +  ARMMMUIdx s2_mmu_idx)
> +{
> +/*
> + * Return the security space to use for stage 2 when doing
> + * the S1 page table descriptor load.
> + */
> +if (regime_is_stage2(s2_mmu_idx)) {
> +/*
> + * The security space for ptw reads is almost always the same
> + * as that of the security space of the stage 1 translation.
> + * The only exception is when stage 1 is Secure; in that case
> + * the ptw read might be to the Secure or the NonSecure space
> + * (but never Realm or Root), and the s2_mmu_idx tells us which.
> + * Root translations are always single-stage.
> + */
> +if (s1_space == ARMSS_Secure) {
> +return arm_secure_to_space(s2_mmu_idx == ARMMMUIdx_Stage2_S);
> +} else {
> +assert(s2_mmu_idx != ARMMMUIdx_Stage2_S);
> +assert(s1_space != ARMSS_Root);
> +return s1_space;
> +}
> +} else {
> +/* ptw loads are from phys: the mmu idx itself says which space */
> +return arm_phys_to_space(s2_mmu_idx);
> +}
> +}
> +
>  /* Translate a S1 pagetable walk through S2 if needed.  */
>  static bool S1_ptw_translate(CPUARMState *env, S1Translate *ptw,
>   hwaddr addr, ARMMMUFaultInfo *fi)
>  {
> -ARMSecuritySpace space = ptw->in_space;
>  bool is_secure = ptw->in_secure;
>  ARMMMUIdx mmu_idx = ptw->in_mmu_idx;
>  ARMMMUIdx s2_mmu_idx = ptw->in_ptw_idx;
> @@ -502,13 +530,12 @@ static bool S1_ptw_translate(CPUARMState *env, 
> S1Translate *ptw,
>   * From gdbstub, do not use softmmu so that we don't modify the
>   * state of the cpu at all, including softmmu tlb contents.
>   */
> +ARMSecuritySpace s2_space = S2_security_space(ptw->in_space, 
> s2_mmu_idx);
>  S1Translate s2ptw = {
>  .in_mmu_idx = s2_mmu_idx,
>  .in_ptw_idx = ptw_idx_for_stage_2(env, s2_mmu_idx),
> -.in_secure = s2_mmu_idx == ARMMMUIdx_Stage2_S,
> -.in_space = (s2_mmu_idx == ARMMMUIdx_Stage2_S ? ARMSS_Secure
> - : space == ARMSS_Realm ? ARMSS_Realm
> - : ARMSS_NonSecure),
> +.in_secure = arm_space_is_secure(s2_space),
> +.in_space = s2_space,
>  .in_debug = true,
>  };
>  GetPhysAddrResult s2 = { };
> -- 
> 2.34.1
>

Re: [PATCH v2 0/2] VIRTIO-IOMMU/VFIO page size related fixes

2023-07-07 Thread Jean-Philippe Brucker

On Wed, Jul 05, 2023 at 06:51:16PM +0200, Eric Auger wrote:
> When assigning a host device and protecting it with the virtio-iommu we may
> end up with qemu crashing with
> 
> qemu-kvm: virtio-iommu page mask 0xf000 is incompatible
> with mask 0x2001
> qemu: hardware error: vfio: DMA mapping failed, unable to continue
> 
> This happens if the host features a 64kB page size and constraints
> the physical IOMMU to use a 64kB page size. By default 4kB granule is used
> by the qemu virtio-iommu device and this latter becomes aware of the 64kB
> requirement too late, after the machine init, when the vfio device domain is
> attached. virtio_iommu_set_page_size_mask() fails and this causes
> vfio_listener_region_add() to end up with hw_error(). Currently the
> granule is global to all domains.
> 
> To work around this issue, despite the IOMMU MR may be bypassed, we
> transiently enable it on machine init done to get vfio_listener_region_add
> and virtio_iommu_set_page_size_mask called ealier, before the domain
> attach. That way the page size requirement can be taken into account
> before the guest gets started.
> 
> Also get benefit of this series to do some cleanups in some traces
> which may confuse the end user.

For both patches:

Reviewed-by: Jean-Philippe Brucker 
Tested-by: Jean-Philippe Brucker

Re: [PATCH] target/arm: Fix ptw parameters in S1_ptw_translate() for debug contexts

2023-07-06 Thread Jean-Philippe Brucker

On Thu, Jul 06, 2023 at 04:42:02PM +0100, Peter Maydell wrote:
> > > Do you have a repro case for this bug? Did it work
> > > before commit fe4a5472ccd6 ?
> >
> > Yes I bisected to fe4a5472ccd6 by trying to run TF-A, following
> > instructions here:
> > https://github.com/ARM-software/arm-trusted-firmware/blob/master/docs/plat/qemu.rst
> >
> > Building TF-A (HEAD 8e31faa05):
> > make -j CROSS_COMPILE=aarch64-linux-gnu- 
> > BL33=edk2/Build/ArmVirtQemuKernel-AARCH64/DEBUG_GCC5/FV/QEMU_EFI.fd 
> > PLAT=qemu DEBUG=1 LOG_LEVEL=40 all fip
> >
> > Installing it to QEMU runtime dir:
> > ln -sf tf-a/build/qemu/debug/bl1.bin build/qemu-cca/run/
> > ln -sf tf-a/build/qemu/debug/bl2.bin build/qemu-cca/run/
> > ln -sf tf-a/build/qemu/debug/bl31.bin build/qemu-cca/run/
> > ln -sf edk2/Build/ArmVirtQemuKernel-AARCH64/DEBUG_GCC5/FV/QEMU_EFI.fd 
> > build/qemu-cca/run/bl33.bin
> 
> Could you put the necessary binary blobs up somewhere, to save
> me trying to rebuild TF-A ?

Uploaded to:
https://jpbrucker.net/tmp/2023-07-06-repro-qemu-tfa.tar.gz

Thanks,
Jean

> 
> 
> > > > ---
> > > >  target/arm/ptw.c | 6 ++
> > > >  1 file changed, 2 insertions(+), 4 deletions(-)
> > > >
> > > > diff --git a/target/arm/ptw.c b/target/arm/ptw.c
> > > > index 9aaff1546a..e3a738c28e 100644
> > > > --- a/target/arm/ptw.c
> > > > +++ b/target/arm/ptw.c
> > > > @@ -465,10 +465,8 @@ static bool S1_ptw_translate(CPUARMState *env, 
> > > > S1Translate *ptw,
> > > >  S1Translate s2ptw = {
> > > >  .in_mmu_idx = s2_mmu_idx,
> > > >  .in_ptw_idx = ptw_idx_for_stage_2(env, s2_mmu_idx),
> > > > -.in_secure = s2_mmu_idx == ARMMMUIdx_Stage2_S,
> > > > -.in_space = (s2_mmu_idx == ARMMMUIdx_Stage2_S ? 
> > > > ARMSS_Secure
> > > > - : space == ARMSS_Realm ? ARMSS_Realm
> > > > - : ARMSS_NonSecure),
> > > > +.in_secure = is_secure,
> > > > +.in_space = space,
> > >
> > > If the problem is fe4a5472ccd6 then this seems an odd change to
> > > be making, because in_secure and in_space were set that way
> > > before that commit too...
> >
> > I think that commit merged both sides of the
> > "regime_is_stage2(s2_mmu_idx)" test, but only kept testing for secure
> > through ARMMMUIdx_Stage2_S, and removed the test through ARMMMUIdx_Phys_S
> 
> Yes, I agree. I'm not sure your proposed fix is the right one,
> though. I need to re-work through what I did in fcc0b0418fff
> to remind myself of what the various fields in a S1Translate
> struct are supposed to be, but I think .in_secure (and now
> .in_space) are supposed to always match .in_mmu_idx, and
> that's not necessarily the same as what the local is_secure
> holds. (is_secure is the original ptw's in_secure, which
> matches that ptw's .in_mmu_idx, not its .in_ptw_idx.)
> So probably the right thing for the .in_secure check is
> to change to "(s2_mmu_idx == ARMMMUIdx_Stage2_S ||
> s2_mmu_idx == ARMMMUIdx_Phys_S)". Less sure about .in_space,
> because that conditional is a bit more complicated.
> 
> thanks
> -- PMM

Re: [PATCH] target/arm: Fix ptw parameters in S1_ptw_translate() for debug contexts

2023-07-06 Thread Jean-Philippe Brucker

On Thu, Jul 06, 2023 at 03:28:32PM +0100, Peter Maydell wrote:
> On Thu, 6 Jul 2023 at 15:12, Jean-Philippe Brucker
>  wrote:
> >
> > Arm TF-A fails to boot via semihosting following a recent change to the
> > MMU code. Semihosting attempts to read parameters passed by TF-A in
> > secure RAM via cpu_memory_rw_debug(). While performing the S1
> > translation, we call S1_ptw_translate() on the page table descriptor
> > address, with an MMU index of ARMMMUIdx_Phys_S. At the moment
> > S1_ptw_translate() doesn't interpret this as a secure access, and as a
> > result we attempt to read the page table descriptor from the non-secure
> > address space, which fails.
> >
> > Fixes: fe4a5472ccd6 ("target/arm: Use get_phys_addr_with_struct in 
> > S1_ptw_translate")
> > Signed-off-by: Jean-Philippe Brucker 
> > ---
> > I'm not entirely sure why the semihosting parameters are accessed
> > through stage-1 translation rather than directly as physical addresses,
> > but I'm not familiar with semihosting.
> 
> The semihosting ABI says the guest code should pass "a pointer
> to the parameter block". It doesn't say explicitly, but the
> straightforward interpretation is "a pointer that the guest
> itself could dereference to read/write the values", which means
> a virtual address, not a physical one. It would be pretty
> painful for the guest to have to figure out "what is the
> physaddr for this virtual address" to pass it to the semihosting
> call.
> 
> Do you have a repro case for this bug? Did it work
> before commit fe4a5472ccd6 ?

Yes I bisected to fe4a5472ccd6 by trying to run TF-A, following
instructions here:
https://github.com/ARM-software/arm-trusted-firmware/blob/master/docs/plat/qemu.rst

Building TF-A (HEAD 8e31faa05):
make -j CROSS_COMPILE=aarch64-linux-gnu- 
BL33=edk2/Build/ArmVirtQemuKernel-AARCH64/DEBUG_GCC5/FV/QEMU_EFI.fd PLAT=qemu 
DEBUG=1 LOG_LEVEL=40 all fip

Installing it to QEMU runtime dir:
ln -sf tf-a/build/qemu/debug/bl1.bin build/qemu-cca/run/
ln -sf tf-a/build/qemu/debug/bl2.bin build/qemu-cca/run/
ln -sf tf-a/build/qemu/debug/bl31.bin build/qemu-cca/run/
ln -sf edk2/Build/ArmVirtQemuKernel-AARCH64/DEBUG_GCC5/FV/QEMU_EFI.fd 
build/qemu-cca/run/bl33.bin

Running QEMU with HEAD=fe4a5472cc:
qemu-system-aarch64 -M virt,secure=on -cpu max -m 1G -nographic -bios bl1.bin 
-semihosting-config enable=on,target=native -d guest_errors

NOTICE:  Booting Trusted Firmware
NOTICE:  BL1: v2.9(debug):v2.9.0-280-g8e31faa05
NOTICE:  BL1: Built : 16:23:20, Jul  6 2023
INFO:BL1: RAM 0xe0ee000 - 0xe0f6000
INFO:BL1: Loading BL2
WARNING: Firmware Image Package header check failed.
Invalid read at addr 0xE0EF900, size 8, region '(null)', reason: rejected
WARNING: Failed to obtain reference to image id=1 (-2)
ERROR:   Failed to load BL2 firmware.

with HEAD=4a7d7702cd:
...
INFO:BL1: Loading BL2
WARNING: Firmware Image Package header check failed.
INFO:Loading image id=1 at address 0xe06b000
INFO:Image id=1 loaded: 0xe06b000 - 0xe073201
NOTICE:  BL1: Booting BL2
INFO:Entry point address = 0xe06b000
INFO:SPSR = 0x3c5
...


(Note that there is an issue with TF-A missing ENABLE_FEAT_FGT for qemu at
the moment, which prevents booting Linux with -cpu max. I'll send the fix
to TF-A after this, but this reproducer should at least boot edk2.)

> > ---
> >  target/arm/ptw.c | 6 ++
> >  1 file changed, 2 insertions(+), 4 deletions(-)
> >
> > diff --git a/target/arm/ptw.c b/target/arm/ptw.c
> > index 9aaff1546a..e3a738c28e 100644
> > --- a/target/arm/ptw.c
> > +++ b/target/arm/ptw.c
> > @@ -465,10 +465,8 @@ static bool S1_ptw_translate(CPUARMState *env, 
> > S1Translate *ptw,
> >  S1Translate s2ptw = {
> >  .in_mmu_idx = s2_mmu_idx,
> >  .in_ptw_idx = ptw_idx_for_stage_2(env, s2_mmu_idx),
> > -.in_secure = s2_mmu_idx == ARMMMUIdx_Stage2_S,
> > -.in_space = (s2_mmu_idx == ARMMMUIdx_Stage2_S ? ARMSS_Secure
> > - : space == ARMSS_Realm ? ARMSS_Realm
> > - : ARMSS_NonSecure),
> > +.in_secure = is_secure,
> > +.in_space = space,
> 
> If the problem is fe4a5472ccd6 then this seems an odd change to
> be making, because in_secure and in_space were set that way
> before that commit too...

I think that commit merged both sides of the
"regime_is_stage2(s2_mmu_idx)" test, but only kept testing for secure
through ARMMMUIdx_Stage2_S, and removed the test through ARMMMUIdx_Phys_S

Thanks,
Jean

Re: [PATCH 2/2] virtio-iommu: Rework the trace in virtio_iommu_set_page_size_mask()

2023-07-06 Thread Jean-Philippe Brucker

On Wed, Jul 05, 2023 at 03:16:31PM +0200, Eric Auger wrote:
> >>> diff --git a/hw/virtio/virtio-iommu.c b/hw/virtio/virtio-iommu.c index
> >>> 1eaf81bab5..0d9f7196fe 100644
> >>> --- a/hw/virtio/virtio-iommu.c
> >>> +++ b/hw/virtio/virtio-iommu.c
> >>> @@ -1101,29 +1101,24 @@ static int
> >>> virtio_iommu_set_page_size_mask(IOMMUMemoryRegion *mr,
> >>>   new_mask);
> >>>
> >>> if ((cur_mask & new_mask) == 0) {
> >>> -error_setg(errp, "virtio-iommu page mask 0x%"PRIx64
> >>> -   " is incompatible with mask 0x%"PRIx64, cur_mask, 
> >>> new_mask);
> >>> +error_setg(errp, "virtio-iommu %s reports a page size mask 
> >>> 0x%"PRIx64
> >>> +   " incompatible with currently supported mask 
> >>> 0x%"PRIx64,
> >>> +   mr->parent_obj.name, new_mask, cur_mask);
> >>> return -1;
> >>> }
> >>>
> >>> /*
> >>>  * Once the granule is frozen we can't change the mask anymore. If by
> >>>  * chance the hotplugged device supports the same granule, we can 
> >>> still
> >>> - * accept it. Having a different masks is possible but the guest 
> >>> will use
> >>> - * sub-optimal block sizes, so warn about it.
> >>> + * accept it.
> >>>  */
> >>> if (s->granule_frozen) {
> >>> -int new_granule = ctz64(new_mask);
> >>> int cur_granule = ctz64(cur_mask);
> >>>
> >>> -if (new_granule != cur_granule) {
> >>> -error_setg(errp, "virtio-iommu page mask 0x%"PRIx64
> >>> -   " is incompatible with mask 0x%"PRIx64, cur_mask,
> >>> -   new_mask);
> >>> +if (!(BIT(cur_granule) & new_mask)) {
> > Sorry, I read this piece code again and got a question, if new_mask has 
> > finer
> > granularity than cur_granule, should we allow it to pass even though
> > BIT(cur_granule) is not set?
> I think this should work but this is not straightforward to test.
> virtio-iommu would use the current granule for map/unmap. In map/unmap
> notifiers, this is split into pow2 ranges and cascaded to VFIO through
> vfio_dma_map/unmap. The iova and size are aligned with the smaller
> supported granule.
> 
> Jean, do you share this understanding or do I miss something.

Yes, I also think that would work. The guest would only issue mappings
with the larger granularity, which can be applied by VFIO with a finer
granule. However I doubt we're going to encounter this case, because
seeing a cur_granule larger than 4k here means that a VFIO device has
already been assigned with a large granule like 64k, and we're trying to
add a new device with 4k. This indicates two HW IOMMUs supporting
different granules in the same system, which seems unlikely.

Hopefully by the time we actually need this (if ever) we will support
per-endpoint probe properties, which allow informing the guest about
different hardware properties instead of relying on one global property in
the virtio config.

Thanks,
Jean

[PATCH] target/arm: Fix ptw parameters in S1_ptw_translate() for debug contexts

2023-07-06 Thread Jean-Philippe Brucker

Arm TF-A fails to boot via semihosting following a recent change to the
MMU code. Semihosting attempts to read parameters passed by TF-A in
secure RAM via cpu_memory_rw_debug(). While performing the S1
translation, we call S1_ptw_translate() on the page table descriptor
address, with an MMU index of ARMMMUIdx_Phys_S. At the moment
S1_ptw_translate() doesn't interpret this as a secure access, and as a
result we attempt to read the page table descriptor from the non-secure
address space, which fails.

Fixes: fe4a5472ccd6 ("target/arm: Use get_phys_addr_with_struct in 
S1_ptw_translate")
Signed-off-by: Jean-Philippe Brucker 
---
I'm not entirely sure why the semihosting parameters are accessed
through stage-1 translation rather than directly as physical addresses,
but I'm not familiar with semihosting.
---
 target/arm/ptw.c | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/target/arm/ptw.c b/target/arm/ptw.c
index 9aaff1546a..e3a738c28e 100644
--- a/target/arm/ptw.c
+++ b/target/arm/ptw.c
@@ -465,10 +465,8 @@ static bool S1_ptw_translate(CPUARMState *env, S1Translate 
*ptw,
 S1Translate s2ptw = {
 .in_mmu_idx = s2_mmu_idx,
 .in_ptw_idx = ptw_idx_for_stage_2(env, s2_mmu_idx),
-.in_secure = s2_mmu_idx == ARMMMUIdx_Stage2_S,
-.in_space = (s2_mmu_idx == ARMMMUIdx_Stage2_S ? ARMSS_Secure
- : space == ARMSS_Realm ? ARMSS_Realm
- : ARMSS_NonSecure),
+.in_secure = is_secure,
+.in_space = space,
 .in_debug = true,
 };
 GetPhysAddrResult s2 = { };
-- 
2.41.0

Re: [PATCH 1/2] virtio-iommu: Fix 64kB host page size VFIO device assignment

2023-07-05 Thread Jean-Philippe Brucker

On Wed, Jul 05, 2023 at 10:13:11AM +, Duan, Zhenzhong wrote:
> >-Original Message-
> >From: Jean-Philippe Brucker 
> >Sent: Wednesday, July 5, 2023 4:29 PM
> >Subject: Re: [PATCH 1/2] virtio-iommu: Fix 64kB host page size VFIO device
> >assignment
> >
> >On Wed, Jul 05, 2023 at 04:52:09AM +, Duan, Zhenzhong wrote:
> >> Hi Eric,
> >>
> >> >-Original Message-
> >> >From: Eric Auger 
> >> >Sent: Tuesday, July 4, 2023 7:15 PM
> >> >Subject: [PATCH 1/2] virtio-iommu: Fix 64kB host page size VFIO
> >> >device assignment
> >> >
> >> >When running on a 64kB page size host and protecting a VFIO device
> >> >with the virtio-iommu, qemu crashes with this kind of message:
> >> >
> >> >qemu-kvm: virtio-iommu page mask 0xf000 is incompatible
> >> >with mask 0x2001
> >>
> >> Does 0x2001 mean only  512MB and 64KB super page mapping is
> >> supported for host iommu hw? 4KB mapping not supported?
> >
> >It's not a restriction by the HW IOMMU, but the host kernel. An Arm SMMU
> >can implement 4KB, 16KB and/or 64KB granules, but the host kernel only
> >advertises through VFIO the granule corresponding to host PAGE_SIZE. This
> >restriction is done by arm_lpae_restrict_pgsizes() in order to choose a page
> >size when a device is driven by the host.
> 
> Just curious why not advertises the Arm SMMU implemented granules to VFIO
> Eg:4KB, 16KB or 64KB granules?

That's possible, but the difficulty is setting up the page table
configuration afterwards. At the moment the host installs the HW page
tables early, when QEMU sets up the VFIO container. That initializes the
page size bitmap because configuring the HW page tables requires picking
one of the supported granules (setting TG0 in the SMMU Context
Descriptor).

If the guest could pick a granule via an ATTACH request, then QEMU would
need to tell the host kernel to install page tables with the desired
granule at that point. That would require a new interface in VFIO to
reconfigure a live container and replace the existing HW page tables
configuration (before ATTACH, the container must already be configured
with working page tables in order to implement boot-bypass, I think).

> But arm_lpae_restrict_pgsizes() restricted ones,
> Eg: for SZ_4K, (SZ_4K | SZ_2M | SZ_1G).
> (SZ_4K | SZ_2M | SZ_1G) looks not real hardware granules of Arm SMMU.

Yes, the granule here is 4K, and other bits only indicate huge page sizes,
so the user can try to optimize large mappings to use fewer TLB entries
where possible.

Thanks,
Jean

Re: [PATCH 1/2] virtio-iommu: Fix 64kB host page size VFIO device assignment

2023-07-05 Thread Jean-Philippe Brucker

On Wed, Jul 05, 2023 at 04:52:09AM +, Duan, Zhenzhong wrote:
> Hi Eric,
> 
> >-Original Message-
> >From: Eric Auger 
> >Sent: Tuesday, July 4, 2023 7:15 PM
> >Subject: [PATCH 1/2] virtio-iommu: Fix 64kB host page size VFIO device
> >assignment
> >
> >When running on a 64kB page size host and protecting a VFIO device
> >with the virtio-iommu, qemu crashes with this kind of message:
> >
> >qemu-kvm: virtio-iommu page mask 0xf000 is incompatible
> >with mask 0x2001
> 
> Does 0x2001 mean only  512MB and 64KB super page mapping is
> supported for host iommu hw? 4KB mapping not supported?

It's not a restriction by the HW IOMMU, but the host kernel. An Arm SMMU
can implement 4KB, 16KB and/or 64KB granules, but the host kernel only
advertises through VFIO the granule corresponding to host PAGE_SIZE. This
restriction is done by arm_lpae_restrict_pgsizes() in order to choose a
page size when a device is driven by the host.

> 
> There is a check in guest kernel side hint only 4KB is supported, with
> this patch we force viommu->pgsize_bitmap to 0x2001
> and fail below check? Does this device work in guest?
> Please correct me if I understand wrong.

Right, a guest using 4KB pages under a host that uses 64KB is not
supported, because if the guest attempts to dma_map a 4K page, the IOMMU
cannot create a mapping small enough, the mapping would have to spill over
neighbouring guest memory.

One possible solution would be supporting multiple page granules. If we
added a page granule negotiation through VFIO and virtio-iommu then the
guest could pick the page size it wants. But this requires changes to
Linux UAPI so isn't a priority at the moment, because we're trying to
enable nesting which would support 64K-host/4K-guest as well.

See also the discussion on the patch that introduced the guest check
https://lore.kernel.org/linux-iommu/20200318114047.1518048-1-jean-phili...@linaro.org/

Thanks,
Jean

Re: [PATCH v4 00/10] Add stage-2 translation for SMMUv3

2023-05-17 Thread Jean-Philippe Brucker

On Tue, May 16, 2023 at 08:33:07PM +, Mostafa Saleh wrote:
> This patch series can be used to run Linux pKVM SMMUv3 patches (currently on 
> the list)
> which controls stage-2 (from EL2) while providing a paravirtualized
> interface the host(EL1)
> https://lore.kernel.org/kvmarm/20230201125328.2186498-1-jean-phili...@linaro.org/

I've been using these patches for pKVM, and also tested the normal stage-2
flow with Linux and VFIO

Tested-by: Jean-Philippe Brucker

Re: [PATCH] kvm: Merge kvm_check_extension() and kvm_vm_check_extension()

2023-04-25 Thread Jean-Philippe Brucker

On Mon, Apr 24, 2023 at 03:01:54PM +0200, Cornelia Huck wrote:
> > @@ -2480,6 +2471,7 @@ static int kvm_init(MachineState *ms)
> >  }
> >  
> >  s->vmfd = ret;
> > +s->check_extension_vm = kvm_check_extension(s, 
> > KVM_CAP_CHECK_EXTENSION_VM);
> 
> Hm, it's a bit strange to set s->check_extension_vm by calling a
> function that already checks for the value of
> s->check_extension_vm... would it be better to call kvm_ioctl() directly
> on this line?

Yes that's probably best. I'll use kvm_vm_ioctl() since the doc suggests
to check KVM_CAP_CHECK_EXTENSION_VM on the vm fd.

Thanks,
Jean

> 
> I think it would be good if some ppc folks could give this a look, but
> in general, it looks fine to me.
>

[PATCH] kvm: Merge kvm_check_extension() and kvm_vm_check_extension()

2023-04-21 Thread Jean-Philippe Brucker

The KVM_CHECK_EXTENSION ioctl can be issued either on the global fd
(/dev/kvm), or on the VM fd obtained with KVM_CREATE_VM. For most
extensions, KVM returns the same value with either method, but for some
of them it can refine the returned value depending on the VM type. The
KVM documentation [1] advises to use the VM fd:

  Based on their initialization different VMs may have different
  capabilities. It is thus encouraged to use the vm ioctl to query for
  capabilities (available with KVM_CAP_CHECK_EXTENSION_VM on the vm fd)

Ongoing work on Arm confidential VMs confirms this, as some capabilities
become unavailable to confidential VMs, requiring changes in QEMU to use
kvm_vm_check_extension() instead of kvm_check_extension() [2]. Rather
than changing each check one by one, change kvm_check_extension() to
always issue the ioctl on the VM fd when available, and remove
kvm_vm_check_extension().

Fall back to the global fd when the VM check is unavailable:

* Ancient kernels do not support KVM_CHECK_EXTENSION on the VM fd, since
  it was added by commit 92b591a4c46b ("KVM: Allow KVM_CHECK_EXTENSION
  on the vm fd") in Linux 3.17 [3]. Support for Linux 3.16 ended only in
  June 2020, but there may still be old images around.

* A couple of calls must be issued before the VM fd is available, since
  they determine the VM type: KVM_CAP_MIPS_VZ and KVM_CAP_ARM_VM_IPA_SIZE

Does any user actually depend on the check being done on the global fd
instead of the VM fd?  I surveyed all cases where KVM can return
different values depending on the query method. Luckily QEMU already
calls kvm_vm_check_extension() for most of those. Only three of them are
ambiguous, because currently done on the global fd:

* KVM_CAP_MAX_VCPUS and KVM_CAP_MAX_VCPU_ID on Arm, changes value if the
  user requests a vGIC different from the default. But QEMU queries this
  before vGIC configuration, so the reported value will be the same.

* KVM_CAP_SW_TLB on PPC. When issued on the global fd, returns false if
  the kvm-hv module is loaded; when issued on the VM fd, returns false
  only if the VM type is HV instead of PR. If this returns false, then
  QEMU will fail to initialize a BOOKE206 MMU model.

  So this patch supposedly improves things, as it allows to run this
  type of vCPU even when both KVM modules are loaded.

* KVM_CAP_PPC_SECURE_GUEST. Similarly, doing this check on a VM fd
  refines the returned value, and ensures that SVM is actually
  supported. Since QEMU follows the check with kvm_vm_enable_cap(), this
  patch should only provide better error reporting.

[1] https://www.kernel.org/doc/html/latest/virt/kvm/api.html#kvm-check-extension
[2] https://lore.kernel.org/kvm/875ybi0ytc@redhat.com/
[3] https://github.com/torvalds/linux/commit/92b591a4c46b

Suggested-by: Cornelia Huck 
Signed-off-by: Jean-Philippe Brucker 
---
 include/sysemu/kvm.h |  2 --
 include/sysemu/kvm_int.h |  1 +
 accel/kvm/kvm-all.c  | 26 +-
 target/i386/kvm/kvm.c|  6 +++---
 target/ppc/kvm.c | 34 +-
 5 files changed, 30 insertions(+), 39 deletions(-)

diff --git a/include/sysemu/kvm.h b/include/sysemu/kvm.h
index c8281c07a7..d62054004e 100644
--- a/include/sysemu/kvm.h
+++ b/include/sysemu/kvm.h
@@ -438,8 +438,6 @@ bool kvm_arch_stop_on_emulation_error(CPUState *cpu);
 
 int kvm_check_extension(KVMState *s, unsigned int extension);
 
-int kvm_vm_check_extension(KVMState *s, unsigned int extension);
-
 #define kvm_vm_enable_cap(s, capability, cap_flags, ...) \
 ({   \
 struct kvm_enable_cap cap = {\
diff --git a/include/sysemu/kvm_int.h b/include/sysemu/kvm_int.h
index a641c974ea..f6aa22ea09 100644
--- a/include/sysemu/kvm_int.h
+++ b/include/sysemu/kvm_int.h
@@ -122,6 +122,7 @@ struct KVMState
 uint32_t xen_caps;
 uint16_t xen_gnttab_max_frames;
 uint16_t xen_evtchn_max_pirq;
+bool check_extension_vm;
 };
 
 void kvm_memory_listener_register(KVMState *s, KVMMemoryListener *kml,
diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index cf3a88d90e..eca815e763 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -1109,22 +1109,13 @@ int kvm_check_extension(KVMState *s, unsigned int 
extension)
 {
 int ret;
 
-ret = kvm_ioctl(s, KVM_CHECK_EXTENSION, extension);
-if (ret < 0) {
-ret = 0;
+if (!s->check_extension_vm) {
+ret = kvm_ioctl(s, KVM_CHECK_EXTENSION, extension);
+} else {
+ret = kvm_vm_ioctl(s, KVM_CHECK_EXTENSION, extension);
 }
-
-return ret;
-}
-
-int kvm_vm_check_extension(KVMState *s, unsigned int extension)
-{
-int ret;
-
-ret = kvm_vm_ioctl(s, KVM_CHECK_EXTENSION, extension);
 if (ret < 0) {
-/* VM wide version not implemented, use global one instead */
-ret = kvm_check_extension(s, extension);
+ret = 0;

Re: virtio-iommu hotplug issue

2023-04-14 Thread Jean-Philippe Brucker

On Thu, Apr 13, 2023 at 08:01:54PM +0900, Akihiko Odaki wrote:
> Yes, that's right. The guest can dynamically create and delete VFs. The
> device is emulated by QEMU: igb, an Intel NIC recently added to QEMU and
> projected to be released as part of QEMU 8.0.

Ah great, that's really useful, I'll add it to my tests

> > Yes, I think this is an issue in the virtio-iommu driver, which should be
> > sending a DETACH request when the VF is disabled, likely from
> > viommu_release_device(). I'll work on a fix unless you would like to do it
> 
> It will be nice if you prepare a fix. I will test your patch with my
> workload if you share it with me.

I sent a fix:
https://lore.kernel.org/linux-iommu/20230414150744.562456-1-jean-phili...@linaro.org/

Thank you for reporting this, it must have been annoying to debug

Thanks,
Jean

Re: virtio-iommu hotplug issue

2023-04-13 Thread Jean-Philippe Brucker

Hello,

On Thu, Apr 13, 2023 at 01:49:43PM +0900, Akihiko Odaki wrote:
> Hi,
> 
> Recently I encountered a problem with the combination of Linux's
> virtio-iommu driver and QEMU when a SR-IOV virtual function gets disabled.
> I'd like to ask you what kind of solution is appropriate here and implement
> the solution if possible.
> 
> A PCIe device implementing the SR-IOV specification exports a virtual
> function, and the guest can enable or disable it at runtime by writing to a
> configuration register. This effectively looks like a PCI device is
> hotplugged for the guest.

Just so I understand this better: the guest gets a whole PCIe device PF
that implements SR-IOV, and so the guest can dynamically create VFs?  Out
of curiosity, is that a hardware device assigned to the guest with VFIO,
or a device emulated by QEMU?

> In such a case, the kernel assumes the endpoint is
> detached from the virtio-iommu domain, but QEMU actually does not detach it.
> 
> This inconsistent view of the removed device sometimes prevents the VM from
> correctly performing the following procedure, for example:
> 1. Enable a VF.
> 2. Disable the VF.
> 3. Open a vfio container.
> 4. Open the group which the PF belongs to.
> 5. Add the group to the vfio container.
> 6. Map some memory region.
> 7. Close the group.
> 8. Close the vfio container.
> 9. Repeat 3-8
> 
> When the VF gets disabled, the kernel assumes the endpoint is detached from
> the IOMMU domain, but QEMU actually doesn't detach it. Later, the domain
> will be reused in step 3-8.
> 
> In step 7, the PF will be detached, and the kernel thinks there is no
> endpoint attached and the mapping the domain holds is cleared, but the VF
> endpoint is still attached and the mapping is kept intact.
> 
> In step 9, the same domain will be reused again, and the kernel requests to
> create a new mapping, but it will conflict with the existing mapping and
> result in -EINVAL.
> 
> This problem can be fixed by either of:
> - requesting the detachment of the endpoint from the guest when the PCI
> device is unplugged (the VF is disabled)

Yes, I think this is an issue in the virtio-iommu driver, which should be
sending a DETACH request when the VF is disabled, likely from
viommu_release_device(). I'll work on a fix unless you would like to do it

> - detecting that the PCI device is gone and automatically detach it on
> QEMU-side.
> 
> It is not completely clear for me which solution is more appropriate as the
> virtio-iommu specification is written in a way independent of the endpoint
> mechanism and does not say what should be done when a PCI device is
> unplugged.

Yes, I'm not sure it's in scope for the specification, it's more about
software guidance

Thanks,
Jean

Re: [RFC PATCH 12/16] hw/arm/smmuv3: Add VMID to tlb tagging

2023-02-15 Thread Jean-Philippe Brucker

Hi Mostafa,

On Sun, Feb 05, 2023 at 09:44:07AM +, Mostafa Saleh wrote:
> Allow TLB to be tagged with VMID.
> 
> If stage-1 is only supported, VMID is set to -1 and ignored from STE
> and CMD_TLBI_NH* cmds.
> 
> Signed-off-by: Mostafa Saleh 
> ---
>  hw/arm/smmu-common.c | 24 +++-
>  hw/arm/smmu-internal.h   |  2 ++
>  hw/arm/smmuv3.c  | 12 +---
>  include/hw/arm/smmu-common.h |  5 +++--
>  4 files changed, 29 insertions(+), 14 deletions(-)
> 
> diff --git a/hw/arm/smmu-common.c b/hw/arm/smmu-common.c
> index 541c427684..028a60949a 100644
> --- a/hw/arm/smmu-common.c
> +++ b/hw/arm/smmu-common.c
> @@ -56,10 +56,11 @@ static gboolean smmu_iotlb_key_equal(gconstpointer v1, 
> gconstpointer v2)
> (k1->level == k2->level) && (k1->tg == k2->tg);

I'm getting some aliasing in the TLB, because smmu_iotlb_key_equal() is
missing the VMID comparison. With that fixed my handful of tests pass

Thanks,
Jean

[PATCH v2 0/2] hw/arm/smmu: Fixes for TTB1

2023-02-14 Thread Jean-Philippe Brucker

Two small changes to support TTB1. Since [v1] I removed the unused
SMMU_MAX_VA_BITS and added tags, thanks!

[v1] 
https://lore.kernel.org/qemu-devel/20230210163731.970130-1-jean-phili...@linaro.org/

Jean-Philippe Brucker (2):
  hw/arm/smmu-common: Support 64-bit addresses
  hw/arm/smmu-common: Fix TTB1 handling

 include/hw/arm/smmu-common.h | 2 --
 hw/arm/smmu-common.c | 4 ++--
 2 files changed, 2 insertions(+), 4 deletions(-)

-- 
2.39.0

[PATCH v2 1/2] hw/arm/smmu-common: Support 64-bit addresses

2023-02-14 Thread Jean-Philippe Brucker

Addresses targeting the second translation table (TTB1) in the SMMU have
all upper bits set. Ensure the IOMMU region covers all 64 bits.

Reviewed-by: Richard Henderson 
Signed-off-by: Jean-Philippe Brucker 
---
 include/hw/arm/smmu-common.h | 2 --
 hw/arm/smmu-common.c | 2 +-
 2 files changed, 1 insertion(+), 3 deletions(-)

diff --git a/include/hw/arm/smmu-common.h b/include/hw/arm/smmu-common.h
index c5683af07d..9fcff26357 100644
--- a/include/hw/arm/smmu-common.h
+++ b/include/hw/arm/smmu-common.h
@@ -27,8 +27,6 @@
 #define SMMU_PCI_DEVFN_MAX256
 #define SMMU_PCI_DEVFN(sid)   (sid & 0xFF)
 
-#define SMMU_MAX_VA_BITS  48
-
 /*
  * Page table walk error types
  */
diff --git a/hw/arm/smmu-common.c b/hw/arm/smmu-common.c
index 733c964778..2b8c67b9a1 100644
--- a/hw/arm/smmu-common.c
+++ b/hw/arm/smmu-common.c
@@ -439,7 +439,7 @@ static AddressSpace *smmu_find_add_as(PCIBus *bus, void 
*opaque, int devfn)
 
 memory_region_init_iommu(>iommu, sizeof(sdev->iommu),
  s->mrtypename,
- OBJECT(s), name, 1ULL << SMMU_MAX_VA_BITS);
+ OBJECT(s), name, UINT64_MAX);
 address_space_init(>as,
MEMORY_REGION(>iommu), name);
 trace_smmu_add_mr(name);
-- 
2.39.0

[PATCH v2 2/2] hw/arm/smmu-common: Fix TTB1 handling

2023-02-14 Thread Jean-Philippe Brucker

Addresses targeting the second translation table (TTB1) in the SMMU have
all upper bits set (except for the top byte when TBI is enabled). Fix
the TTB1 check.

Reported-by: Ola Hugosson 
Reviewed-by: Eric Auger 
Reviewed-by: Richard Henderson 
Signed-off-by: Jean-Philippe Brucker 
---
 hw/arm/smmu-common.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/arm/smmu-common.c b/hw/arm/smmu-common.c
index 2b8c67b9a1..0a5a60ca1e 100644
--- a/hw/arm/smmu-common.c
+++ b/hw/arm/smmu-common.c
@@ -249,7 +249,7 @@ SMMUTransTableInfo *select_tt(SMMUTransCfg *cfg, dma_addr_t 
iova)
 /* there is a ttbr0 region and we are in it (high bits all zero) */
 return >tt[0];
 } else if (cfg->tt[1].tsz &&
-   !extract64(iova, 64 - cfg->tt[1].tsz, cfg->tt[1].tsz - tbi_byte)) {
+sextract64(iova, 64 - cfg->tt[1].tsz, cfg->tt[1].tsz - tbi_byte) == 
-1) {
 /* there is a ttbr1 region and we are in it (high bits all one) */
 return >tt[1];
 } else if (!cfg->tt[0].tsz) {
-- 
2.39.0

Re: [PATCH 2/2] hw/arm/smmu-common: Fix TTB1 handling

2023-02-14 Thread Jean-Philippe Brucker

On Mon, Feb 13, 2023 at 05:30:03PM +0100, Eric Auger wrote:
> Hi Jean,
> 
> On 2/10/23 17:37, Jean-Philippe Brucker wrote:
> > Addresses targeting the second translation table (TTB1) in the SMMU have
> > all upper bits set (except for the top byte when TBI is enabled). Fix
> > the TTB1 check.
> >
> > Reported-by: Ola Hugosson 
> > Signed-off-by: Jean-Philippe Brucker 
> > ---
> >  hw/arm/smmu-common.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/hw/arm/smmu-common.c b/hw/arm/smmu-common.c
> > index 2b8c67b9a1..0a5a60ca1e 100644
> > --- a/hw/arm/smmu-common.c
> > +++ b/hw/arm/smmu-common.c
> > @@ -249,7 +249,7 @@ SMMUTransTableInfo *select_tt(SMMUTransCfg *cfg, 
> > dma_addr_t iova)
> >  /* there is a ttbr0 region and we are in it (high bits all zero) */
> >  return >tt[0];
> >  } else if (cfg->tt[1].tsz &&
> > -   !extract64(iova, 64 - cfg->tt[1].tsz, cfg->tt[1].tsz - 
> > tbi_byte)) {
> > +sextract64(iova, 64 - cfg->tt[1].tsz, cfg->tt[1].tsz - tbi_byte) 
> > == -1) {
> >  /* there is a ttbr1 region and we are in it (high bits all one) */
> >  return >tt[1];
> >  } else if (!cfg->tt[0].tsz) {
> 
> Reviewed-by: Eric Auger 
> 
> While reading the spec again, I noticed we do not support VAX. Is it
> something that we would need to support?

I guess it would be needed to support sharing page tables with the CPU, if
the CPU supports and the OS uses FEAT_LVA. But in order to share the
stage-1, Linux would need more complex features as well (ATS+PRI/Stall,
PASID).

For a private DMA address space, I think 48 bits of VA is already plenty.

Thanks,
Jean

[PATCH 2/2] hw/arm/smmu-common: Fix TTB1 handling

2023-02-10 Thread Jean-Philippe Brucker

Addresses targeting the second translation table (TTB1) in the SMMU have
all upper bits set (except for the top byte when TBI is enabled). Fix
the TTB1 check.

Reported-by: Ola Hugosson 
Signed-off-by: Jean-Philippe Brucker 
---
 hw/arm/smmu-common.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/arm/smmu-common.c b/hw/arm/smmu-common.c
index 2b8c67b9a1..0a5a60ca1e 100644
--- a/hw/arm/smmu-common.c
+++ b/hw/arm/smmu-common.c
@@ -249,7 +249,7 @@ SMMUTransTableInfo *select_tt(SMMUTransCfg *cfg, dma_addr_t 
iova)
 /* there is a ttbr0 region and we are in it (high bits all zero) */
 return >tt[0];
 } else if (cfg->tt[1].tsz &&
-   !extract64(iova, 64 - cfg->tt[1].tsz, cfg->tt[1].tsz - tbi_byte)) {
+sextract64(iova, 64 - cfg->tt[1].tsz, cfg->tt[1].tsz - tbi_byte) == 
-1) {
 /* there is a ttbr1 region and we are in it (high bits all one) */
 return >tt[1];
 } else if (!cfg->tt[0].tsz) {
-- 
2.39.0

[PATCH 0/2] hw/arm/smmu: Fixes for TTB1

2023-02-10 Thread Jean-Philippe Brucker

Two small changes to support TTB1. Note that I had to modify the Linux
driver in order to test this (see below), but other OSes might use TTB1.

Jean-Philippe Brucker (2):
  hw/arm/smmu-common: Support 64-bit addresses
  hw/arm/smmu-common: Fix TTB1 handling

 hw/arm/smmu-common.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)


--- 8< --- Linux hacks:
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index 8d772ea8a583..bf0ff699b64b 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -276,6 +276,11 @@
 #define CTXDESC_CD_0_TCR_IRGN0 GENMASK_ULL(9, 8)
 #define CTXDESC_CD_0_TCR_ORGN0 GENMASK_ULL(11, 10)
 #define CTXDESC_CD_0_TCR_SH0   GENMASK_ULL(13, 12)
+#define CTXDESC_CD_0_TCR_T1SZ  GENMASK_ULL(21, 16)
+#define CTXDESC_CD_0_TCR_TG1   GENMASK_ULL(23, 22)
+#define CTXDESC_CD_0_TCR_IRGN1 GENMASK_ULL(25, 24)
+#define CTXDESC_CD_0_TCR_ORGN1 GENMASK_ULL(27, 26)
+#define CTXDESC_CD_0_TCR_SH1   GENMASK_ULL(29, 28)
 #define CTXDESC_CD_0_TCR_EPD0  (1ULL << 14)
 #define CTXDESC_CD_0_TCR_EPD1  (1ULL << 30)

@@ -293,6 +298,7 @@
 #define CTXDESC_CD_0_ASID  GENMASK_ULL(63, 48)

 #define CTXDESC_CD_1_TTB0_MASK GENMASK_ULL(51, 4)
+#define CTXDESC_CD_2_TTB1_MASK GENMASK_ULL(51, 4)

 /*
  * When the SMMU only supports linear context descriptor tables, pick a
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index f2425b0f0cd6..3a4343e60a54 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -1075,8 +1075,8 @@ int arm_smmu_write_ctx_desc(struct arm_smmu_domain 
*smmu_domain, int ssid,
 * this substream's traffic
 */
} else { /* (1) and (2) */
-   cdptr[1] = cpu_to_le64(cd->ttbr & CTXDESC_CD_1_TTB0_MASK);
-   cdptr[2] = 0;
+   cdptr[1] = 0;
+   cdptr[2] = cpu_to_le64(cd->ttbr & CTXDESC_CD_2_TTB1_MASK);
cdptr[3] = cpu_to_le64(cd->mair);

/*
@@ -2108,13 +2108,13 @@ static int arm_smmu_domain_finalise_s1(struct 
arm_smmu_domain *smmu_domain,

cfg->cd.asid= (u16)asid;
cfg->cd.ttbr= pgtbl_cfg->arm_lpae_s1_cfg.ttbr;
-   cfg->cd.tcr = FIELD_PREP(CTXDESC_CD_0_TCR_T0SZ, tcr->tsz) |
- FIELD_PREP(CTXDESC_CD_0_TCR_TG0, tcr->tg) |
- FIELD_PREP(CTXDESC_CD_0_TCR_IRGN0, tcr->irgn) |
- FIELD_PREP(CTXDESC_CD_0_TCR_ORGN0, tcr->orgn) |
- FIELD_PREP(CTXDESC_CD_0_TCR_SH0, tcr->sh) |
+   cfg->cd.tcr = FIELD_PREP(CTXDESC_CD_0_TCR_T1SZ, tcr->tsz) |
+ FIELD_PREP(CTXDESC_CD_0_TCR_TG1, tcr->tg) |
+ FIELD_PREP(CTXDESC_CD_0_TCR_IRGN1, tcr->irgn) |
+ FIELD_PREP(CTXDESC_CD_0_TCR_ORGN1, tcr->orgn) |
+ FIELD_PREP(CTXDESC_CD_0_TCR_SH1, tcr->sh) |
  FIELD_PREP(CTXDESC_CD_0_TCR_IPS, tcr->ips) |
- CTXDESC_CD_0_TCR_EPD1 | CTXDESC_CD_0_AA64;
+ CTXDESC_CD_0_TCR_EPD0 | CTXDESC_CD_0_AA64;
cfg->cd.mair= pgtbl_cfg->arm_lpae_s1_cfg.mair;

/*
@@ -2212,6 +2212,7 @@ static int arm_smmu_domain_finalise(struct iommu_domain 
*domain,
.pgsize_bitmap  = smmu->pgsize_bitmap,
.ias= ias,
.oas= oas,
+   .quirks = IO_PGTABLE_QUIRK_ARM_TTBR1,
.coherent_walk  = smmu->features & ARM_SMMU_FEAT_COHERENCY,
.tlb= _smmu_flush_ops,
.iommu_dev  = smmu->dev,
diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 38434203bf04..3fe154c9782d 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -677,6 +677,10 @@ static int dma_info_to_prot(enum dma_data_direction dir, 
bool coherent,
}
 }

+/* HACK */
+#define VA_SIZE39
+#define VA_MASK(~0ULL << VA_SIZE)
+
 static dma_addr_t iommu_dma_alloc_iova(struct iommu_domain *domain,
size_t size, u64 dma_limit, struct device *dev)
 {
@@ -706,7 +710,7 @@ static dma_addr_t iommu_dma_alloc_iova(struct iommu_domain 
*domain,
iova = alloc_iova_fast(iovad, iova_len, dma_limit >> shift,
   true);

-   return (dma_addr_t)iova << shift;
+   return (dma_addr_t)iova << shift | VA_MASK;
 }

 static void iommu_dma_free_iova(struct iommu_dma_cookie *cookie,
@@ -714,6 +718,7 @@ static void iommu_dma_free_iova(struct iommu_dma_cookie 
*cookie,
 {
struct iova_dom

[PATCH 1/2] hw/arm/smmu-common: Support 64-bit addresses

2023-02-10 Thread Jean-Philippe Brucker

Addresses targeting the second translation table (TTB1) in the SMMU have
all upper bits set. Ensure the IOMMU region covers all 64 bits.

Signed-off-by: Jean-Philippe Brucker 
---
 hw/arm/smmu-common.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/arm/smmu-common.c b/hw/arm/smmu-common.c
index 733c964778..2b8c67b9a1 100644
--- a/hw/arm/smmu-common.c
+++ b/hw/arm/smmu-common.c
@@ -439,7 +439,7 @@ static AddressSpace *smmu_find_add_as(PCIBus *bus, void 
*opaque, int devfn)
 
 memory_region_init_iommu(>iommu, sizeof(sdev->iommu),
  s->mrtypename,
- OBJECT(s), name, 1ULL << SMMU_MAX_VA_BITS);
+ OBJECT(s), name, UINT64_MAX);
 address_space_init(>as,
MEMORY_REGION(>iommu), name);
 trace_smmu_add_mr(name);
-- 
2.39.0

Re: [RFC PATCH 08/16] target/arm/kvm-rme: Populate the realm with boot images

2023-02-08 Thread Jean-Philippe Brucker

On Fri, Jan 27, 2023 at 01:54:23PM -1000, Richard Henderson wrote:
> >   static void rme_vm_state_change(void *opaque, bool running, RunState 
> > state)
> >   {
> >   int ret;
> > @@ -72,6 +115,9 @@ static void rme_vm_state_change(void *opaque, bool 
> > running, RunState state)
> >   }
> >   }
> > +g_slist_foreach(rme_images, rme_populate_realm, NULL);
> > +g_slist_free_full(g_steal_pointer(_images), g_free);
> 
> I suppose this technically works because you clear the list, and thus the
> hook is called only on the first transition to RUNNING.  On all subsequent
> transitions the list is empty.
> 
> I see that i386 sev does this immediately during machine init, alongside the
> kernel setup.  Since kvm_init has already been called, that seems workable,
> rather than queuing anything for later.

The problem I faced was that RME_POPULATE_REALM needs to be called after
rom_reset(), which copies all the blobs into guest memory, and that
happens at device reset time, after machine init and
kvm_cpu_synchronize_post_init().

> But I think ideally this would be handled generically in (approximately)
> kvm_cpu_synchronize_post_init, looping over all blobs.  This would handle
> any usage of '-device loader,...', instead of the 4 specific things you
> handle in the next patch.

I'd definitely prefer something generic that hooks into the loader, I'll
look into that. I didn't do it right away because the arm64 Linux kernel
loading is special, requires reserving extra RAM in addition to the blob
(hence the two parameters to kvm_arm_rme_add_blob()). But we could just
have a special case for the extra memory needed by Linux and make the rest
generic.

Thanks,
Jean

Re: [RFC PATCH 06/16] target/arm/kvm-rme: Initialize vCPU

2023-02-08 Thread Jean-Philippe Brucker

On Fri, Jan 27, 2023 at 12:37:12PM -1000, Richard Henderson wrote:
> On 1/27/23 05:07, Jean-Philippe Brucker wrote:
> > +static int kvm_arm_rme_get_core_regs(CPUState *cs)
> > +{
> > +int i, ret;
> > +struct kvm_one_reg reg;
> > +ARMCPU *cpu = ARM_CPU(cs);
> > +CPUARMState *env = >env;
> > +
> > +for (i = 0; i < 8; i++) {
> > +reg.id = AARCH64_CORE_REG(regs.regs[i]);
> > +reg.addr = (uintptr_t) >xregs[i];
> > +ret = kvm_vcpu_ioctl(cs, KVM_GET_ONE_REG, );
> > +if (ret) {
> > +return ret;
> > +}
> > +}
> > +
> > +return 0;
> > +}
> 
> Wow, this is quite the restriction.
> 
> I get that this is just enough to seed the guest for boot, and take SMC
> traps, but I'm concerned that we can't do much with the machine underneath,
> when it comes to other things like "info registers" or gdbstub will be
> silently unusable.  I would prefer if we can somehow make this loudly
> unusable.

For "info registers", which currently displays zero values for all regs,
we can instead return an error message in aarch64_cpu_dump_state().

For gdbstub, I suspect we should disable it entirely since it seems
fundamentally incompatible with confidential VMs, but I need to spend more
time on this.

Thanks,
Jean

Re: [RFC PATCH 03/16] target/arm/kvm-rme: Initialize realm

2023-02-08 Thread Jean-Philippe Brucker

Hi Richard,

Thanks a lot for the review

On Fri, Jan 27, 2023 at 10:37:12AM -1000, Richard Henderson wrote:
> At present I would expect exactly one object class to be present in the
> qemu-system-aarch64 binary that would pass the
> machine_check_confidential_guest_support test done by core code.  But we are
> hoping to move toward a heterogeneous model where e.g. the TYPE_SEV_GUEST
> object might be discoverable within the same executable.

Yes, I'm not sure SEV can be supported on qemu-system-aarch64, but pKVM could
probably coexist with RME as another type of confidential guest support
(https://lwn.net/ml/linux-arm-kernel/20220519134204.5379-1-w...@kernel.org/)

Thanks,
Jean

Re: [RFC PATCH 04/16] hw/arm/virt: Add support for Arm RME

2023-02-08 Thread Jean-Philippe Brucker

On Fri, Jan 27, 2023 at 11:07:35AM -1000, Richard Henderson wrote:
> > +/*
> > + * Since the devicetree is included in the initial measurement, it must
> > + * not contain random data.
> > + */
> > +if (virt_machine_is_confidential(vms)) {
> > +vms->dtb_randomness = false;
> > +}
> 
> This property is default off, and the only way it can be on is user
> argument.  This should be an error, not a silent disable.

This one seems to default to true in virt_instance_init(), and I did need
to disable it in order to get deterministic measurements. Maybe I could
throw an error only when the user attempts to explicitly enables it.

> > +if (virt_machine_is_confidential(vms)) {
> > +/*
> > + * The host cannot write into a confidential guest's memory until 
> > the
> > + * guest shares it. Since the host writes the pvtime region before 
> > the
> > + * guest gets a chance to set it up, disable pvtime.
> > + */
> > +steal_time = false;
> > +}
> 
> This property is default on since 5.2, so falls into a different category.
> Since 5.2 it is auto-on for 64-bit guests.  Since it's auto-off for 32-bit
> guests, I don't see a problem with it being auto-off for RME guests.
> 
> I do wonder if we should change it to an OnOffAuto property, just to catch 
> silly usage.

I'll look into that

Thanks,
Jean

[RFC PATCH 13/16] target/arm/kvm-rme: Add breakpoints and watchpoints parameters

2023-01-27 Thread Jean-Philippe Brucker

Pass the num_bps and num_wps parameters to Realm creation. These
parameters contribute to the initial Realm measurement.

Signed-off-by: Jean-Philippe Brucker 
---
 qapi/qom.json|  8 +++-
 target/arm/kvm-rme.c | 34 +-
 2 files changed, 40 insertions(+), 2 deletions(-)

diff --git a/qapi/qom.json b/qapi/qom.json
index 94ecb87f6f..86ed386f26 100644
--- a/qapi/qom.json
+++ b/qapi/qom.json
@@ -866,12 +866,18 @@
 #
 # @sve-vector-length: SVE vector length (default: 0, SVE disabled)
 #
+# @num-breakpoints: Number of breakpoints (default: 0)
+#
+# @num-watchpoints: Number of watchpoints (default: 0)
+#
 # Since: FIXME
 ##
 { 'struct': 'RmeGuestProperties',
   'data': { '*measurement-algo': 'str',
 '*personalization-value': 'str',
-'*sve-vector-length': 'uint32' } }
+'*sve-vector-length': 'uint32',
+'*num-breakpoints': 'uint32',
+'*num-watchpoints': 'uint32' } }
 
 ##
 # @ObjectType:
diff --git a/target/arm/kvm-rme.c b/target/arm/kvm-rme.c
index 0b2153a45c..3f39f1f7ad 100644
--- a/target/arm/kvm-rme.c
+++ b/target/arm/kvm-rme.c
@@ -22,7 +22,9 @@ OBJECT_DECLARE_SIMPLE_TYPE(RmeGuest, RME_GUEST)
 
 #define RME_PAGE_SIZE qemu_real_host_page_size()
 
-#define RME_MAX_CFG 3
+#define RME_MAX_BPS 0x10
+#define RME_MAX_WPS 0x10
+#define RME_MAX_CFG 4
 
 typedef struct RmeGuest RmeGuest;
 
@@ -31,6 +33,8 @@ struct RmeGuest {
 char *measurement_algo;
 char *personalization_value;
 uint32_t sve_vl;
+uint32_t num_wps;
+uint32_t num_bps;
 };
 
 struct RmeImage {
@@ -145,6 +149,14 @@ static int rme_configure_one(RmeGuest *guest, uint32_t 
cfg, Error **errp)
 args.sve_vq = guest->sve_vl / 128;
 cfg_str = "SVE";
 break;
+case KVM_CAP_ARM_RME_CFG_DBG:
+if (!guest->num_bps && !guest->num_wps) {
+return 0;
+}
+args.num_brps = guest->num_bps;
+args.num_wrps = guest->num_wps;
+cfg_str = "debug parameters";
+break;
 default:
 g_assert_not_reached();
 }
@@ -362,6 +374,10 @@ static void rme_get_uint32(Object *obj, Visitor *v, const 
char *name,
 
 if (strcmp(name, "sve-vector-length") == 0) {
 value = guest->sve_vl;
+} else if (strcmp(name, "num-breakpoints") == 0) {
+value = guest->num_bps;
+} else if (strcmp(name, "num-watchpoints") == 0) {
+value = guest->num_wps;
 } else {
 g_assert_not_reached();
 }
@@ -388,6 +404,12 @@ static void rme_set_uint32(Object *obj, Visitor *v, const 
char *name,
 error_setg(errp, "invalid SVE vector length %"PRIu32, value);
 return;
 }
+} else if (strcmp(name, "num-breakpoints") == 0) {
+max_value = RME_MAX_BPS;
+var = >num_bps;
+} else if (strcmp(name, "num-watchpoints") == 0) {
+max_value = RME_MAX_WPS;
+var = >num_wps;
 } else {
 g_assert_not_reached();
 }
@@ -424,6 +446,16 @@ static void rme_guest_class_init(ObjectClass *oc, void 
*data)
   rme_set_uint32, NULL, NULL);
 object_class_property_set_description(oc, "sve-vector-length",
 "SVE vector length. 0 disables SVE (the default)");
+
+object_class_property_add(oc, "num-breakpoints", "uint32", rme_get_uint32,
+  rme_set_uint32, NULL, NULL);
+object_class_property_set_description(oc, "num-breakpoints",
+"Number of breakpoints");
+
+object_class_property_add(oc, "num-watchpoints", "uint32", rme_get_uint32,
+  rme_set_uint32, NULL, NULL);
+object_class_property_set_description(oc, "num-watchpoints",
+"Number of watchpoints");
 }
 
 static const TypeInfo rme_guest_info = {
-- 
2.39.0

[RFC PATCH 14/16] target/arm/kvm-rme: Add PMU num counters parameters

2023-01-27 Thread Jean-Philippe Brucker

Pass the num_cntrs parameter to Realm creation. These parameters
contribute to the initial Realm measurement.

Signed-off-by: Jean-Philippe Brucker 
---
 qapi/qom.json|  5 -
 target/arm/kvm-rme.c | 21 -
 2 files changed, 24 insertions(+), 2 deletions(-)

diff --git a/qapi/qom.json b/qapi/qom.json
index 86ed386f26..13c85abde9 100644
--- a/qapi/qom.json
+++ b/qapi/qom.json
@@ -870,6 +870,8 @@
 #
 # @num-watchpoints: Number of watchpoints (default: 0)
 #
+# @num-pmu-counters: Number of PMU counters (default: 0, PMU disabled)
+#
 # Since: FIXME
 ##
 { 'struct': 'RmeGuestProperties',
@@ -877,7 +879,8 @@
 '*personalization-value': 'str',
 '*sve-vector-length': 'uint32',
 '*num-breakpoints': 'uint32',
-'*num-watchpoints': 'uint32' } }
+'*num-watchpoints': 'uint32',
+'*num-pmu-counters': 'uint32' } }
 
 ##
 # @ObjectType:
diff --git a/target/arm/kvm-rme.c b/target/arm/kvm-rme.c
index 3f39f1f7ad..1baed79d46 100644
--- a/target/arm/kvm-rme.c
+++ b/target/arm/kvm-rme.c
@@ -24,7 +24,8 @@ OBJECT_DECLARE_SIMPLE_TYPE(RmeGuest, RME_GUEST)
 
 #define RME_MAX_BPS 0x10
 #define RME_MAX_WPS 0x10
-#define RME_MAX_CFG 4
+#define RME_MAX_PMU_CTRS0x20
+#define RME_MAX_CFG 5
 
 typedef struct RmeGuest RmeGuest;
 
@@ -35,6 +36,7 @@ struct RmeGuest {
 uint32_t sve_vl;
 uint32_t num_wps;
 uint32_t num_bps;
+uint32_t num_pmu_cntrs;
 };
 
 struct RmeImage {
@@ -157,6 +159,13 @@ static int rme_configure_one(RmeGuest *guest, uint32_t 
cfg, Error **errp)
 args.num_wrps = guest->num_wps;
 cfg_str = "debug parameters";
 break;
+case KVM_CAP_ARM_RME_CFG_PMU:
+if (!guest->num_pmu_cntrs) {
+return 0;
+}
+args.num_pmu_cntrs = guest->num_pmu_cntrs;
+cfg_str = "PMU";
+break;
 default:
 g_assert_not_reached();
 }
@@ -378,6 +387,8 @@ static void rme_get_uint32(Object *obj, Visitor *v, const 
char *name,
 value = guest->num_bps;
 } else if (strcmp(name, "num-watchpoints") == 0) {
 value = guest->num_wps;
+} else if (strcmp(name, "num-pmu-counters") == 0) {
+value = guest->num_pmu_cntrs;
 } else {
 g_assert_not_reached();
 }
@@ -410,6 +421,9 @@ static void rme_set_uint32(Object *obj, Visitor *v, const 
char *name,
 } else if (strcmp(name, "num-watchpoints") == 0) {
 max_value = RME_MAX_WPS;
 var = >num_wps;
+} else if (strcmp(name, "num-pmu-counters") == 0) {
+max_value = RME_MAX_PMU_CTRS;
+var = >num_pmu_cntrs;
 } else {
 g_assert_not_reached();
 }
@@ -456,6 +470,11 @@ static void rme_guest_class_init(ObjectClass *oc, void 
*data)
   rme_set_uint32, NULL, NULL);
 object_class_property_set_description(oc, "num-watchpoints",
 "Number of watchpoints");
+
+object_class_property_add(oc, "num-pmu-counters", "uint32", rme_get_uint32,
+  rme_set_uint32, NULL, NULL);
+object_class_property_set_description(oc, "num-pmu-counters",
+"Number of PMU counters");
 }
 
 static const TypeInfo rme_guest_info = {
-- 
2.39.0

[RFC PATCH 08/16] target/arm/kvm-rme: Populate the realm with boot images

2023-01-27 Thread Jean-Philippe Brucker

Initialize the GPA space and populate it with boot images (kernel,
initrd, firmware, etc). Populating has to be done at VM start time,
because the images are loaded during reset by rom_reset()

Signed-off-by: Jean-Philippe Brucker 
---
 target/arm/kvm_arm.h |  6 
 target/arm/kvm-rme.c | 79 
 2 files changed, 85 insertions(+)

diff --git a/target/arm/kvm_arm.h b/target/arm/kvm_arm.h
index e4dc7fbb8d..cec6500603 100644
--- a/target/arm/kvm_arm.h
+++ b/target/arm/kvm_arm.h
@@ -371,6 +371,7 @@ int kvm_arm_set_irq(int cpu, int irqtype, int irq, int 
level);
 
 int kvm_arm_rme_init(ConfidentialGuestSupport *cgs, Error **errp);
 int kvm_arm_rme_vm_type(MachineState *ms);
+void kvm_arm_rme_add_blob(hwaddr start, hwaddr src_size, hwaddr dst_size);
 
 bool kvm_arm_rme_enabled(void);
 int kvm_arm_rme_vcpu_init(CPUState *cs);
@@ -458,6 +459,11 @@ static inline int kvm_arm_rme_vm_type(MachineState *ms)
 {
 return 0;
 }
+
+static inline void kvm_arm_rme_add_blob(hwaddr start, hwaddr src_size,
+hwaddr dst_size)
+{
+}
 #endif
 
 static inline const char *gic_class_name(void)
diff --git a/target/arm/kvm-rme.c b/target/arm/kvm-rme.c
index 3833b187f9..c8c019f78a 100644
--- a/target/arm/kvm-rme.c
+++ b/target/arm/kvm-rme.c
@@ -9,6 +9,7 @@
 #include "exec/confidential-guest-support.h"
 #include "hw/boards.h"
 #include "hw/core/cpu.h"
+#include "hw/loader.h"
 #include "kvm_arm.h"
 #include "migration/blocker.h"
 #include "qapi/error.h"
@@ -19,12 +20,22 @@
 #define TYPE_RME_GUEST "rme-guest"
 OBJECT_DECLARE_SIMPLE_TYPE(RmeGuest, RME_GUEST)
 
+#define RME_PAGE_SIZE qemu_real_host_page_size()
+
 typedef struct RmeGuest RmeGuest;
 
 struct RmeGuest {
 ConfidentialGuestSupport parent_obj;
 };
 
+struct RmeImage {
+hwaddr base;
+hwaddr src_size;
+hwaddr dst_size;
+};
+
+static GSList *rme_images;
+
 static RmeGuest *cgs_to_rme(ConfidentialGuestSupport *cgs)
 {
 if (!cgs) {
@@ -51,6 +62,38 @@ static int rme_create_rd(RmeGuest *guest, Error **errp)
 return ret;
 }
 
+static void rme_populate_realm(gpointer data, gpointer user_data)
+{
+int ret;
+struct RmeImage *image = data;
+struct kvm_cap_arm_rme_init_ipa_args init_args = {
+.init_ipa_base = image->base,
+.init_ipa_size = image->dst_size,
+};
+struct kvm_cap_arm_rme_populate_realm_args populate_args = {
+.populate_ipa_base = image->base,
+.populate_ipa_size = image->src_size,
+};
+
+ret = kvm_vm_enable_cap(kvm_state, KVM_CAP_ARM_RME, 0,
+KVM_CAP_ARM_RME_INIT_IPA_REALM,
+(intptr_t)_args);
+if (ret) {
+error_setg_errno(_fatal, -ret,
+ "RME: failed to initialize GPA range 
(0x%"HWADDR_PRIx", 0x%"HWADDR_PRIx")",
+ image->base, image->dst_size);
+}
+
+ret = kvm_vm_enable_cap(kvm_state, KVM_CAP_ARM_RME, 0,
+KVM_CAP_ARM_RME_POPULATE_REALM,
+(intptr_t)_args);
+if (ret) {
+error_setg_errno(_fatal, -ret,
+ "RME: failed to populate realm (0x%"HWADDR_PRIx", 
0x%"HWADDR_PRIx")",
+ image->base, image->src_size);
+}
+}
+
 static void rme_vm_state_change(void *opaque, bool running, RunState state)
 {
 int ret;
@@ -72,6 +115,9 @@ static void rme_vm_state_change(void *opaque, bool running, 
RunState state)
 }
 }
 
+g_slist_foreach(rme_images, rme_populate_realm, NULL);
+g_slist_free_full(g_steal_pointer(_images), g_free);
+
 ret = kvm_vm_enable_cap(kvm_state, KVM_CAP_ARM_RME, 0,
 KVM_CAP_ARM_RME_ACTIVATE_REALM);
 if (ret) {
@@ -118,6 +164,39 @@ int kvm_arm_rme_init(ConfidentialGuestSupport *cgs, Error 
**errp)
 return 0;
 }
 
+/*
+ * kvm_arm_rme_add_blob - Initialize the Realm IPA range and set up the image.
+ *
+ * @src_size is the number of bytes of the source image, to be populated into
+ *   Realm memory.
+ * @dst_size is the effective image size, which may be larger than @src_size.
+ *   For a kernel @dst_size may include zero-initialized regions such as the 
BSS
+ *   and initial page directory.
+ */
+void kvm_arm_rme_add_blob(hwaddr base, hwaddr src_size, hwaddr dst_size)
+{
+struct RmeImage *image;
+
+if (!kvm_arm_rme_enabled()) {
+return;
+}
+
+base = QEMU_ALIGN_DOWN(base, RME_PAGE_SIZE);
+src_size = QEMU_ALIGN_UP(src_size, RME_PAGE_SIZE);
+dst_size = QEMU_ALIGN_UP(dst_size, RME_PAGE_SIZE);
+
+image = g_malloc0(sizeof(*image));
+image->base = base;
+image->src_size = src_size;
+image->dst_size = dst_size;
+
+/*
+ * The ROM loader will only load the images during reset, so postpone the
+

[RFC PATCH 01/16] NOMERGE: Add KVM Arm RME definitions to Linux headers

2023-01-27 Thread Jean-Philippe Brucker

Copy the KVM definitions for Arm RME from the development branch.
Don't merge, they will be added from the periodic Linux header sync.

Signed-off-by: Jean-Philippe Brucker 
---
 linux-headers/asm-arm64/kvm.h | 63 +++
 linux-headers/linux/kvm.h | 21 +---
 2 files changed, 80 insertions(+), 4 deletions(-)

diff --git a/linux-headers/asm-arm64/kvm.h b/linux-headers/asm-arm64/kvm.h
index 4bf2d7246e..8e04d6f7ff 100644
--- a/linux-headers/asm-arm64/kvm.h
+++ b/linux-headers/asm-arm64/kvm.h
@@ -108,6 +108,7 @@ struct kvm_regs {
 #define KVM_ARM_VCPU_SVE   4 /* enable SVE for this CPU */
 #define KVM_ARM_VCPU_PTRAUTH_ADDRESS   5 /* VCPU uses address authentication */
 #define KVM_ARM_VCPU_PTRAUTH_GENERIC   6 /* VCPU uses generic authentication */
+#define KVM_ARM_VCPU_REC   7 /* VCPU REC state as part of Realm */
 
 struct kvm_vcpu_init {
__u32 target;
@@ -391,6 +392,68 @@ enum {
 #define   KVM_DEV_ARM_VGIC_SAVE_PENDING_TABLES 3
 #define   KVM_DEV_ARM_ITS_CTRL_RESET   4
 
+/* KVM_CAP_ARM_RME on VM fd */
+#define KVM_CAP_ARM_RME_CONFIG_REALM   0
+#define KVM_CAP_ARM_RME_CREATE_RD  1
+#define KVM_CAP_ARM_RME_INIT_IPA_REALM 2
+#define KVM_CAP_ARM_RME_POPULATE_REALM 3
+#define KVM_CAP_ARM_RME_ACTIVATE_REALM 4
+
+#define KVM_CAP_ARM_RME_MEASUREMENT_ALGO_SHA2560
+#define KVM_CAP_ARM_RME_MEASUREMENT_ALGO_SHA5121
+
+#define KVM_CAP_ARM_RME_RPV_SIZE 64
+
+/* List of configuration items accepted for KVM_CAP_ARM_RME_CONFIG_REALM */
+#define KVM_CAP_ARM_RME_CFG_RPV0
+#define KVM_CAP_ARM_RME_CFG_HASH_ALGO  1
+#define KVM_CAP_ARM_RME_CFG_SVE2
+#define KVM_CAP_ARM_RME_CFG_DBG3
+#define KVM_CAP_ARM_RME_CFG_PMU4
+
+struct kvm_cap_arm_rme_config_item {
+   __u32 cfg;
+   union {
+   /* cfg == KVM_CAP_ARM_RME_CFG_RPV */
+   struct {
+   __u8rpv[KVM_CAP_ARM_RME_RPV_SIZE];
+   };
+
+   /* cfg == KVM_CAP_ARM_RME_CFG_HASH_ALGO */
+   struct {
+   __u32   hash_algo;
+   };
+
+   /* cfg == KVM_CAP_ARM_RME_CFG_SVE */
+   struct {
+   __u32   sve_vq;
+   };
+
+   /* cfg == KVM_CAP_ARM_RME_CFG_DBG */
+   struct {
+   __u32   num_brps;
+   __u32   num_wrps;
+   };
+
+   /* cfg == KVM_CAP_ARM_RME_CFG_PMU */
+   struct {
+   __u32   num_pmu_cntrs;
+   };
+   /* Fix the size of the union */
+   __u8reserved[256];
+   };
+};
+
+struct kvm_cap_arm_rme_populate_realm_args {
+   __u64 populate_ipa_base;
+   __u64 populate_ipa_size;
+};
+
+struct kvm_cap_arm_rme_init_ipa_args {
+   __u64 init_ipa_base;
+   __u64 init_ipa_size;
+};
+
 /* Device Control API on vcpu fd */
 #define KVM_ARM_VCPU_PMU_V3_CTRL   0
 #define   KVM_ARM_VCPU_PMU_V3_IRQ  0
diff --git a/linux-headers/linux/kvm.h b/linux-headers/linux/kvm.h
index ebdafa576d..9d5affc98a 100644
--- a/linux-headers/linux/kvm.h
+++ b/linux-headers/linux/kvm.h
@@ -901,14 +901,25 @@ struct kvm_ppc_resize_hpt {
 #define KVM_S390_SIE_PAGE_OFFSET 1
 
 /*
- * On arm64, machine type can be used to request the physical
- * address size for the VM. Bits[7-0] are reserved for the guest
- * PA size shift (i.e, log2(PA_Size)). For backward compatibility,
- * value 0 implies the default IPA size, 40bits.
+ * On arm64, machine type can be used to request both the machine type and
+ * the physical address size for the VM.
+ *
+ * Bits[11-8] are reserved for the ARM specific machine type.
+ *
+ * Bits[7-0] are reserved for the guest PA size shift (i.e, log2(PA_Size)).
+ * For backward compatibility, value 0 implies the default IPA size, 40bits.
  */
+#define KVM_VM_TYPE_ARM_SHIFT  8
+#define KVM_VM_TYPE_ARM_MASK   (0xfULL << KVM_VM_TYPE_ARM_SHIFT)
+#define KVM_VM_TYPE_ARM(_type) \
+   (((_type) << KVM_VM_TYPE_ARM_SHIFT) & KVM_VM_TYPE_ARM_MASK)
+#define KVM_VM_TYPE_ARM_NORMAL KVM_VM_TYPE_ARM(0)
+#define KVM_VM_TYPE_ARM_REALM  KVM_VM_TYPE_ARM(1)
+
 #define KVM_VM_TYPE_ARM_IPA_SIZE_MASK  0xffULL
 #define KVM_VM_TYPE_ARM_IPA_SIZE(x)\
((x) & KVM_VM_TYPE_ARM_IPA_SIZE_MASK)
+
 /*
  * ioctls for /dev/kvm fds:
  */
@@ -1176,6 +1187,8 @@ struct kvm_ppc_resize_hpt {
 #define KVM_CAP_S390_ZPCI_OP 221
 #define KVM_CAP_S390_CPU_TOPOLOGY 222
 
+#define KVM_CAP_ARM_RME 300 // FIXME: Large number to prevent conflicts
+
 #ifdef KVM_CAP_IRQ_ROUTING
 
 struct kvm_irq_routing_irqchip {
-- 
2.39.0

[RFC PATCH 11/16] target/arm/kvm-rme: Add Realm Personalization Value parameter

2023-01-27 Thread Jean-Philippe Brucker

The Realm Personalization Value (RPV) is provided by the user to
distinguish Realms that have the same initial measurement.

The user provides a 512-bit hexadecimal number.

Signed-off-by: Jean-Philippe Brucker 
---
 qapi/qom.json|  5 ++-
 target/arm/kvm-rme.c | 72 +++-
 2 files changed, 75 insertions(+), 2 deletions(-)

diff --git a/qapi/qom.json b/qapi/qom.json
index 87fe7c31fe..a012281628 100644
--- a/qapi/qom.json
+++ b/qapi/qom.json
@@ -862,10 +862,13 @@
 #
 # @measurement-algo: Realm measurement algorithm (default: RMM default)
 #
+# @personalization-value: Realm personalization value (default: 0)
+#
 # Since: FIXME
 ##
 { 'struct': 'RmeGuestProperties',
-  'data': { '*measurement-algo': 'str' } }
+  'data': { '*measurement-algo': 'str',
+'*personalization-value': 'str' } }
 
 ##
 # @ObjectType:
diff --git a/target/arm/kvm-rme.c b/target/arm/kvm-rme.c
index 3929b941ae..e974c27e5c 100644
--- a/target/arm/kvm-rme.c
+++ b/target/arm/kvm-rme.c
@@ -22,13 +22,14 @@ OBJECT_DECLARE_SIMPLE_TYPE(RmeGuest, RME_GUEST)
 
 #define RME_PAGE_SIZE qemu_real_host_page_size()
 
-#define RME_MAX_CFG 1
+#define RME_MAX_CFG 2
 
 typedef struct RmeGuest RmeGuest;
 
 struct RmeGuest {
 ConfidentialGuestSupport parent_obj;
 char *measurement_algo;
+char *personalization_value;
 };
 
 struct RmeImage {
@@ -65,6 +66,45 @@ static int rme_create_rd(RmeGuest *guest, Error **errp)
 return ret;
 }
 
+static int rme_parse_rpv(uint8_t *out, const char *in, Error **errp)
+{
+int ret;
+size_t in_len = strlen(in);
+
+/* Two chars per byte */
+if (in_len > KVM_CAP_ARM_RME_RPV_SIZE * 2) {
+error_setg(errp, "Realm Personalization Value is too large");
+return -E2BIG;
+}
+
+/*
+ * Parse as big-endian hexadecimal number (most significant byte on the
+ * left), store little-endian, zero-padded on the right.
+ */
+while (in_len) {
+/*
+ * Do the lower nibble first to catch invalid inputs such as '2z', and
+ * to handle the last char.
+ */
+in_len--;
+ret = sscanf(in + in_len, "%1hhx", out);
+if (ret != 1) {
+error_setg(errp, "Invalid Realm Personalization Value");
+return -EINVAL;
+}
+if (!in_len) {
+break;
+}
+in_len--;
+ret = sscanf(in + in_len, "%2hhx", out++);
+if (ret != 1) {
+error_setg(errp, "Invalid Realm Personalization Value");
+return -EINVAL;
+}
+}
+return 0;
+}
+
 static int rme_configure_one(RmeGuest *guest, uint32_t cfg, Error **errp)
 {
 int ret;
@@ -87,6 +127,16 @@ static int rme_configure_one(RmeGuest *guest, uint32_t cfg, 
Error **errp)
 }
 cfg_str = "hash algorithm";
 break;
+case KVM_CAP_ARM_RME_CFG_RPV:
+if (!guest->personalization_value) {
+return 0;
+}
+ret = rme_parse_rpv(args.rpv, guest->personalization_value, errp);
+if (ret) {
+return ret;
+}
+cfg_str = "personalization value";
+break;
 default:
 g_assert_not_reached();
 }
@@ -281,6 +331,21 @@ static void rme_set_measurement_algo(Object *obj, const 
char *value,
 guest->measurement_algo = g_strdup(value);
 }
 
+static char *rme_get_rpv(Object *obj, Error **errp)
+{
+RmeGuest *guest = RME_GUEST(obj);
+
+return g_strdup(guest->personalization_value);
+}
+
+static void rme_set_rpv(Object *obj, const char *value, Error **errp)
+{
+RmeGuest *guest = RME_GUEST(obj);
+
+g_free(guest->personalization_value);
+guest->personalization_value = g_strdup(value);
+}
+
 static void rme_guest_class_init(ObjectClass *oc, void *data)
 {
 object_class_property_add_str(oc, "measurement-algo",
@@ -288,6 +353,11 @@ static void rme_guest_class_init(ObjectClass *oc, void 
*data)
   rme_set_measurement_algo);
 object_class_property_set_description(oc, "measurement-algo",
 "Realm measurement algorithm ('sha256', 'sha512')");
+
+object_class_property_add_str(oc, "personalization-value", rme_get_rpv,
+  rme_set_rpv);
+object_class_property_set_description(oc, "personalization-value",
+"Realm personalization value (512-bit hexadecimal number)");
 }
 
 static const TypeInfo rme_guest_info = {
-- 
2.39.0

[RFC PATCH 10/16] target/arm/kvm-rme: Add measurement algorithm property

2023-01-27 Thread Jean-Philippe Brucker

This option selects which measurement algorithm to use for attestation.
Supported values are sha256 and sha512.

Signed-off-by: Jean-Philippe Brucker 
---
 qapi/qom.json| 14 -
 target/arm/kvm-rme.c | 71 
 2 files changed, 84 insertions(+), 1 deletion(-)

diff --git a/qapi/qom.json b/qapi/qom.json
index 7ca27bb86c..87fe7c31fe 100644
--- a/qapi/qom.json
+++ b/qapi/qom.json
@@ -855,6 +855,17 @@
   'data': { '*cpu-affinity': ['uint16'],
 '*node-affinity': ['uint16'] } }
 
+##
+# @RmeGuestProperties:
+#
+# Properties for rme-guest objects.
+#
+# @measurement-algo: Realm measurement algorithm (default: RMM default)
+#
+# Since: FIXME
+##
+{ 'struct': 'RmeGuestProperties',
+  'data': { '*measurement-algo': 'str' } }
 
 ##
 # @ObjectType:
@@ -985,7 +996,8 @@
   'tls-creds-x509': 'TlsCredsX509Properties',
   'tls-cipher-suites':  'TlsCredsProperties',
   'x-remote-object':'RemoteObjectProperties',
-  'x-vfio-user-server': 'VfioUserServerProperties'
+  'x-vfio-user-server': 'VfioUserServerProperties',
+  'rme-guest':  'RmeGuestProperties'
   } }
 
 ##
diff --git a/target/arm/kvm-rme.c b/target/arm/kvm-rme.c
index c8c019f78a..3929b941ae 100644
--- a/target/arm/kvm-rme.c
+++ b/target/arm/kvm-rme.c
@@ -22,10 +22,13 @@ OBJECT_DECLARE_SIMPLE_TYPE(RmeGuest, RME_GUEST)
 
 #define RME_PAGE_SIZE qemu_real_host_page_size()
 
+#define RME_MAX_CFG 1
+
 typedef struct RmeGuest RmeGuest;
 
 struct RmeGuest {
 ConfidentialGuestSupport parent_obj;
+char *measurement_algo;
 };
 
 struct RmeImage {
@@ -62,6 +65,40 @@ static int rme_create_rd(RmeGuest *guest, Error **errp)
 return ret;
 }
 
+static int rme_configure_one(RmeGuest *guest, uint32_t cfg, Error **errp)
+{
+int ret;
+const char *cfg_str;
+struct kvm_cap_arm_rme_config_item args = {
+.cfg = cfg,
+};
+
+switch (cfg) {
+case KVM_CAP_ARM_RME_CFG_HASH_ALGO:
+if (!guest->measurement_algo) {
+return 0;
+}
+if (!strcmp(guest->measurement_algo, "sha256")) {
+args.hash_algo = KVM_CAP_ARM_RME_MEASUREMENT_ALGO_SHA256;
+} else if (!strcmp(guest->measurement_algo, "sha512")) {
+args.hash_algo = KVM_CAP_ARM_RME_MEASUREMENT_ALGO_SHA512;
+} else {
+g_assert_not_reached();
+}
+cfg_str = "hash algorithm";
+break;
+default:
+g_assert_not_reached();
+}
+
+ret = kvm_vm_enable_cap(kvm_state, KVM_CAP_ARM_RME, 0,
+KVM_CAP_ARM_RME_CONFIG_REALM, (intptr_t));
+if (ret) {
+error_setg_errno(errp, -ret, "RME: failed to configure %s", cfg_str);
+}
+return ret;
+}
+
 static void rme_populate_realm(gpointer data, gpointer user_data)
 {
 int ret;
@@ -128,6 +165,7 @@ static void rme_vm_state_change(void *opaque, bool running, 
RunState state)
 int kvm_arm_rme_init(ConfidentialGuestSupport *cgs, Error **errp)
 {
 int ret;
+int cfg;
 static Error *rme_mig_blocker;
 RmeGuest *guest = cgs_to_rme(cgs);
 
@@ -146,6 +184,13 @@ int kvm_arm_rme_init(ConfidentialGuestSupport *cgs, Error 
**errp)
 return -ENODEV;
 }
 
+for (cfg = 0; cfg < RME_MAX_CFG; cfg++) {
+ret = rme_configure_one(guest, cfg, errp);
+if (ret) {
+return ret;
+}
+}
+
 ret = rme_create_rd(guest, errp);
 if (ret) {
 return ret;
@@ -215,8 +260,34 @@ int kvm_arm_rme_vm_type(MachineState *ms)
 return 0;
 }
 
+static char *rme_get_measurement_algo(Object *obj, Error **errp)
+{
+RmeGuest *guest = RME_GUEST(obj);
+
+return g_strdup(guest->measurement_algo);
+}
+
+static void rme_set_measurement_algo(Object *obj, const char *value,
+ Error **errp)
+{
+RmeGuest *guest = RME_GUEST(obj);
+
+if (strncmp(value, "sha256", 6) &&
+strncmp(value, "sha512", 6)) {
+error_setg(errp, "invalid Realm measurement algorithm '%s'", value);
+return;
+}
+g_free(guest->measurement_algo);
+guest->measurement_algo = g_strdup(value);
+}
+
 static void rme_guest_class_init(ObjectClass *oc, void *data)
 {
+object_class_property_add_str(oc, "measurement-algo",
+  rme_get_measurement_algo,
+  rme_set_measurement_algo);
+object_class_property_set_description(oc, "measurement-algo",
+"Realm measurement algorithm ('sha256', 'sha512')");
 }
 
 static const TypeInfo rme_guest_info = {
-- 
2.39.0

[RFC PATCH 06/16] target/arm/kvm-rme: Initialize vCPU

2023-01-27 Thread Jean-Philippe Brucker

The target code calls kvm_arm_vcpu_init() to mark the vCPU as part of a
realm. RME support does not use the register lists, because the host can
only set the boot PC and registers x0-x7. The rest is private to the
Realm and saved/restored by the RMM.

Signed-off-by: Jean-Philippe Brucker 
---
 target/arm/cpu.h |  3 ++
 target/arm/kvm_arm.h |  1 +
 target/arm/helper.c  |  8 ++
 target/arm/kvm-rme.c | 10 +++
 target/arm/kvm.c | 12 
 target/arm/kvm64.c   | 65 ++--
 6 files changed, 97 insertions(+), 2 deletions(-)

diff --git a/target/arm/cpu.h b/target/arm/cpu.h
index 9aeed3c848..7d8397985f 100644
--- a/target/arm/cpu.h
+++ b/target/arm/cpu.h
@@ -937,6 +937,9 @@ struct ArchCPU {
 /* KVM steal time */
 OnOffAuto kvm_steal_time;
 
+/* Realm Management Extension */
+bool kvm_rme;
+
 /* Uniprocessor system with MP extensions */
 bool mp_is_up;
 
diff --git a/target/arm/kvm_arm.h b/target/arm/kvm_arm.h
index 00d3df8cac..e4dc7fbb8d 100644
--- a/target/arm/kvm_arm.h
+++ b/target/arm/kvm_arm.h
@@ -373,6 +373,7 @@ int kvm_arm_rme_init(ConfidentialGuestSupport *cgs, Error 
**errp);
 int kvm_arm_rme_vm_type(MachineState *ms);
 
 bool kvm_arm_rme_enabled(void);
+int kvm_arm_rme_vcpu_init(CPUState *cs);
 
 #else
 
diff --git a/target/arm/helper.c b/target/arm/helper.c
index d8c8223ec3..52360ae2ff 100644
--- a/target/arm/helper.c
+++ b/target/arm/helper.c
@@ -126,6 +126,10 @@ bool write_cpustate_to_list(ARMCPU *cpu, bool kvm_sync)
 int i;
 bool ok = true;
 
+if (cpu->kvm_rme) {
+return ok;
+}
+
 for (i = 0; i < cpu->cpreg_array_len; i++) {
 uint32_t regidx = kvm_to_cpreg_id(cpu->cpreg_indexes[i]);
 const ARMCPRegInfo *ri;
@@ -171,6 +175,10 @@ bool write_list_to_cpustate(ARMCPU *cpu)
 int i;
 bool ok = true;
 
+if (cpu->kvm_rme) {
+return ok;
+}
+
 for (i = 0; i < cpu->cpreg_array_len; i++) {
 uint32_t regidx = kvm_to_cpreg_id(cpu->cpreg_indexes[i]);
 uint64_t v = cpu->cpreg_values[i];
diff --git a/target/arm/kvm-rme.c b/target/arm/kvm-rme.c
index d7cdca1cbf..3833b187f9 100644
--- a/target/arm/kvm-rme.c
+++ b/target/arm/kvm-rme.c
@@ -118,6 +118,16 @@ int kvm_arm_rme_init(ConfidentialGuestSupport *cgs, Error 
**errp)
 return 0;
 }
 
+int kvm_arm_rme_vcpu_init(CPUState *cs)
+{
+ARMCPU *cpu = ARM_CPU(cs);
+
+if (kvm_arm_rme_enabled()) {
+cpu->kvm_rme = true;
+}
+return 0;
+}
+
 int kvm_arm_rme_vm_type(MachineState *ms)
 {
 if (cgs_to_rme(ms->cgs)) {
diff --git a/target/arm/kvm.c b/target/arm/kvm.c
index f022c644d2..fcddead4fe 100644
--- a/target/arm/kvm.c
+++ b/target/arm/kvm.c
@@ -449,6 +449,10 @@ int kvm_arm_init_cpreg_list(ARMCPU *cpu)
 int i, ret, arraylen;
 CPUState *cs = CPU(cpu);
 
+if (cpu->kvm_rme) {
+return 0;
+}
+
 rl.n = 0;
 ret = kvm_vcpu_ioctl(cs, KVM_GET_REG_LIST, );
 if (ret != -E2BIG) {
@@ -521,6 +525,10 @@ bool write_kvmstate_to_list(ARMCPU *cpu)
 int i;
 bool ok = true;
 
+if (cpu->kvm_rme) {
+return ok;
+}
+
 for (i = 0; i < cpu->cpreg_array_len; i++) {
 struct kvm_one_reg r;
 uint64_t regidx = cpu->cpreg_indexes[i];
@@ -557,6 +565,10 @@ bool write_list_to_kvmstate(ARMCPU *cpu, int level)
 int i;
 bool ok = true;
 
+if (cpu->kvm_rme) {
+return ok;
+}
+
 for (i = 0; i < cpu->cpreg_array_len; i++) {
 struct kvm_one_reg r;
 uint64_t regidx = cpu->cpreg_indexes[i];
diff --git a/target/arm/kvm64.c b/target/arm/kvm64.c
index 55191496f3..b6320672b2 100644
--- a/target/arm/kvm64.c
+++ b/target/arm/kvm64.c
@@ -887,6 +887,11 @@ int kvm_arch_init_vcpu(CPUState *cs)
 return ret;
 }
 
+ret = kvm_arm_rme_vcpu_init(cs);
+if (ret) {
+return ret;
+}
+
 if (cpu_isar_feature(aa64_sve, cpu)) {
 ret = kvm_arm_sve_set_vls(cs);
 if (ret) {
@@ -1080,6 +1085,35 @@ static int kvm_arch_put_sve(CPUState *cs)
 return 0;
 }
 
+static int kvm_arm_rme_put_core_regs(CPUState *cs, int level)
+{
+int i, ret;
+struct kvm_one_reg reg;
+ARMCPU *cpu = ARM_CPU(cs);
+CPUARMState *env = >env;
+
+/*
+ * The RME ABI only allows us to set 8 GPRs and the PC
+ */
+for (i = 0; i < 8; i++) {
+reg.id = AARCH64_CORE_REG(regs.regs[i]);
+reg.addr = (uintptr_t) >xregs[i];
+ret = kvm_vcpu_ioctl(cs, KVM_SET_ONE_REG, );
+if (ret) {
+return ret;
+}
+}
+
+reg.id = AARCH64_CORE_REG(regs.pc);
+reg.addr = (uintptr_t) >pc;
+ret = kvm_vcpu_ioctl(cs, KVM_SET_ONE_REG, );
+if (ret) {
+return ret;
+}
+
+return 0;
+}
+
 static int kvm_arm_put_core_regs(CPUState *cs)
 {
 struct kvm_one_reg reg;
@@ -1208,7 +1242,11 @@ int kvm_arch_put_registers(CPUState *cs, int level)
 int ret;
 ARMCPU

[RFC PATCH 04/16] hw/arm/virt: Add support for Arm RME

2023-01-27 Thread Jean-Philippe Brucker

When confidential-guest-support is enabled for the virt machine, call
the RME init function, and add the RME flag to the VM type.

* The Realm differentiates non-secure from realm memory using the upper
  GPA bit. Reserve that bit when creating the memory map, to make sure
  that device MMIO located in high memory can still fit.

* pvtime is disabled for the moment. Since the hypervisor has to write
  into the shared pvtime page before scheduling a vcpu, it seems
  incompatible with confidential guests.

Signed-off-by: Jean-Philippe Brucker 
---
 hw/arm/virt.c | 48 
 1 file changed, 44 insertions(+), 4 deletions(-)

diff --git a/hw/arm/virt.c b/hw/arm/virt.c
index b871350856..df613e634a 100644
--- a/hw/arm/virt.c
+++ b/hw/arm/virt.c
@@ -210,6 +210,11 @@ static const char *valid_cpus[] = {
 ARM_CPU_TYPE_NAME("max"),
 };
 
+static bool virt_machine_is_confidential(VirtMachineState *vms)
+{
+return MACHINE(vms)->cgs;
+}
+
 static bool cpu_type_valid(const char *cpu)
 {
 int i;
@@ -247,6 +252,14 @@ static void create_fdt(VirtMachineState *vms)
 exit(1);
 }
 
+/*
+ * Since the devicetree is included in the initial measurement, it must
+ * not contain random data.
+ */
+if (virt_machine_is_confidential(vms)) {
+vms->dtb_randomness = false;
+}
+
 ms->fdt = fdt;
 
 /* Header */
@@ -1924,6 +1937,15 @@ static void virt_cpu_post_init(VirtMachineState *vms, 
MemoryRegion *sysmem)
 steal_time = object_property_get_bool(OBJECT(first_cpu),
   "kvm-steal-time", NULL);
 
+if (virt_machine_is_confidential(vms)) {
+/*
+ * The host cannot write into a confidential guest's memory until the
+ * guest shares it. Since the host writes the pvtime region before the
+ * guest gets a chance to set it up, disable pvtime.
+ */
+steal_time = false;
+}
+
 if (kvm_enabled()) {
 hwaddr pvtime_reg_base = vms->memmap[VIRT_PVTIME].base;
 hwaddr pvtime_reg_size = vms->memmap[VIRT_PVTIME].size;
@@ -2053,10 +2075,11 @@ static void machvirt_init(MachineState *machine)
  * if the guest has EL2 then we will use SMC as the conduit,
  * and otherwise we will use HVC (for backwards compatibility and
  * because if we're using KVM then we must use HVC).
+ * Realm guests must also use SMC.
  */
 if (vms->secure && firmware_loaded) {
 vms->psci_conduit = QEMU_PSCI_CONDUIT_DISABLED;
-} else if (vms->virt) {
+} else if (vms->virt || virt_machine_is_confidential(vms)) {
 vms->psci_conduit = QEMU_PSCI_CONDUIT_SMC;
 } else {
 vms->psci_conduit = QEMU_PSCI_CONDUIT_HVC;
@@ -2102,6 +2125,8 @@ static void machvirt_init(MachineState *machine)
 exit(1);
 }
 
+kvm_arm_rme_init(machine->cgs, _fatal);
+
 create_fdt(vms);
 
 assert(possible_cpus->len == max_cpus);
@@ -2854,15 +2879,26 @@ static HotplugHandler 
*virt_machine_get_hotplug_handler(MachineState *machine,
 static int virt_kvm_type(MachineState *ms, const char *type_str)
 {
 VirtMachineState *vms = VIRT_MACHINE(ms);
+int rme_vm_type = kvm_arm_rme_vm_type(ms);
 int max_vm_pa_size, requested_pa_size;
+int rme_reserve_bit = 0;
 bool fixed_ipa;
 
-max_vm_pa_size = kvm_arm_get_max_vm_ipa_size(ms, _ipa);
+if (rme_vm_type) {
+/*
+ * With RME, the upper GPA bit differentiates Realm from NS memory.
+ * Reserve the upper bit to guarantee that highmem devices will fit.
+ */
+rme_reserve_bit = 1;
+}
+
+max_vm_pa_size = kvm_arm_get_max_vm_ipa_size(ms, _ipa) -
+ rme_reserve_bit;
 
 /* we freeze the memory map to compute the highest gpa */
 virt_set_memmap(vms, max_vm_pa_size);
 
-requested_pa_size = 64 - clz64(vms->highest_gpa);
+requested_pa_size = 64 - clz64(vms->highest_gpa) + rme_reserve_bit;
 
 /*
  * KVM requires the IPA size to be at least 32 bits.
@@ -2883,7 +2919,11 @@ static int virt_kvm_type(MachineState *ms, const char 
*type_str)
  * the implicit legacy 40b IPA setting, in which case the kvm_type
  * must be 0.
  */
-return fixed_ipa ? 0 : requested_pa_size;
+if (fixed_ipa) {
+return 0;
+}
+
+return requested_pa_size | rme_vm_type;
 }
 
 static void virt_machine_class_init(ObjectClass *oc, void *data)
-- 
2.39.0

[RFC PATCH 15/16] target/arm/kvm: Disable Realm reboot

2023-01-27 Thread Jean-Philippe Brucker

A realm cannot be reset, it must be recreated from scratch. The RMM
specification defines states of a Realm as NEW -> ACTIVE -> SYSTEM_OFF,
after which the Realm can only be destroyed. A PCSI_SYSTEM_RESET call,
which normally reboots the system, puts the Realm in SYSTEM_OFF state.

QEMU does not support recreating a VM. Normally, a reboot request by the
guest causes all devices to reset, which cannot work for a Realm.
Indeed, loading images into Realm memory and changing the PC is only
allowed for a Realm in NEW state. Resetting the images for a Realm in
SYSTEM_OFF state will cause QEMU to crash with a bus error.

Handle reboot requests by the guest more gracefully, by indicating to
runstate.c that the vCPUs of a Realm are not resettable, and that QEMU
should exit.

Signed-off-by: Jean-Philippe Brucker 
---
 target/arm/kvm.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/target/arm/kvm.c b/target/arm/kvm.c
index d8655d9041..02b5e8009f 100644
--- a/target/arm/kvm.c
+++ b/target/arm/kvm.c
@@ -1071,7 +1071,8 @@ int kvm_arch_msi_data_to_gsi(uint32_t data)
 
 bool kvm_arch_cpu_check_are_resettable(void)
 {
-return true;
+/* A Realm cannot be reset */
+return !kvm_arm_rme_enabled();
 }
 
 void kvm_arch_accel_class_init(ObjectClass *oc)
-- 
2.39.0

[RFC PATCH 05/16] target/arm/kvm: Split kvm_arch_get/put_registers

2023-01-27 Thread Jean-Philippe Brucker

The confidential guest support in KVM limits the number of registers
that we can read and write. Split the get/put_registers function to
prepare for it.

Signed-off-by: Jean-Philippe Brucker 
---
 target/arm/kvm64.c | 30 --
 1 file changed, 28 insertions(+), 2 deletions(-)

diff --git a/target/arm/kvm64.c b/target/arm/kvm64.c
index 1197253d12..55191496f3 100644
--- a/target/arm/kvm64.c
+++ b/target/arm/kvm64.c
@@ -1080,7 +1080,7 @@ static int kvm_arch_put_sve(CPUState *cs)
 return 0;
 }
 
-int kvm_arch_put_registers(CPUState *cs, int level)
+static int kvm_arm_put_core_regs(CPUState *cs)
 {
 struct kvm_one_reg reg;
 uint64_t val;
@@ -1200,6 +1200,19 @@ int kvm_arch_put_registers(CPUState *cs, int level)
 return ret;
 }
 
+return 0;
+}
+
+int kvm_arch_put_registers(CPUState *cs, int level)
+{
+int ret;
+ARMCPU *cpu = ARM_CPU(cs);
+
+ret = kvm_arm_put_core_regs(cs);
+if (ret) {
+return ret;
+}
+
 write_cpustate_to_list(cpu, true);
 
 if (!write_list_to_kvmstate(cpu, level)) {
@@ -1293,7 +1306,7 @@ static int kvm_arch_get_sve(CPUState *cs)
 return 0;
 }
 
-int kvm_arch_get_registers(CPUState *cs)
+static int kvm_arm_get_core_regs(CPUState *cs)
 {
 struct kvm_one_reg reg;
 uint64_t val;
@@ -1413,6 +1426,19 @@ int kvm_arch_get_registers(CPUState *cs)
 }
 vfp_set_fpcr(env, fpr);
 
+return 0;
+}
+
+int kvm_arch_get_registers(CPUState *cs)
+{
+int ret;
+ARMCPU *cpu = ARM_CPU(cs);
+
+ret = kvm_arm_get_core_regs(cs);
+if (ret) {
+return ret;
+}
+
 ret = kvm_get_vcpu_events(cpu);
 if (ret) {
 return ret;
-- 
2.39.0

1 2 3 4 5 >

1 - 100 of 426 matches

Mail list logo