from:"Ard Biesheuvel"

Re: Unexpected error in rme_configure_one() at ../target/arm/kvm-rme.c:159

2024-06-04 Thread Ard Biesheuvel

On Tue, 4 Jun 2024 at 20:08, Jean-Philippe Brucker
 wrote:
>
> On Fri, May 31, 2024 at 05:24:44PM +0200, Ard Biesheuvel wrote:
> > > I'm able to reproduce this even without RME. This code was introduced
> > > recently by c98f7f755089 ("ArmVirtPkg: Use dynamic PCD to set the SMCCC
> > > conduit"). Maybe Ard (Cc'd) knows what could be going wrong here.
> > >
> > > A slightly reduced reproducer:
> > >
> > > $ cd edk2/
> > > $ build -b DEBUG -a AARCH64 -t GCC5 -p ArmVirtPkg/ArmVirtQemuKernel.dsc
> > > $ cd ..
> > >
> > > $ git clone https://github.com/ARM-software/arm-trusted-firmware.git tf-a
> > > $ cd tf-a/
> > > $ make -j CROSS_COMPILE=aarch64-linux-gnu- PLAT=qemu DEBUG=1 LOG_LEVEL=40 
> > > QEMU_USE_GIC_DRIVER=QEMU_GICV3 
> > > BL33=../edk2/Build/ArmVirtQemuKernel-AARCH64/DEBUG_GCC5/FV/QEMU_EFI.fd 
> > > all fip && \
> > >   dd if=build/qemu/debug/bl1.bin of=flash.bin && \
> > >   dd if=build/qemu/debug/fip.bin of=flash.bin seek=64 bs=4096
> > > $ qemu-system-aarch64 -M virt,virtualization=on,secure=on,gic-version=3 
> > > -cpu max -m 2G -smp 8 -monitor none -serial mon:stdio -nographic -bios 
> > > flash.bin
> > >
> >
> > Hmm, this is not something I anticipated.
> >
> > The problem here is that ArmVirtQemuKernel does not actually support
> > dynamic PCDs, so instead, the PCD here is 'patchable', which means
> > that the underlying value is just overwritten in the binary image, and
> > does not propagate to the rest of the firmware. I assume the write
> > ends up targettng a location that does not tolerate this.
>
> Yes, the QemuVirtMemInfoLib declares this region read-only, so we end up
> with a permission fault
>
>   // Map the FV region as normal executable memory
>   VirtualMemoryTable[2].PhysicalBase = PcdGet64 (PcdFvBaseAddress);
>   VirtualMemoryTable[2].VirtualBase  = VirtualMemoryTable[2].PhysicalBase;
>   VirtualMemoryTable[2].Length   = FixedPcdGet32 (PcdFvSize);
>   VirtualMemoryTable[2].Attributes   = 
> ARM_MEMORY_REGION_ATTRIBUTE_WRITE_BACK_RO;
>
> Making it writable doesn't seem sufficient, since I then get a "HVC issued
> at EL2" fault. I'll keep debugging.
>

That is expected, sadly. As I said, this code was never intended to run at EL2.

The dynamic PCD will propagate to other boot stages. However, the
'patchable' PCD that we use in ArmVirtQemuKernel is local to the
driver, and other users of the PCD will see the default value of
'HVC'. Which would be fine if we only executed at EL1.

So I know exactly what is wrong and have an idea how to fix it - I
just need to find the time for it.

Re: Unexpected error in rme_configure_one() at ../target/arm/kvm-rme.c:159

2024-05-31 Thread Ard Biesheuvel

On Fri, 31 May 2024 at 17:09, Jean-Philippe Brucker
 wrote:
>
> Hi Gavin,
>
> On Fri, May 31, 2024 at 04:23:13PM +1000, Gavin Shan wrote:
> > I got a chance to try CCA software components, suggested by [1]. However, 
> > the edk2
> > is stuck somewhere. I didn't reach to stage of loading guest kernel yet. 
> > I'm replying
> > to see if anyone has a idea.
> ...
> > INFO:BL31: Preparing for EL3 exit to normal world
> > INFO:Entry point address = 0x6000
> > INFO:SPSR = 0x3c9
> > UEFI firmware (version  built at 01:31:23 on May 31 2024)
> >
> > The boot is stuck and no more output after that. I tried adding more 
> > verbose output
> > from edk2 and found it's stuck at the following point.
> >
> >
> > ArmVirtPkg/PrePi/PrePi.c::PrePiMain
> > rmVirtPkg/Library/PlatformPeiLib/PlatformPeiLib.c::PlatformPeim
> >
> >  #ifdef MDE_CPU_AARCH64
> >   //
> >   // Set the SMCCC conduit to SMC if executing at EL2, which is typically 
> > the
> >   // exception level that services HVCs rather than the one that invokes 
> > them.
> >   //
> >   if (ArmReadCurrentEL () == AARCH64_EL2) {
> > Status = PcdSetBoolS (PcdMonitorConduitHvc, FALSE);   // The 
> > function is never returned in my case
> > ASSERT_EFI_ERROR (Status);
> >   }
> >  #endif
>
> I'm able to reproduce this even without RME. This code was introduced
> recently by c98f7f755089 ("ArmVirtPkg: Use dynamic PCD to set the SMCCC
> conduit"). Maybe Ard (Cc'd) knows what could be going wrong here.
>
> A slightly reduced reproducer:
>
> $ cd edk2/
> $ build -b DEBUG -a AARCH64 -t GCC5 -p ArmVirtPkg/ArmVirtQemuKernel.dsc
> $ cd ..
>
> $ git clone https://github.com/ARM-software/arm-trusted-firmware.git tf-a
> $ cd tf-a/
> $ make -j CROSS_COMPILE=aarch64-linux-gnu- PLAT=qemu DEBUG=1 LOG_LEVEL=40 
> QEMU_USE_GIC_DRIVER=QEMU_GICV3 
> BL33=../edk2/Build/ArmVirtQemuKernel-AARCH64/DEBUG_GCC5/FV/QEMU_EFI.fd all 
> fip && \
>   dd if=build/qemu/debug/bl1.bin of=flash.bin && \
>   dd if=build/qemu/debug/fip.bin of=flash.bin seek=64 bs=4096
> $ qemu-system-aarch64 -M virt,virtualization=on,secure=on,gic-version=3 -cpu 
> max -m 2G -smp 8 -monitor none -serial mon:stdio -nographic -bios flash.bin
>

Hmm, this is not something I anticipated.

The problem here is that ArmVirtQemuKernel does not actually support
dynamic PCDs, so instead, the PCD here is 'patchable', which means
that the underlying value is just overwritten in the binary image, and
does not propagate to the rest of the firmware. I assume the write
ends up targettng a location that does not tolerate this.

Running ArmVirtQemu or ArmVirtQemuKernel at EL2 has really only ever
worked by accident, it was simply never intended for that. The fix in
question was a last minute tweak to prevent some CVE fixes pushed by
MicroSoft from breaking network boot entirely, and now that the
release has been made, I guess we should revisit this and fix it
properly.

So the underlying issue here is that on these platforms, we need to
decide at runtime whether to use HVC or SMC instructions for SMCCC
calls. This code attempts to record this into a dynamic PCD once at
boot, in a way that permits other users of the same library to simply
hardcode this in the platform definition (given that bare metal
platforms never need this flexibility).

Re: [edk2-devel] [PATCH v3 5/6] target/arm: Do memory type alignment check when translation disabled

2024-04-19 Thread Ard Biesheuvel

On Fri, 19 Apr 2024 at 18:36, Ard Biesheuvel  wrote:
>
> On Fri, 19 Apr 2024 at 18:09, Jonathan Cameron via groups.io
>  wrote:
> >
> > On Fri, 19 Apr 2024 13:52:07 +0200
> > Gerd Hoffmann  wrote:
> >
> > >   Hi,
> > >
> > > > Gerd, any ideas?  Maybe I needs something subtly different in my
> > > > edk2 build?  I've not looked at this bit of the qemu infrastructure
> > > > before - is there a document on how that image is built?
> > >
> > > There is roms/Makefile for that.
> > >
> > > make -C roms help
> > > make -C roms efi
> > >
> > > So easiest would be to just update the edk2 submodule to what you
> > > need, then rebuild.
> > >
> > > The build is handled by the roms/edk2-build.py script,
> > > with the build configuration being in roms/edk2-build.config.
> > > That is usable outside the qemu source tree too, i.e. like this:
> > >
> > >   python3 /path/to/qemu.git/roms/edk2-build.py \
> > > --config /path/to/qemu.git/roms/edk2-build.config \
> > > --core /path/to/edk2.git \
> > > --match armvirt \
> > > --silent --no-logs
> > >
> > > That'll try to place the images build in "../pc-bios", so maybe better
> > > work with a copy of the config file where you adjust this.
> > >
> > > HTH,
> > >   Gerd
> > >
> >
> > Thanks Gerd!
> >
> > So the builds are very similar via the two method...
> > However - the QEMU build sets -D CAVIUM_ERRATUM_27456=TRUE
> >
> > And that's the difference - with that set for my other builds the alignment
> > problems go away...
> >
> > Any idea why we have that set in roms/edk2-build.config?
> > Superficially it seems rather unlikely anyone cares about thunderx1
> > (if they do we need to get them some new hardware with fresh bugs)
> > bugs now and this config file was only added last year.
> >
> >
> > However, the last comment in Ard's commit message below seems
> > highly likely to be relevant!
> >
> > Chasing through Ard's patch it has the side effect of dropping
> > an override of a requirement for strict alignment.
> > So with out the errata
> > DEFINE GCC_AARCH64_CC_XIPFLAGS = -mstrict-align -mgeneral-regs-only
> > is replaced with
> >  [BuildOptions]
> > +!if $(CAVIUM_ERRATUM_27456) == TRUE^M
> > +  GCC:*_*_AARCH64_PP_FLAGS = -DCAVIUM_ERRATUM_27456^M
> > +!else^M
> >GCC:*_*_AARCH64_CC_XIPFLAGS ==
> > +!endif^M
> >
> > The edk2 commit that added this was the following +CC Ard.
> >
> > Given I wasn't sure of the syntax of that file I set it
> > manually to the original value and indeed it works.
> >
> >
> > commit ec54ce1f1ab41b92782b37ae59e752fff0ef9c41
> > Author: Ard Biesheuvel 
> > Date:   Wed Jan 4 16:51:35 2023 +0100
> >
> > ArmVirtPkg/ArmVirtQemu: Avoid early ID map on ThunderX
> >
> > The early ID map used by ArmVirtQemu uses ASID scoped non-global
> > mappings, as this allows us to switch to the permanent ID map seamlessly
> > without the need for explicit TLB maintenance.
> >
> > However, this triggers a known erratum on ThunderX, which does not
> > tolerate non-global mappings that are executable at EL1, as this appears
> > to result in I-cache corruption. (Linux disables the KPTI based Meltdown
> > mitigation on ThunderX for the same reason)
> >
> > So work around this, by detecting the CPU implementor and part number,
> > and proceeding without the early ID map if a ThunderX CPU is detected.
> >
> > Note that this requires the C code to be built with strict alignment
> > again, as we may end up executing it with the MMU and caches off.
> >
> > Signed-off-by: Ard Biesheuvel 
> > Acked-by: Laszlo Ersek 
> > Tested-by: dann frazier 
> >
> > Test case is
> > qemu-system-aarch64 -M virt,virtualization=true, -m 4g -cpu cortex-a76 \
> > -bios QEMU_EFI.fd -d int
> >
> > Which gets alignment faults since:
> > https://lore.kernel.org/all/20240301204110.656742-6-richard.hender...@linaro.org/
> >
> > So my feeling here is EDK2 should either have yet another config for QEMU 
> > as a host
> > or should always set the alignment without needing to pick the CAVIUM 27456 
> > errata
> > which I suspect will get dropped soonish anyway if anyone ever cleans up
> > old errata.
> >
>
> This code was never really intended for execution

Re: [edk2-devel] [PATCH v3 5/6] target/arm: Do memory type alignment check when translation disabled

2024-04-19 Thread Ard Biesheuvel

On Fri, 19 Apr 2024 at 18:09, Jonathan Cameron via groups.io
 wrote:
>
> On Fri, 19 Apr 2024 13:52:07 +0200
> Gerd Hoffmann  wrote:
>
> >   Hi,
> >
> > > Gerd, any ideas?  Maybe I needs something subtly different in my
> > > edk2 build?  I've not looked at this bit of the qemu infrastructure
> > > before - is there a document on how that image is built?
> >
> > There is roms/Makefile for that.
> >
> > make -C roms help
> > make -C roms efi
> >
> > So easiest would be to just update the edk2 submodule to what you
> > need, then rebuild.
> >
> > The build is handled by the roms/edk2-build.py script,
> > with the build configuration being in roms/edk2-build.config.
> > That is usable outside the qemu source tree too, i.e. like this:
> >
> >   python3 /path/to/qemu.git/roms/edk2-build.py \
> > --config /path/to/qemu.git/roms/edk2-build.config \
> > --core /path/to/edk2.git \
> > --match armvirt \
> > --silent --no-logs
> >
> > That'll try to place the images build in "../pc-bios", so maybe better
> > work with a copy of the config file where you adjust this.
> >
> > HTH,
> >   Gerd
> >
>
> Thanks Gerd!
>
> So the builds are very similar via the two method...
> However - the QEMU build sets -D CAVIUM_ERRATUM_27456=TRUE
>
> And that's the difference - with that set for my other builds the alignment
> problems go away...
>
> Any idea why we have that set in roms/edk2-build.config?
> Superficially it seems rather unlikely anyone cares about thunderx1
> (if they do we need to get them some new hardware with fresh bugs)
> bugs now and this config file was only added last year.
>
>
> However, the last comment in Ard's commit message below seems
> highly likely to be relevant!
>
> Chasing through Ard's patch it has the side effect of dropping
> an override of a requirement for strict alignment.
> So with out the errata
> DEFINE GCC_AARCH64_CC_XIPFLAGS = -mstrict-align -mgeneral-regs-only
> is replaced with
>  [BuildOptions]
> +!if $(CAVIUM_ERRATUM_27456) == TRUE^M
> +  GCC:*_*_AARCH64_PP_FLAGS = -DCAVIUM_ERRATUM_27456^M
> +!else^M
>GCC:*_*_AARCH64_CC_XIPFLAGS ==
> +!endif^M
>
> The edk2 commit that added this was the following +CC Ard.
>
> Given I wasn't sure of the syntax of that file I set it
> manually to the original value and indeed it works.
>
>
> commit ec54ce1f1ab41b92782b37ae59e752fff0ef9c41
> Author: Ard Biesheuvel 
> Date:   Wed Jan 4 16:51:35 2023 +0100
>
> ArmVirtPkg/ArmVirtQemu: Avoid early ID map on ThunderX
>
> The early ID map used by ArmVirtQemu uses ASID scoped non-global
> mappings, as this allows us to switch to the permanent ID map seamlessly
> without the need for explicit TLB maintenance.
>
> However, this triggers a known erratum on ThunderX, which does not
> tolerate non-global mappings that are executable at EL1, as this appears
> to result in I-cache corruption. (Linux disables the KPTI based Meltdown
> mitigation on ThunderX for the same reason)
>
> So work around this, by detecting the CPU implementor and part number,
> and proceeding without the early ID map if a ThunderX CPU is detected.
>
> Note that this requires the C code to be built with strict alignment
> again, as we may end up executing it with the MMU and caches off.
>
> Signed-off-by: Ard Biesheuvel 
> Acked-by: Laszlo Ersek 
> Tested-by: dann frazier 
>
> Test case is
> qemu-system-aarch64 -M virt,virtualization=true, -m 4g -cpu cortex-a76 \
> -bios QEMU_EFI.fd -d int
>
> Which gets alignment faults since:
> https://lore.kernel.org/all/20240301204110.656742-6-richard.hender...@linaro.org/
>
> So my feeling here is EDK2 should either have yet another config for QEMU as 
> a host
> or should always set the alignment without needing to pick the CAVIUM 27456 
> errata
> which I suspect will get dropped soonish anyway if anyone ever cleans up
> old errata.
>

This code was never really intended for execution at EL2, but it
happened to work, partially because TCG's lack of strict alignment
checking when the MMU is off.

Those assumptions no longer hold, so yes, let's get this fixed properly.

Given VHE and nested virt (which will likely imply VHE in practice), I
would like to extend this functionality (i.e., the use of preliminary
page tables in NOR flash) to EL2 as well, but with VHE enabled. This
means we can still elide TLB maintenance (and BBM checks) by using
different ASIDs, and otherwise, fall back to entering with the MMU off
if VHE is not available. In that case, we should enforce strict
alignment too, so that needs to be fixed regardless.

I'll try to code something up and send it round. In the mean time,
feel free to propose a minimal patch that reinstates the strict
alignment if you are pressed for time, and I'll merge it right away.

Re: [PATCH] target/arm: Advertise Cortex-A53 erratum #843419 fix via REVIDR

2024-02-15 Thread Ard Biesheuvel

On Thu, 15 Feb 2024 at 21:47, Richard Henderson
 wrote:
>
> On 2/15/24 06:02, Ard Biesheuvel wrote:
> > From: Ard Biesheuvel 
> >
> > The Cortex-A53 r0p4 revision that QEMU emulates is affected by a CatA
> > erratum #843419 (i.e., the most severe), which requires workarounds in
> > the toolchain as well as the OS.
> >
> > Since the emulation is obviously not affected in the same way, we can
> > indicate this via REVIDR bit #8, which on r0p4 has the meaning that no
> > workarounds for erratum #843419 are needed.
> >
> > Signed-off-by: Ard Biesheuvel 
> > ---
> >   target/arm/cpu64.c | 2 +-
> >   1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/target/arm/cpu64.c b/target/arm/cpu64.c
> > index 8e30a7993e..0f7a44a28f 100644
> > --- a/target/arm/cpu64.c
> > +++ b/target/arm/cpu64.c
> > @@ -663,7 +663,7 @@ static void aarch64_a53_initfn(Object *obj)
> >   set_feature(>env, ARM_FEATURE_PMU);
> >   cpu->kvm_target = QEMU_KVM_ARM_TARGET_CORTEX_A53;
> >   cpu->midr = 0x410fd034;
> > -cpu->revidr = 0x;
> > +cpu->revidr = 0x0100;
>
> Is it worth indicating all three errata fixes (bits 7-9)?
>

835769 has a build time workaround in the linker which I don't think
we even bother to enable in the kernel build. It is definitely not
something the OS needs to worry about at runtime, so I don't think it
matters.

The other one is a performance related CatC without a workaround, so
that one can be ignored as well.

OTOH, our emulation is affected by neither so setting the REVIDR bits
for them makes sense. But there is simply no software that I am aware
of that will behave differently as a result (as opposed to the one for
843419, which is read by the Linux kernel and triggers workaround
logic in the module loader)

> Anyway,
> Reviewed-by: Richard Henderson 
>

Thanks.

[PATCH] target/arm: Advertise Cortex-A53 erratum #843419 fix via REVIDR

2024-02-15 Thread Ard Biesheuvel

From: Ard Biesheuvel 

The Cortex-A53 r0p4 revision that QEMU emulates is affected by a CatA
erratum #843419 (i.e., the most severe), which requires workarounds in
the toolchain as well as the OS.

Since the emulation is obviously not affected in the same way, we can
indicate this via REVIDR bit #8, which on r0p4 has the meaning that no
workarounds for erratum #843419 are needed.

Signed-off-by: Ard Biesheuvel 
---
 target/arm/cpu64.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/target/arm/cpu64.c b/target/arm/cpu64.c
index 8e30a7993e..0f7a44a28f 100644
--- a/target/arm/cpu64.c
+++ b/target/arm/cpu64.c
@@ -663,7 +663,7 @@ static void aarch64_a53_initfn(Object *obj)
 set_feature(>env, ARM_FEATURE_PMU);
 cpu->kvm_target = QEMU_KVM_ARM_TARGET_CORTEX_A53;
 cpu->midr = 0x410fd034;
-cpu->revidr = 0x;
+cpu->revidr = 0x0100;
 cpu->reset_fpsid = 0x41034070;
 cpu->isar.mvfr0 = 0x10110222;
 cpu->isar.mvfr1 = 0x1211;
-- 
2.43.0.687.g38aa6559b0-goog

Re: [PATCH v2 0/3] virt: wire up NS EL2 virtual timer IRQ

2024-02-14 Thread Ard Biesheuvel

On Mon, 22 Jan 2024 at 15:35, Peter Maydell  wrote:
>
> This patchset wires up the NS EL2 virtual timer IRQ on the virt
> board, similarly to what commit 058262e0a8b2 did for the sbsa-ref board.
>
> Version 1 was an RFC patchset, originally sent back in autumn:
> https://patchew.org/QEMU/20230919101240.2569334-1-peter.mayd...@linaro.org/
> The main reason for it being an RFC is that the change, while correct,
> triggers a bug in EDK2 guest firmware that makes EDK2 assert on bootup.
> Since the RFC, we've upgraded our in-tree version of the EDK2 binaries
> to a version that has the fix for that bug, so I think the QEMU side of
> these patches is ready to go in now.
>
> To accommodate users who might still be using older EDK2 binaries,
> we only expose the IRQ in the DTB and ACPI tables for virt-9.0 and
> later machine types.
>
> If you see in the guest:
>  ASSERT [ArmTimerDxe] 
> /home/kraxel/projects/qemu/roms/edk2/ArmVirtPkg/Library/ArmVirtTimerFdtClientLib/ArmVirtTimerFdtClientLib.c(72):
>  PropSize == 36 || PropSize == 48
>
> then your options are:
>  * update your EDK2 binaries to edk2-stable202311 or newer
>  * use the 'virt-8.2' versioned machine type
>  * not use 'virtualization=on'
>
> I'll put something about this into the release notes when this
> goes into git. (There are other reasons why you probably want a
> newer EDK2 for AArch64 guests, so this is worth flagging up to our
> downstream distros who don't take our pre-built firmware binaries.)
>
> changes v1->v2:
>  * the change in DTB and ACPI tables is now tied to the machine version
>  * handle change of the ARCH_TIMER_*_IRQ values from PPI numbers to INTIDs
>  * bump the FADT header to indicate ACPI v6.3, since we might be using
>a 6.3 feature in the GTDT
>  * the avocado tests now all pass, because we have updated our copy
>of EDK2 in pc-bios/ to a version which has the fix for the bug
>which would otherwise cause it to assert on bootup
>  * patch 2 commit message improved to give details of the EDK2 assert and
>state the options for dealing with it (this will also go into the
>QEMU release notes)
>
> thanks
> -- PMM
>
> Peter Maydell (3):
>   tests/qtest/bios-tables-test: Allow changes to virt GTDT
>   hw/arm/virt: Wire up non-secure EL2 virtual timer IRQ
>   tests/qtest/bios-tables-tests: Update virt golden reference
>

Reviewed-by: Ard Biesheuvel

Re: EDK2 ArmVirtQemu behaviour with multiple UARTs

2023-09-21 Thread Ard Biesheuvel

On Thu, 21 Sept 2023 at 10:50, Peter Maydell  wrote:
>
> Hi; I've been looking again at a very long standing missing feature in
> the QEMU virt board, which is that we only have one UART. One of the
> things that has stalled this in the past has been the odd behaviour of
> EDK2 if the DTB that QEMU passes it describes two UARTs.
>
> I'm going to describe the behaviour I see in more detail below, but to
> put the summary up front:
>  * EDK2 puts some debug output on one UART and some on the other
>(the exact arrangement depends on ordering of the dtb nodes)
>  * EDK2 doesn't look at either stdout-path or the serial* aliases,
>so its choices about how to use the UARTs differ from those
>made by the guest kernel it is booting (and it also seems to be
>iterating through the dtb in the opposite order to the kernel)
>
> The current proposal for adding a second UART is that it only happens
> if you explicitly add one on the command line (with a second "-serial
> something" option), so whatever we do won't break existing user
> setups. So we have scope for saying "if you want to use a second UART,
> you're going to want a newer EDK2 which handles it better". Exactly
> what "better" means here is up for grabs, but honouring stdout-path
> and the serial aliases would be the ideal I think. It would also be
> possible to select a particular ordering for the DTB nodes to produce
> "least-worst" behaviour from an existing EDK2 binary, but I'm not
> sure if that's worth doing.
>
> What do the EDK2 folks think about what the correct behaviour
> should be for a 2-UART setup?
>

Hi Peter,

Thanks for the elaborate analysis.

EDK2's DEBUG output is extremely noisy, so being able to redirect this
output to a different UART would be very useful.

The stdout-path is the intended console, and so we should honour that.
This also means that we should parse aliases. But the console is
actually configurable [persistenly] via the UEFI menu, and so it would
be nice if we could take advantage of this flexibility. This means in
principle that the UARTs should be represented via different device
paths (which would include the base address so they are
distinguishable) with perhaps a magical alias which is the default and
is tied to whatever stdout-path points to. This way, all the logic we
introduce is spec compliant and reusable on physical platforms with
multiple UARTs.

The DEBUG output is a different matter. On physical hardware, this is
typically configured at build time, as the info is needed extremely
early and on a physical platform, the debug port generally doesn't
change. Currently, we just grab the first UART that we encounter in
the DT, but the logic used by the DEBUG code and the ordinary console
driver are mostly separate.

What we might do is use stdout-path as well, unless a certain DT alias
exist perhaps? We should probably align here with other projects,
although this a distinction of the same nature may not exist there.

-- 
Ard.

Re: [RFC 2/3] hw/arm/virt: Wire up non-secure EL2 virtual timer IRQ

2023-09-19 Thread Ard Biesheuvel

On Tue, 19 Sept 2023 at 12:12, Peter Maydell  wrote:
>
> Armv8.1+ CPUs have the Virtual Host Extension (VHE) which adds
> a non-secure EL2 virtual timer. We implemented the timer itself
> in the CPU model, but never wired up its IRQ line to the GIC.
>
> Wire up the IRQ line (this is always safe whether the CPU has the
> interrupt or not, since it always creates the outbound IRQ line).
> Report it to the guest via dtb and ACPI if the CPU has the feature.
>
> The DTB binding is documented in the kernel's
> Documentation/devicetree/bindings/timer/arm\,arch_timer.yaml
> and the ACPI table entries are documented in the ACPI
> specification version 6.3 or later.
>
> Signed-off-by: Peter Maydell 

As mentioned in reply to the cover letter, this needs the hunk below
to avoid using ACPI 6.3 features while claiming compatibility with
ACPI 6.0

With that added,

Reviewed-by: Ard Biesheuvel 


--- a/hw/arm/virt-acpi-build.c
+++ b/hw/arm/virt-acpi-build.c
@@ -811,10 +811,10 @@ build_madt(GArray *table_data, BIOSLinker
*linker, VirtMachineState *vms)
 static void build_fadt_rev6(GArray *table_data, BIOSLinker *linker,
 VirtMachineState *vms, unsigned dsdt_tbl_offset)
 {
-/* ACPI v6.0 */
+/* ACPI v6.3 */
 AcpiFadtData fadt = {
 .rev = 6,
-.minor_ver = 0,
+.minor_ver = 3,
 .flags = 1 << ACPI_FADT_F_HW_REDUCED_ACPI,
 .xdsdt_tbl_offset = _tbl_offset,
 };


> ---
>  include/hw/arm/virt.h|  2 ++
>  hw/arm/virt-acpi-build.c | 16 
>  hw/arm/virt.c| 29 -
>  3 files changed, 42 insertions(+), 5 deletions(-)
>
> diff --git a/include/hw/arm/virt.h b/include/hw/arm/virt.h
> index e1ddbea96be..79b1f9b737d 100644
> --- a/include/hw/arm/virt.h
> +++ b/include/hw/arm/virt.h
> @@ -49,6 +49,7 @@
>  #define ARCH_TIMER_S_EL1_IRQ  13
>  #define ARCH_TIMER_NS_EL1_IRQ 14
>  #define ARCH_TIMER_NS_EL2_IRQ 10
> +#define ARCH_TIMER_NS_EL2_VIRT_IRQ 12
>
>  #define VIRTUAL_PMU_IRQ 7
>
> @@ -183,6 +184,7 @@ struct VirtMachineState {
>  PCIBus *bus;
>  char *oem_id;
>  char *oem_table_id;
> +bool ns_el2_virt_timer_present;
>  };
>
>  #define VIRT_ECAM_ID(high) (high ? VIRT_HIGH_PCIE_ECAM : VIRT_PCIE_ECAM)
> diff --git a/hw/arm/virt-acpi-build.c b/hw/arm/virt-acpi-build.c
> index 6b674231c27..7bc120a0f13 100644
> --- a/hw/arm/virt-acpi-build.c
> +++ b/hw/arm/virt-acpi-build.c
> @@ -573,8 +573,8 @@ build_srat(GArray *table_data, BIOSLinker *linker, 
> VirtMachineState *vms)
>  }
>
>  /*
> - * ACPI spec, Revision 5.1
> - * 5.2.24 Generic Timer Description Table (GTDT)
> + * ACPI spec, Revision 6.5
> + * 5.2.25 Generic Timer Description Table (GTDT)
>   */
>  static void
>  build_gtdt(GArray *table_data, BIOSLinker *linker, VirtMachineState *vms)
> @@ -588,7 +588,7 @@ build_gtdt(GArray *table_data, BIOSLinker *linker, 
> VirtMachineState *vms)
>  uint32_t irqflags = vmc->claim_edge_triggered_timers ?
>  1 : /* Interrupt is Edge triggered */
>  0;  /* Interrupt is Level triggered  */
> -AcpiTable table = { .sig = "GTDT", .rev = 2, .oem_id = vms->oem_id,
> +AcpiTable table = { .sig = "GTDT", .rev = 3, .oem_id = vms->oem_id,
>  .oem_table_id = vms->oem_table_id };
>
>  acpi_table_begin(, table_data);
> @@ -624,7 +624,15 @@ build_gtdt(GArray *table_data, BIOSLinker *linker, 
> VirtMachineState *vms)
>  build_append_int_noprefix(table_data, 0, 4);
>  /* Platform Timer Offset */
>  build_append_int_noprefix(table_data, 0, 4);
> -
> +if (vms->ns_el2_virt_timer_present) {
> +/* Virtual EL2 Timer GSIV */
> +build_append_int_noprefix(table_data, ARCH_TIMER_NS_EL2_VIRT_IRQ + 
> 16, 4);
> +/* Virtual EL2 Timer Flags */
> +build_append_int_noprefix(table_data, irqflags, 4);
> +} else {
> +build_append_int_noprefix(table_data, 0, 4);
> +build_append_int_noprefix(table_data, 0, 4);
> +}
>  acpi_table_end(linker, );
>  }
>
> diff --git a/hw/arm/virt.c b/hw/arm/virt.c
> index 8ad78b23c24..4df7cd0a366 100644
> --- a/hw/arm/virt.c
> +++ b/hw/arm/virt.c
> @@ -248,6 +248,19 @@ static void create_randomness(MachineState *ms, const 
> char *node)
>  qemu_fdt_setprop(ms->fdt, node, "rng-seed", seed.rng, sizeof(seed.rng));
>  }
>
> +/*
> + * The CPU object always exposes the NS EL2 virt timer IRQ line,
> + * but we don't want to advertise it to the guest in the dtb or ACPI
> + * table unless it's really going to do something.
> + */
> +static bool ns_el2_virt_timer_present(void)
> +{
> +ARMCPU *cpu = ARM_CPU(qemu_get_cpu(0));
&

Re: [RFC 0/3] virt: wire up NS EL2 virtual timer IRQ

2023-09-19 Thread Ard Biesheuvel

Hi Peter,

On Tue, 19 Sept 2023 at 12:12, Peter Maydell  wrote:
>
> This patchset is an RFC that wires up the NS EL2 virtual timer IRQ on
> the virt board, similarly to what
> https://patchew.org/QEMU/20230913140610.214893-1-marcin.juszkiew...@linaro.org/
> does for the sbsa-ref board.
>
> Patches 1 and 3 are the usual dance to keep the ACPI unit tests happy
> with the change to the ACPI table contents; patch 2 is the meat.
>
> This is an RFC for two reasons:
>
> (1) I'm not very familiar with ACPI, and patch 2 needs to update the
> ACPI GTDT table to report the interrupt number.  In particular, this
> means we need the rev 3 of this table (present in ACPI spec 6.3 and
> later), not the rev 2 we currently report.  I'm not sure if it's
> permitted to rev just this table, or if we would need to upgrade all
> our ACPI tables to the newer spec first.
>

Using a newer revision of a table is fine as long as the FADT
major/minor fields match the spec version that introduced it. No need
to update all the other tables.

> (2) The change causes EDK2 (UEFI) to assert on startup:
> ASSERT [ArmTimerDxe] 
> /home/kraxel/projects/qemu/roms/edk2/ArmVirtPkg/Library/ArmVirtTimerFdtClientLib/ArmVirtTimerFdtClientLib.c(72):
>  PropSize == 36 || PropSize == 48
> This is because the EDK2 code that consumes the QEMU generated device
> tree blob is incorrectly insisting that the architectural-timer
> interrupts property has only 3 or 4 entries, so it falls over if
> given a dtb with the 5th entry for the EL2 virtual timer irq.  In
> particular this breaks the avocado test:
> machine_aarch64_virt.py:Aarch64VirtMachine.test_alpine_virt_tcg_gic_max
> I'm not entirely sure what to do about this -- we can get EDK2 fixed
> and update our own test case, but there's a lot of binaries out there
> in the wild that won't run if we just update the virt board the way
> this patchset does.  We could perhaps make the virt board change be
> dependent on machine type version, as a way to let users fall back to
> the old behaviour.
>

ASSERT()s only fire in DEBUG builds so the impact should be limited.


> I'm putting this patchset out on the list to get opinions and
> review on those two points.
>

Re: [PATCH v3 05/19] crypto: Add generic 16-bit carry-less multiply routines

2023-09-10 Thread Ard Biesheuvel

On Mon, 21 Aug 2023 at 18:19, Richard Henderson
 wrote:
>
> Signed-off-by: Richard Henderson 
> ---
>  include/crypto/clmul.h | 16 
>  crypto/clmul.c | 21 +
>  2 files changed, 37 insertions(+)
>
> diff --git a/include/crypto/clmul.h b/include/crypto/clmul.h
> index 153b5e3057..c7ad28aa85 100644
> --- a/include/crypto/clmul.h
> +++ b/include/crypto/clmul.h
> @@ -38,4 +38,20 @@ uint64_t clmul_8x4_odd(uint64_t, uint64_t);
>   */
>  uint64_t clmul_8x4_packed(uint32_t, uint32_t);
>
> +/**
> + * clmul_16x2_even:
> + *
> + * Perform two 16x16->32 carry-less multiplies.
> + * The odd words of the inputs are ignored.
> + */
> +uint64_t clmul_16x2_even(uint64_t, uint64_t);
> +
> +/**
> + * clmul_16x2_odd:
> + *
> + * Perform two 16x16->32 carry-less multiplies.
> + * The even bytes of the inputs are ignored.

even words

Reviewed-by: Ard Biesheuvel 


> + */
> +uint64_t clmul_16x2_odd(uint64_t, uint64_t);
> +
>  #endif /* CRYPTO_CLMUL_H */
> diff --git a/crypto/clmul.c b/crypto/clmul.c
> index 82d873fee5..2c87cfbf8a 100644
> --- a/crypto/clmul.c
> +++ b/crypto/clmul.c
> @@ -58,3 +58,24 @@ uint64_t clmul_8x4_packed(uint32_t n, uint32_t m)
>  {
>  return clmul_8x4_even_int(unpack_8_to_16(n), unpack_8_to_16(m));
>  }
> +
> +uint64_t clmul_16x2_even(uint64_t n, uint64_t m)
> +{
> +uint64_t r = 0;
> +
> +n &= 0xull;
> +m &= 0xull;
> +
> +for (int i = 0; i < 16; ++i) {
> +uint64_t mask = (n & 0x00010001ull) * 0xull;
> +r ^= m & mask;
> +n >>= 1;
> +m <<= 1;
> +}
> +return r;
> +}
> +
> +uint64_t clmul_16x2_odd(uint64_t n, uint64_t m)
> +{
> +return clmul_16x2_even(n >> 16, m >> 16);
> +}
> --
> 2.34.1
>

Re: [PATCH v3 01/19] crypto: Add generic 8-bit carry-less multiply routines

2023-09-10 Thread Ard Biesheuvel

On Mon, 21 Aug 2023 at 18:18, Richard Henderson
 wrote:
>
> Signed-off-by: Richard Henderson 

Reviewed-by: Ard Biesheuvel 


> ---
>  include/crypto/clmul.h | 41 +
>  crypto/clmul.c | 60 ++
>  crypto/meson.build |  9 ---
>  3 files changed, 107 insertions(+), 3 deletions(-)
>  create mode 100644 include/crypto/clmul.h
>  create mode 100644 crypto/clmul.c
>
> diff --git a/include/crypto/clmul.h b/include/crypto/clmul.h
> new file mode 100644
> index 00..153b5e3057
> --- /dev/null
> +++ b/include/crypto/clmul.h
> @@ -0,0 +1,41 @@
> +/*
> + * Carry-less multiply operations.
> + * SPDX-License-Identifier: GPL-2.0-or-later
> + *
> + * Copyright (C) 2023 Linaro, Ltd.
> + */
> +
> +#ifndef CRYPTO_CLMUL_H
> +#define CRYPTO_CLMUL_H
> +
> +/**
> + * clmul_8x8_low:
> + *
> + * Perform eight 8x8->8 carry-less multiplies.
> + */
> +uint64_t clmul_8x8_low(uint64_t, uint64_t);
> +
> +/**
> + * clmul_8x4_even:
> + *
> + * Perform four 8x8->16 carry-less multiplies.
> + * The odd bytes of the inputs are ignored.
> + */
> +uint64_t clmul_8x4_even(uint64_t, uint64_t);
> +
> +/**
> + * clmul_8x4_odd:
> + *
> + * Perform four 8x8->16 carry-less multiplies.
> + * The even bytes of the inputs are ignored.
> + */
> +uint64_t clmul_8x4_odd(uint64_t, uint64_t);
> +
> +/**
> + * clmul_8x4_packed:
> + *
> + * Perform four 8x8->16 carry-less multiplies.
> + */
> +uint64_t clmul_8x4_packed(uint32_t, uint32_t);
> +
> +#endif /* CRYPTO_CLMUL_H */
> diff --git a/crypto/clmul.c b/crypto/clmul.c
> new file mode 100644
> index 00..82d873fee5
> --- /dev/null
> +++ b/crypto/clmul.c
> @@ -0,0 +1,60 @@
> +/*
> + * Carry-less multiply operations.
> + * SPDX-License-Identifier: GPL-2.0-or-later
> + *
> + * Copyright (C) 2023 Linaro, Ltd.
> + */
> +
> +#include "qemu/osdep.h"
> +#include "crypto/clmul.h"
> +
> +uint64_t clmul_8x8_low(uint64_t n, uint64_t m)
> +{
> +uint64_t r = 0;
> +
> +for (int i = 0; i < 8; ++i) {
> +uint64_t mask = (n & 0x0101010101010101ull) * 0xff;
> +r ^= m & mask;
> +m = (m << 1) & 0xfefefefefefefefeull;
> +n >>= 1;
> +}
> +return r;
> +}
> +
> +static uint64_t clmul_8x4_even_int(uint64_t n, uint64_t m)
> +{
> +uint64_t r = 0;
> +
> +for (int i = 0; i < 8; ++i) {
> +uint64_t mask = (n & 0x0001000100010001ull) * 0x;
> +r ^= m & mask;
> +n >>= 1;
> +m <<= 1;
> +}
> +return r;
> +}
> +
> +uint64_t clmul_8x4_even(uint64_t n, uint64_t m)
> +{
> +n &= 0x00ff00ff00ff00ffull;
> +m &= 0x00ff00ff00ff00ffull;
> +return clmul_8x4_even_int(n, m);
> +}
> +
> +uint64_t clmul_8x4_odd(uint64_t n, uint64_t m)
> +{
> +return clmul_8x4_even(n >> 8, m >> 8);
> +}
> +
> +static uint64_t unpack_8_to_16(uint64_t x)
> +{
> +return  (x & 0x00ff)
> + | ((x & 0xff00) << 8)
> + | ((x & 0x00ff) << 16)
> + | ((x & 0xff00) << 24);
> +}
> +
> +uint64_t clmul_8x4_packed(uint32_t n, uint32_t m)
> +{
> +return clmul_8x4_even_int(unpack_8_to_16(n), unpack_8_to_16(m));
> +}
> diff --git a/crypto/meson.build b/crypto/meson.build
> index 5f03a30d34..9ac1a89802 100644
> --- a/crypto/meson.build
> +++ b/crypto/meson.build
> @@ -48,9 +48,12 @@ if have_afalg
>  endif
>  crypto_ss.add(when: gnutls, if_true: files('tls-cipher-suites.c'))
>
> -util_ss.add(files('sm4.c'))
> -util_ss.add(files('aes.c'))
> -util_ss.add(files('init.c'))
> +util_ss.add(files(
> +  'aes.c',
> +  'clmul.c',
> +  'init.c',
> +  'sm4.c',
> +))
>  if gnutls.found()
>util_ss.add(gnutls)
>  endif
> --
> 2.34.1
>

Re: [PATCH v3 18/19] host/include/i386: Implement clmul.h

2023-09-10 Thread Ard Biesheuvel

On Mon, 21 Aug 2023 at 18:19, Richard Henderson
 wrote:
>
> Detect PCLMUL in cpuinfo; implement the accel hook.
>
> Signed-off-by: Richard Henderson 

Reviewed-by: Ard Biesheuvel 

> ---
>  host/include/i386/host/cpuinfo.h|  1 +
>  host/include/i386/host/crypto/clmul.h   | 29 +
>  host/include/x86_64/host/crypto/clmul.h |  1 +
>  include/qemu/cpuid.h|  3 +++
>  util/cpuinfo-i386.c |  1 +
>  5 files changed, 35 insertions(+)
>  create mode 100644 host/include/i386/host/crypto/clmul.h
>  create mode 100644 host/include/x86_64/host/crypto/clmul.h
>
> diff --git a/host/include/i386/host/cpuinfo.h 
> b/host/include/i386/host/cpuinfo.h
> index 073d0a426f..7ae21568f7 100644
> --- a/host/include/i386/host/cpuinfo.h
> +++ b/host/include/i386/host/cpuinfo.h
> @@ -27,6 +27,7 @@
>  #define CPUINFO_ATOMIC_VMOVDQA  (1u << 16)
>  #define CPUINFO_ATOMIC_VMOVDQU  (1u << 17)
>  #define CPUINFO_AES (1u << 18)
> +#define CPUINFO_PCLMUL  (1u << 19)
>
>  /* Initialized with a constructor. */
>  extern unsigned cpuinfo;
> diff --git a/host/include/i386/host/crypto/clmul.h 
> b/host/include/i386/host/crypto/clmul.h
> new file mode 100644
> index 00..dc3c814797
> --- /dev/null
> +++ b/host/include/i386/host/crypto/clmul.h
> @@ -0,0 +1,29 @@
> +/*
> + * x86 specific clmul acceleration.
> + * SPDX-License-Identifier: GPL-2.0-or-later
> + */
> +
> +#ifndef X86_HOST_CRYPTO_CLMUL_H
> +#define X86_HOST_CRYPTO_CLMUL_H
> +
> +#include "host/cpuinfo.h"
> +#include 
> +
> +#if defined(__PCLMUL__)
> +# define HAVE_CLMUL_ACCEL  true
> +# define ATTR_CLMUL_ACCEL
> +#else
> +# define HAVE_CLMUL_ACCEL  likely(cpuinfo & CPUINFO_PCLMUL)
> +# define ATTR_CLMUL_ACCEL  __attribute__((target("pclmul")))
> +#endif
> +
> +static inline Int128 ATTR_CLMUL_ACCEL
> +clmul_64_accel(uint64_t n, uint64_t m)
> +{
> +union { __m128i v; Int128 s; } u;
> +
> +u.v = _mm_clmulepi64_si128(_mm_set_epi64x(0, n), _mm_set_epi64x(0, m), 
> 0);
> +return u.s;
> +}
> +
> +#endif /* X86_HOST_CRYPTO_CLMUL_H */
> diff --git a/host/include/x86_64/host/crypto/clmul.h 
> b/host/include/x86_64/host/crypto/clmul.h
> new file mode 100644
> index 00..f25eced416
> --- /dev/null
> +++ b/host/include/x86_64/host/crypto/clmul.h
> @@ -0,0 +1 @@
> +#include "host/include/i386/host/crypto/clmul.h"
> diff --git a/include/qemu/cpuid.h b/include/qemu/cpuid.h
> index 35325f1995..b11161555b 100644
> --- a/include/qemu/cpuid.h
> +++ b/include/qemu/cpuid.h
> @@ -25,6 +25,9 @@
>  #endif
>
>  /* Leaf 1, %ecx */
> +#ifndef bit_PCLMUL
> +#define bit_PCLMUL  (1 << 1)
> +#endif
>  #ifndef bit_SSE4_1
>  #define bit_SSE4_1  (1 << 19)
>  #endif
> diff --git a/util/cpuinfo-i386.c b/util/cpuinfo-i386.c
> index 3a7b7e0ad1..36783fd199 100644
> --- a/util/cpuinfo-i386.c
> +++ b/util/cpuinfo-i386.c
> @@ -39,6 +39,7 @@ unsigned __attribute__((constructor)) cpuinfo_init(void)
>  info |= (c & bit_SSE4_1 ? CPUINFO_SSE4 : 0);
>  info |= (c & bit_MOVBE ? CPUINFO_MOVBE : 0);
>  info |= (c & bit_POPCNT ? CPUINFO_POPCNT : 0);
> +info |= (c & bit_PCLMUL ? CPUINFO_PCLMUL : 0);
>
>  /* Our AES support requires PSHUFB as well. */
>  info |= ((c & bit_AES) && (c & bit_SSSE3) ? CPUINFO_AES : 0);
> --
> 2.34.1
>

Re: [PATCH v3 13/19] crypto: Add generic 64-bit carry-less multiply routine

2023-09-10 Thread Ard Biesheuvel

On Mon, 21 Aug 2023 at 18:19, Richard Henderson
 wrote:
>
> Signed-off-by: Richard Henderson 

Reviewed-by: Ard Biesheuvel 

> ---
>  host/include/generic/host/crypto/clmul.h | 15 +++
>  include/crypto/clmul.h   | 19 +++
>  crypto/clmul.c   | 18 ++
>  3 files changed, 52 insertions(+)
>  create mode 100644 host/include/generic/host/crypto/clmul.h
>
> diff --git a/host/include/generic/host/crypto/clmul.h 
> b/host/include/generic/host/crypto/clmul.h
> new file mode 100644
> index 00..915bfb88d3
> --- /dev/null
> +++ b/host/include/generic/host/crypto/clmul.h
> @@ -0,0 +1,15 @@
> +/*
> + * No host specific carry-less multiply acceleration.
> + * SPDX-License-Identifier: GPL-2.0-or-later
> + */
> +
> +#ifndef GENERIC_HOST_CRYPTO_CLMUL_H
> +#define GENERIC_HOST_CRYPTO_CLMUL_H
> +
> +#define HAVE_CLMUL_ACCEL  false
> +#define ATTR_CLMUL_ACCEL
> +
> +Int128 clmul_64_accel(uint64_t, uint64_t)
> +QEMU_ERROR("unsupported accel");
> +
> +#endif /* GENERIC_HOST_CRYPTO_CLMUL_H */
> diff --git a/include/crypto/clmul.h b/include/crypto/clmul.h
> index 0ea25a252c..c82d2d7559 100644
> --- a/include/crypto/clmul.h
> +++ b/include/crypto/clmul.h
> @@ -8,6 +8,9 @@
>  #ifndef CRYPTO_CLMUL_H
>  #define CRYPTO_CLMUL_H
>
> +#include "qemu/int128.h"
> +#include "host/crypto/clmul.h"
> +
>  /**
>   * clmul_8x8_low:
>   *
> @@ -61,4 +64,20 @@ uint64_t clmul_16x2_odd(uint64_t, uint64_t);
>   */
>  uint64_t clmul_32(uint32_t, uint32_t);
>
> +/**
> + * clmul_64:
> + *
> + * Perform a 64x64->128 carry-less multiply.
> + */
> +Int128 clmul_64_gen(uint64_t, uint64_t);
> +
> +static inline Int128 clmul_64(uint64_t a, uint64_t b)
> +{
> +if (HAVE_CLMUL_ACCEL) {
> +return clmul_64_accel(a, b);
> +} else {
> +return clmul_64_gen(a, b);
> +}
> +}
> +
>  #endif /* CRYPTO_CLMUL_H */
> diff --git a/crypto/clmul.c b/crypto/clmul.c
> index 36ada1be9d..abf79cc49a 100644
> --- a/crypto/clmul.c
> +++ b/crypto/clmul.c
> @@ -92,3 +92,21 @@ uint64_t clmul_32(uint32_t n, uint32_t m32)
>  }
>  return r;
>  }
> +
> +Int128 clmul_64_gen(uint64_t n, uint64_t m)
> +{
> +uint64_t rl = 0, rh = 0;
> +
> +/* Bit 0 can only influence the low 64-bit result.  */
> +if (n & 1) {
> +rl = m;
> +}
> +
> +for (int i = 1; i < 64; ++i) {
> +uint64_t mask = -(n & 1);
> +rl ^= (m << i) & mask;
> +rh ^= (m >> (64 - i)) & mask;
> +n >>= 1;
> +}
> +return int128_make128(rl, rh);
> +}
> --
> 2.34.1
>

Re: [PATCH v3 09/19] crypto: Add generic 32-bit carry-less multiply routines

2023-09-10 Thread Ard Biesheuvel

On Mon, 21 Aug 2023 at 18:19, Richard Henderson
 wrote:
>
> Signed-off-by: Richard Henderson 

Replied to v2 by accident:

Reviewed-by: Ard Biesheuvel 


> ---
>  include/crypto/clmul.h |  7 +++
>  crypto/clmul.c | 13 +
>  2 files changed, 20 insertions(+)
>
> diff --git a/include/crypto/clmul.h b/include/crypto/clmul.h
> index c7ad28aa85..0ea25a252c 100644
> --- a/include/crypto/clmul.h
> +++ b/include/crypto/clmul.h
> @@ -54,4 +54,11 @@ uint64_t clmul_16x2_even(uint64_t, uint64_t);
>   */
>  uint64_t clmul_16x2_odd(uint64_t, uint64_t);
>
> +/**
> + * clmul_32:
> + *
> + * Perform a 32x32->64 carry-less multiply.
> + */
> +uint64_t clmul_32(uint32_t, uint32_t);
> +
>  #endif /* CRYPTO_CLMUL_H */
> diff --git a/crypto/clmul.c b/crypto/clmul.c
> index 2c87cfbf8a..36ada1be9d 100644
> --- a/crypto/clmul.c
> +++ b/crypto/clmul.c
> @@ -79,3 +79,16 @@ uint64_t clmul_16x2_odd(uint64_t n, uint64_t m)
>  {
>  return clmul_16x2_even(n >> 16, m >> 16);
>  }
> +
> +uint64_t clmul_32(uint32_t n, uint32_t m32)
> +{
> +uint64_t r = 0;
> +uint64_t m = m32;
> +
> +for (int i = 0; i < 32; ++i) {
> +r ^= n & 1 ? m : 0;
> +n >>= 1;
> +m <<= 1;
> +}
> +return r;
> +}
> --
> 2.34.1
>

Re: [PATCH v2 09/18] crypto: Add generic 32-bit carry-less multiply routines

2023-09-10 Thread Ard Biesheuvel

On Sat, 19 Aug 2023 at 03:02, Richard Henderson
 wrote:
>
> Signed-off-by: Richard Henderson 

Reviewed-by: Ard Biesheuvel 

> ---
>  include/crypto/clmul.h |  7 +++
>  crypto/clmul.c | 13 +
>  2 files changed, 20 insertions(+)
>
> diff --git a/include/crypto/clmul.h b/include/crypto/clmul.h
> index c7ad28aa85..0ea25a252c 100644
> --- a/include/crypto/clmul.h
> +++ b/include/crypto/clmul.h
> @@ -54,4 +54,11 @@ uint64_t clmul_16x2_even(uint64_t, uint64_t);
>   */
>  uint64_t clmul_16x2_odd(uint64_t, uint64_t);
>
> +/**
> + * clmul_32:
> + *
> + * Perform a 32x32->64 carry-less multiply.
> + */
> +uint64_t clmul_32(uint32_t, uint32_t);
> +
>  #endif /* CRYPTO_CLMUL_H */
> diff --git a/crypto/clmul.c b/crypto/clmul.c
> index 2c87cfbf8a..36ada1be9d 100644
> --- a/crypto/clmul.c
> +++ b/crypto/clmul.c
> @@ -79,3 +79,16 @@ uint64_t clmul_16x2_odd(uint64_t n, uint64_t m)
>  {
>  return clmul_16x2_even(n >> 16, m >> 16);
>  }
> +
> +uint64_t clmul_32(uint32_t n, uint32_t m32)
> +{
> +uint64_t r = 0;
> +uint64_t m = m32;
> +
> +for (int i = 0; i < 32; ++i) {
> +r ^= n & 1 ? m : 0;
> +n >>= 1;
> +m <<= 1;
> +}
> +return r;
> +}
> --
> 2.34.1
>

Re: [PATCH v2 05/18] crypto: Add generic 16-bit carry-less multiply routines

2023-09-10 Thread Ard Biesheuvel

On Sat, 19 Aug 2023 at 03:02, Richard Henderson
 wrote:
>
> Signed-off-by: Richard Henderson 

Reviewed-by: Ard Biesheuvel 

> ---
>  include/crypto/clmul.h | 16 
>  crypto/clmul.c | 21 +
>  2 files changed, 37 insertions(+)
>
> diff --git a/include/crypto/clmul.h b/include/crypto/clmul.h
> index 153b5e3057..c7ad28aa85 100644
> --- a/include/crypto/clmul.h
> +++ b/include/crypto/clmul.h
> @@ -38,4 +38,20 @@ uint64_t clmul_8x4_odd(uint64_t, uint64_t);
>   */
>  uint64_t clmul_8x4_packed(uint32_t, uint32_t);
>
> +/**
> + * clmul_16x2_even:
> + *
> + * Perform two 16x16->32 carry-less multiplies.
> + * The odd words of the inputs are ignored.
> + */
> +uint64_t clmul_16x2_even(uint64_t, uint64_t);
> +
> +/**
> + * clmul_16x2_odd:
> + *
> + * Perform two 16x16->32 carry-less multiplies.
> + * The even bytes of the inputs are ignored.
> + */
> +uint64_t clmul_16x2_odd(uint64_t, uint64_t);
> +
>  #endif /* CRYPTO_CLMUL_H */
> diff --git a/crypto/clmul.c b/crypto/clmul.c
> index 82d873fee5..2c87cfbf8a 100644
> --- a/crypto/clmul.c
> +++ b/crypto/clmul.c
> @@ -58,3 +58,24 @@ uint64_t clmul_8x4_packed(uint32_t n, uint32_t m)
>  {
>  return clmul_8x4_even_int(unpack_8_to_16(n), unpack_8_to_16(m));
>  }
> +
> +uint64_t clmul_16x2_even(uint64_t n, uint64_t m)
> +{
> +uint64_t r = 0;
> +
> +n &= 0xull;
> +m &= 0xull;
> +
> +for (int i = 0; i < 16; ++i) {
> +uint64_t mask = (n & 0x00010001ull) * 0xull;
> +r ^= m & mask;
> +n >>= 1;
> +m <<= 1;
> +}
> +return r;
> +}
> +
> +uint64_t clmul_16x2_odd(uint64_t n, uint64_t m)
> +{
> +return clmul_16x2_even(n >> 16, m >> 16);
> +}
> --
> 2.34.1
>

Re: [PATCH v2 01/18] crypto: Add generic 8-bit carry-less multiply routines

2023-09-10 Thread Ard Biesheuvel

On Sat, 19 Aug 2023 at 03:02, Richard Henderson
 wrote:
>
> Signed-off-by: Richard Henderson 

Reviewed-by: Ard Biesheuvel 

> ---
>  include/crypto/clmul.h | 41 +
>  crypto/clmul.c | 60 ++
>  crypto/meson.build |  9 ---
>  3 files changed, 107 insertions(+), 3 deletions(-)
>  create mode 100644 include/crypto/clmul.h
>  create mode 100644 crypto/clmul.c
>
> diff --git a/include/crypto/clmul.h b/include/crypto/clmul.h
> new file mode 100644
> index 00..153b5e3057
> --- /dev/null
> +++ b/include/crypto/clmul.h
> @@ -0,0 +1,41 @@
> +/*
> + * Carry-less multiply operations.
> + * SPDX-License-Identifier: GPL-2.0-or-later
> + *
> + * Copyright (C) 2023 Linaro, Ltd.
> + */
> +
> +#ifndef CRYPTO_CLMUL_H
> +#define CRYPTO_CLMUL_H
> +
> +/**
> + * clmul_8x8_low:
> + *
> + * Perform eight 8x8->8 carry-less multiplies.
> + */
> +uint64_t clmul_8x8_low(uint64_t, uint64_t);
> +
> +/**
> + * clmul_8x4_even:
> + *
> + * Perform four 8x8->16 carry-less multiplies.
> + * The odd bytes of the inputs are ignored.
> + */
> +uint64_t clmul_8x4_even(uint64_t, uint64_t);
> +
> +/**
> + * clmul_8x4_odd:
> + *
> + * Perform four 8x8->16 carry-less multiplies.
> + * The even bytes of the inputs are ignored.
> + */
> +uint64_t clmul_8x4_odd(uint64_t, uint64_t);
> +
> +/**
> + * clmul_8x4_packed:
> + *
> + * Perform four 8x8->16 carry-less multiplies.
> + */
> +uint64_t clmul_8x4_packed(uint32_t, uint32_t);
> +
> +#endif /* CRYPTO_CLMUL_H */
> diff --git a/crypto/clmul.c b/crypto/clmul.c
> new file mode 100644
> index 00..82d873fee5
> --- /dev/null
> +++ b/crypto/clmul.c
> @@ -0,0 +1,60 @@
> +/*
> + * Carry-less multiply operations.
> + * SPDX-License-Identifier: GPL-2.0-or-later
> + *
> + * Copyright (C) 2023 Linaro, Ltd.
> + */
> +
> +#include "qemu/osdep.h"
> +#include "crypto/clmul.h"
> +
> +uint64_t clmul_8x8_low(uint64_t n, uint64_t m)
> +{
> +uint64_t r = 0;
> +
> +for (int i = 0; i < 8; ++i) {
> +uint64_t mask = (n & 0x0101010101010101ull) * 0xff;
> +r ^= m & mask;
> +m = (m << 1) & 0xfefefefefefefefeull;
> +n >>= 1;
> +}
> +return r;
> +}
> +
> +static uint64_t clmul_8x4_even_int(uint64_t n, uint64_t m)
> +{
> +uint64_t r = 0;
> +
> +for (int i = 0; i < 8; ++i) {
> +uint64_t mask = (n & 0x0001000100010001ull) * 0x;
> +r ^= m & mask;
> +n >>= 1;
> +m <<= 1;
> +}
> +return r;
> +}
> +
> +uint64_t clmul_8x4_even(uint64_t n, uint64_t m)
> +{
> +n &= 0x00ff00ff00ff00ffull;
> +m &= 0x00ff00ff00ff00ffull;
> +return clmul_8x4_even_int(n, m);
> +}
> +
> +uint64_t clmul_8x4_odd(uint64_t n, uint64_t m)
> +{
> +return clmul_8x4_even(n >> 8, m >> 8);
> +}
> +
> +static uint64_t unpack_8_to_16(uint64_t x)
> +{
> +return  (x & 0x00ff)
> + | ((x & 0xff00) << 8)
> + | ((x & 0x00ff) << 16)
> + | ((x & 0xff00) << 24);
> +}
> +
> +uint64_t clmul_8x4_packed(uint32_t n, uint32_t m)
> +{
> +return clmul_8x4_even_int(unpack_8_to_16(n), unpack_8_to_16(m));
> +}
> diff --git a/crypto/meson.build b/crypto/meson.build
> index 5f03a30d34..9ac1a89802 100644
> --- a/crypto/meson.build
> +++ b/crypto/meson.build
> @@ -48,9 +48,12 @@ if have_afalg
>  endif
>  crypto_ss.add(when: gnutls, if_true: files('tls-cipher-suites.c'))
>
> -util_ss.add(files('sm4.c'))
> -util_ss.add(files('aes.c'))
> -util_ss.add(files('init.c'))
> +util_ss.add(files(
> +  'aes.c',
> +  'clmul.c',
> +  'init.c',
> +  'sm4.c',
> +))
>  if gnutls.found()
>util_ss.add(gnutls)
>  endif
> --
> 2.34.1
>

[PATCH v2] target/riscv: Use accelerated helper for AES64KS1I

2023-08-31 Thread Ard Biesheuvel

Use the accelerated SubBytes/ShiftRows/AddRoundKey AES helper to
implement the first half of the key schedule derivation. This does not
actually involve shifting rows, so clone the same value into all four
columns of the AES vector to counter that operation.

Cc: Richard Henderson 
Cc: Philippe Mathieu-Daudé 
Cc: Palmer Dabbelt 
Cc: Alistair Francis 
Signed-off-by: Ard Biesheuvel 
---
v2: assign round constant to elements 0 and 1 only

 target/riscv/crypto_helper.c | 17 +
 1 file changed, 5 insertions(+), 12 deletions(-)

diff --git a/target/riscv/crypto_helper.c b/target/riscv/crypto_helper.c
index 4d65945429c6dcc4..bb084e00efe52d1b 100644
--- a/target/riscv/crypto_helper.c
+++ b/target/riscv/crypto_helper.c
@@ -148,24 +148,17 @@ target_ulong HELPER(aes64ks1i)(target_ulong rs1, 
target_ulong rnum)
 
 uint8_t enc_rnum = rnum;
 uint32_t temp = (RS1 >> 32) & 0x;
-uint8_t rcon_ = 0;
-target_ulong result;
+AESState t, rc = {};
 
 if (enc_rnum != 0xA) {
 temp = ror32(temp, 8); /* Rotate right by 8 */
-rcon_ = round_consts[enc_rnum];
+rc.w[0] = rc.w[1] = round_consts[enc_rnum];
 }
 
-temp = ((uint32_t)AES_sbox[(temp >> 24) & 0xFF] << 24) |
-   ((uint32_t)AES_sbox[(temp >> 16) & 0xFF] << 16) |
-   ((uint32_t)AES_sbox[(temp >> 8) & 0xFF] << 8) |
-   ((uint32_t)AES_sbox[(temp >> 0) & 0xFF] << 0);
+t.w[0] = t.w[1] = t.w[2] = t.w[3] = temp;
+aesenc_SB_SR_AK(, , , false);
 
-temp ^= rcon_;
-
-result = ((uint64_t)temp << 32) | temp;
-
-return result;
+return t.d[0];
 }
 
 target_ulong HELPER(aes64im)(target_ulong rs1)
-- 
2.39.2

Re: [PATCH v3 00/19] crypto: Provide clmul.h and host accel

2023-08-21 Thread Ard Biesheuvel

On Mon, 21 Aug 2023 at 18:18, Richard Henderson
 wrote:
>
> Inspired by Ard Biesheuvel's RFC patches [1] for accelerating
> carry-less multiply under emulation.
>
> Changes for v3:
>   * Update target/i386 ops_sse.h.
>   * Apply r-b.
>
> Changes for v2:
>   * Only accelerate clmul_64; keep generic helpers for other sizes.
>   * Drop most of the Int128 interfaces, except for clmul_64.
>   * Use the same acceleration format as aes-round.h.
>
>
> r~
>
>
> [1] https://patchew.org/QEMU/20230601123332.3297404-1-a...@kernel.org/
>
>
> Richard Henderson (19):
>   crypto: Add generic 8-bit carry-less multiply routines
>   target/arm: Use clmul_8* routines
>   target/s390x: Use clmul_8* routines
>   target/ppc: Use clmul_8* routines
>   crypto: Add generic 16-bit carry-less multiply routines
>   target/arm: Use clmul_16* routines
>   target/s390x: Use clmul_16* routines
>   target/ppc: Use clmul_16* routines
>   crypto: Add generic 32-bit carry-less multiply routines
>   target/arm: Use clmul_32* routines
>   target/s390x: Use clmul_32* routines
>   target/ppc: Use clmul_32* routines
>   crypto: Add generic 64-bit carry-less multiply routine
>   target/arm: Use clmul_64
>   target/i386: Use clmul_64
>   target/s390x: Use clmul_64
>   target/ppc: Use clmul_64
>   host/include/i386: Implement clmul.h
>   host/include/aarch64: Implement clmul.h
>

OK, I did the OpenSSL benchmark this time, using a x86_64 cross build
on arm64/ThunderX2, and the speedup is 7x (\o/)

Tested-by: Ard Biesheuvel 
Acked-by: Ard Biesheuvel 



Distro qemu (no acceleration):

$ qemu-x86_64 --version
qemu-x86_64 version 7.2.4 (Debian 1:7.2+dfsg-7+deb12u1)

$ apps/openssl speed -evp aes-128-gcm
version: 3.2.0-dev
built on: Mon Aug 21 17:57:37 2023 UTC
options: bn(64,64)
compiler: x86_64-linux-gnu-gcc -pthread -m64 -Wa,--noexecstack -Wall
-O3 -DOPENSSL_USE_NODELETE -DL_ENDIAN -DOPENSSL_BUILDING_OPENSSL
-DNDEBUG
CPUINFO: OPENSSL_ia32cap=0xfed8320b0fcbfffd:0x8001020c01d843a9
The 'numbers' are in 1000s of bytes per second processed.
type 16 bytes 64 bytes256 bytes   1024 bytes
8192 bytes  16384 bytes
AES-128-GCM   8856.13k13820.95k17375.49k16826.37k
16870.06k17208.66k


QEMU built with this series applied onto latest master:

$ ~/build/qemu/build/qemu-x86_64 apps/openssl speed -evp aes-128-gcm
version: 3.2.0-dev
built on: Mon Aug 21 17:57:37 2023 UTC
options: bn(64,64)
compiler: x86_64-linux-gnu-gcc -pthread -m64 -Wa,--noexecstack -Wall
-O3 -DOPENSSL_USE_NODELETE -DL_ENDIAN -DOPENSSL_BUILDING_OPENSSL
-DNDEBUG
CPUINFO: OPENSSL_ia32cap=0xfffa320b0fcbfffd:0x8041020c01dc47a9
The 'numbers' are in 1000s of bytes per second processed.
type 16 bytes 64 bytes256 bytes   1024 bytes
8192 bytes  16384 bytes
AES-128-GCM  14237.01k34176.34k70633.13k97372.84k
119668.74k   122049.88k

Re: [PATCH v2 00/18] crypto: Provide clmul.h and host accel

2023-08-21 Thread Ard Biesheuvel

On Mon, 21 Aug 2023 at 17:15, Richard Henderson
 wrote:
>
> On 8/21/23 07:57, Ard Biesheuvel wrote:
> >> Richard Henderson (18):
> >>crypto: Add generic 8-bit carry-less multiply routines
> >>target/arm: Use clmul_8* routines
> >>target/s390x: Use clmul_8* routines
> >>target/ppc: Use clmul_8* routines
> >>crypto: Add generic 16-bit carry-less multiply routines
> >>target/arm: Use clmul_16* routines
> >>target/s390x: Use clmul_16* routines
> >>target/ppc: Use clmul_16* routines
> >>crypto: Add generic 32-bit carry-less multiply routines
> >>target/arm: Use clmul_32* routines
> >>target/s390x: Use clmul_32* routines
> >>target/ppc: Use clmul_32* routines
> >>crypto: Add generic 64-bit carry-less multiply routine
> >>target/arm: Use clmul_64
> >>target/s390x: Use clmul_64
> >>target/ppc: Use clmul_64
> >>host/include/i386: Implement clmul.h
> >>host/include/aarch64: Implement clmul.h
> >>
> >
> > I didn't re-run the OpenSSL benchmark, but the x86 Linux kernel still
> > passes all its crypto selftests when running under TCG emulation on a
> > TX2 arm64 host, so
> >
> > Tested-by: Ard Biesheuvel 
>
> Oh, whoops.  What's missing here?  Any target/i386 changes.
>

Ah yes - I hadn't spotted that. The below seems to do the trick.

--- a/target/i386/ops_sse.h
+++ b/target/i386/ops_sse.h
@@ -2156,7 +2156,10 @@ void glue(helper_pclmulqdq, SUFFIX)(CPUX86State
*env, Reg *d, Reg *v, Reg *s,
 for (i = 0; i < 1 << SHIFT; i += 2) {
 a = v->Q(((ctrl & 1) != 0) + i);
 b = s->Q(((ctrl & 16) != 0) + i);
-clmulq(>Q(i), >Q(i + 1), a, b);
+
+Int128 r = clmul_64(a, b);
+d->Q(i) = int128_getlo(r);
+d->Q(i + 1) = int128_gethi(r);
 }
 }

[and the #include added and clmulq() dropped]

I did a quick RFC4106 benchmark with tcrypt (which doesn't speed up as
much as OpenSSL but it is a bit of a hassle cross-rebuilding that)

no acceleration:

tcrypt: test 7 (160 bit key, 8192 byte blocks): 1547 operations in 1
seconds (12673024 bytes)

AES only:

tcrypt: test 7 (160 bit key, 8192 byte blocks): 1679 operations in 1
seconds (13754368 bytes)

AES and PMULL

tcrypt: test 7 (160 bit key, 8192 byte blocks): 3298 operations in 1
seconds (27017216 bytes)

Re: [PATCH v2 00/18] crypto: Provide clmul.h and host accel

2023-08-21 Thread Ard Biesheuvel

On Sat, 19 Aug 2023 at 03:02, Richard Henderson
 wrote:
>
> Inspired by Ard Biesheuvel's RFC patches [1] for accelerating
> carry-less multiply under emulation.
>
> Changes for v2:
>   * Only accelerate clmul_64; keep generic helpers for other sizes.
>   * Drop most of the Int128 interfaces, except for clmul_64.
>   * Use the same acceleration format as aes-round.h.
>
>
> r~
>
>
> [1] https://patchew.org/QEMU/20230601123332.3297404-1-a...@kernel.org/
>
> Richard Henderson (18):
>   crypto: Add generic 8-bit carry-less multiply routines
>   target/arm: Use clmul_8* routines
>   target/s390x: Use clmul_8* routines
>   target/ppc: Use clmul_8* routines
>   crypto: Add generic 16-bit carry-less multiply routines
>   target/arm: Use clmul_16* routines
>   target/s390x: Use clmul_16* routines
>   target/ppc: Use clmul_16* routines
>   crypto: Add generic 32-bit carry-less multiply routines
>   target/arm: Use clmul_32* routines
>   target/s390x: Use clmul_32* routines
>   target/ppc: Use clmul_32* routines
>   crypto: Add generic 64-bit carry-less multiply routine
>   target/arm: Use clmul_64
>   target/s390x: Use clmul_64
>   target/ppc: Use clmul_64
>   host/include/i386: Implement clmul.h
>   host/include/aarch64: Implement clmul.h
>

I didn't re-run the OpenSSL benchmark, but the x86 Linux kernel still
passes all its crypto selftests when running under TCG emulation on a
TX2 arm64 host, so

Tested-by: Ard Biesheuvel 

for the series.

Thanks,
Ard.

Re: [PATCH] target/riscv: Use accelerated helper for AES64KS1I

2023-08-07 Thread Ard Biesheuvel

(cc riscv maintainers)

On Mon, 31 Jul 2023 at 11:39, Ard Biesheuvel  wrote:
>
> Use the accelerated SubBytes/ShiftRows/AddRoundKey AES helper to
> implement the first half of the key schedule derivation. This does not
> actually involve shifting rows, so clone the same uint32_t 4 times into
> the AES vector to counter that.
>
> Cc: Richard Henderson 
> Cc: Philippe Mathieu-Daudé 
> Signed-off-by: Ard Biesheuvel 
> ---
>  target/riscv/crypto_helper.c | 17 +
>  1 file changed, 5 insertions(+), 12 deletions(-)
>
> diff --git a/target/riscv/crypto_helper.c b/target/riscv/crypto_helper.c
> index 4d65945429c6dcc4..257c5c4863fb160f 100644
> --- a/target/riscv/crypto_helper.c
> +++ b/target/riscv/crypto_helper.c
> @@ -148,24 +148,17 @@ target_ulong HELPER(aes64ks1i)(target_ulong rs1, 
> target_ulong rnum)
>
>  uint8_t enc_rnum = rnum;
>  uint32_t temp = (RS1 >> 32) & 0x;
> -uint8_t rcon_ = 0;
> -target_ulong result;
> +AESState t, rc = {};
>
>  if (enc_rnum != 0xA) {
>  temp = ror32(temp, 8); /* Rotate right by 8 */
> -rcon_ = round_consts[enc_rnum];
> +rc.w[0] = rc.w[1] = rc.w[2] = rc.w[3] = round_consts[enc_rnum];

This can be simplified to

rc.w[0] = rc.w[1] = round_consts[enc_rnum];


>  }
>
> -temp = ((uint32_t)AES_sbox[(temp >> 24) & 0xFF] << 24) |
> -   ((uint32_t)AES_sbox[(temp >> 16) & 0xFF] << 16) |
> -   ((uint32_t)AES_sbox[(temp >> 8) & 0xFF] << 8) |
> -   ((uint32_t)AES_sbox[(temp >> 0) & 0xFF] << 0);
> +t.w[0] = t.w[1] = t.w[2] = t.w[3] = temp;
> +aesenc_SB_SR_AK(, , , false);
>
> -temp ^= rcon_;
> -
> -result = ((uint64_t)temp << 32) | temp;
> -
> -return result;
> +return t.d[0];
>  }
>
>  target_ulong HELPER(aes64im)(target_ulong rs1)
> --
> 2.39.2
>

Re: [RFC PATCH for-8.2 00/18] crypto: Provide clmul.h and host accel

2023-08-03 Thread Ard Biesheuvel

On Thu, 13 Jul 2023 at 23:14, Richard Henderson
 wrote:
>
> Inspired by Ard Biesheuvel's RFC patches [1] for accelerating
> carry-less multiply under emulation.
>
> This is less polished than the AES patch set:
>
> (1) Should I split HAVE_CLMUL_ACCEL into per-width HAVE_CLMUL{N}_ACCEL?
> The "_generic" and "_accel" split is different from aes-round.h
> because of the difference in support for different widths, and it
> means that each host accel has more boilerplate.
>
> (2) Should I bother trying to accelerate anything other than 64x64->128?

That is the only compelling use case afaict.

> That seems to be the one that GSM really wants anyway.  I'd keep all
> of the sizes implemented generically, since that centralizes the 3
> target implementations.
>
> (3) The use of Int128 isn't fantastic -- better would be a vector type,
> though that has its own special problems for ppc64le (see the
> endianness hoops within aes-round.h).  Perhaps leave things in
> env memory, like I was mostly able to do with AES?
>
> (4) No guest test case(s).
>
>
> r~
>
>
> [1] https://patchew.org/QEMU/20230601123332.3297404-1-a...@kernel.org/
>
> Richard Henderson (18):
>   crypto: Add generic 8-bit carry-less multiply routines
>   target/arm: Use clmul_8* routines
>   target/s390x: Use clmul_8* routines
>   target/ppc: Use clmul_8* routines
>   crypto: Add generic 16-bit carry-less multiply routines
>   target/arm: Use clmul_16* routines
>   target/s390x: Use clmul_16* routines
>   target/ppc: Use clmul_16* routines
>   crypto: Add generic 32-bit carry-less multiply routines
>   target/arm: Use clmul_32* routines
>   target/s390x: Use clmul_32* routines
>   target/ppc: Use clmul_32* routines
>   crypto: Add generic 64-bit carry-less multiply routine
>   target/arm: Use clmul_64
>   target/s390x: Use clmul_64
>   target/ppc: Use clmul_64
>   host/include/i386: Implement clmul.h
>   host/include/aarch64: Implement clmul.h
>
>  host/include/aarch64/host/cpuinfo.h  |   1 +
>  host/include/aarch64/host/crypto/clmul.h | 230 +++
>  host/include/generic/host/crypto/clmul.h |  28 +++
>  host/include/i386/host/cpuinfo.h |   1 +
>  host/include/i386/host/crypto/clmul.h| 187 ++
>  host/include/x86_64/host/crypto/clmul.h  |   1 +
>  include/crypto/clmul.h   | 123 
>  target/arm/tcg/vec_internal.h|  11 --
>  crypto/clmul.c   | 163 
>  target/arm/tcg/mve_helper.c  |  16 +-
>  target/arm/tcg/vec_helper.c  | 112 ++-
>  target/ppc/int_helper.c  |  63 +++
>  target/s390x/tcg/vec_int_helper.c| 175 +++--
>  util/cpuinfo-aarch64.c   |   4 +-
>  util/cpuinfo-i386.c  |   1 +
>  crypto/meson.build   |   9 +-
>  16 files changed, 865 insertions(+), 260 deletions(-)
>  create mode 100644 host/include/aarch64/host/crypto/clmul.h
>  create mode 100644 host/include/generic/host/crypto/clmul.h
>  create mode 100644 host/include/i386/host/crypto/clmul.h
>  create mode 100644 host/include/x86_64/host/crypto/clmul.h
>  create mode 100644 include/crypto/clmul.h
>  create mode 100644 crypto/clmul.c
>
> --
> 2.34.1
>

[PATCH] target/riscv: Use accelerated helper for AES64KS1I

2023-07-31 Thread Ard Biesheuvel

Use the accelerated SubBytes/ShiftRows/AddRoundKey AES helper to
implement the first half of the key schedule derivation. This does not
actually involve shifting rows, so clone the same uint32_t 4 times into
the AES vector to counter that.

Cc: Richard Henderson 
Cc: Philippe Mathieu-Daudé 
Signed-off-by: Ard Biesheuvel 
---
 target/riscv/crypto_helper.c | 17 +
 1 file changed, 5 insertions(+), 12 deletions(-)

diff --git a/target/riscv/crypto_helper.c b/target/riscv/crypto_helper.c
index 4d65945429c6dcc4..257c5c4863fb160f 100644
--- a/target/riscv/crypto_helper.c
+++ b/target/riscv/crypto_helper.c
@@ -148,24 +148,17 @@ target_ulong HELPER(aes64ks1i)(target_ulong rs1, 
target_ulong rnum)
 
 uint8_t enc_rnum = rnum;
 uint32_t temp = (RS1 >> 32) & 0x;
-uint8_t rcon_ = 0;
-target_ulong result;
+AESState t, rc = {};
 
 if (enc_rnum != 0xA) {
 temp = ror32(temp, 8); /* Rotate right by 8 */
-rcon_ = round_consts[enc_rnum];
+rc.w[0] = rc.w[1] = rc.w[2] = rc.w[3] = round_consts[enc_rnum];
 }
 
-temp = ((uint32_t)AES_sbox[(temp >> 24) & 0xFF] << 24) |
-   ((uint32_t)AES_sbox[(temp >> 16) & 0xFF] << 16) |
-   ((uint32_t)AES_sbox[(temp >> 8) & 0xFF] << 8) |
-   ((uint32_t)AES_sbox[(temp >> 0) & 0xFF] << 0);
+t.w[0] = t.w[1] = t.w[2] = t.w[3] = temp;
+aesenc_SB_SR_AK(, , , false);
 
-temp ^= rcon_;
-
-result = ((uint64_t)temp << 32) | temp;
-
-return result;
+return t.d[0];
 }
 
 target_ulong HELPER(aes64im)(target_ulong rs1)
-- 
2.39.2

[PATCH v2] target/riscv: Use existing lookup tables for MixColumns

2023-07-31 Thread Ard Biesheuvel

The AES MixColumns and InvMixColumns operations are relatively
expensive 4x4 matrix multiplications in GF(2^8), which is why C
implementations usually rely on precomputed lookup tables rather than
performing the calculations on demand.

Given that we already carry those tables in QEMU, we can just grab the
right value in the implementation of the RISC-V AES32 instructions. Note
that the tables in question are permuted according to the respective
Sbox, so we can omit the Sbox lookup as well in this case.

Cc: Richard Henderson 
Cc: Philippe Mathieu-Daudé 
Cc: Zewen Ye 
Cc: Weiwei Li 
Cc: Junqiang Wang 
Signed-off-by: Ard Biesheuvel 
---
v2:
- ignore host endianness and use be32_to_cpu() unconditionally

 crypto/aes.c |  4 +--
 include/crypto/aes.h |  7 
 target/riscv/crypto_helper.c | 34 +++-
 3 files changed, 13 insertions(+), 32 deletions(-)

diff --git a/crypto/aes.c b/crypto/aes.c
index 836d7d5c0bf1b392..df4362ac6022eac2 100644
--- a/crypto/aes.c
+++ b/crypto/aes.c
@@ -272,7 +272,7 @@ AES_Td3[x] = Si[x].[09, 0d, 0b, 0e];
 AES_Td4[x] = Si[x].[01, 01, 01, 01];
 */
 
-static const uint32_t AES_Te0[256] = {
+const uint32_t AES_Te0[256] = {
 0xc66363a5U, 0xf87c7c84U, 0xee99U, 0xf67b7b8dU,
 0xfff2f20dU, 0xd66b6bbdU, 0xde6f6fb1U, 0x91c5c554U,
 0x60303050U, 0x02010103U, 0xce6767a9U, 0x562b2b7dU,
@@ -607,7 +607,7 @@ static const uint32_t AES_Te4[256] = {
 0xb0b0b0b0U, 0x54545454U, 0xU, 0x16161616U,
 };
 
-static const uint32_t AES_Td0[256] = {
+const uint32_t AES_Td0[256] = {
 0x51f4a750U, 0x7e416553U, 0x1a17a4c3U, 0x3a275e96U,
 0x3bab6bcbU, 0x1f9d45f1U, 0xacfa58abU, 0x4be30393U,
 0x2030fa55U, 0xad766df6U, 0x88cc7691U, 0xf5024c25U,
diff --git a/include/crypto/aes.h b/include/crypto/aes.h
index 709d4d226bfe158b..381f24c9022d2aa8 100644
--- a/include/crypto/aes.h
+++ b/include/crypto/aes.h
@@ -30,4 +30,11 @@ void AES_decrypt(const unsigned char *in, unsigned char *out,
 extern const uint8_t AES_sbox[256];
 extern const uint8_t AES_isbox[256];
 
+/*
+AES_Te0[x] = S [x].[02, 01, 01, 03];
+AES_Td0[x] = Si[x].[0e, 09, 0d, 0b];
+*/
+
+extern const uint32_t AES_Te0[256], AES_Td0[256];
+
 #endif
diff --git a/target/riscv/crypto_helper.c b/target/riscv/crypto_helper.c
index 99d85a618843e87e..4d65945429c6dcc4 100644
--- a/target/riscv/crypto_helper.c
+++ b/target/riscv/crypto_helper.c
@@ -25,29 +25,6 @@
 #include "crypto/aes-round.h"
 #include "crypto/sm4.h"
 
-#define AES_XTIME(a) \
-((a << 1) ^ ((a & 0x80) ? 0x1b : 0))
-
-#define AES_GFMUL(a, b) (( \
-(((b) & 0x1) ? (a) : 0) ^ \
-(((b) & 0x2) ? AES_XTIME(a) : 0) ^ \
-(((b) & 0x4) ? AES_XTIME(AES_XTIME(a)) : 0) ^ \
-(((b) & 0x8) ? AES_XTIME(AES_XTIME(AES_XTIME(a))) : 0)) & 0xFF)
-
-static inline uint32_t aes_mixcolumn_byte(uint8_t x, bool fwd)
-{
-uint32_t u;
-
-if (fwd) {
-u = (AES_GFMUL(x, 3) << 24) | (x << 16) | (x << 8) |
-(AES_GFMUL(x, 2) << 0);
-} else {
-u = (AES_GFMUL(x, 0xb) << 24) | (AES_GFMUL(x, 0xd) << 16) |
-(AES_GFMUL(x, 0x9) << 8) | (AES_GFMUL(x, 0xe) << 0);
-}
-return u;
-}
-
 #define sext32_xlen(x) (target_ulong)(int32_t)(x)
 
 static inline target_ulong aes32_operation(target_ulong shamt,
@@ -55,23 +32,20 @@ static inline target_ulong aes32_operation(target_ulong 
shamt,
bool enc, bool mix)
 {
 uint8_t si = rs2 >> shamt;
-uint8_t so;
 uint32_t mixed;
 target_ulong res;
 
 if (enc) {
-so = AES_sbox[si];
 if (mix) {
-mixed = aes_mixcolumn_byte(so, true);
+mixed = be32_to_cpu(AES_Te0[si]);
 } else {
-mixed = so;
+mixed = AES_sbox[si];
 }
 } else {
-so = AES_isbox[si];
 if (mix) {
-mixed = aes_mixcolumn_byte(so, false);
+mixed = be32_to_cpu(AES_Td0[si]);
 } else {
-mixed = so;
+mixed = AES_isbox[si];
 }
 }
 mixed = rol32(mixed, shamt);
-- 
2.39.2

Re: [RFC PATCH] target/i386: Truncate ESP when exiting from long mode

2023-07-31 Thread Ard Biesheuvel

On Wed, 26 Jul 2023 at 17:01, Richard Henderson
 wrote:
>
> On 7/26/23 01:17, Ard Biesheuvel wrote:
> > While working on some EFI boot changes for Linux/x86, I noticed that TCG 
> > deviates from
> > bare metal when it comes to how it handles the value of the stack pointer 
> > register RSP
> > when dropping out of long mode.
> >
> > On bare metal, RSP is truncated to 32 bits, even if the code that runs in 
> > 32-bit
> > protected mode never uses the stack at all (and uses a long jump rather 
> > than long
> > return to switch back to long mode). This means 64-bit code cannot rely on 
> > RSP
> > surviving any excursions into 32-bit protected mode (with paging disabled).
> >
> > Let's align TCG with this behavior, so that code that relies on RSP 
> > retaining its value
> > does not inadvertently work while bare metal does not.
> >
> > Observed on Intel Ice Lake cores.
> >
> > Cc: Paolo Bonzini Cc: Richard
> > Henderson Cc: Eduardo 
> > Habkost
> > Link:https://lore.kernel.org/all/20230711091453.2543622-11-a...@kernel.org/
> > Signed-off-by: Ard Biesheuvel --- I used this patch 
> > locally to
> > reproduce an issue that was reported on Ice Lake but didn't trigger in my 
> > QEMU
> > testing.
> >
> > Hints welcome on where the architectural behavior is specified, and in 
> > particular,
> > whether or not other 64-bit GPRs can be relied upon to preserve their full 
> > 64-bit
> > length values.
>
> No idea about chapter and verse, but it has the feel of being part and parcel 
> with the
> truncation of eip.  While esp is always special, I suspect that none of the 
> GPRs can be
> relied on carrying all bits.
>
> I'm happy with the change though, since similar behaviour can be observed on 
> hw.
>
> Acked-by: Richard Henderson 
>

I experimented with truncating all GPRs that exist in 32-bit mode, and
this actually breaks kexec on Linux if it happens to load the kernel
above 4G (which it appears to do reproducibly when sufficient memory
is available)

This is due to the 4/5 level paging switch trampoline, which is called
while RBX, RBP and RSI are live and refer to assets in memory that may
reside above 4G.

I am fixing that code, but it does mean we should probably limit this
change to ESP (as apparently, current hw only happens to truncate ESP
but no other GPRs)

Re: [RFC PATCH] target/i386: Truncate ESP when exiting from long mode

2023-07-28 Thread Ard Biesheuvel

On Fri, 28 Jul 2023 at 02:17, Richard Henderson
 wrote:
>
> On 7/27/23 14:36, Ard Biesheuvel wrote:
> > On Thu, 27 Jul 2023 at 19:56, Richard Henderson
> >  wrote:
> >>
> >> On 7/26/23 08:01, Richard Henderson wrote:
> >>> On 7/26/23 01:17, Ard Biesheuvel wrote:
> >>>> Hints welcome on where the architectural behavior is specified, and in 
> >>>> particular,
> >>>> whether or not other 64-bit GPRs can be relied upon to preserve their 
> >>>> full 64-bit
> >>>> length values.
> >>>
> >>> No idea about chapter and verse, but it has the feel of being part and 
> >>> parcel with the
> >>> truncation of eip.  While esp is always special, I suspect that none of 
> >>> the GPRs can be
> >>> relied on carrying all bits.
> >>
> >> Coincidentally, I was having a gander at the newly announced APX extension 
> >> [1],
> >> and happened across
> >>
> >> 3.1.4.1.2 Extended GPR Access (Direct and Indirect)
> >>
> >>   ... Entering/leaving 64-bit mode via traditional (explicit)
> >>   control flow does not directly alter the content of the EGPRs
> >>   (EGPRs behave similar to R8-R15 in this regard).
> >>
> >> which suggests to me that the 8 low registers are squashed to 32-bit
> >> on transition to 32-bit IA-32e mode.
> >>
> >> I still have not found similar language in the main architecture manual.
> >>
> >
> > Interesting - that matches my observations on those Ice Lake cores:
> > RSP will be truncated, but preserving/restoring it to/from R8 across
> > the exit from long mode works fine.
>
> Found it:
>
> Volume 1 Basic Architecture
> 3.4.1.1 General-Purpose Registers in 64-Bit Mode
>
> # Registers only available in 64-bit mode (R8-R15 and XMM8-XMM15)
> # are preserved across transitions from 64-bit mode into compatibility mode
> # then back into 64-bit mode. However, values of R8-R15 and XMM8-XMM15 are
> # undefined after transitions from 64-bit mode through compatibility mode
> # to legacy or real mode and then back through compatibility mode to 64-bit 
> mode.
>

Thanks. Not what I was hoping though ...

Re: [RFC PATCH] target/i386: Truncate ESP when exiting from long mode

2023-07-27 Thread Ard Biesheuvel

On Thu, 27 Jul 2023 at 19:56, Richard Henderson
 wrote:
>
> On 7/26/23 08:01, Richard Henderson wrote:
> > On 7/26/23 01:17, Ard Biesheuvel wrote:
> >> Hints welcome on where the architectural behavior is specified, and in 
> >> particular,
> >> whether or not other 64-bit GPRs can be relied upon to preserve their full 
> >> 64-bit
> >> length values.
> >
> > No idea about chapter and verse, but it has the feel of being part and 
> > parcel with the
> > truncation of eip.  While esp is always special, I suspect that none of the 
> > GPRs can be
> > relied on carrying all bits.
>
> Coincidentally, I was having a gander at the newly announced APX extension 
> [1],
> and happened across
>
> 3.1.4.1.2 Extended GPR Access (Direct and Indirect)
>
>  ... Entering/leaving 64-bit mode via traditional (explicit)
>  control flow does not directly alter the content of the EGPRs
>  (EGPRs behave similar to R8-R15 in this regard).
>
> which suggests to me that the 8 low registers are squashed to 32-bit
> on transition to 32-bit IA-32e mode.
>
> I still have not found similar language in the main architecture manual.
>

Interesting - that matches my observations on those Ice Lake cores:
RSP will be truncated, but preserving/restoring it to/from R8 across
the exit from long mode works fine.

[PATCH] target/riscv: Use existing lookup tables for AES MixColumns

2023-07-27 Thread Ard Biesheuvel

The AES MixColumns and InvMixColumns operations are relatively
expensive 4x4 matrix multiplications in GF(2^8), which is why C
implementations usually rely on precomputed lookup tables rather than
performing the calculations on demand.

Given that we already carry those tables in QEMU, we can just grab the
right value in the implementation of the RISC-V AES32 instructions. Note
that the tables in question are permuted according to the respective
Sbox, so we can omit the Sbox lookup as well in this case.

Cc: Richard Henderson 
Cc: Philippe Mathieu-Daudé 
Cc: Zewen Ye 
Cc: Weiwei Li 
Cc: Junqiang Wang 
Signed-off-by: Ard Biesheuvel 
---
 crypto/aes.c |  5 ++--
 include/crypto/aes.h |  7 +
 target/riscv/crypto_helper.c | 30 
 3 files changed, 14 insertions(+), 28 deletions(-)

diff --git a/crypto/aes.c b/crypto/aes.c
index 836d7d5c0bf1b392..27d7e1a22dfe8c74 100644
--- a/crypto/aes.c
+++ b/crypto/aes.c
@@ -272,7 +272,7 @@ AES_Td3[x] = Si[x].[09, 0d, 0b, 0e];
 AES_Td4[x] = Si[x].[01, 01, 01, 01];
 */
 
-static const uint32_t AES_Te0[256] = {
+const uint32_t AES_Te0[256] = {
 0xc66363a5U, 0xf87c7c84U, 0xee99U, 0xf67b7b8dU,
 0xfff2f20dU, 0xd66b6bbdU, 0xde6f6fb1U, 0x91c5c554U,
 0x60303050U, 0x02010103U, 0xce6767a9U, 0x562b2b7dU,
@@ -606,8 +606,7 @@ static const uint32_t AES_Te4[256] = {
 0x41414141U, 0xU, 0x2d2d2d2dU, 0x0f0f0f0fU,
 0xb0b0b0b0U, 0x54545454U, 0xU, 0x16161616U,
 };
-
-static const uint32_t AES_Td0[256] = {
+const uint32_t AES_Td0[256] = {
 0x51f4a750U, 0x7e416553U, 0x1a17a4c3U, 0x3a275e96U,
 0x3bab6bcbU, 0x1f9d45f1U, 0xacfa58abU, 0x4be30393U,
 0x2030fa55U, 0xad766df6U, 0x88cc7691U, 0xf5024c25U,
diff --git a/include/crypto/aes.h b/include/crypto/aes.h
index 709d4d226bfe158b..381f24c9022d2aa8 100644
--- a/include/crypto/aes.h
+++ b/include/crypto/aes.h
@@ -30,4 +30,11 @@ void AES_decrypt(const unsigned char *in, unsigned char *out,
 extern const uint8_t AES_sbox[256];
 extern const uint8_t AES_isbox[256];
 
+/*
+AES_Te0[x] = S [x].[02, 01, 01, 03];
+AES_Td0[x] = Si[x].[0e, 09, 0d, 0b];
+*/
+
+extern const uint32_t AES_Te0[256], AES_Td0[256];
+
 #endif
diff --git a/target/riscv/crypto_helper.c b/target/riscv/crypto_helper.c
index 99d85a618843e87e..40f95c71cef45877 100644
--- a/target/riscv/crypto_helper.c
+++ b/target/riscv/crypto_helper.c
@@ -25,29 +25,6 @@
 #include "crypto/aes-round.h"
 #include "crypto/sm4.h"
 
-#define AES_XTIME(a) \
-((a << 1) ^ ((a & 0x80) ? 0x1b : 0))
-
-#define AES_GFMUL(a, b) (( \
-(((b) & 0x1) ? (a) : 0) ^ \
-(((b) & 0x2) ? AES_XTIME(a) : 0) ^ \
-(((b) & 0x4) ? AES_XTIME(AES_XTIME(a)) : 0) ^ \
-(((b) & 0x8) ? AES_XTIME(AES_XTIME(AES_XTIME(a))) : 0)) & 0xFF)
-
-static inline uint32_t aes_mixcolumn_byte(uint8_t x, bool fwd)
-{
-uint32_t u;
-
-if (fwd) {
-u = (AES_GFMUL(x, 3) << 24) | (x << 16) | (x << 8) |
-(AES_GFMUL(x, 2) << 0);
-} else {
-u = (AES_GFMUL(x, 0xb) << 24) | (AES_GFMUL(x, 0xd) << 16) |
-(AES_GFMUL(x, 0x9) << 8) | (AES_GFMUL(x, 0xe) << 0);
-}
-return u;
-}
-
 #define sext32_xlen(x) (target_ulong)(int32_t)(x)
 
 static inline target_ulong aes32_operation(target_ulong shamt,
@@ -62,18 +39,21 @@ static inline target_ulong aes32_operation(target_ulong 
shamt,
 if (enc) {
 so = AES_sbox[si];
 if (mix) {
-mixed = aes_mixcolumn_byte(so, true);
+mixed = AES_Te0[si];
 } else {
 mixed = so;
 }
 } else {
 so = AES_isbox[si];
 if (mix) {
-mixed = aes_mixcolumn_byte(so, false);
+mixed = AES_Td0[si];
 } else {
 mixed = so;
 }
 }
+if (!HOST_BIG_ENDIAN && mix) {
+mixed = bswap32(mixed);
+}
 mixed = rol32(mixed, shamt);
 res = rs1 ^ mixed;
 
-- 
2.39.2

[RFC PATCH] target/i386: Truncate ESP when exiting from long mode

2023-07-26 Thread Ard Biesheuvel

While working on some EFI boot changes for Linux/x86, I noticed that TCG
deviates from bare metal when it comes to how it handles the value of
the stack pointer register RSP when dropping out of long mode.

On bare metal, RSP is truncated to 32 bits, even if the code that runs
in 32-bit protected mode never uses the stack at all (and uses a long
jump rather than long return to switch back to long mode). This means
64-bit code cannot rely on RSP surviving any excursions into 32-bit
protected mode (with paging disabled).

Let's align TCG with this behavior, so that code that relies on RSP
retaining its value does not inadvertently work while bare metal does
not.

Observed on Intel Ice Lake cores.

Cc: Paolo Bonzini 
Cc: Richard Henderson 
Cc: Eduardo Habkost 
Link: https://lore.kernel.org/all/20230711091453.2543622-11-a...@kernel.org/
Signed-off-by: Ard Biesheuvel 
---
I used this patch locally to reproduce an issue that was reported on Ice
Lake but didn't trigger in my QEMU testing.

Hints welcome on where the architectural behavior is specified, and in
particular, whether or not other 64-bit GPRs can be relied upon to
preserve their full 64-bit length values.

 target/i386/helper.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/target/i386/helper.c b/target/i386/helper.c
index 89aa696c6d53d68c..a338da23a87746ed 100644
--- a/target/i386/helper.c
+++ b/target/i386/helper.c
@@ -149,6 +149,7 @@ void cpu_x86_update_cr0(CPUX86State *env, uint32_t new_cr0)
 env->efer &= ~MSR_EFER_LMA;
 env->hflags &= ~(HF_LMA_MASK | HF_CS64_MASK);
 env->eip &= 0x;
+env->regs[R_ESP] &= 0x;
 }
 #endif
 env->cr[0] = new_cr0 | CR0_ET_MASK;
-- 
2.39.2

Re: [PATCH v2 05/38] crypto/aes: Add constants for ShiftRows, InvShiftRows

2023-06-29 Thread Ard Biesheuvel

On Fri, 9 Jun 2023 at 04:24, Richard Henderson
 wrote:
>
> These symbols will avoid the indirection through memory
> when fully unrolling some new primitives.
>
> Reviewed-by: Philippe Mathieu-Daudé 
> Signed-off-by: Richard Henderson 
> ---
>  crypto/aes.c | 50 --
>  1 file changed, 48 insertions(+), 2 deletions(-)
>
> diff --git a/crypto/aes.c b/crypto/aes.c
> index 67bb74b8e3..cdf937883d 100644
> --- a/crypto/aes.c
> +++ b/crypto/aes.c
> @@ -108,12 +108,58 @@ const uint8_t AES_isbox[256] = {
>  0xE1, 0x69, 0x14, 0x63, 0x55, 0x21, 0x0C, 0x7D,
>  };
>
> +/* AES ShiftRows, for complete unrolling. */
> +enum {
> +AES_SH_0 = 0x0,
> +AES_SH_1 = 0x5,
> +AES_SH_2 = 0xa,
> +AES_SH_3 = 0xf,
> +AES_SH_4 = 0x4,
> +AES_SH_5 = 0x9,
> +AES_SH_6 = 0xe,
> +AES_SH_7 = 0x3,
> +AES_SH_8 = 0x8,
> +AES_SH_9 = 0xd,
> +AES_SH_A = 0x2,
> +AES_SH_B = 0x7,
> +AES_SH_C = 0xc,
> +AES_SH_D = 0x1,
> +AES_SH_E = 0x6,
> +AES_SH_F = 0xb,
> +};
> +

We might simplify this further by doing

#define AES_SH(n)  (((n) * 5) % 16)
#define AES_ISH(n)  (((n) * 13) % 16)

>  const uint8_t AES_shifts[16] = {
> -0, 5, 10, 15, 4, 9, 14, 3, 8, 13, 2, 7, 12, 1, 6, 11
> +AES_SH_0, AES_SH_1, AES_SH_2, AES_SH_3,
> +AES_SH_4, AES_SH_5, AES_SH_6, AES_SH_7,
> +AES_SH_8, AES_SH_9, AES_SH_A, AES_SH_B,
> +AES_SH_C, AES_SH_D, AES_SH_E, AES_SH_F,
> +};
> +
> +/* AES InvShiftRows, for complete unrolling. */
> +enum {
> +AES_ISH_0 = 0x0,
> +AES_ISH_1 = 0xd,
> +AES_ISH_2 = 0xa,
> +AES_ISH_3 = 0x7,
> +AES_ISH_4 = 0x4,
> +AES_ISH_5 = 0x1,
> +AES_ISH_6 = 0xe,
> +AES_ISH_7 = 0xb,
> +AES_ISH_8 = 0x8,
> +AES_ISH_9 = 0x5,
> +AES_ISH_A = 0x2,
> +AES_ISH_B = 0xf,
> +AES_ISH_C = 0xc,
> +AES_ISH_D = 0x9,
> +AES_ISH_E = 0x6,
> +AES_ISH_F = 0x3,
>  };
>
>  const uint8_t AES_ishifts[16] = {
> -0, 13, 10, 7, 4, 1, 14, 11, 8, 5, 2, 15, 12, 9, 6, 3
> +AES_ISH_0, AES_ISH_1, AES_ISH_2, AES_ISH_3,
> +AES_ISH_4, AES_ISH_5, AES_ISH_6, AES_ISH_7,
> +AES_ISH_8, AES_ISH_9, AES_ISH_A, AES_ISH_B,
> +AES_ISH_C, AES_ISH_D, AES_ISH_E, AES_ISH_F,
>  };
>
>  /*
> --
> 2.34.1
>

Re: [PATCH 00/35] crypto: Provide aes-round.h and host accel

2023-06-04 Thread Ard Biesheuvel

On Sat, 3 Jun 2023 at 15:23, Ard Biesheuvel  wrote:
>
> On Sat, 3 Jun 2023 at 04:34, Richard Henderson
>  wrote:
> >
> > Inspired by Ard Biesheuvel's RFC patches for accelerating AES
> > under emulation, provide a set of primitives that maps between
> > the guest and host fragments.
> >
> > There is a small guest correctness test case.
> >
> > I think the end result is quite a bit cleaner, since the logic
> > is now centralized, rather than spread across 4 different guests.
> >
> > Further work could clean up crypto/aes.c itself to use these
> > instead of the tables directly.  I'm sure that's just an ultimate
> > fallback when an appropriate system library is not available, and
> > so not terribly important, but it could still significantly reduce
> > the amount of code we carry.
> >
> > I would imagine structuring a polynomial multiplication header
> > in a similar way.  There are 4 or 5 versions of those spread across
> > the different guests.
> >
> > Anyway, please review.
> >
> >
> > r~
> >
> >
> > Richard Henderson (35):
> >   tests/multiarch: Add test-aes
> >   target/arm: Move aesmc and aesimc tables to crypto/aes.c
> >   crypto/aes: Add constants for ShiftRows, InvShiftRows
> >   crypto: Add aesenc_SB_SR
> >   target/i386: Use aesenc_SB_SR
> >   target/arm: Demultiplex AESE and AESMC
> >   target/arm: Use aesenc_SB_SR
> >   target/ppc: Use aesenc_SB_SR
> >   target/riscv: Use aesenc_SB_SR
> >   crypto: Add aesdec_ISB_ISR
> >   target/i386: Use aesdec_ISB_ISR
> >   target/arm: Use aesdec_ISB_ISR
> >   target/ppc: Use aesdec_ISB_ISR
> >   target/riscv: Use aesdec_ISB_ISR
> >   crypto: Add aesenc_MC
> >   target/arm: Use aesenc_MC
> >   crypto: Add aesdec_IMC
> >   target/i386: Use aesdec_IMC
> >   target/arm: Use aesdec_IMC
> >   target/riscv: Use aesdec_IMC
> >   crypto: Add aesenc_SB_SR_MC_AK
> >   target/i386: Use aesenc_SB_SR_MC_AK
> >   target/ppc: Use aesenc_SB_SR_MC_AK
> >   target/riscv: Use aesenc_SB_SR_MC_AK
> >   crypto: Add aesdec_ISB_ISR_IMC_AK
> >   target/i386: Use aesdec_ISB_ISR_IMC_AK
> >   target/riscv: Use aesdec_ISB_ISR_IMC_AK
> >   crypto: Add aesdec_ISB_ISR_AK_IMC
> >   target/ppc: Use aesdec_ISB_ISR_AK_IMC
> >   host/include/i386: Implement aes-round.h
> >   host/include/aarch64: Implement aes-round.h
> >   crypto: Remove AES_shifts, AES_ishifts
> >   crypto: Implement aesdec_IMC with AES_imc_rot
> >   crypto: Remove AES_imc
> >   crypto: Unexport AES_*_rot, AES_TeN, AES_TdN
> >
>
> This is looking very good - it is clearly a much better abstraction
> than what I proposed, and I'd expect the performance boost to be the
> same.

Benchmark results for OpenSSL running in emulation on TX2:

Without acceleration:

$ ../qemu/build/qemu-x86_64 apps/openssl speed -evp aes-128-ctr
version: 3.2.0-dev
built on: Thu Jun  1 17:06:09 2023 UTC
options: bn(64,64)
compiler: x86_64-linux-gnu-gcc -pthread -m64 -Wa,--noexecstack -Wall
-O3 -DOPENSSL_USE_NODELETE -DL_ENDIAN -DOPENSSL_BUILDING_OPENSSL
-DNDEBUG
CPUINFO: OPENSSL_ia32cap=0xfed8320b0fcbfffd:0x8001020c01d843a9
The 'numbers' are in 1000s of bytes per second processed.
type 16 bytes 64 bytes256 bytes   1024 bytes
8192 bytes  16384 bytes
AES-128-CTR  25146.07k50482.19k69373.44k76236.80k
78391.98k78381.06k


With acceleration:

$ ../qemu/build/qemu-x86_64 apps/openssl speed -evp aes-128-ctr
version: 3.2.0-dev
built on: Thu Jun  1 17:06:09 2023 UTC
options: bn(64,64)
compiler: x86_64-linux-gnu-gcc -pthread -m64 -Wa,--noexecstack -Wall
-O3 -DOPENSSL_USE_NODELETE -DL_ENDIAN -DOPENSSL_BUILDING_OPENSSL
-DNDEBUG
CPUINFO: OPENSSL_ia32cap=0xfed8320b0fcbfffd:0x8001020c01d843a9
The 'numbers' are in 1000s of bytes per second processed.
type 16 bytes 64 bytes256 bytes   1024 bytes
8192 bytes  16384 bytes
AES-128-CTR  28774.46k81173.59k   162346.24k   206301.53k
224214.22k   225600.56k

Re: [PATCH 00/35] crypto: Provide aes-round.h and host accel

2023-06-03 Thread Ard Biesheuvel

On Sat, 3 Jun 2023 at 04:34, Richard Henderson
 wrote:
>
> Inspired by Ard Biesheuvel's RFC patches for accelerating AES
> under emulation, provide a set of primitives that maps between
> the guest and host fragments.
>
> There is a small guest correctness test case.
>
> I think the end result is quite a bit cleaner, since the logic
> is now centralized, rather than spread across 4 different guests.
>
> Further work could clean up crypto/aes.c itself to use these
> instead of the tables directly.  I'm sure that's just an ultimate
> fallback when an appropriate system library is not available, and
> so not terribly important, but it could still significantly reduce
> the amount of code we carry.
>
> I would imagine structuring a polynomial multiplication header
> in a similar way.  There are 4 or 5 versions of those spread across
> the different guests.
>
> Anyway, please review.
>
>
> r~
>
>
> Richard Henderson (35):
>   tests/multiarch: Add test-aes
>   target/arm: Move aesmc and aesimc tables to crypto/aes.c
>   crypto/aes: Add constants for ShiftRows, InvShiftRows
>   crypto: Add aesenc_SB_SR
>   target/i386: Use aesenc_SB_SR
>   target/arm: Demultiplex AESE and AESMC
>   target/arm: Use aesenc_SB_SR
>   target/ppc: Use aesenc_SB_SR
>   target/riscv: Use aesenc_SB_SR
>   crypto: Add aesdec_ISB_ISR
>   target/i386: Use aesdec_ISB_ISR
>   target/arm: Use aesdec_ISB_ISR
>   target/ppc: Use aesdec_ISB_ISR
>   target/riscv: Use aesdec_ISB_ISR
>   crypto: Add aesenc_MC
>   target/arm: Use aesenc_MC
>   crypto: Add aesdec_IMC
>   target/i386: Use aesdec_IMC
>   target/arm: Use aesdec_IMC
>   target/riscv: Use aesdec_IMC
>   crypto: Add aesenc_SB_SR_MC_AK
>   target/i386: Use aesenc_SB_SR_MC_AK
>   target/ppc: Use aesenc_SB_SR_MC_AK
>   target/riscv: Use aesenc_SB_SR_MC_AK
>   crypto: Add aesdec_ISB_ISR_IMC_AK
>   target/i386: Use aesdec_ISB_ISR_IMC_AK
>   target/riscv: Use aesdec_ISB_ISR_IMC_AK
>   crypto: Add aesdec_ISB_ISR_AK_IMC
>   target/ppc: Use aesdec_ISB_ISR_AK_IMC
>   host/include/i386: Implement aes-round.h
>   host/include/aarch64: Implement aes-round.h
>   crypto: Remove AES_shifts, AES_ishifts
>   crypto: Implement aesdec_IMC with AES_imc_rot
>   crypto: Remove AES_imc
>   crypto: Unexport AES_*_rot, AES_TeN, AES_TdN
>

This is looking very good - it is clearly a much better abstraction
than what I proposed, and I'd expect the performance boost to be the
same.

Re: [PATCH 31/35] host/include/aarch64: Implement aes-round.h

2023-06-03 Thread Ard Biesheuvel

On Sat, 3 Jun 2023 at 04:34, Richard Henderson
 wrote:
>
> Detect AES in cpuinfo; implement the accel hooks.
>
> Signed-off-by: Richard Henderson 
> ---
>  host/include/aarch64/host/aes-round.h | 204 ++
>  host/include/aarch64/host/cpuinfo.h   |   1 +
>  util/cpuinfo-aarch64.c|   2 +
>  3 files changed, 207 insertions(+)
>  create mode 100644 host/include/aarch64/host/aes-round.h
>
> diff --git a/host/include/aarch64/host/aes-round.h 
> b/host/include/aarch64/host/aes-round.h
> new file mode 100644
> index 00..27ca823db6
> --- /dev/null
> +++ b/host/include/aarch64/host/aes-round.h
> @@ -0,0 +1,204 @@
> +/*
> + * AArch64 specific aes acceleration.
> + * SPDX-License-Identifier: GPL-2.0-or-later
> + */
> +
> +#ifndef HOST_AES_ROUND_H
> +#define HOST_AES_ROUND_H
> +
> +#include "host/cpuinfo.h"
> +#include 
> +
> +#ifdef __ARM_FEATURE_AES
> +# define HAVE_AES_ACCEL  true
> +# define ATTR_AES_ACCEL
> +#else
> +# define HAVE_AES_ACCEL  likely(cpuinfo & CPUINFO_AES)
> +# define ATTR_AES_ACCEL  __attribute__((target("+crypto")))
> +#endif
> +
> +static inline uint8x16_t aes_accel_bswap(uint8x16_t x)
> +{
> +/* No arm_neon.h primitive, and the compilers don't share builtins. */

vqtbl1q_u8() perhaps?

> +#ifdef __clang__
> +return __builtin_shufflevector(x, x, 15, 14, 13, 12, 11, 10, 9, 8,
> +   7, 6, 5, 4, 3, 2, 1, 0);
> +#else
> +return __builtin_shuffle(x, (uint8x16_t)
> + { 15, 14, 13, 12, 11, 10, 9, 8,
> +   7,  6,  5,  4,  3,   2, 1, 0, });
> +#endif
> +}
> +
> +/*
> + * Through clang 15, the aes inlines are only defined if __ARM_FEATURE_AES;
> + * one cannot use __attribute__((target)) to make them appear after the fact.
> + * Therefore we must fallback to inline asm.
> + */
> +#ifdef __ARM_FEATURE_AES
> +# define aes_accel_aesd   vaesdq_u8
> +# define aes_accel_aese   vaeseq_u8
> +# define aes_accel_aesmc  vaesmcq_u8
> +# define aes_accel_aesimc vaesimcq_u8
> +#else
> +static inline uint8x16_t aes_accel_aesd(uint8x16_t d, uint8x16_t k)
> +{
> +asm(".arch_extension aes\n\t"
> +"aesd %0.16b, %1.16b" : "+w"(d) : "w"(k));
> +return d;
> +}
> +
> +static inline uint8x16_t aes_accel_aese(uint8x16_t d, uint8x16_t k)
> +{
> +asm(".arch_extension aes\n\t"
> +"aese %0.16b, %1.16b" : "+w"(d) : "w"(k));
> +return d;
> +}
> +
> +static inline uint8x16_t aes_accel_aesmc(uint8x16_t d)
> +{
> +asm(".arch_extension aes\n\t"
> +"aesmc %0.16b, %1.16b" : "=w"(d) : "w"(d));


Most ARM cores fuse aese/aesmc into a single uop (with the associated
performance boost) if the pattern is

aese x, y
aesmc x,x

aesd x, y
aesimc x,x

So it might make sense to use +w here at least, and use only a single
register (which the compiler will likely do in any case, but still)

I would assume that the compiler cannot issue these separately based
on the sequences below, but if it might, it may be worth it to emit
the aese/aesmc together in a single asm() block

> +return d;
> +}
> +
> +static inline uint8x16_t aes_accel_aesimc(uint8x16_t d)
> +{
> +asm(".arch_extension aes\n\t"
> +"aesimc %0.16b, %1.16b" : "=w"(d) : "w"(d));
> +return d;
> +}
> +#endif /* __ARM_FEATURE_AES */
> +
> +static inline void ATTR_AES_ACCEL
> +aesenc_MC_accel(AESState *ret, const AESState *st, bool be)
> +{
> +uint8x16_t t = (uint8x16_t)st->v;
> +
> +if (be) {
> +t = aes_accel_bswap(t);
> +t = aes_accel_aesmc(t);
> +t = aes_accel_bswap(t);
> +} else {
> +t = aes_accel_aesmc(t);
> +}
> +ret->v = (AESStateVec)t;
> +}
> +
> +static inline void ATTR_AES_ACCEL
> +aesenc_SB_SR_accel(AESState *ret, const AESState *st, bool be)
> +{
> +uint8x16_t t = (uint8x16_t)st->v;
> +uint8x16_t z = { };
> +
> +if (be) {
> +t = aes_accel_bswap(t);
> +t = aes_accel_aese(t, z);
> +t = aes_accel_bswap(t);
> +} else {
> +t = aes_accel_aese(t, z);
> +}
> +ret->v = (AESStateVec)t;
> +}
> +
> +static inline void ATTR_AES_ACCEL
> +aesenc_SB_SR_MC_AK_accel(AESState *ret, const AESState *st,
> + const AESState *rk, bool be)
> +{
> +uint8x16_t t = (uint8x16_t)st->v;
> +uint8x16_t k = (uint8x16_t)rk->v;
> +uint8x16_t z = { };
> +
> +if (be) {
> +t = aes_accel_bswap(t);
> +k = aes_accel_bswap(k);
> +t = aes_accel_aese(t, z);
> +t = aes_accel_aesmc(t);
> +t = veorq_u8(t, k);
> +t = aes_accel_bswap(t);
> +} else {
> +t = aes_accel_aese(t, z);
> +t = aes_accel_aesmc(t);
> +t = veorq_u8(t, k);
> +}
> +ret->v = (AESStateVec)t;
> +}
> +
> +static inline void ATTR_AES_ACCEL
> +aesdec_IMC_accel(AESState *ret, const AESState *st, bool be)
> +{
> +uint8x16_t t = (uint8x16_t)st->v;
> +
> +if (be) {
> +t = aes_accel_bswap(t);
> +t = aes_accel_aesimc(t);

Re: [PATCH 04/35] crypto: Add aesenc_SB_SR

2023-06-03 Thread Ard Biesheuvel

On Sat, 3 Jun 2023 at 04:34, Richard Henderson
 wrote:
>
> Start adding infrastructure for accelerating guest AES.
> Begin with a SubBytes + ShiftRows primitive.
>
> Signed-off-by: Richard Henderson 
> ---
>  host/include/generic/host/aes-round.h | 15 +
>  include/crypto/aes-round.h| 41 +++
>  crypto/aes.c  | 47 +++
>  3 files changed, 103 insertions(+)
>  create mode 100644 host/include/generic/host/aes-round.h
>  create mode 100644 include/crypto/aes-round.h
>
> diff --git a/host/include/generic/host/aes-round.h 
> b/host/include/generic/host/aes-round.h
> new file mode 100644
> index 00..598242c603
> --- /dev/null
> +++ b/host/include/generic/host/aes-round.h
> @@ -0,0 +1,15 @@
> +/*
> + * No host specific aes acceleration.
> + * SPDX-License-Identifier: GPL-2.0-or-later
> + */
> +
> +#ifndef HOST_AES_ROUND_H
> +#define HOST_AES_ROUND_H
> +
> +#define HAVE_AES_ACCEL  false
> +#define ATTR_AES_ACCEL
> +
> +void aesenc_SB_SR_accel(AESState *, const AESState *, bool)
> +QEMU_ERROR("unsupported accel");
> +
> +#endif
> diff --git a/include/crypto/aes-round.h b/include/crypto/aes-round.h
> new file mode 100644
> index 00..784e1daee6
> --- /dev/null
> +++ b/include/crypto/aes-round.h
> @@ -0,0 +1,41 @@
> +/*
> + * SPDX-License-Identifier: GPL-2.0-or-later
> + * AES round fragments, generic version
> + *
> + * Copyright (C) 2023 Linaro, Ltd.
> + */
> +
> +#ifndef CRYPTO_AES_ROUND_H
> +#define CRYPTO_AES_ROUND_H
> +
> +/* Hosts with acceleration will usually need a 16-byte vector type. */
> +typedef uint8_t AESStateVec __attribute__((vector_size(16)));
> +
> +typedef union {
> +uint8_t b[16];
> +uint32_t w[4];
> +uint64_t d[4];
> +AESStateVec v;
> +} AESState;
> +
> +#include "host/aes-round.h"
> +
> +/*
> + * Perform SubBytes + ShiftRows.
> + */
> +
> +void aesenc_SB_SR_gen(AESState *ret, const AESState *st);
> +void aesenc_SB_SR_genrev(AESState *ret, const AESState *st);
> +
> +static inline void aesenc_SB_SR(AESState *r, const AESState *st, bool be)
> +{
> +if (HAVE_AES_ACCEL) {
> +aesenc_SB_SR_accel(r, st, be);
> +} else if (HOST_BIG_ENDIAN == be) {
> +aesenc_SB_SR_gen(r, st);
> +} else {
> +aesenc_SB_SR_genrev(r, st);
> +}
> +}
> +
> +#endif /* CRYPTO_AES_ROUND_H */
> diff --git a/crypto/aes.c b/crypto/aes.c
> index 1309a13e91..708838315a 100644
> --- a/crypto/aes.c
> +++ b/crypto/aes.c
> @@ -29,6 +29,7 @@
>   */
>  #include "qemu/osdep.h"
>  #include "crypto/aes.h"
> +#include "crypto/aes-round.h"
>
>  typedef uint32_t u32;
>  typedef uint8_t u8;
> @@ -1251,6 +1252,52 @@ static const u32 rcon[] = {
>  0x1B00, 0x3600, /* for 128-bit blocks, Rijndael never uses 
> more than 10 rcon values */
>  };
>
> +/* Perform SubBytes + ShiftRows. */
> +static inline void
> +aesenc_SB_SR_swap(AESState *r, const AESState *st, bool swap)
> +{
> +const int swap_b = swap ? 15 : 0;
> +uint8_t t;
> +
> +/* These four indexes are not swizzled. */
> +r->b[swap_b ^ 0x0] = AES_sbox[st->b[swap_b ^ AES_SH_0]];
> +r->b[swap_b ^ 0x4] = AES_sbox[st->b[swap_b ^ AES_SH_4]];
> +r->b[swap_b ^ 0x8] = AES_sbox[st->b[swap_b ^ AES_SH_8]];
> +r->b[swap_b ^ 0xc] = AES_sbox[st->b[swap_b ^ AES_SH_C]];
> +
> +/* Otherwise, break cycles. */
> +

This is only needed it r == st, right?

> +t = AES_sbox[st->b[swap_b ^ AES_SH_D]];
> +r->b[swap_b ^ 0x1] = AES_sbox[st->b[swap_b ^ AES_SH_1]];
> +r->b[swap_b ^ 0x5] = AES_sbox[st->b[swap_b ^ AES_SH_5]];
> +r->b[swap_b ^ 0x9] = AES_sbox[st->b[swap_b ^ AES_SH_9]];
> +r->b[swap_b ^ 0xd] = t;
> +
> +t = AES_sbox[st->b[swap_b ^ AES_SH_A]];
> +r->b[swap_b ^ 0x2] = AES_sbox[st->b[swap_b ^ AES_SH_2]];
> +r->b[swap_b ^ 0xa] = t;
> +
> +t = AES_sbox[st->b[swap_b ^ AES_SH_E]];
> +r->b[swap_b ^ 0x6] = AES_sbox[st->b[swap_b ^ AES_SH_6]];
> +r->b[swap_b ^ 0xe] = t;
> +
> +t = AES_sbox[st->b[swap_b ^ AES_SH_7]];
> +r->b[swap_b ^ 0x3] = AES_sbox[st->b[swap_b ^ AES_SH_3]];
> +r->b[swap_b ^ 0xf] = AES_sbox[st->b[swap_b ^ AES_SH_F]];
> +r->b[swap_b ^ 0xb] = AES_sbox[st->b[swap_b ^ AES_SH_B]];
> +r->b[swap_b ^ 0x7] = t;
> +}
> +
> +void aesenc_SB_SR_gen(AESState *r, const AESState *st)
> +{
> +aesenc_SB_SR_swap(r, st, false);
> +}
> +
> +void aesenc_SB_SR_genrev(AESState *r, const AESState *st)
> +{
> +aesenc_SB_SR_swap(r, st, true);
> +}
> +
>  /**
>   * Expand the cipher key into the encryption key schedule.
>   */
> --
> 2.34.1
>

Re: [PATCH 02/35] target/arm: Move aesmc and aesimc tables to crypto/aes.c

2023-06-03 Thread Ard Biesheuvel

On Sat, 3 Jun 2023 at 04:34, Richard Henderson
 wrote:
>
> We do not currently have a table in crypto/ for
> just MixColumns.  Move both tables for consistency.
>
> Signed-off-by: Richard Henderson 
> ---
>  include/crypto/aes.h   |   6 ++
>  crypto/aes.c   | 142 
>  target/arm/tcg/crypto_helper.c | 143 ++---
>  3 files changed, 153 insertions(+), 138 deletions(-)
>
> diff --git a/include/crypto/aes.h b/include/crypto/aes.h
> index 822d64588c..24b073d569 100644
> --- a/include/crypto/aes.h
> +++ b/include/crypto/aes.h
> @@ -34,6 +34,12 @@ extern const uint8_t AES_isbox[256];
>  extern const uint8_t AES_shifts[16];
>  extern const uint8_t AES_ishifts[16];
>
> +/* AES MixColumns, for use with rot32. */
> +extern const uint32_t AES_mc_rot[256];
> +
> +/* AES InvMixColumns, for use with rot32. */
> +extern const uint32_t AES_imc_rot[256];
> +
>  /* AES InvMixColumns */
>  /* AES_imc[x][0] = [x].[0e, 09, 0d, 0b]; */
>  /* AES_imc[x][1] = [x].[0b, 0e, 09, 0d]; */
> diff --git a/crypto/aes.c b/crypto/aes.c
> index af72ff7779..72c95c38fb 100644
> --- a/crypto/aes.c
> +++ b/crypto/aes.c
> @@ -116,6 +116,148 @@ const uint8_t AES_ishifts[16] = {
>  0, 13, 10, 7, 4, 1, 14, 11, 8, 5, 2, 15, 12, 9, 6, 3
>  };
>
> +/*
> + * MixColumns lookup table, for use with rot32.
> + * From Arm ARM pseudocode.

I remember writing the code to generate these tables, and my copy of
the ARM ARM doesn't appear to have them, so this comment seems
inaccurate to me.

> + */
> +const uint32_t AES_mc_rot[256] = {
> +0x, 0x03010102, 0x06020204, 0x05030306,
> +0x0c040408, 0x0f05050a, 0x0a06060c, 0x0907070e,
> +0x18080810, 0x1b090912, 0x1e0a0a14, 0x1d0b0b16,
> +0x140c0c18, 0x170d0d1a, 0x120e0e1c, 0x110f0f1e,
> +0x30101020, 0x3322, 0x36121224, 0x35131326,
> +0x3c141428, 0x3f15152a, 0x3a16162c, 0x3917172e,
> +0x28181830, 0x2b191932, 0x2e1a1a34, 0x2d1b1b36,
> +0x241c1c38, 0x271d1d3a, 0x221e1e3c, 0x211f1f3e,
> +0x60202040, 0x63212142, 0x6644, 0x65232346,
> +0x6c242448, 0x6f25254a, 0x6a26264c, 0x6927274e,
> +0x78282850, 0x7b292952, 0x7e2a2a54, 0x7d2b2b56,
> +0x742c2c58, 0x772d2d5a, 0x722e2e5c, 0x712f2f5e,
> +0x50303060, 0x53313162, 0x56323264, 0x5566,
> +0x5c343468, 0x5f35356a, 0x5a36366c, 0x5937376e,
> +0x48383870, 0x4b393972, 0x4e3a3a74, 0x4d3b3b76,
> +0x443c3c78, 0x473d3d7a, 0x423e3e7c, 0x413f3f7e,
> +0xc0404080, 0xc3414182, 0xc6424284, 0xc5434386,
> +0xcc88, 0xcf45458a, 0xca46468c, 0xc947478e,
> +0xd8484890, 0xdb494992, 0xde4a4a94, 0xdd4b4b96,
> +0xd44c4c98, 0xd74d4d9a, 0xd24e4e9c, 0xd14f4f9e,
> +0xf05050a0, 0xf35151a2, 0xf65252a4, 0xf55353a6,
> +0xfc5454a8, 0xffaa, 0xfa5656ac, 0xf95757ae,
> +0xe85858b0, 0xeb5959b2, 0xee5a5ab4, 0xed5b5bb6,
> +0xe45c5cb8, 0xe75d5dba, 0xe25e5ebc, 0xe15f5fbe,
> +0xa06060c0, 0xa36161c2, 0xa66262c4, 0xa56363c6,
> +0xac6464c8, 0xaf6565ca, 0xaacc, 0xa96767ce,
> +0xb86868d0, 0xbb6969d2, 0xbe6a6ad4, 0xbd6b6bd6,
> +0xb46c6cd8, 0xb76d6dda, 0xb26e6edc, 0xb16f6fde,
> +0x907070e0, 0x937171e2, 0x967272e4, 0x957373e6,
> +0x9c7474e8, 0x9f7575ea, 0x9a7676ec, 0x99ee,
> +0x887878f0, 0x8b7979f2, 0x8e7a7af4, 0x8d7b7bf6,
> +0x847c7cf8, 0x877d7dfa, 0x827e7efc, 0x817f7ffe,
> +0x9b80801b, 0x98818119, 0x9d82821f, 0x9e83831d,
> +0x97848413, 0x94858511, 0x91868617, 0x92878715,
> +0x830b, 0x80898909, 0x858a8a0f, 0x868b8b0d,
> +0x8f8c8c03, 0x8c8d8d01, 0x898e8e07, 0x8a8f8f05,
> +0xab90903b, 0xa8919139, 0xad92923f, 0xae93933d,
> +0xa7949433, 0xa4959531, 0xa1969637, 0xa2979735,
> +0xb398982b, 0xb029, 0xb59a9a2f, 0xb69b9b2d,
> +0xbf9c9c23, 0xbc9d9d21, 0xb99e9e27, 0xba9f9f25,
> +0xfba0a05b, 0xf8a1a159, 0xfda2a25f, 0xfea3a35d,
> +0xf7a4a453, 0xf4a5a551, 0xf1a6a657, 0xf2a7a755,
> +0xe3a8a84b, 0xe0a9a949, 0xe54f, 0xe6abab4d,
> +0xefacac43, 0xecadad41, 0xe9aeae47, 0xeaafaf45,
> +0xcbb0b07b, 0xc8b1b179, 0xcdb2b27f, 0xceb3b37d,
> +0xc7b4b473, 0xc4b5b571, 0xc1b6b677, 0xc2b7b775,
> +0xd3b8b86b, 0xd0b9b969, 0xd5baba6f, 0xd66d,
> +0xdfbcbc63, 0xdcbdbd61, 0xd9bebe67, 0xdabfbf65,
> +0x5bc0c09b, 0x58c1c199, 0x5dc2c29f, 0x5ec3c39d,
> +0x57c4c493, 0x54c5c591, 0x51c6c697, 0x52c7c795,
> +0x43c8c88b, 0x40c9c989, 0x45caca8f, 0x46cbcb8d,
> +0x4f83, 0x4ccdcd81, 0x49cece87, 0x4acfcf85,
> +0x6bd0d0bb, 0x68d1d1b9, 0x6dd2d2bf, 0x6ed3d3bd,
> +0x67d4d4b3, 0x64d5d5b1, 0x61d6d6b7, 0x62d7d7b5,
> +0x73d8d8ab, 0x70d9d9a9, 0x75dadaaf, 0x76dbdbad,
> +0x7fdcdca3, 0x7ca1, 0x79dedea7, 0x7adfdfa5,
> +0x3be0e0db, 0x38e1e1d9, 0x3de2e2df, 0x3ee3e3dd,
> +0x37e4e4d3, 0x34e5e5d1, 0x31e6e6d7, 0x32e7e7d5,
> +0x23e8e8cb, 0x20e9e9c9, 0x25eaeacf, 0x26ebebcd,
> +0x2fececc3, 0x2cededc1, 0x29c7, 0x2aefefc5,
> +0x0bf0f0fb, 0x08f1f1f9, 0x0df2f2ff, 0x0ef3f3fd,
> +0x07f4f4f3, 0x04f5f5f1, 0x01f6f6f7,

Re: [PATCH 1/1] hw/arm/sbsa-ref: use XHCI to replace EHCI

2023-06-01 Thread Ard Biesheuvel

On Thu, 1 Jun 2023 at 20:00, Leif Lindholm  wrote:
>
> +Ard
>
> On Thu, Jun 01, 2023 at 16:01:43 +0100, Peter Maydell wrote:
> > > >> Also has EHCI never worked, or has it worked in some modes and so this
> > > >> change should be versioned?
> > > >
> > > > AIUI, EHCI has never worked and can never have worked, because
> > > > this board's RAM is all above 4G and the QEMU EHCI controller
> > > > implementation only allows DMA descriptors with 32-bit addresses.
> > > >
> > > > Looking back at the archives, it seems we discussed XHCI vs
> > > > EHCI when the sbsa-ref board went in, and the conclusion was
> > > > that XHCI would be better. But there wasn't a sysbus XHCI device
> > > > at that point, so we ended up committing the sbsa-ref board
> > > > with EHCI and a plan to switch to XHCI when the sysbus-xhci
> > > > device was done, which we then forgot about:
> > > > https://mail.gnu.org/archive/html/qemu-arm/2018-11/msg00638.html
> > >
> > > Ah, thanks! That explains why we did the thing that made no sense :)
> > >
> > > To skip the migration hazard, my prefernece is we just leave the EHCI
> > > device in for now, and add a separate XHCI on PCIe. We can drop the
> > > EHCI device at some point in the future.
> >
> > Why PCIe for the XHCI and not sysbus? At the time the board
> > was originally added the argument was in favour of using
> > a sysbus USB controller (you can see Ard making that point
> > in the linked archive thread).
>
> The original argument was that having the device on the sysbus
> 1) enabled codepaths we wanted to exercise and
> 2) more closely resembled the development systems available at the
> time.
>
> 1 still applies, but I'm not sure 2 does. Ard?
>

It was always primarily about #1. This was also the reason for putting
all DRAM above 4G.

I'm surprised that the EHCI never worked, though - I don't see the
point in keeping it in that case.

Re: [PATCH 2/2] target/i386: Implement PCLMULQDQ using AArch64 PMULL instructions

2023-06-01 Thread Ard Biesheuvel

On Thu, 1 Jun 2023 at 14:33, Ard Biesheuvel  wrote:
>
> Use the AArch64 PMULL{2}.P64 instructions to implement PCLMULQDQ instead
> of emulating them in C code if the host supports this. This is used in
> the implementation of GCM, which is widely used in IPsec VPN and HTTPS.
>
> Somewhat surprising results: on my ThunderX2, enabling this on top of
> the AES acceleration I sent out earlier, the speedup is substantial.
>
> (1420 is a typical IPsec block size - in HTTPS, GCM operates on much
> larger block sizes but the kernel mode benchmarks are not the best place
> to measure its performance in this mode)
>
> tcrypt: testing speed of rfc4106(gcm(aes)) (rfc4106-gcm-aesni) encryption
>
> No acceleration
> tcrypt: test 5 (160 bit key, 1420 byte blocks): 10046 operations in 1 seconds 
> (14265320 bytes)
>
> AES acceleration
> tcrypt: test 5 (160 bit key, 1420 byte blocks): 13970 operations in 1 seconds 
> (19837400 bytes)
>
> AES + PMULL acceleration
> tcrypt: test 5 (160 bit key, 1420 byte blocks): 24372 operations in 1 seconds 
> (34608240 bytes)
>

User space benchmark (using OS's qemu-x86_64 vs one built with these
changes applied)

Speedup is about 5x


ard@gambale:~/build/openssl$ apps/openssl speed -evp aes-128-gcm
Doing AES-128-GCM for 3s on 16 size blocks: 1692138 AES-128-GCM's in 2.98s
Doing AES-128-GCM for 3s on 64 size blocks: 665012 AES-128-GCM's in 3.00s
Doing AES-128-GCM for 3s on 256 size blocks: 203784 AES-128-GCM's in 3.00s
Doing AES-128-GCM for 3s on 1024 size blocks: 49397 AES-128-GCM's in 3.00s
Doing AES-128-GCM for 3s on 8192 size blocks: 6447 AES-128-GCM's in 3.00s
Doing AES-128-GCM for 3s on 16384 size blocks: 3058 AES-128-GCM's in 3.00s
version: 3.2.0-dev
built on: Thu Jun  1 17:06:09 2023 UTC
options: bn(64,64)
compiler: x86_64-linux-gnu-gcc -pthread -m64 -Wa,--noexecstack -Wall
-O3 -DOPENSSL_USE_NODELETE -DL_ENDIAN -DOPENSSL_BUILDING_OPENSSL
-DNDEBUG
CPUINFO: OPENSSL_ia32cap=0xfed8320b0fcbfffd:0x8001020c01d843a9
The 'numbers' are in 1000s of bytes per second processed.
type 16 bytes 64 bytes256 bytes   1024 bytes
8192 bytes  16384 bytes
AES-128-GCM   9085.30k14186.92k17389.57k16860.84k
17604.61k16700.76k



ard@gambale:~/build/openssl$ ../qemu/build/qemu-x86_64 apps/openssl
speed -evp aes-128-gcm
Doing AES-128-GCM for 3s on 16 size blocks: 2703271 AES-128-GCM's in 2.99s
Doing AES-128-GCM for 3s on 64 size blocks: 1537884 AES-128-GCM's in 3.00s
Doing AES-128-GCM for 3s on 256 size blocks: 653008 AES-128-GCM's in 3.00s
Doing AES-128-GCM for 3s on 1024 size blocks: 203579 AES-128-GCM's in 3.00s
Doing AES-128-GCM for 3s on 8192 size blocks: 29020 AES-128-GCM's in 3.00s
Doing AES-128-GCM for 3s on 16384 size blocks: 14716 AES-128-GCM's in 2.99s
version: 3.2.0-dev
built on: Thu Jun  1 17:06:09 2023 UTC
options: bn(64,64)
compiler: x86_64-linux-gnu-gcc -pthread -m64 -Wa,--noexecstack -Wall
-O3 -DOPENSSL_USE_NODELETE -DL_ENDIAN -DOPENSSL_BUILDING_OPENSSL
-DNDEBUG
CPUINFO: OPENSSL_ia32cap=0xfed8320b0fcbfffd:0x8001020c01d843a9
The 'numbers' are in 1000s of bytes per second processed.
type 16 bytes 64 bytes256 bytes   1024 bytes
8192 bytes  16384 bytes
AES-128-GCM  14465.66k32808.19k55723.35k69488.30k
79243.95k80637.77k

Re: [PATCH 1/2] target/arm: Use x86 intrinsics to implement PMULL.P64

2023-06-01 Thread Ard Biesheuvel

On Thu, 1 Jun 2023 at 15:01, Peter Maydell  wrote:
>
> On Thu, 1 Jun 2023 at 13:33, Ard Biesheuvel  wrote:
> >
> > Signed-off-by: Ard Biesheuvel 
> > ---
> >  host/include/i386/host/cpuinfo.h |  1 +
> >  target/arm/tcg/vec_helper.c  | 26 +++-
> >  util/cpuinfo-i386.c  |  1 +
> >  3 files changed, 27 insertions(+), 1 deletion(-)
> >
> > diff --git a/host/include/i386/host/cpuinfo.h 
> > b/host/include/i386/host/cpuinfo.h
> > index 073d0a426f31487d..cf4ced844760d28f 100644
> > --- a/host/include/i386/host/cpuinfo.h
> > +++ b/host/include/i386/host/cpuinfo.h
> > @@ -27,6 +27,7 @@
> >  #define CPUINFO_ATOMIC_VMOVDQA  (1u << 16)
> >  #define CPUINFO_ATOMIC_VMOVDQU  (1u << 17)
> >  #define CPUINFO_AES (1u << 18)
> > +#define CPUINFO_PMULL   (1u << 19)
> >
> >  /* Initialized with a constructor. */
> >  extern unsigned cpuinfo;
> > diff --git a/target/arm/tcg/vec_helper.c b/target/arm/tcg/vec_helper.c
> > index f59d3b26eacf08f8..fb422627588439b3 100644
> > --- a/target/arm/tcg/vec_helper.c
> > +++ b/target/arm/tcg/vec_helper.c
> > @@ -25,6 +25,14 @@
> >  #include "qemu/int128.h"
> >  #include "vec_internal.h"
> >
> > +#ifdef __x86_64__
> > +#include "host/cpuinfo.h"
> > +#include 
> > +#define TARGET_PMULL  __attribute__((__target__("pclmul")))
> > +#else
> > +#define TARGET_PMULL
> > +#endif
> > +
> >  /*
> >   * Data for expanding active predicate bits to bytes, for byte elements.
> >   *
> > @@ -2010,12 +2018,28 @@ void HELPER(gvec_pmul_b)(void *vd, void *vn, void 
> > *vm, uint32_t desc)
> >   * Because of the lanes are not accessed in strict columns,
> >   * this probably cannot be turned into a generic helper.
> >   */
> > -void HELPER(gvec_pmull_q)(void *vd, void *vn, void *vm, uint32_t desc)
> > +void TARGET_PMULL HELPER(gvec_pmull_q)(void *vd, void *vn, void *vm, 
> > uint32_t desc)
> >  {
> >  intptr_t i, j, opr_sz = simd_oprsz(desc);
> >  intptr_t hi = simd_data(desc);
> >  uint64_t *d = vd, *n = vn, *m = vm;
> >
> > +#ifdef __x86_64__
> > +if (cpuinfo & CPUINFO_PMULL) {
> > +   switch (hi) {
> > +   case 0:
> > +   *(__m128i *)vd = _mm_clmulepi64_si128(*(__m128i *)vm, 
> > *(__m128i *)vn, 0x0);
> > +   break;
> > +   case 1:
> > +   *(__m128i *)vd = _mm_clmulepi64_si128(*(__m128i *)vm, 
> > *(__m128i *)vn, 0x11);
> > +   break;
> > +   default:
> > +   g_assert_not_reached();
> > +   }
> > +return;
> > +}
> > +#endif
>
> This needs to cope with the input vectors being more than
> just 128 bits wide, I think. Also you probably still
> need the clear_tail() to clear any high bits of the register.
>

Ah yes, I missed that completely.

[PATCH 1/2] target/arm: Use x86 intrinsics to implement PMULL.P64

2023-06-01 Thread Ard Biesheuvel

Signed-off-by: Ard Biesheuvel 
---
 host/include/i386/host/cpuinfo.h |  1 +
 target/arm/tcg/vec_helper.c  | 26 +++-
 util/cpuinfo-i386.c  |  1 +
 3 files changed, 27 insertions(+), 1 deletion(-)

diff --git a/host/include/i386/host/cpuinfo.h b/host/include/i386/host/cpuinfo.h
index 073d0a426f31487d..cf4ced844760d28f 100644
--- a/host/include/i386/host/cpuinfo.h
+++ b/host/include/i386/host/cpuinfo.h
@@ -27,6 +27,7 @@
 #define CPUINFO_ATOMIC_VMOVDQA  (1u << 16)
 #define CPUINFO_ATOMIC_VMOVDQU  (1u << 17)
 #define CPUINFO_AES (1u << 18)
+#define CPUINFO_PMULL   (1u << 19)
 
 /* Initialized with a constructor. */
 extern unsigned cpuinfo;
diff --git a/target/arm/tcg/vec_helper.c b/target/arm/tcg/vec_helper.c
index f59d3b26eacf08f8..fb422627588439b3 100644
--- a/target/arm/tcg/vec_helper.c
+++ b/target/arm/tcg/vec_helper.c
@@ -25,6 +25,14 @@
 #include "qemu/int128.h"
 #include "vec_internal.h"
 
+#ifdef __x86_64__
+#include "host/cpuinfo.h"
+#include 
+#define TARGET_PMULL  __attribute__((__target__("pclmul")))
+#else
+#define TARGET_PMULL
+#endif
+
 /*
  * Data for expanding active predicate bits to bytes, for byte elements.
  *
@@ -2010,12 +2018,28 @@ void HELPER(gvec_pmul_b)(void *vd, void *vn, void *vm, 
uint32_t desc)
  * Because of the lanes are not accessed in strict columns,
  * this probably cannot be turned into a generic helper.
  */
-void HELPER(gvec_pmull_q)(void *vd, void *vn, void *vm, uint32_t desc)
+void TARGET_PMULL HELPER(gvec_pmull_q)(void *vd, void *vn, void *vm, uint32_t 
desc)
 {
 intptr_t i, j, opr_sz = simd_oprsz(desc);
 intptr_t hi = simd_data(desc);
 uint64_t *d = vd, *n = vn, *m = vm;
 
+#ifdef __x86_64__
+if (cpuinfo & CPUINFO_PMULL) {
+   switch (hi) {
+   case 0:
+   *(__m128i *)vd = _mm_clmulepi64_si128(*(__m128i *)vm, *(__m128i 
*)vn, 0x0);
+   break;
+   case 1:
+   *(__m128i *)vd = _mm_clmulepi64_si128(*(__m128i *)vm, *(__m128i 
*)vn, 0x11);
+   break;
+   default:
+   g_assert_not_reached();
+   }
+return;
+}
+#endif
+
 for (i = 0; i < opr_sz / 8; i += 2) {
 uint64_t nn = n[i + hi];
 uint64_t mm = m[i + hi];
diff --git a/util/cpuinfo-i386.c b/util/cpuinfo-i386.c
index 3043f066c0182dc8..8930e13451201a64 100644
--- a/util/cpuinfo-i386.c
+++ b/util/cpuinfo-i386.c
@@ -40,6 +40,7 @@ unsigned __attribute__((constructor)) cpuinfo_init(void)
 info |= (c & bit_MOVBE ? CPUINFO_MOVBE : 0);
 info |= (c & bit_POPCNT ? CPUINFO_POPCNT : 0);
 info |= (c & bit_AES ? CPUINFO_AES : 0);
+info |= (c & bit_PCLMULQDQ ? CPUINFO_PMULL : 0);
 
 /* For AVX features, we must check available and usable. */
 if ((c & bit_AVX) && (c & bit_OSXSAVE)) {
-- 
2.39.2

[PATCH 2/2] target/i386: Implement PCLMULQDQ using AArch64 PMULL instructions

2023-06-01 Thread Ard Biesheuvel

Use the AArch64 PMULL{2}.P64 instructions to implement PCLMULQDQ instead
of emulating them in C code if the host supports this. This is used in
the implementation of GCM, which is widely used in IPsec VPN and HTTPS.

Somewhat surprising results: on my ThunderX2, enabling this on top of
the AES acceleration I sent out earlier, the speedup is substantial.

(1420 is a typical IPsec block size - in HTTPS, GCM operates on much
larger block sizes but the kernel mode benchmarks are not the best place
to measure its performance in this mode)

tcrypt: testing speed of rfc4106(gcm(aes)) (rfc4106-gcm-aesni) encryption

No acceleration
tcrypt: test 5 (160 bit key, 1420 byte blocks): 10046 operations in 1 seconds 
(14265320 bytes)

AES acceleration
tcrypt: test 5 (160 bit key, 1420 byte blocks): 13970 operations in 1 seconds 
(19837400 bytes)

AES + PMULL acceleration
tcrypt: test 5 (160 bit key, 1420 byte blocks): 24372 operations in 1 seconds 
(34608240 bytes)

Signed-off-by: Ard Biesheuvel 
---
 host/include/aarch64/host/cpuinfo.h |  1 +
 target/i386/ops_sse.h   | 24 
 util/cpuinfo-aarch64.c  |  1 +
 3 files changed, 26 insertions(+)

diff --git a/host/include/aarch64/host/cpuinfo.h 
b/host/include/aarch64/host/cpuinfo.h
index 05feeb4f4369fc19..da268dce1390cac0 100644
--- a/host/include/aarch64/host/cpuinfo.h
+++ b/host/include/aarch64/host/cpuinfo.h
@@ -10,6 +10,7 @@
 #define CPUINFO_LSE (1u << 1)
 #define CPUINFO_LSE2(1u << 2)
 #define CPUINFO_AES (1u << 3)
+#define CPUINFO_PMULL   (1u << 4)
 
 /* Initialized with a constructor. */
 extern unsigned cpuinfo;
diff --git a/target/i386/ops_sse.h b/target/i386/ops_sse.h
index db79132778efd211..d7e7bd8b733122a8 100644
--- a/target/i386/ops_sse.h
+++ b/target/i386/ops_sse.h
@@ -2157,6 +2157,30 @@ void glue(helper_pclmulqdq, SUFFIX)(CPUX86State *env, 
Reg *d, Reg *v, Reg *s,
 uint64_t a, b;
 int i;
 
+#ifdef __aarch64__
+if (cpuinfo & CPUINFO_PMULL) {
+aes_vec_t vv = *(aes_vec_t *)v, vs = *(aes_vec_t *)s;
+aes_vec_t *vd = (aes_vec_t *)d;
+
+switch (ctrl & 0x11) {
+case 0x1:
+asm("ext %0.16b, %0.16b, %0.16b, #8":"+w"(vv));
+/* fallthrough */
+case 0x0:
+asm(".arch_extension aes\n"
+"pmull %0.1q, %1.1d, %2.1d":"=w"(*vd):"w"(vv),"w"(vs));
+break;
+case 0x10:
+asm("ext %0.16b, %0.16b, %0.16b, #8":"+w"(vv));
+/* fallthrough */
+case 0x11:
+asm(".arch_extension aes\n"
+"pmull2 %0.1q, %1.2d, %2.2d":"=w"(*vd):"w"(vv),"w"(vs));
+}
+return;
+}
+#endif
+
 for (i = 0; i < 1 << SHIFT; i += 2) {
 a = v->Q(((ctrl & 1) != 0) + i);
 b = s->Q(((ctrl & 16) != 0) + i);
diff --git a/util/cpuinfo-aarch64.c b/util/cpuinfo-aarch64.c
index 769cdfeb2fc32d5e..95ec1f4adfc829b9 100644
--- a/util/cpuinfo-aarch64.c
+++ b/util/cpuinfo-aarch64.c
@@ -57,6 +57,7 @@ unsigned __attribute__((constructor)) cpuinfo_init(void)
 info |= (hwcap & HWCAP_ATOMICS ? CPUINFO_LSE : 0);
 info |= (hwcap & HWCAP_USCAT ? CPUINFO_LSE2 : 0);
 info |= (hwcap & HWCAP_AES ? CPUINFO_AES : 0);
+info |= (hwcap & HWCAP_PMULL ? CPUINFO_PMULL : 0);
 #endif
 #ifdef CONFIG_DARWIN
 info |= sysctl_for_bool("hw.optional.arm.FEAT_LSE") * CPUINFO_LSE;
-- 
2.39.2

[PATCH 0/2] Implement PMULL using host intrinsics

2023-06-01 Thread Ard Biesheuvel

Another set of RFC patches - this time for 64x64->128 polynomial
multiplication. Playing round with this on top of the AES changes I sent
out earlier this week, I noticed that the speedup is rather substantial.

PMULL is relevant for GCM encryption, which combines AES in counter mode
with GHASH, which is based on multiplication in GF(2^128). The
significance of PMULL to this encryption mode is basically why PMULL is
part of the AES crypto extension on AArch64.

Note that user emulation on a AArch64 host of x86 binaries that perform
any kind of HTTPS communication under the hood would likely benefit from
this.

Again, this approach is likely too ad-hoc, but it helps span the space
of what we might want to cover in terms of host acceleration API. (I'm
not a TCG expert, but I guess this raises the question what to cover in
helpers and what to cover using native TCG ops?)

Cc: Peter Maydell 
Cc: Alex Bennée 
Cc: Richard Henderson 
Cc: Philippe Mathieu-Daudé 

Ard Biesheuvel (2):
  target/arm: Use x86 intrinsics to implement PMULL.P64
  target/i386: Implement PCLMULQDQ using AArch64 PMULL instructions

 host/include/aarch64/host/cpuinfo.h |  1 +
 host/include/i386/host/cpuinfo.h|  1 +
 target/arm/tcg/vec_helper.c | 26 +++-
 target/i386/ops_sse.h   | 24 ++
 util/cpuinfo-aarch64.c  |  1 +
 util/cpuinfo-i386.c |  1 +
 6 files changed, 53 insertions(+), 1 deletion(-)

-- 
2.39.2

Re: [PATCH v2 0/2] Implement AES on ARM using x86 instructions and vv

2023-05-31 Thread Ard Biesheuvel

On Wed, 31 May 2023 at 18:33, Richard Henderson
 wrote:
>
> On 5/31/23 04:22, Ard Biesheuvel wrote:
> > Use the host native instructions to implement the AES instructions
> > exposed by the emulated target. The mapping is not 1:1, so it requires a
> > bit of fiddling to get the right result.
> >
> > This is still RFC material - the current approach feels too ad-hoc, but
> > given the non-1:1 correspondence, doing a proper abstraction is rather
> > difficult.
> >
> > Changes since v1/RFC:
> > - add second patch to implement x86 AES instructions on ARM hosts - this
> >helps illustrate what an abstraction should cover.
> > - use cpuinfo framework to detect host support for AES instructions.
> > - implement ARM aesimc using x86 aesimc directly
> >
> > Patch #1 produces a 1.5-2x speedup in tests using the Linux kernel's
> > tcrypt benchmark (mode=500)
> >
> > Patch #2 produces a 2-3x speedup. The discrepancy is most likely due to
> > the fact that ARM uses two instructions to implement a single AES round,
> > whereas x86 only uses one.
>
> Thanks.  I spent some time yesterday looking at this, with an encrypted disk 
> test case and
> could only measure 0.6% and 0.5% for total overhead of decrypt and encrypt 
> respectively.
>

I don't understand what 'overhead' means in this context. Are you
saying you saw barely any improvement?

> > As for the design of an abstraction: I imagine we could introduce a
> > host/aes.h API that implements some building blocks that the TCG helper
> > implementation could use.
>
> Indeed.  I was considering interfaces like
>
> /* Perform SubBytes + ShiftRows on state. */
> Int128 aesenc_SB_SR(Int128 state);
>
> /* Perform MixColumns on state. */
> Int128 aesenc_MC(Int128 state);
>
> /* Perform SubBytes + ShiftRows + MixColumns on state. */
> Int128 aesenc_SB_SR_MC(Int128 state);
>
> /* Perform SubBytes + ShiftRows + MixColumns + AddRoundKey. */
> Int128 aesenc_SB_SR_MC_AK(Int128 state, Int128 roundkey);
>
> and so forth for aesdec as well.  All but aesenc_MC should be implementable 
> on x86 and
> Power7, and all of them on aarch64.
>

aesenc_MC() can be implemented on x86 the way I did in patch #!, using
aesdeclast+aesenc


> > I suppose it really depends on whether there is a third host
> > architecture that could make use of this, and how its AES instructions
> > map onto the primitive AES ops above.
>
> There is Power6 (v{,n}cipher{,last}) and RISC-V Zkn (aes64{es,esm,ds,dsm,im})
>
> I got hung up yesterday was understanding the different endian requirements 
> of x86 vs Power.
>
> ppc64:
>
>  asm("lxvd2x 32,0,%1;"
>  "lxvd2x 33,0,%2;"
>  "vcipher 0,0,1;"
>  "stxvd2x 32,0,%0"
>  : : "r"(o), "r"(i), "r"(k), : "memory", "v0", "v1", "v2");
>
> ppc64le:
>
>  unsigned char le[16] = {8,9,10,11,12,13,14,15,0,1,2,3,4,5,6,7};
>  asm("lxvd2x 32,0,%1;"
>  "lxvd2x 33,0,%2;"
>  "lxvd2x 34,0,%3;"
>  "vperm 0,0,0,2;"
>  "vperm 1,1,1,2;"
>  "vcipher 0,0,1;"
>  "vperm 0,0,0,2;"
>  "stxvd2x 32,0,%0"
>  : : "r"(o), "r"(i), "r"(k), "r"(le) : "memory", "v0", "v1", "v2");
>
> There are also differences in their AES_Te* based C routines as well, which 
> made me wonder
> if we are handling host endianness differences correctly in emulation right 
> now.  I think
> I should most definitely add some generic-ish tests for this...
>

The above kind of sums it up, no? Or isn't this working code?

[PATCH v2 2/2] target/i386: Implement AES instructions using AArch64 counterparts

2023-05-31 Thread Ard Biesheuvel

When available, use the AArch64 AES instructions to implement the x86
ones. These are not a 1:1 fit, but considerably more efficient, and
without data dependent timing.

For a typical benchmark (linux tcrypt mode=500), this gives a 2-3x
speedup when running on ThunderX2.

Signed-off-by: Ard Biesheuvel 
---
 host/include/aarch64/host/cpuinfo.h |  1 +
 target/i386/ops_sse.h   | 69 
 util/cpuinfo-aarch64.c  |  1 +
 3 files changed, 71 insertions(+)

diff --git a/host/include/aarch64/host/cpuinfo.h 
b/host/include/aarch64/host/cpuinfo.h
index 82227890b4b4db03..05feeb4f4369fc19 100644
--- a/host/include/aarch64/host/cpuinfo.h
+++ b/host/include/aarch64/host/cpuinfo.h
@@ -9,6 +9,7 @@
 #define CPUINFO_ALWAYS  (1u << 0)  /* so cpuinfo is nonzero */
 #define CPUINFO_LSE (1u << 1)
 #define CPUINFO_LSE2(1u << 2)
+#define CPUINFO_AES (1u << 3)
 
 /* Initialized with a constructor. */
 extern unsigned cpuinfo;
diff --git a/target/i386/ops_sse.h b/target/i386/ops_sse.h
index fb63af7afa21588d..db79132778efd211 100644
--- a/target/i386/ops_sse.h
+++ b/target/i386/ops_sse.h
@@ -20,6 +20,11 @@
 
 #include "crypto/aes.h"
 
+#ifdef __aarch64__
+#include "host/cpuinfo.h"
+typedef uint8_t aes_vec_t __attribute__((vector_size(16)));
+#endif
+
 #if SHIFT == 0
 #define Reg MMXReg
 #define XMM_ONLY(...)
@@ -2165,6 +2170,20 @@ void glue(helper_aesdec, SUFFIX)(CPUX86State *env, Reg 
*d, Reg *v, Reg *s)
 Reg st = *v;
 Reg rk = *s;
 
+#ifdef __aarch64__
+if (cpuinfo & CPUINFO_AES) {
+asm("   .arch_extension aes \n"
+"   aesd%0.16b, %1.16b  \n"
+"   aesimc  %0.16b, %0.16b  \n"
+"   eor %0.16b, %0.16b, %2.16b  \n"
+:   "=w"(*(aes_vec_t *)d)
+:   "w"((aes_vec_t){}),
+"w"(*(aes_vec_t *)s),
+"0"(*(aes_vec_t *)v));
+return;
+}
+#endif
+
 for (i = 0 ; i < 2 << SHIFT ; i++) {
 int j = i & 3;
 d->L(i) = rk.L(i) ^ bswap32(AES_Td0[st.B(AES_ishifts[4 * j + 0])] ^
@@ -2180,6 +2199,19 @@ void glue(helper_aesdeclast, SUFFIX)(CPUX86State *env, 
Reg *d, Reg *v, Reg *s)
 Reg st = *v;
 Reg rk = *s;
 
+#ifdef __aarch64__
+if (cpuinfo & CPUINFO_AES) {
+asm("   .arch_extension aes \n"
+"   aesd%0.16b, %1.16b  \n"
+"   eor %0.16b, %0.16b, %2.16b  \n"
+:   "=w"(*(aes_vec_t *)d)
+:   "w"((aes_vec_t){}),
+"w"(*(aes_vec_t *)s),
+"0"(*(aes_vec_t *)v));
+return;
+}
+#endif
+
 for (i = 0; i < 8 << SHIFT; i++) {
 d->B(i) = rk.B(i) ^ (AES_isbox[st.B(AES_ishifts[i & 15] + (i & ~15))]);
 }
@@ -2191,6 +2223,20 @@ void glue(helper_aesenc, SUFFIX)(CPUX86State *env, Reg 
*d, Reg *v, Reg *s)
 Reg st = *v;
 Reg rk = *s;
 
+#ifdef __aarch64__
+if (cpuinfo & CPUINFO_AES) {
+asm("   .arch_extension aes \n"
+"   aese%0.16b, %1.16b  \n"
+"   aesmc   %0.16b, %0.16b  \n"
+"   eor %0.16b, %0.16b, %2.16b  \n"
+:   "=w"(*(aes_vec_t *)d)
+:   "w"((aes_vec_t){}),
+"w"(*(aes_vec_t *)s),
+"0"(*(aes_vec_t *)v));
+return;
+}
+#endif
+
 for (i = 0 ; i < 2 << SHIFT ; i++) {
 int j = i & 3;
 d->L(i) = rk.L(i) ^ bswap32(AES_Te0[st.B(AES_shifts[4 * j + 0])] ^
@@ -2206,6 +2252,19 @@ void glue(helper_aesenclast, SUFFIX)(CPUX86State *env, 
Reg *d, Reg *v, Reg *s)
 Reg st = *v;
 Reg rk = *s;
 
+#ifdef __aarch64__
+if (cpuinfo & CPUINFO_AES) {
+asm("   .arch_extension aes \n"
+"   aese%0.16b, %1.16b  \n"
+"   eor %0.16b, %0.16b, %2.16b  \n"
+:   "=w"(*(aes_vec_t *)d)
+:   "w"((aes_vec_t){}),
+"w"(*(aes_vec_t *)s),
+"0"(*(aes_vec_t *)v));
+return;
+}
+#endif
+
 for (i = 0; i < 8 << SHIFT; i++) {
 d->B(i) = rk.B(i) ^ (AES_sbox[st.B(AES_shifts[i & 15] + (i & ~15))]);
 }
@@ -2217,6 +2276,16 @@ void glue(helper_aesimc, SUFFIX)(CPUX86State *env, Reg 
*d, Reg *s)
 int i;
 Reg tmp = *s;
 
+#ifdef __aarch64__
+if (cpuinfo & CPUINFO_AES) {
+asm("   .arch_extension aes \n"
+"   aesimc  %0.16b, %1.16b  \n"
+:   "=w"(*(aes_vec_t *)d)
+

[PATCH v2 0/2] Implement AES on ARM using x86 instructions and vv

2023-05-31 Thread Ard Biesheuvel

Use the host native instructions to implement the AES instructions
exposed by the emulated target. The mapping is not 1:1, so it requires a
bit of fiddling to get the right result.

This is still RFC material - the current approach feels too ad-hoc, but
given the non-1:1 correspondence, doing a proper abstraction is rather
difficult.

Changes since v1/RFC:
- add second patch to implement x86 AES instructions on ARM hosts - this
  helps illustrate what an abstraction should cover.
- use cpuinfo framework to detect host support for AES instructions.
- implement ARM aesimc using x86 aesimc directly

Patch #1 produces a 1.5-2x speedup in tests using the Linux kernel's
tcrypt benchmark (mode=500)

Patch #2 produces a 2-3x speedup. The discrepancy is most likely due to
the fact that ARM uses two instructions to implement a single AES round,
whereas x86 only uses one.

Note that using the ARM intrinsics is fiddly with Clang, as it does not
declare the prototypes unless some builtin CPP macro (__ARM_FEATURE_AES)
is defined, which will be set by the compiler based on the command line
arch/cpu options. However, setting this globally for a compilation unit
is dubious, given that we test cpuinfo for AES support, and only emit
the instructions conditionally. So I used inline asm() instead.

As for the design of an abstraction: I imagine we could introduce a
host/aes.h API that implements some building blocks that the TCG helper
implementation could use.

Quoting from my reply to Richard:

Using the primitive operations defined in the AES paper, we basically
perform the following transformation for n rounds of AES (for n in {10,
12, 14})

for (n-1 rounds) {
  AddRoundKey
  ShiftRows
  SubBytes
  MixColumns
}
AddRoundKey
ShiftRows
SubBytes
AddRoundKey

AddRoundKey is just XOR, but it is incorporated into the instructions
that combine a couple of these steps.

So on x86, we have

aesenc:
  ShiftRows
  SubBytes
  MixColumns
  AddRoundKey

aesenclast:
  ShiftRows
  SubBytes
  AddRoundKey

and on ARM we have

aese:
  AddRoundKey
  ShiftRows
  SubBytes

aesmc:
  MixColumns

So a generic routine that does only ShiftRows+SubBytes could be backed by
x86's aesenclast and ARM's aese, using a NULL round key argument in each
case. Then, it would be up to the TCG helper code for either ARM or x86
to incorporate those routines in the right way.

I suppose it really depends on whether there is a third host
architecture that could make use of this, and how its AES instructions
map onto the primitive AES ops above.

Cc: Peter Maydell 
Cc: Alex Bennée 
Cc: Richard Henderson 
Cc: Philippe Mathieu-Daudé 

Ard Biesheuvel (2):
  target/arm: use x86 intrinsics to implement AES instructions
  target/i386: Implement AES instructions using AArch64 counterparts

 host/include/aarch64/host/cpuinfo.h |  1 +
 host/include/i386/host/cpuinfo.h|  1 +
 target/arm/tcg/crypto_helper.c  | 37 ++-
 target/i386/ops_sse.h   | 69 
 util/cpuinfo-aarch64.c  |  1 +
 util/cpuinfo-i386.c |  1 +
 6 files changed, 107 insertions(+), 3 deletions(-)

-- 
2.39.2

[PATCH v2 1/2] target/arm: use x86 intrinsics to implement AES instructions

2023-05-31 Thread Ard Biesheuvel

ARM intrinsics for AES deviate from the x86 ones in the way they cover
the different stages of each round, and so mapping one to the other is
not entirely straight-forward. However, with a bit of care, we can still
use the x86 ones to emulate the ARM ones, which makes them constant time
(which is an important property in crypto) and substantially more
efficient.

Signed-off-by: Ard Biesheuvel 
---
 host/include/i386/host/cpuinfo.h |  1 +
 target/arm/tcg/crypto_helper.c   | 37 ++--
 util/cpuinfo-i386.c  |  1 +
 3 files changed, 36 insertions(+), 3 deletions(-)

diff --git a/host/include/i386/host/cpuinfo.h b/host/include/i386/host/cpuinfo.h
index a6537123cf80ec5b..073d0a426f31487d 100644
--- a/host/include/i386/host/cpuinfo.h
+++ b/host/include/i386/host/cpuinfo.h
@@ -26,6 +26,7 @@
 #define CPUINFO_AVX512VBMI2 (1u << 15)
 #define CPUINFO_ATOMIC_VMOVDQA  (1u << 16)
 #define CPUINFO_ATOMIC_VMOVDQU  (1u << 17)
+#define CPUINFO_AES (1u << 18)
 
 /* Initialized with a constructor. */
 extern unsigned cpuinfo;
diff --git a/target/arm/tcg/crypto_helper.c b/target/arm/tcg/crypto_helper.c
index d28690321f0b86ea..747c061b5a1b0e5e 100644
--- a/target/arm/tcg/crypto_helper.c
+++ b/target/arm/tcg/crypto_helper.c
@@ -18,10 +18,21 @@
 #include "crypto/sm4.h"
 #include "vec_internal.h"
 
+#ifdef __x86_64__
+#include "host/cpuinfo.h"
+#include 
+#define TARGET_AES  __attribute__((__target__("aes")))
+#else
+#define TARGET_AES
+#endif
+
 union CRYPTO_STATE {
 uint8_tbytes[16];
 uint32_t   words[4];
 uint64_t   l[2];
+#ifdef __x86_64__
+__m128ivec;
+#endif
 };
 
 #if HOST_BIG_ENDIAN
@@ -45,8 +56,8 @@ static void clear_tail_16(void *vd, uint32_t desc)
 clear_tail(vd, opr_sz, max_sz);
 }
 
-static void do_crypto_aese(uint64_t *rd, uint64_t *rn,
-   uint64_t *rm, bool decrypt)
+static void TARGET_AES do_crypto_aese(uint64_t *rd, uint64_t *rn,
+  uint64_t *rm, bool decrypt)
 {
 static uint8_t const * const sbox[2] = { AES_sbox, AES_isbox };
 static uint8_t const * const shift[2] = { AES_shifts, AES_ishifts };
@@ -54,6 +65,16 @@ static void do_crypto_aese(uint64_t *rd, uint64_t *rn,
 union CRYPTO_STATE st = { .l = { rn[0], rn[1] } };
 int i;
 
+#ifdef __x86_64__
+if (cpuinfo & CPUINFO_AES) {
+__m128i *d = (__m128i *)rd, z = {};
+
+*d = decrypt ? _mm_aesdeclast_si128(rk.vec ^ st.vec, z)
+ : _mm_aesenclast_si128(rk.vec ^ st.vec, z);
+return;
+}
+#endif
+
 /* xor state vector with round key */
 rk.l[0] ^= st.l[0];
 rk.l[1] ^= st.l[1];
@@ -78,7 +99,7 @@ void HELPER(crypto_aese)(void *vd, void *vn, void *vm, 
uint32_t desc)
 clear_tail(vd, opr_sz, simd_maxsz(desc));
 }
 
-static void do_crypto_aesmc(uint64_t *rd, uint64_t *rm, bool decrypt)
+static void TARGET_AES do_crypto_aesmc(uint64_t *rd, uint64_t *rm, bool 
decrypt)
 {
 static uint32_t const mc[][256] = { {
 /* MixColumns lookup table */
@@ -217,6 +238,16 @@ static void do_crypto_aesmc(uint64_t *rd, uint64_t *rm, 
bool decrypt)
 union CRYPTO_STATE st = { .l = { rm[0], rm[1] } };
 int i;
 
+#ifdef __x86_64__
+if (cpuinfo & CPUINFO_AES) {
+__m128i *d = (__m128i *)rd, z = {};
+
+*d = decrypt ? _mm_aesimc_si128(st.vec)
+ : _mm_aesenc_si128(_mm_aesdeclast_si128(st.vec, z), z);
+return;
+}
+#endif
+
 for (i = 0; i < 16; i += 4) {
 CR_ST_WORD(st, i >> 2) =
 mc[decrypt][CR_ST_BYTE(st, i)] ^
diff --git a/util/cpuinfo-i386.c b/util/cpuinfo-i386.c
index ab6143d9e77291f1..3043f066c0182dc8 100644
--- a/util/cpuinfo-i386.c
+++ b/util/cpuinfo-i386.c
@@ -39,6 +39,7 @@ unsigned __attribute__((constructor)) cpuinfo_init(void)
 info |= (c & bit_SSE4_1 ? CPUINFO_SSE4 : 0);
 info |= (c & bit_MOVBE ? CPUINFO_MOVBE : 0);
 info |= (c & bit_POPCNT ? CPUINFO_POPCNT : 0);
+info |= (c & bit_AES ? CPUINFO_AES : 0);
 
 /* For AVX features, we must check available and usable. */
 if ((c & bit_AVX) && (c & bit_OSXSAVE)) {
-- 
2.39.2

Re: [RFC PATCH] target/arm: use x86 intrinsics to implement AES instructions

2023-05-30 Thread Ard Biesheuvel

On Tue, 30 May 2023 at 18:45, Peter Maydell  wrote:
>
> On Tue, 30 May 2023 at 14:52, Ard Biesheuvel  wrote:
> >
> > ARM intrinsics for AES deviate from the x86 ones in the way they cover
> > the different stages of each round, and so mapping one to the other is
> > not entirely straight-forward. However, with a bit of care, we can still
> > use the x86 ones to emulate the ARM ones, which makes them constant time
> > (which is an important property in crypto) and substantially more
> > efficient.
>
> Do you have examples of workloads and speedups obtained,
> by the way?
>

I don't have any actual numbers to share, unfortunately.

I implemented this when i was experimenting with TPM based measured
boot and disk encryption in the guest. I'd say that running an OS
under emulation that uses disk encryption would be the most relevant
use case here.

Accelerated AES is typically at least an order of magnitude faster
than a table based C implementation, and does not stress the D-cache
as much (the tables involved are not tiny).

Re: [RFC PATCH] target/arm: use x86 intrinsics to implement AES instructions

2023-05-30 Thread Ard Biesheuvel

On Tue, 30 May 2023 at 18:43, Richard Henderson
 wrote:
>
> On 5/30/23 06:52, Ard Biesheuvel wrote:
> > +#ifdef __x86_64__
> > +if (have_aes()) {
> > +__m128i *d = (__m128i *)rd;
> > +
> > +*d = decrypt ? _mm_aesdeclast_si128(rk.vec ^ st.vec, (__m128i){})
> > + : _mm_aesenclast_si128(rk.vec ^ st.vec, (__m128i){});
>
> Do I correctly understand that the ARM xor is pre-shift
>
> > +return;
> > +}
> > +#endif
> > +
> >   /* xor state vector with round key */
> >   rk.l[0] ^= st.l[0];
> >   rk.l[1] ^= st.l[1];
>
> (like so)
>
> whereas the x86 xor is post-shift
>
> > void glue(helper_aesenclast, SUFFIX)(CPUX86State *env, Reg *d, Reg *v, Reg 
> > *s)
> > {
> > int i;
> > Reg st = *v;
> > Reg rk = *s;
> >
> > for (i = 0; i < 8 << SHIFT; i++) {
> > d->B(i) = rk.B(i) ^ (AES_sbox[st.B(AES_shifts[i & 15] + (i & 
> > ~15))]);
> > }
>
> (like so, from target/i386/ops_sse.h)?
>

Indeed. Using the primitive operations defined in the AES paper, we
basically have the following for n rounds of AES (for n in {10, 12,
14})

for (n-1 rounds) {
  AddRoundKey
  ShiftRows
  SubBytes
  MixColumns
}
AddRoundKey
ShiftRows
SubBytes
AddRoundKey

AddRoundKey is just XOR, but it is incorporated into the instructions
that combine a couple of these steps.

So on x86, we have

aesenc:
  ShiftRows
  SubBytes
  MixColumns
  AddRoundKey

aesenclast:
  ShiftRows
  SubBytes
  AddRoundKey

and on ARM we have

aese:
  AddRoundKey
  ShiftRows
  SubBytes

aesmc:
  MixColumns


> What might help: could we do the reverse -- emulate the x86 aesdeclast 
> instruction with
> the aarch64 aesd instruction?
>

Help in what sense? To emulate the x86 instructions on a ARM host?

But yes, aesenclast can be implement using aese in a similar way,
i.e., by passing a {0} vector as the round key into the instruction,
and performing the XOR explicitly using the real round key afterwards.

[RFC PATCH] target/arm: use x86 intrinsics to implement AES instructions

2023-05-30 Thread Ard Biesheuvel

ARM intrinsics for AES deviate from the x86 ones in the way they cover
the different stages of each round, and so mapping one to the other is
not entirely straight-forward. However, with a bit of care, we can still
use the x86 ones to emulate the ARM ones, which makes them constant time
(which is an important property in crypto) and substantially more
efficient.

Cc: Peter Maydell 
Cc: Alex Bennée 
Cc: Richard Henderson 
Cc: Philippe Mathieu-Daudé 
Signed-off-by: Ard Biesheuvel 
---
Suggestions welcome on how to make this more generic across targets and
compilers etc.

 target/arm/tcg/crypto_helper.c | 43 
 1 file changed, 43 insertions(+)

diff --git a/target/arm/tcg/crypto_helper.c b/target/arm/tcg/crypto_helper.c
index d28690321f..961112b6bd 100644
--- a/target/arm/tcg/crypto_helper.c
+++ b/target/arm/tcg/crypto_helper.c
@@ -18,10 +18,32 @@
 #include "crypto/sm4.h"
 #include "vec_internal.h"
 
+#ifdef __x86_64
+#pragma GCC target ("aes")
+#include 
+#include 
+
+static bool have_aes(void)
+{
+static int cpuid_have_aes = -1;
+
+if (cpuid_have_aes == -1) {
+unsigned int eax, ebx, ecx, edx;
+int ret = __get_cpuid(0x1, , , , );
+
+cpuid_have_aes = ret && (ecx & bit_AES);
+}
+return cpuid_have_aes > 0;
+}
+#endif
+
 union CRYPTO_STATE {
 uint8_tbytes[16];
 uint32_t   words[4];
 uint64_t   l[2];
+#ifdef __x86_64
+__m128ivec;
+#endif
 };
 
 #if HOST_BIG_ENDIAN
@@ -54,6 +76,16 @@ static void do_crypto_aese(uint64_t *rd, uint64_t *rn,
 union CRYPTO_STATE st = { .l = { rn[0], rn[1] } };
 int i;
 
+#ifdef __x86_64__
+if (have_aes()) {
+__m128i *d = (__m128i *)rd;
+
+*d = decrypt ? _mm_aesdeclast_si128(rk.vec ^ st.vec, (__m128i){})
+ : _mm_aesenclast_si128(rk.vec ^ st.vec, (__m128i){});
+return;
+}
+#endif
+
 /* xor state vector with round key */
 rk.l[0] ^= st.l[0];
 rk.l[1] ^= st.l[1];
@@ -217,6 +249,17 @@ static void do_crypto_aesmc(uint64_t *rd, uint64_t *rm, 
bool decrypt)
 union CRYPTO_STATE st = { .l = { rm[0], rm[1] } };
 int i;
 
+#ifdef __x86_64__
+if (have_aes()) {
+__m128i *d = (__m128i *)rd;
+
+*d = decrypt ? _mm_aesdec_si128(_mm_aesenclast_si128(st.vec, 
(__m128i){}),
+(__m128i){})
+ : _mm_aesenc_si128(_mm_aesdeclast_si128(st.vec, 
(__m128i){}),
+(__m128i){});
+return;
+}
+#endif
 for (i = 0; i < 16; i += 4) {
 CR_ST_WORD(st, i >> 2) =
 mc[decrypt][CR_ST_BYTE(st, i)] ^
-- 
2.39.2

[PATCH v2] hw: arm: Support direct boot for Linux/arm64 EFI zboot images

2023-03-03 Thread Ard Biesheuvel

Fedora 39 will ship its arm64 kernels in the new generic EFI zboot
format, using gzip compression for the payload.

For doing EFI boot in QEMU, this is completely transparent, as the
firmware or bootloader will take care of this. However, for direct
kernel boot without firmware, we will lose the ability to boot such
distro kernels unless we deal with the new format directly.

EFI zboot images contain metadata in the header regarding the placement
of the compressed payload inside the image, and the type of compression
used. This means we can wire up the existing gzip support without too
much hassle, by parsing the header and grabbing the payload from inside
the loaded zboot image.

Cc: Peter Maydell 
Cc: Alex Bennée 
Cc: Richard Henderson 
Cc: Philippe Mathieu-Daudé 
Signed-off-by: Ard Biesheuvel 
---
v2:
- only attempt EFI zboot unpacking of raw images
- check MS-DOS magic as well
- document the origin of the zboot header definition
- use strmcp() for comparing NUL terminated header field against "gzip"
- limit %s consumption of the compression type field in case it is not
  NUL terminated
- add doc-comment description of unpack_efi_zboot_image()
- use special accessors for fields that may appear misaligned in memory
- avoid // commenting style
- sanity check payload offset and size fields before use

 hw/arm/boot.c   |  6 ++
 hw/core/loader.c| 90 
 include/hw/loader.h | 18 
 3 files changed, 114 insertions(+)

diff --git a/hw/arm/boot.c b/hw/arm/boot.c
index 1e021c4a340c7c61..50e5141116b9137d 100644
--- a/hw/arm/boot.c
+++ b/hw/arm/boot.c
@@ -926,6 +926,12 @@ static uint64_t load_aarch64_image(const char *filename, 
hwaddr mem_base,
 return -1;
 }
 size = len;
+
+/* Unpack the image if it is a EFI zboot image */
+if (unpack_efi_zboot_image(, ) < 0) {
+g_free(buffer);
+return -1;
+}
 }
 
 /* check the arm64 magic header value -- very old kernels may not have it 
*/
diff --git a/hw/core/loader.c b/hw/core/loader.c
index 173f8f67f6e3e79c..7a5e6bcc070b7bf8 100644
--- a/hw/core/loader.c
+++ b/hw/core/loader.c
@@ -857,6 +857,96 @@ ssize_t load_image_gzipped(const char *filename, hwaddr 
addr, uint64_t max_sz)
 return bytes;
 }
 
+/* The PE/COFF MS-DOS stub magic number */
+#define EFI_PE_MSDOS_MAGIC"MZ"
+
+/*
+ * The Linux header magic number for a EFI PE/COFF
+ * image targetting an unspecified architecture.
+ */
+#define EFI_PE_LINUX_MAGIC"\xcd\x23\x82\x81"
+
+/*
+ * Bootable Linux kernel images may be packaged as EFI zboot images, which are
+ * self-decompressing executables when loaded via EFI. The compressed payload
+ * can also be extracted from the image and decompressed by a non-EFI loader.
+ *
+ * The de facto specification for this format is at the following URL:
+ *
+ * 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/firmware/efi/libstub/zboot-header.S
+ *
+ * This definition is based on Linux upstream commit 29636a5ce87beba.
+ */
+struct linux_efi_zboot_header {
+uint8_t msdos_magic[2]; /* PE/COFF 'MZ' magic number */
+uint8_t reserved0[2];
+uint8_t zimg[4];/* "zimg" for Linux EFI zboot images */
+uint32_tpayload_offset; /* LE offset to the compressed payload 
*/
+uint32_tpayload_size;   /* LE size of the compressed payload */
+uint8_t reserved1[8];
+charcompression_type[32];   /* Compression type, NUL terminated */
+uint8_t linux_magic[4]; /* Linux header magic */
+uint32_tpe_header_offset;   /* LE offset to the PE header */
+};
+
+/*
+ * Check whether *buffer points to a Linux EFI zboot image in memory.
+ *
+ * If it does, attempt to decompress it to a new buffer, and free the old one.
+ * If any of this fails, return an error to the caller.
+ *
+ * If the image is not a Linux EFI zboot image, do nothing and return success.
+ */
+ssize_t unpack_efi_zboot_image(uint8_t **buffer, int *size)
+{
+const struct linux_efi_zboot_header *header;
+uint8_t *data = NULL;
+int ploff, plsize;
+ssize_t bytes;
+
+/* ignore if this is too small to be a EFI zboot image */
+if (*size < sizeof(*header)) {
+return 0;
+}
+
+header = (struct linux_efi_zboot_header *)*buffer;
+
+/* ignore if this is not a Linux EFI zboot image */
+if (memcmp(>msdos_magic, EFI_PE_MSDOS_MAGIC, 2) != 0 ||
+memcmp(>zimg, "zimg", 4) != 0 ||
+memcmp(>linux_magic, EFI_PE_LINUX_MAGIC, 4) != 0) {
+return 0;
+}
+
+if (strcmp(header->compression_type, "gzip") != 0) {
+fprintf(stderr, "unable to handle EFI zboot image with \"%.*s\" 
compression\n",
+(int)sizeof(header->compression_type) - 1,
+header->compression_type);
+

Re: [RFC PATCH] hw: arm: Support direct boot for Linux/arm64 EFI zboot images

2023-03-03 Thread Ard Biesheuvel

On Fri, 3 Mar 2023 at 15:25, Peter Maydell  wrote:
>
> On Thu, 23 Feb 2023 at 10:53, Ard Biesheuvel  wrote:
> >
> > Fedora 39 will ship its arm64 kernels in the new generic EFI zboot
> > format, using gzip compression for the payload.
> >
> > For doing EFI boot in QEMU, this is completely transparent, as the
> > firmware or bootloader will take care of this. However, for direct
> > kernel boot without firmware, we will lose the ability to boot such
> > distro kernels unless we deal with the new format directly.
> >
> > EFI zboot images contain metadata in the header regarding the placement
> > of the compressed payload inside the image, and the type of compression
> > used. This means we can wire up the existing gzip support without too
> > much hassle, by parsing the header and grabbing the payload from inside
> > the loaded zboot image.
>
> Seems reasonable to me. Any particular reason for marking the
> patch RFC ?
>

Nothing except for the fact that I contribute so rarely that I may
have violated some coding style rules inadvertently.

> > Cc: Peter Maydell 
> > Cc: Alex Bennée 
> > Cc: Richard Henderson 
> > Cc: Philippe Mathieu-Daudé 
> > Signed-off-by: Ard Biesheuvel 
> > ---
> >  hw/arm/boot.c   |  4 ++
> >  hw/core/loader.c| 64 
> >  include/hw/loader.h |  2 +
> >  3 files changed, 70 insertions(+)
> >
> > diff --git a/hw/arm/boot.c b/hw/arm/boot.c
> > index 3d7d11f782feb5da..dc10a0788227443e 100644
> > --- a/hw/arm/boot.c
> > +++ b/hw/arm/boot.c
> > @@ -924,6 +924,10 @@ static uint64_t load_aarch64_image(const char 
> > *filename, hwaddr mem_base,
> >  size = len;
> >  }
> >
> > +if (unpack_efi_zboot_image(, )) {
> > +return -1;
> > +}
>
> It seems a bit odd that we will now accept a gzipped file, unzip
> it and then look inside it for the EFI zboot image that tells us
> to do a second unzip step. Is that intentional/useful?

No and no.

> If not, probably better to do something like "if this is an
> EFI zboot image, load-and-decompress, otherwise if a plain gzipped
> file, load-and-decompress, otherwise assume a raw file".
>
> > +
> >  /* check the arm64 magic header value -- very old kernels may not have 
> > it */
> >  if (size > ARM64_MAGIC_OFFSET + 4 &&
> >  memcmp(buffer + ARM64_MAGIC_OFFSET, "ARM\x64", 4) == 0) {
> > diff --git a/hw/core/loader.c b/hw/core/loader.c
> > index 173f8f67f6e3e79c..7e7f49261a309012 100644
> > --- a/hw/core/loader.c
> > +++ b/hw/core/loader.c
> > @@ -857,6 +857,70 @@ ssize_t load_image_gzipped(const char *filename, 
> > hwaddr addr, uint64_t max_sz)
> >  return bytes;
> >  }
>
> I assume there's a spec somewhere that defines the file format;
> this would be a good place for a comment giving a reference to it
> (URL, document name, etc).
>

It is de facto defined by the Linux kernel's EFI stub - I can link to
the right file here

> > +// The Linux header magic number for a EFI PE/COFF
> > +// image targetting an unspecified architecture.
> > +#define LINUX_EFI_PE_MAGIC"\xcd\x23\x82\x81"
> > +
> > +struct linux_efi_zboot_header {
> > +uint8_t msdos_magic[4]; // PE/COFF 'MZ' magic number
> > +uint8_t zimg[4];// "zimg" for Linux EFI zboot 
> > images
> > +uint32_tpayload_offset; // LE offset to the compressed 
> > payload
> > +uint32_tpayload_size;   // LE size of the compressed 
> > payload
> > +uint8_t reserved[8];
> > +charcompression_type[32];   // Compression type, e.g., "gzip"
> > +uint8_t linux_magic[4]; // Linux header magic
> > +uint32_tpe_header_offset;   // LE offset to the PE header
> > +};
>
> QEMU coding standard doesn't use '//' style comments.
>

OK

> > +
> > +/*
> > + * Check whether *buffer points to a Linux EFI zboot image in memory.
> > + *
> > + * If it does, attempt to decompress it to a new buffer, and free the old 
> > one.
> > + * If any of this fails, return an error to the caller.
> > + *
> > + * If the image is not a Linux EFI zboot image, do nothing and return 
> > success.
> > + */
> > +int unpack_efi_zboot_image(uint8_t **buffer, int *size)
> > +{
> > +const struct linux_efi_zboot_header *header;
> > +uint8_t *data = NULL;
> > +ssize_t bytes;
> > +
> > +/* ignore if this is too small t

[RFC PATCH] hw: arm: Support direct boot for Linux/arm64 EFI zboot images

2023-02-23 Thread Ard Biesheuvel

Fedora 39 will ship its arm64 kernels in the new generic EFI zboot
format, using gzip compression for the payload.

For doing EFI boot in QEMU, this is completely transparent, as the
firmware or bootloader will take care of this. However, for direct
kernel boot without firmware, we will lose the ability to boot such
distro kernels unless we deal with the new format directly.

EFI zboot images contain metadata in the header regarding the placement
of the compressed payload inside the image, and the type of compression
used. This means we can wire up the existing gzip support without too
much hassle, by parsing the header and grabbing the payload from inside
the loaded zboot image.

Cc: Peter Maydell 
Cc: Alex Bennée 
Cc: Richard Henderson 
Cc: Philippe Mathieu-Daudé 
Signed-off-by: Ard Biesheuvel 
---
 hw/arm/boot.c   |  4 ++
 hw/core/loader.c| 64 
 include/hw/loader.h |  2 +
 3 files changed, 70 insertions(+)

diff --git a/hw/arm/boot.c b/hw/arm/boot.c
index 3d7d11f782feb5da..dc10a0788227443e 100644
--- a/hw/arm/boot.c
+++ b/hw/arm/boot.c
@@ -924,6 +924,10 @@ static uint64_t load_aarch64_image(const char *filename, 
hwaddr mem_base,
 size = len;
 }
 
+if (unpack_efi_zboot_image(, )) {
+return -1;
+}
+
 /* check the arm64 magic header value -- very old kernels may not have it 
*/
 if (size > ARM64_MAGIC_OFFSET + 4 &&
 memcmp(buffer + ARM64_MAGIC_OFFSET, "ARM\x64", 4) == 0) {
diff --git a/hw/core/loader.c b/hw/core/loader.c
index 173f8f67f6e3e79c..7e7f49261a309012 100644
--- a/hw/core/loader.c
+++ b/hw/core/loader.c
@@ -857,6 +857,70 @@ ssize_t load_image_gzipped(const char *filename, hwaddr 
addr, uint64_t max_sz)
 return bytes;
 }
 
+// The Linux header magic number for a EFI PE/COFF
+// image targetting an unspecified architecture.
+#define LINUX_EFI_PE_MAGIC"\xcd\x23\x82\x81"
+
+struct linux_efi_zboot_header {
+uint8_t msdos_magic[4]; // PE/COFF 'MZ' magic number
+uint8_t zimg[4];// "zimg" for Linux EFI zboot images
+uint32_tpayload_offset; // LE offset to the compressed payload
+uint32_tpayload_size;   // LE size of the compressed payload
+uint8_t reserved[8];
+charcompression_type[32];   // Compression type, e.g., "gzip"
+uint8_t linux_magic[4]; // Linux header magic
+uint32_tpe_header_offset;   // LE offset to the PE header
+};
+
+/*
+ * Check whether *buffer points to a Linux EFI zboot image in memory.
+ *
+ * If it does, attempt to decompress it to a new buffer, and free the old one.
+ * If any of this fails, return an error to the caller.
+ *
+ * If the image is not a Linux EFI zboot image, do nothing and return success.
+ */
+int unpack_efi_zboot_image(uint8_t **buffer, int *size)
+{
+const struct linux_efi_zboot_header *header;
+uint8_t *data = NULL;
+ssize_t bytes;
+
+/* ignore if this is too small to be a EFI zboot image */
+if (*size < sizeof(*header)) {
+return 0;
+}
+
+header = (struct linux_efi_zboot_header *)*buffer;
+
+/* ignore if this is not a Linux EFI zboot image */
+if (memcmp(>zimg, "zimg", 4) != 0 ||
+memcmp(>linux_magic, LINUX_EFI_PE_MAGIC, 4) != 0) {
+return 0;
+}
+
+if (strncmp(header->compression_type, "gzip", 4) != 0) {
+fprintf(stderr, "unable to handle EFI zboot image with \"%s\" 
compression\n",
+header->compression_type);
+return -1;
+}
+
+data = g_malloc(LOAD_IMAGE_MAX_GUNZIP_BYTES);
+bytes = gunzip(data, LOAD_IMAGE_MAX_GUNZIP_BYTES,
+   *buffer + le32_to_cpu(header->payload_offset),
+   le32_to_cpu(header->payload_size));
+if (bytes < 0) {
+fprintf(stderr, "failed to decompress EFI zboot image\n");
+g_free(data);
+return -1;
+}
+
+g_free(*buffer);
+*buffer = g_realloc(data, bytes);
+*size = bytes;
+return 0;
+}
+
 /*
  * Functions for reboot-persistent memory regions.
  *  - used for vga bios and option roms.
diff --git a/include/hw/loader.h b/include/hw/loader.h
index 70248e0da77908c1..d1092c8bfbd903c7 100644
--- a/include/hw/loader.h
+++ b/include/hw/loader.h
@@ -86,6 +86,8 @@ ssize_t load_image_gzipped_buffer(const char *filename, 
uint64_t max_sz,
   uint8_t **buffer);
 ssize_t load_image_gzipped(const char *filename, hwaddr addr, uint64_t max_sz);
 
+int unpack_efi_zboot_image(uint8_t **buffer, int *size);
+
 #define ELF_LOAD_FAILED   -1
 #define ELF_LOAD_NOT_ELF  -2
 #define ELF_LOAD_WRONG_ARCH   -3
-- 
2.39.1

Re: [PATCH] acpi: cpuhp: fix guest-visible maximum access size to the legacy reg block

2023-01-04 Thread Ard Biesheuvel

ence with KVM acceleration -- the DWORD accesses still work,
> despite "valid.max_access_size = 1".
>
> As commit 5d971f9e6725 suggests, fix the problem by raising
> "valid.max_access_size" to 4 -- the spec now clearly instructs the guest
> to perform DWORD accesses to the legacy register block too, for enabling
> (and verifying!) the modern block.  In order to keep compatibility for the
> device model implementation though, set "impl.max_access_size = 1", so
> that wide accesses be split before they reach the legacy read/write
> handlers, like they always have been on KVM, and like they were on TCG
> before 5d971f9e6725 (v5.1.0).
>
> Tested with:
>
> - OVMF IA32 + qemu-system-i386, CPU hotplug/hot-unplug with SMM,
>   intermixed with ACPI S3 suspend/resume, using KVM accel
>   (regression-test);
>
> - OVMF IA32X64 + qemu-system-x86_64, CPU hotplug/hot-unplug with SMM,
>   intermixed with ACPI S3 suspend/resume, using KVM accel
>   (regression-test);
>
> - OVMF IA32 + qemu-system-i386, SMM enabled, using TCG accel; verified the
>   register block switch and the present/possible CPU counting through the
>   modern hotplug interface, during OVMF boot (bugfix test);
>
> - I do not have any testcase (guest payload) for regression-testing CPU
>   hotplug through the *legacy* CPU hotplug register block.
>
> Cc: "Michael S. Tsirkin" 
> Cc: Ani Sinha 
> Cc: Ard Biesheuvel 
> Cc: Igor Mammedov 
> Cc: Paolo Bonzini 
> Cc: Peter Maydell 
> Cc: qemu-sta...@nongnu.org
> Ref: "IO port write width clamping differs between TCG and KVM"
> Link: 
> aaedee84-d3ed-a4f9-21e7-d221a28d1683@redhat.com">http://mid.mail-archive.com/aaedee84-d3ed-a4f9-21e7-d221a28d1683@redhat.com
> Link: https://lists.gnu.org/archive/html/qemu-devel/2023-01/msg00199.html
> Reported-by: Ard Biesheuvel 
> Signed-off-by: Laszlo Ersek 

Thanks for going down this rabbit hole.

With this patch applied, the QEMU IA32 regression that would only
manifest when using KVM now also happens in TCG mode.

Yay

Tested-by: Ard Biesheuvel 

> ---
>
> Notes:
> This should be applied to:
>
> - stable-5.2 (new branch)
>
> - stable-6.2 (new branch)
>
> - stable-7.2 (new branch)
>
> whichever is still considered maintained, as there is currently *no*
> public QEMU release in which the modern CPU hotplug register block
> works, when using TCG acceleration.  v5.0.0 works, but that minor
> release has been obsoleted by v5.2.0, which does not work.
>
>  hw/acpi/cpu_hotplug.c | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/hw/acpi/cpu_hotplug.c b/hw/acpi/cpu_hotplug.c
> index 53654f863830..ff14c3f4106f 100644
> --- a/hw/acpi/cpu_hotplug.c
> +++ b/hw/acpi/cpu_hotplug.c
> @@ -52,6 +52,9 @@ static const MemoryRegionOps AcpiCpuHotplug_ops = {
>  .endianness = DEVICE_LITTLE_ENDIAN,
>  .valid = {
>  .min_access_size = 1,
> +.max_access_size = 4,
> +},
> +.impl = {
>  .max_access_size = 1,
>  },
>  };

Re: [PATCH qemu] x86: don't let decompressed kernel image clobber setup_data

2023-01-02 Thread Ard Biesheuvel

On Mon, 2 Jan 2023 at 14:37, Borislav Petkov  wrote:
>
> On Mon, Jan 02, 2023 at 10:32:03AM +0100, Ard Biesheuvel wrote:
> > So instead of appending data to the compressed image and assuming that
> > it will stay in place, create or extend a memory reservation
> > elsewhere, and refer to its absolute address in setup_data.
>
> From my limited experience with all those boot protocols, I'd say hardcoding
> stuff is always a bad idea. But, we already more or less hardcode, or rather
> codify through the setup header contract how stuff needs to get accessed.
>
> And yeah, maybe specifying an absolute address and size for a blob of data and
> putting that address and size in the setup header so that all the parties
> involved are where what is, is probably better.
>

Exactly. In the EFI case, this was problematic because we would need
to introduce a new way to pass memory reservations between QEMU and
the firmware. But I don't think that issue should affect legacy BIOS
boot, and we could just reserve the memory in the E820 table AFAIK.

Re: [PATCH qemu] x86: don't let decompressed kernel image clobber setup_data

2023-01-02 Thread Ard Biesheuvel

On Mon, 2 Jan 2023 at 07:17, Borislav Petkov  wrote:
>
> On Mon, Jan 02, 2023 at 07:01:50AM +0100, Borislav Petkov wrote:
> > On Sat, Dec 31, 2022 at 07:31:21PM -0800, H. Peter Anvin wrote:
> > > It would probably be a good idea to add a "maximum physical address for
> > > initrd/setup_data/cmdline" field to struct kernel_info, though. It appears
> > > right now that those fields are being identity-mapped in the decompressor,
> > > and that means that if 48-bit addressing is used, physical memory may 
> > > extend
> > > past the addressable range.
> >
> > Yeah, we will probably need that too.
> >
> > Btw, looka here - it can't get any more obvious than that after dumping
> > setup_data too:
> >
> > early console in setup code
> > early console in extract_kernel
> > input_data: 0x040f92bf
> > input_len: 0x00f1c325
> > output: 0x0100
> > output_len: 0x03c5e7d8
> > kernel_total_size: 0x04428000
> > needed_size: 0x0460
> > boot_params->hdr.setup_data: 0x010203b0
> > trampoline_32bit: 0x0009d000
> >
> > Decompressing Linux... Parsing ELF... done.
> > Booting the kernel.
> > 
> >
> > Aligning them vertically:
> >
> > output:   0x0100
> > output_len:   0x03c5e7d8
> > kernel_total_size:0x04428000
> > needed_size:  0x0460
> > boot_params->hdr.setup_data:  0x010203b0
>
> Ok, waait a minute:
>
> 
> Field name: pref_address
> Type:   read (reloc)
> Offset/size:0x258/8
> Protocol:   2.10+
> 
>
>   This field, if nonzero, represents a preferred load address for the
>   kernel.  A relocating bootloader should attempt to load at this
>   address if possible.
>
>   A non-relocatable kernel will unconditionally move itself and to run
>   at this address.
>
> so a kernel loader (qemu in this case) already knows where the kernel goes:
>
> boot_params->hdr.setup_data: 0x01020450
> boot_params->hdr.pref_address: 0x0100
> ^
>
> now, considering that same kernel loader (qemu) knows how big that kernel is:
>
> kernel_total_size: 0x04428000
>
> should that loader *not* put anything that the kernel will use in the range
>
> pref_addr + kernel_total_size
>

This seems to be related to another issue that was discussed in the
context of this change, but affecting EFI boot not legacy BIOS boot
[0].

So, in a nutshell, we have the following pieces:
- QEMU, which manages a directory of files and other data blobs, and
exposes them via its fw_cfg interface.
- SeaBIOS, which invokes the fw_cfg interface to load the 'kernel'
blob at its preferred address
- The boot code in the kernel, which interprets the various fields in
the setup header to figure out where the compressed image lives etc

So the problem here, which applies to SETUP_DTB as well as
SETUP_RNG_SEED, is that the internal file representation of the kernel
blob (which does not have an absolute address at this point, it's just
a file in the fw_cfg filesystem) is augmented with:
1) setup_data linked-list entries carrying absolute addresses that are
assumed to be valid once SeaBIOS loads the file to memory
2) DTB and/or RNG seed blobs appended to the compressed 'kernel' blob,
but without updating that file's internal metadata

Issue 1) is what broke EFI boot, given that EFI interprets the kernel
blob as a PE/COFF image and hands it to the Loadimage() boot service,
which has no awareness of boot_params or setup_data and so just
ignores it and loads the image at an arbitrary address, resulting in
setup_data absolute address values pointing to bogus places.

It seems that now, we have another issue 2), where the fw_cfg view of
the file size goes out of sync with the compressed image's own view of
its size.

As a fix for issue 1), we explored another solution, which was to
allocate fixed areas in memory for the RNG seed, so that the absolute
address added to setup_data is guaranteed to be correct regardless of
where the compressed image is loaded, but that was shot down for other
reasons, and we ended up enabling this feature only for legacy BIOS
boot. But apparently, this approach has other issues so perhaps it is
better to revisit that solution again.

So instead of appending data to the compressed image and assuming that
it will stay in place, create or extend a memory reservation
elsewhere, and refer to its absolute address in setup_data.

-- 
Ard.


[0] 
https://lore.kernel.org/all/camj1kxfr6bv4_g0-wctu4fp_icrg060nhjx_j2dbnyifjky...@mail.gmail.com/

Re: [PATCH v2] pflash: Only read non-zero parts of backend image

2022-12-23 Thread Ard Biesheuvel

On Tue, 20 Dec 2022 at 16:33, Gerd Hoffmann  wrote:
>
> On Tue, Dec 20, 2022 at 10:30:43AM +0100, Philippe Mathieu-Daudé wrote:
> > [Extending to people using UEFI VARStore on Virt machines]
> >
> > On 20/12/22 09:42, Gerd Hoffmann wrote:
> > > From: Xiang Zheng 
> > >
> > > Currently we fill the VIRT_FLASH memory space with two 64MB NOR images
> > > when using persistent UEFI variables on virt board. Actually we only use
> > > a very small(non-zero) part of the memory while the rest significant
> > > large(zero) part of memory is wasted.
> > >
> > > So this patch checks the block status and only writes the non-zero part
> > > into memory. This requires pflash devices to use sparse files for
> > > backends.
> >
> > I like the idea, but I'm not sure how to relate with NOR flash devices.
> >
> > From the block layer, we get BDRV_BLOCK_ZERO when a block is fully
> > filled by zeroes ('\0').
> >
> > We don't want to waste host memory, I get it.
> >
> > Now what "sees" the guest? Is the UEFI VARStore filled with zeroes?
>
> The varstore is filled with 0xff.  It's 768k in size.  The padding
> following (63M plus a bit) is 0x00.  To be exact:
>
> kraxel@sirius ~# hex /usr/share/edk2/aarch64/vars-template-pflash.raw
>   00 00 00 00  00 00 00 00  00 00 00 00  00 00 00 00  
> 0010  8d 2b f1 ff  96 76 8b 4c  a9 85 27 47  07 5b 4f 50  .+...v.L..'G.[OP
> 0020  00 00 0c 00  00 00 00 00  5f 46 56 48  ff fe 04 00  _FVH
> 0030  48 00 28 09  00 00 00 02  03 00 00 00  00 00 04 00  H.(.
> 0040  00 00 00 00  00 00 00 00  78 2c f3 aa  7b 94 9a 43  x,..{..C
> 0050  a1 80 2e 14  4e c3 77 92  b8 ff 03 00  5a fe 00 00  N.w.Z...
> 0060  00 00 00 00  ff ff ff ff  ff ff ff ff  ff ff ff ff  
> 0070  ff ff ff ff  ff ff ff ff  ff ff ff ff  ff ff ff ff  
> *
> 0004  2b 29 58 9e  68 7c 7d 49  a0 ce 65 00  fd 9f 1b 95  +)X.h|}I..e.
> 00040010  5b e7 c6 86  fe ff ff ff  e0 ff 03 00  00 00 00 00  [...
> 00040020  ff ff ff ff  ff ff ff ff  ff ff ff ff  ff ff ff ff  
> *
> 000c  00 00 00 00  00 00 00 00  00 00 00 00  00 00 00 00  
> *
>
> > If so, is it a EDK2 specific case for all virt machines?  This would
> > be a virtualization optimization and in that case, this patch would
> > work.
>
> vars-template-pflash.raw (padded image) is simply QEMU_VARS.fd (unpadded
> image) with 'truncate --size 64M' applied.
>
> Yes, that's a pure virtual machine thing.  On physical hardware you
> would probably just flash the first 768k and leave the remaining flash
> capacity untouched.
>
> > * or you are trying to optimize paravirtualized guests.
>
> This.  Ideally without putting everything upside-down.
>
> >   In that case why insist with emulated NOR devices? Why not have EDK2
> >   directly use a paravirtualized block driver which we can optimize /
> >   tune without interfering with emulated models?
>
> While that probably would work for the variable store (I think we could
> very well do with variable store not being mapped and requiring explicit
> read/write requests) that idea is not going to work very well for the
> firmware code which must be mapped into the address space.  pflash is
> almost the only device we have which serves that need.  The only other
> option I can see would be a rom (the code is usually mapped r/o anyway),
> but that has pretty much the same problem space.  We would likewise want
> a big enough fixed size ROM, to avoid life migration problems and all
> that, and we want the unused space not waste memory.
>
> > Keeping insisting on optimizing guests using the QEMU pflash device
> > seems wrong to me. I'm pretty sure we can do better optimizing clouds
> > payloads.
>
> Moving away from pflash for efi variable storage would cause alot of
> churn through the whole stack.  firmware, qemu, libvirt, upper
> management, all affected.  Is that worth the trouble?  Using pflash
> isn't that much of a problem IMHO.
>

Agreed. pflash is a bit clunky but not a huge problem atm (although
setting up and tearing down the r/o memslot for every read resp. write
results in some performance issues under kvm/arm64)

*If* we decide to replace it, I would suggest an emulated ROM for the
executable image (without any emulated programming facility
whatsoever) and a paravirtualized get/setvariable interface which can
be used in a sane way to virtualize secure boot without having to
emulate SMM or other secure world firmware interfaces.

Re: regression: insmod module failed in VM with nvdimm on

2022-12-02 Thread Ard Biesheuvel

On Fri, 2 Dec 2022 at 03:48, chenxiang (M)  wrote:
>
> Hi Ard,
>
>
> 在 2022/12/1 19:07, Ard Biesheuvel 写道:
> > On Thu, 1 Dec 2022 at 09:07, Ard Biesheuvel  wrote:
> >> On Thu, 1 Dec 2022 at 08:15, chenxiang (M)  
> >> wrote:
> >>> Hi Ard,
> >>>
> >>>
> >>> 在 2022/11/30 16:18, Ard Biesheuvel 写道:
> >>>> On Wed, 30 Nov 2022 at 08:53, Marc Zyngier  wrote:
> >>>>> On Wed, 30 Nov 2022 02:52:35 +,
> >>>>> "chenxiang (M)"  wrote:
> >>>>>> Hi,
> >>>>>>
> >>>>>> We boot the VM using following commands (with nvdimm on)  (qemu
> >>>>>> version 6.1.50, kernel 6.0-r4):
> >>>>> How relevant is the presence of the nvdimm? Do you observe the failure
> >>>>> without this?
> >>>>>
> >>>>>> qemu-system-aarch64 -machine
> >>>>>> virt,kernel_irqchip=on,gic-version=3,nvdimm=on  -kernel
> >>>>>> /home/kernel/Image -initrd /home/mini-rootfs/rootfs.cpio.gz -bios
> >>>>>> /root/QEMU_EFI.FD -cpu host -enable-kvm -net none -nographic -m
> >>>>>> 2G,maxmem=64G,slots=3 -smp 4 -append 'rdinit=init console=ttyAMA0
> >>>>>> ealycon=pl0ll,0x9000 pcie_ports=native pciehp.pciehp_debug=1'
> >>>>>> -object memory-backend-ram,id=ram1,size=10G -device
> >>>>>> nvdimm,id=dimm1,memdev=ram1  -device ioh3420,id=root_port1,chassis=1
> >>>>>> -device vfio-pci,host=7d:01.0,id=net0,bus=root_port1
> >>>>>>
> >>>>>> Then in VM we insmod a module, vmalloc error occurs as follows (kernel
> >>>>>> 5.19-rc4 is normal, and the issue is still on kernel 6.1-rc4):
> >>>>>>
> >>>>>> estuary:/$ insmod /lib/modules/$(uname -r)/hnae3.ko
> >>>>>> [8.186563] vmap allocation for size 20480 failed: use
> >>>>>> vmalloc= to increase size
> >>>>> Have you tried increasing the vmalloc size to check that this is
> >>>>> indeed the problem?
> >>>>>
> >>>>> [...]
> >>>>>
> >>>>>> We git bisect the code, and find the patch c5a89f75d2a ("arm64: kaslr:
> >>>>>> defer initialization to initcall where permitted").
> >>>>> I guess you mean commit fc5a89f75d2a instead, right?
> >>>>>
> >>>>>> Do you have any idea about the issue?
> >>>>> I sort of suspect that the nvdimm gets vmap-ed and consumes a large
> >>>>> portion of the vmalloc space, but you give very little information
> >>>>> that could help here...
> >>>>>
> >>>> Ouch. I suspect what's going on here: that patch defers the
> >>>> randomization of the module region, so that we can decouple it from
> >>>> the very early init code.
> >>>>
> >>>> Obviously, it is happening too late now, and the randomized module
> >>>> region is overlapping with a vmalloc region that is in use by the time
> >>>> the randomization occurs.
> >>>>
> >>>> Does the below fix the issue?
> >>> The issue still occurs, but it seems decrease the probability, before it
> >>> occured almost every time, after the change, i tried 2-3 times, and it
> >>> occurs.
> >>> But i change back "subsys_initcall" to "core_initcall", and i test more
> >>> than 20 times, and it is still ok.
> >>>
> >> Thank you for confirming. I will send out a patch today.
> >>
> > ...but before I do that, could you please check whether the change
> > below fixes your issue as well?
> >
> > diff --git a/arch/arm64/kernel/kaslr.c b/arch/arm64/kernel/kaslr.c
> > index 6ccc7ef600e7c1e1..c8c205b630da1951 100644
> > --- a/arch/arm64/kernel/kaslr.c
> > +++ b/arch/arm64/kernel/kaslr.c
> > @@ -20,7 +20,11 @@
> >   #include 
> >   #include 
> >
> > -u64 __ro_after_init module_alloc_base;
> > +/*
> > + * Set a reasonable default for module_alloc_base in case
> > + * we end up running with module randomization disabled.
> > + */
> > +u64 __ro_after_init module_alloc_base = (u64)_etext - MODULES_VSIZE;
> >   u16 __initdata memstart_offset_seed;
> >
> >   struct arm64_ftr_override kaslr_feature_override __initdata;
> > @@ -30,12 +34,6 @@ static int __init kaslr_init(void)
> >  u64 module_range;
> >  u32 seed;
> >
> > -   /*
> > -* Set a reasonable default for module_alloc_base in case
> > -* we end up running with module randomization disabled.
> > -*/
> > -   module_alloc_base = (u64)_etext - MODULES_VSIZE;
> > -
> >  if (kaslr_feature_override.val & kaslr_feature_override.mask & 
> > 0xf) {
> >  pr_info("KASLR disabled on command line\n");
> >  return 0;
> > .
>
> We have tested this change, the issue is still and it doesn't fix the issue.
>

Thanks for the report.

Re: regression: insmod module failed in VM with nvdimm on

2022-12-01 Thread Ard Biesheuvel

On Thu, 1 Dec 2022 at 13:07, chenxiang (M)  wrote:
>
>
>
> 在 2022/12/1 19:07, Ard Biesheuvel 写道:
> > On Thu, 1 Dec 2022 at 09:07, Ard Biesheuvel  wrote:
> >> On Thu, 1 Dec 2022 at 08:15, chenxiang (M)  
> >> wrote:
> >>> Hi Ard,
> >>>
> >>>
> >>> 在 2022/11/30 16:18, Ard Biesheuvel 写道:
> >>>> On Wed, 30 Nov 2022 at 08:53, Marc Zyngier  wrote:
> >>>>> On Wed, 30 Nov 2022 02:52:35 +,
> >>>>> "chenxiang (M)"  wrote:
> >>>>>> Hi,
> >>>>>>
> >>>>>> We boot the VM using following commands (with nvdimm on)  (qemu
> >>>>>> version 6.1.50, kernel 6.0-r4):
> >>>>> How relevant is the presence of the nvdimm? Do you observe the failure
> >>>>> without this?
> >>>>>
> >>>>>> qemu-system-aarch64 -machine
> >>>>>> virt,kernel_irqchip=on,gic-version=3,nvdimm=on  -kernel
> >>>>>> /home/kernel/Image -initrd /home/mini-rootfs/rootfs.cpio.gz -bios
> >>>>>> /root/QEMU_EFI.FD -cpu host -enable-kvm -net none -nographic -m
> >>>>>> 2G,maxmem=64G,slots=3 -smp 4 -append 'rdinit=init console=ttyAMA0
> >>>>>> ealycon=pl0ll,0x9000 pcie_ports=native pciehp.pciehp_debug=1'
> >>>>>> -object memory-backend-ram,id=ram1,size=10G -device
> >>>>>> nvdimm,id=dimm1,memdev=ram1  -device ioh3420,id=root_port1,chassis=1
> >>>>>> -device vfio-pci,host=7d:01.0,id=net0,bus=root_port1
> >>>>>>
> >>>>>> Then in VM we insmod a module, vmalloc error occurs as follows (kernel
> >>>>>> 5.19-rc4 is normal, and the issue is still on kernel 6.1-rc4):
> >>>>>>
> >>>>>> estuary:/$ insmod /lib/modules/$(uname -r)/hnae3.ko
> >>>>>> [8.186563] vmap allocation for size 20480 failed: use
> >>>>>> vmalloc= to increase size
> >>>>> Have you tried increasing the vmalloc size to check that this is
> >>>>> indeed the problem?
> >>>>>
> >>>>> [...]
> >>>>>
> >>>>>> We git bisect the code, and find the patch c5a89f75d2a ("arm64: kaslr:
> >>>>>> defer initialization to initcall where permitted").
> >>>>> I guess you mean commit fc5a89f75d2a instead, right?
> >>>>>
> >>>>>> Do you have any idea about the issue?
> >>>>> I sort of suspect that the nvdimm gets vmap-ed and consumes a large
> >>>>> portion of the vmalloc space, but you give very little information
> >>>>> that could help here...
> >>>>>
> >>>> Ouch. I suspect what's going on here: that patch defers the
> >>>> randomization of the module region, so that we can decouple it from
> >>>> the very early init code.
> >>>>
> >>>> Obviously, it is happening too late now, and the randomized module
> >>>> region is overlapping with a vmalloc region that is in use by the time
> >>>> the randomization occurs.
> >>>>
> >>>> Does the below fix the issue?
> >>> The issue still occurs, but it seems decrease the probability, before it
> >>> occured almost every time, after the change, i tried 2-3 times, and it
> >>> occurs.
> >>> But i change back "subsys_initcall" to "core_initcall", and i test more
> >>> than 20 times, and it is still ok.
> >>>
> >> Thank you for confirming. I will send out a patch today.
> >>
> > ...but before I do that, could you please check whether the change
> > below fixes your issue as well?
>
> Yes, but i can only reply to you tomorrow as other guy is testing on the
> only environment today.
>

That is fine, thanks.

Re: regression: insmod module failed in VM with nvdimm on

2022-12-01 Thread Ard Biesheuvel

On Thu, 1 Dec 2022 at 09:07, Ard Biesheuvel  wrote:
>
> On Thu, 1 Dec 2022 at 08:15, chenxiang (M)  wrote:
> >
> > Hi Ard,
> >
> >
> > 在 2022/11/30 16:18, Ard Biesheuvel 写道:
> > > On Wed, 30 Nov 2022 at 08:53, Marc Zyngier  wrote:
> > >> On Wed, 30 Nov 2022 02:52:35 +,
> > >> "chenxiang (M)"  wrote:
> > >>> Hi,
> > >>>
> > >>> We boot the VM using following commands (with nvdimm on)  (qemu
> > >>> version 6.1.50, kernel 6.0-r4):
> > >> How relevant is the presence of the nvdimm? Do you observe the failure
> > >> without this?
> > >>
> > >>> qemu-system-aarch64 -machine
> > >>> virt,kernel_irqchip=on,gic-version=3,nvdimm=on  -kernel
> > >>> /home/kernel/Image -initrd /home/mini-rootfs/rootfs.cpio.gz -bios
> > >>> /root/QEMU_EFI.FD -cpu host -enable-kvm -net none -nographic -m
> > >>> 2G,maxmem=64G,slots=3 -smp 4 -append 'rdinit=init console=ttyAMA0
> > >>> ealycon=pl0ll,0x9000 pcie_ports=native pciehp.pciehp_debug=1'
> > >>> -object memory-backend-ram,id=ram1,size=10G -device
> > >>> nvdimm,id=dimm1,memdev=ram1  -device ioh3420,id=root_port1,chassis=1
> > >>> -device vfio-pci,host=7d:01.0,id=net0,bus=root_port1
> > >>>
> > >>> Then in VM we insmod a module, vmalloc error occurs as follows (kernel
> > >>> 5.19-rc4 is normal, and the issue is still on kernel 6.1-rc4):
> > >>>
> > >>> estuary:/$ insmod /lib/modules/$(uname -r)/hnae3.ko
> > >>> [8.186563] vmap allocation for size 20480 failed: use
> > >>> vmalloc= to increase size
> > >> Have you tried increasing the vmalloc size to check that this is
> > >> indeed the problem?
> > >>
> > >> [...]
> > >>
> > >>> We git bisect the code, and find the patch c5a89f75d2a ("arm64: kaslr:
> > >>> defer initialization to initcall where permitted").
> > >> I guess you mean commit fc5a89f75d2a instead, right?
> > >>
> > >>> Do you have any idea about the issue?
> > >> I sort of suspect that the nvdimm gets vmap-ed and consumes a large
> > >> portion of the vmalloc space, but you give very little information
> > >> that could help here...
> > >>
> > > Ouch. I suspect what's going on here: that patch defers the
> > > randomization of the module region, so that we can decouple it from
> > > the very early init code.
> > >
> > > Obviously, it is happening too late now, and the randomized module
> > > region is overlapping with a vmalloc region that is in use by the time
> > > the randomization occurs.
> > >
> > > Does the below fix the issue?
> >
> > The issue still occurs, but it seems decrease the probability, before it
> > occured almost every time, after the change, i tried 2-3 times, and it
> > occurs.
> > But i change back "subsys_initcall" to "core_initcall", and i test more
> > than 20 times, and it is still ok.
> >
>
> Thank you for confirming. I will send out a patch today.
>

...but before I do that, could you please check whether the change
below fixes your issue as well?

diff --git a/arch/arm64/kernel/kaslr.c b/arch/arm64/kernel/kaslr.c
index 6ccc7ef600e7c1e1..c8c205b630da1951 100644
--- a/arch/arm64/kernel/kaslr.c
+++ b/arch/arm64/kernel/kaslr.c
@@ -20,7 +20,11 @@
 #include 
 #include 

-u64 __ro_after_init module_alloc_base;
+/*
+ * Set a reasonable default for module_alloc_base in case
+ * we end up running with module randomization disabled.
+ */
+u64 __ro_after_init module_alloc_base = (u64)_etext - MODULES_VSIZE;
 u16 __initdata memstart_offset_seed;

 struct arm64_ftr_override kaslr_feature_override __initdata;
@@ -30,12 +34,6 @@ static int __init kaslr_init(void)
u64 module_range;
u32 seed;

-   /*
-* Set a reasonable default for module_alloc_base in case
-* we end up running with module randomization disabled.
-*/
-   module_alloc_base = (u64)_etext - MODULES_VSIZE;
-
if (kaslr_feature_override.val & kaslr_feature_override.mask & 0xf) {
pr_info("KASLR disabled on command line\n");
return 0;

Re: regression: insmod module failed in VM with nvdimm on

2022-12-01 Thread Ard Biesheuvel

On Thu, 1 Dec 2022 at 08:15, chenxiang (M)  wrote:
>
> Hi Ard,
>
>
> 在 2022/11/30 16:18, Ard Biesheuvel 写道:
> > On Wed, 30 Nov 2022 at 08:53, Marc Zyngier  wrote:
> >> On Wed, 30 Nov 2022 02:52:35 +,
> >> "chenxiang (M)"  wrote:
> >>> Hi,
> >>>
> >>> We boot the VM using following commands (with nvdimm on)  (qemu
> >>> version 6.1.50, kernel 6.0-r4):
> >> How relevant is the presence of the nvdimm? Do you observe the failure
> >> without this?
> >>
> >>> qemu-system-aarch64 -machine
> >>> virt,kernel_irqchip=on,gic-version=3,nvdimm=on  -kernel
> >>> /home/kernel/Image -initrd /home/mini-rootfs/rootfs.cpio.gz -bios
> >>> /root/QEMU_EFI.FD -cpu host -enable-kvm -net none -nographic -m
> >>> 2G,maxmem=64G,slots=3 -smp 4 -append 'rdinit=init console=ttyAMA0
> >>> ealycon=pl0ll,0x9000 pcie_ports=native pciehp.pciehp_debug=1'
> >>> -object memory-backend-ram,id=ram1,size=10G -device
> >>> nvdimm,id=dimm1,memdev=ram1  -device ioh3420,id=root_port1,chassis=1
> >>> -device vfio-pci,host=7d:01.0,id=net0,bus=root_port1
> >>>
> >>> Then in VM we insmod a module, vmalloc error occurs as follows (kernel
> >>> 5.19-rc4 is normal, and the issue is still on kernel 6.1-rc4):
> >>>
> >>> estuary:/$ insmod /lib/modules/$(uname -r)/hnae3.ko
> >>> [8.186563] vmap allocation for size 20480 failed: use
> >>> vmalloc= to increase size
> >> Have you tried increasing the vmalloc size to check that this is
> >> indeed the problem?
> >>
> >> [...]
> >>
> >>> We git bisect the code, and find the patch c5a89f75d2a ("arm64: kaslr:
> >>> defer initialization to initcall where permitted").
> >> I guess you mean commit fc5a89f75d2a instead, right?
> >>
> >>> Do you have any idea about the issue?
> >> I sort of suspect that the nvdimm gets vmap-ed and consumes a large
> >> portion of the vmalloc space, but you give very little information
> >> that could help here...
> >>
> > Ouch. I suspect what's going on here: that patch defers the
> > randomization of the module region, so that we can decouple it from
> > the very early init code.
> >
> > Obviously, it is happening too late now, and the randomized module
> > region is overlapping with a vmalloc region that is in use by the time
> > the randomization occurs.
> >
> > Does the below fix the issue?
>
> The issue still occurs, but it seems decrease the probability, before it
> occured almost every time, after the change, i tried 2-3 times, and it
> occurs.
> But i change back "subsys_initcall" to "core_initcall", and i test more
> than 20 times, and it is still ok.
>

Thank you for confirming. I will send out a patch today.

> >
> > diff --git a/arch/arm64/kernel/kaslr.c b/arch/arm64/kernel/kaslr.c
> > index 37a9deed2aec..71fb18b2f304 100644
> > --- a/arch/arm64/kernel/kaslr.c
> > +++ b/arch/arm64/kernel/kaslr.c
> > @@ -90,4 +90,4 @@ static int __init kaslr_init(void)
> >
> >  return 0;
> >   }
> > -subsys_initcall(kaslr_init)
> > +arch_initcall(kaslr_init)
> > .
> >
>

Re: regression: insmod module failed in VM with nvdimm on

2022-11-30 Thread Ard Biesheuvel

On Wed, 30 Nov 2022 at 08:53, Marc Zyngier  wrote:
>
> On Wed, 30 Nov 2022 02:52:35 +,
> "chenxiang (M)"  wrote:
> >
> > Hi,
> >
> > We boot the VM using following commands (with nvdimm on)  (qemu
> > version 6.1.50, kernel 6.0-r4):
>
> How relevant is the presence of the nvdimm? Do you observe the failure
> without this?
>
> >
> > qemu-system-aarch64 -machine
> > virt,kernel_irqchip=on,gic-version=3,nvdimm=on  -kernel
> > /home/kernel/Image -initrd /home/mini-rootfs/rootfs.cpio.gz -bios
> > /root/QEMU_EFI.FD -cpu host -enable-kvm -net none -nographic -m
> > 2G,maxmem=64G,slots=3 -smp 4 -append 'rdinit=init console=ttyAMA0
> > ealycon=pl0ll,0x9000 pcie_ports=native pciehp.pciehp_debug=1'
> > -object memory-backend-ram,id=ram1,size=10G -device
> > nvdimm,id=dimm1,memdev=ram1  -device ioh3420,id=root_port1,chassis=1
> > -device vfio-pci,host=7d:01.0,id=net0,bus=root_port1
> >
> > Then in VM we insmod a module, vmalloc error occurs as follows (kernel
> > 5.19-rc4 is normal, and the issue is still on kernel 6.1-rc4):
> >
> > estuary:/$ insmod /lib/modules/$(uname -r)/hnae3.ko
> > [8.186563] vmap allocation for size 20480 failed: use
> > vmalloc= to increase size
>
> Have you tried increasing the vmalloc size to check that this is
> indeed the problem?
>
> [...]
>
> > We git bisect the code, and find the patch c5a89f75d2a ("arm64: kaslr:
> > defer initialization to initcall where permitted").
>
> I guess you mean commit fc5a89f75d2a instead, right?
>
> > Do you have any idea about the issue?
>
> I sort of suspect that the nvdimm gets vmap-ed and consumes a large
> portion of the vmalloc space, but you give very little information
> that could help here...
>

Ouch. I suspect what's going on here: that patch defers the
randomization of the module region, so that we can decouple it from
the very early init code.

Obviously, it is happening too late now, and the randomized module
region is overlapping with a vmalloc region that is in use by the time
the randomization occurs.

Does the below fix the issue?

diff --git a/arch/arm64/kernel/kaslr.c b/arch/arm64/kernel/kaslr.c
index 37a9deed2aec..71fb18b2f304 100644
--- a/arch/arm64/kernel/kaslr.c
+++ b/arch/arm64/kernel/kaslr.c
@@ -90,4 +90,4 @@ static int __init kaslr_init(void)

return 0;
 }
-subsys_initcall(kaslr_init)
+arch_initcall(kaslr_init)

[PATCH v2] target/arm: Use signed quantity to represent VMSAv8-64 translation level

2022-11-22 Thread Ard Biesheuvel

The LPA2 extension implements 52-bit virtual addressing for 4k and 16k
translation granules, and for the former, this means an additional level
of translation is needed. This means we start counting at -1 instead of
0 when doing a walk, and so 'level' is now a signed quantity, and should
be typed as such. So turn it from uint32_t into int32_t.

This avoids a level of -1 getting misinterpreted as being >= 3, and
terminating a page table walk prematurely with a bogus output address.

Cc: Peter Maydell 
Cc: Philippe Mathieu-Daudé 
Cc: Richard Henderson 
Signed-off-by: Ard Biesheuvel 
---
 target/arm/ptw.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/target/arm/ptw.c b/target/arm/ptw.c
index 9a6277d862fac229..1d9bb4448761ddf4 100644
--- a/target/arm/ptw.c
+++ b/target/arm/ptw.c
@@ -1172,7 +1172,7 @@ static bool get_phys_addr_lpae(CPUARMState *env, 
S1Translate *ptw,
 ARMCPU *cpu = env_archcpu(env);
 ARMMMUIdx mmu_idx = ptw->in_mmu_idx;
 bool is_secure = ptw->in_secure;
-uint32_t level;
+int32_t level;
 ARMVAParameters param;
 uint64_t ttbr;
 hwaddr descaddr, indexmask, indexmask_grainsize;
@@ -1302,7 +1302,7 @@ static bool get_phys_addr_lpae(CPUARMState *env, 
S1Translate *ptw,
  */
 uint32_t sl0 = extract32(tcr, 6, 2);
 uint32_t sl2 = extract64(tcr, 33, 1);
-uint32_t startlevel;
+int32_t startlevel;
 bool ok;
 
 /* SL2 is RES0 unless DS=1 & 4kb granule. */
-- 
2.35.1

Re: [PATCH] target/arm: Use signed quantity to represent VMSAv8-64 translation level

2022-11-22 Thread Ard Biesheuvel

On Tue, 22 Nov 2022 at 14:21, Peter Maydell  wrote:
>
> On Mon, 21 Nov 2022 at 19:02, Ard Biesheuvel  wrote:
> >
> > On Mon, 21 Nov 2022 at 19:51, Peter Maydell  
> > wrote:
> > >
> > > On Mon, 21 Nov 2022 at 17:43, Ard Biesheuvel  wrote:
> > > >
> > > > The LPA2 extension implements 52-bit virtual addressing for 4k and 16k
> > > > translation granules, and for the former, this means an additional level
> > > > of translation is needed. This means we start counting at -1 instead of
> > > > 0 when doing a walk, and so 'level' is now a signed quantity, and should
> > > > be typed as such. So turn it from uint32_t into int32_t.
> > > >
> > >
> > > Does this cause any visible wrong behaviour, or is it just
> > > a cleanup thing ?
> > >
> >
> > No, 5 level paging is completely broken because of this, given that
> > the 'level < 3' tests give the wrong result for (uint32_t)-1
>
> Right, thanks. This seems like a bug worth fixing for 7.2.
>

Indeed. And the other patch I sent is needed too if you want to run with LPA2

'target/arm: Limit LPA2 effective output address when TCR.DS == 0'

In case it is useful, I have a WIP kernel branch here which can be
built with 52-bit virtual addressing for 4k or 16k pages.

https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/log/?h=arm64-4k-lpa2


> We should make 'uint32_t startlevel' also an int32_t
> for consistency, I think, given that it is also sometimes
> negative, though in that case it doesn't get used in any
> comparisons so it's not going to cause wrong behaviour.
>

Indeed. I'll send a v2 and fold that in.

Re: [PATCH] target/arm: Use signed quantity to represent VMSAv8-64 translation level

2022-11-21 Thread Ard Biesheuvel

On Mon, 21 Nov 2022 at 19:51, Peter Maydell  wrote:
>
> On Mon, 21 Nov 2022 at 17:43, Ard Biesheuvel  wrote:
> >
> > The LPA2 extension implements 52-bit virtual addressing for 4k and 16k
> > translation granules, and for the former, this means an additional level
> > of translation is needed. This means we start counting at -1 instead of
> > 0 when doing a walk, and so 'level' is now a signed quantity, and should
> > be typed as such. So turn it from uint32_t into int32_t.
> >
>
> Does this cause any visible wrong behaviour, or is it just
> a cleanup thing ?
>

No, 5 level paging is completely broken because of this, given that
the 'level < 3' tests give the wrong result for (uint32_t)-1

[PATCH] target/arm: Use signed quantity to represent VMSAv8-64 translation level

2022-11-21 Thread Ard Biesheuvel

The LPA2 extension implements 52-bit virtual addressing for 4k and 16k
translation granules, and for the former, this means an additional level
of translation is needed. This means we start counting at -1 instead of
0 when doing a walk, and so 'level' is now a signed quantity, and should
be typed as such. So turn it from uint32_t into int32_t.

Cc: Peter Maydell 
Cc: Philippe Mathieu-Daudé 
Cc: Richard Henderson 
Signed-off-by: Ard Biesheuvel 
---
 target/arm/ptw.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/target/arm/ptw.c b/target/arm/ptw.c
index 3745ac9723..6d6992580a 100644
--- a/target/arm/ptw.c
+++ b/target/arm/ptw.c
@@ -1172,7 +1172,7 @@ static bool get_phys_addr_lpae(CPUARMState *env, 
S1Translate *ptw,
 ARMCPU *cpu = env_archcpu(env);
 ARMMMUIdx mmu_idx = ptw->in_mmu_idx;
 bool is_secure = ptw->in_secure;
-uint32_t level;
+int32_t level;
 ARMVAParameters param;
 uint64_t ttbr;
 hwaddr descaddr, indexmask, indexmask_grainsize;
-- 
2.35.1

[PATCH] target/arm: Limit LPA2 effective output address when TCR.DS == 0

2022-11-16 Thread Ard Biesheuvel

With LPA2, the effective output address size is at most 48 bits when
TCR.DS == 0. This case is currently unhandled in the page table walker,
where we happily assume LVA/64k granule when outputsize > 48 and
param.ds == 0, resulting in the wrong conversion to be used from a
page table descriptor to a physical address.

if (outputsize > 48) {
if (param.ds) {
descaddr |= extract64(descriptor, 8, 2) << 50;
} else {
descaddr |= extract64(descriptor, 12, 4) << 48;
}

So cap the outputsize to 48 when TCR.DS is cleared, as per the
architecture.

Cc: Peter Maydell 
Cc: Philippe Mathieu-Daudé 
Cc: Richard Henderson 
Signed-off-by: Ard Biesheuvel 
---
 target/arm/ptw.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/target/arm/ptw.c b/target/arm/ptw.c
index 3745ac9723474332..9a6277d862fac229 100644
--- a/target/arm/ptw.c
+++ b/target/arm/ptw.c
@@ -1222,6 +1222,14 @@ static bool get_phys_addr_lpae(CPUARMState *env, 
S1Translate *ptw,
 ps = MIN(ps, param.ps);
 assert(ps < ARRAY_SIZE(pamax_map));
 outputsize = pamax_map[ps];
+
+/*
+ * With LPA2, the effective output address (OA) size is at most 48 bits
+ * unless TCR.DS == 1
+ */
+if (!param.ds && param.gran != Gran64K) {
+outputsize = MIN(outputsize, 48);
+}
 } else {
 param = aa32_va_parameters(env, address, mmu_idx);
 level = 1;
-- 
2.35.1

Re: [PATCH v4 1/2] x86: return modified setup_data only if read as memory, not as file

2022-09-16 Thread Ard Biesheuvel

On Wed, 14 Sept 2022 at 01:42, Jason A. Donenfeld  wrote:
>
> If setup_data is being read into a specific memory location, then
> generally the setup_data address parameter is read first, so that the
> caller knows where to read it into. In that case, we should return
> setup_data containing the absolute addresses that are hard coded and
> determined a priori. This is the case when kernels are loaded by BIOS,
> for example. In contrast, when setup_data is read as a file, then we
> shouldn't modify setup_data, since the absolute address will be wrong by
> definition. This is the case when OVMF loads the image.
>
> This allows setup_data to be used like normal, without crashing when EFI
> tries to use it.
>
> (As a small development note, strangely, fw_cfg_add_file_callback() was
> exported but fw_cfg_add_bytes_callback() wasn't, so this makes that
> consistent.)
>
> Cc: Gerd Hoffmann 
> Cc: Laurent Vivier 
> Cc: Michael S. Tsirkin 
> Cc: Paolo Bonzini 
> Cc: Peter Maydell 
> Cc: Philippe Mathieu-Daudé 
> Cc: Richard Henderson 
> Suggested-by: Ard Biesheuvel 
> Signed-off-by: Jason A. Donenfeld 

Reviewed-by: Ard Biesheuvel 

This is still somewhat of a crutch, but at least we can now
disambiguate between loaders that treat the setup data as a file
(OVMF) and ones that treat it as an object that lives at a fixed
address in memory (SeaBIOS)

I'll note that this also addresses the existing issue with -dtb on
x86, which currently breaks the OVMF direct kernel boot in the same
way as the RNG seed does.

> ---
>  hw/i386/x86.c | 37 +++--
>  hw/nvram/fw_cfg.c | 12 ++--
>  include/hw/nvram/fw_cfg.h | 22 ++
>  3 files changed, 55 insertions(+), 16 deletions(-)
>
> diff --git a/hw/i386/x86.c b/hw/i386/x86.c
> index 050eedc0c8..933bbdd836 100644
> --- a/hw/i386/x86.c
> +++ b/hw/i386/x86.c
> @@ -764,6 +764,18 @@ static bool load_elfboot(const char *kernel_filename,
>  return true;
>  }
>
> +struct setup_data_fixup {
> +void *pos;
> +hwaddr val;
> +uint32_t addr;
> +};
> +
> +static void fixup_setup_data(void *opaque)
> +{
> +struct setup_data_fixup *fixup = opaque;
> +stq_p(fixup->pos, fixup->val);
> +}
> +
>  void x86_load_linux(X86MachineState *x86ms,
>  FWCfgState *fw_cfg,
>  int acpi_data_size,
> @@ -1088,8 +1100,11 @@ void x86_load_linux(X86MachineState *x86ms,
>  qemu_guest_getrandom_nofail(setup_data->data, RNG_SEED_LENGTH);
>  }
>
> -/* Offset 0x250 is a pointer to the first setup_data link. */
> -stq_p(header + 0x250, first_setup_data);
> +fw_cfg_add_i32(fw_cfg, FW_CFG_KERNEL_ADDR, prot_addr);
> +fw_cfg_add_i32(fw_cfg, FW_CFG_KERNEL_SIZE, kernel_size);
> +fw_cfg_add_bytes(fw_cfg, FW_CFG_KERNEL_DATA, kernel, kernel_size);
> +sev_load_ctx.kernel_data = (char *)kernel;
> +sev_load_ctx.kernel_size = kernel_size;
>
>  /*
>   * If we're starting an encrypted VM, it will be OVMF based, which uses 
> the
> @@ -1099,16 +1114,18 @@ void x86_load_linux(X86MachineState *x86ms,
>   * file the user passed in.
>   */
>  if (!sev_enabled()) {
> +struct setup_data_fixup *fixup = g_malloc(sizeof(*fixup));
> +
>  memcpy(setup, header, MIN(sizeof(header), setup_size));
> +/* Offset 0x250 is a pointer to the first setup_data link. */
> +fixup->pos = setup + 0x250;
> +fixup->val = first_setup_data;
> +fixup->addr = real_addr;
> +fw_cfg_add_bytes_callback(fw_cfg, FW_CFG_SETUP_ADDR, 
> fixup_setup_data, NULL,
> +  fixup, >addr, sizeof(fixup->addr), 
> true);
> +} else {
> +fw_cfg_add_i32(fw_cfg, FW_CFG_SETUP_ADDR, real_addr);
>  }
> -
> -fw_cfg_add_i32(fw_cfg, FW_CFG_KERNEL_ADDR, prot_addr);
> -fw_cfg_add_i32(fw_cfg, FW_CFG_KERNEL_SIZE, kernel_size);
> -fw_cfg_add_bytes(fw_cfg, FW_CFG_KERNEL_DATA, kernel, kernel_size);
> -sev_load_ctx.kernel_data = (char *)kernel;
> -sev_load_ctx.kernel_size = kernel_size;
> -
> -fw_cfg_add_i32(fw_cfg, FW_CFG_SETUP_ADDR, real_addr);
>  fw_cfg_add_i32(fw_cfg, FW_CFG_SETUP_SIZE, setup_size);
>  fw_cfg_add_bytes(fw_cfg, FW_CFG_SETUP_DATA, setup, setup_size);
>  sev_load_ctx.setup_data = (char *)setup;
> diff --git a/hw/nvram/fw_cfg.c b/hw/nvram/fw_cfg.c
> index d605f3f45a..564bda3395 100644
> --- a/hw/nvram/fw_cfg.c
> +++ b/hw/nvram/fw_cfg.c
> @@ -692,12 +692,12 @@ static const VMStateDescription vmstate_fw_cfg = {
>  }
>  };
>
> -static void fw_cfg_add_bytes_callback(FWCfgState *s, uint16_t key,
> -

Re: [PATCH v2 1/2] x86: only modify setup_data if the boot protocol indicates safety

2022-09-08 Thread Ard Biesheuvel

On Thu, 8 Sept 2022 at 13:30, Laszlo Ersek  wrote:
>
> On 09/06/22 13:33, Daniel P. Berrangé wrote:
> > On Tue, Sep 06, 2022 at 01:14:50PM +0200, Ard Biesheuvel wrote:
> >> (cc Laszlo)
> >>
> >> On Tue, 6 Sept 2022 at 12:45, Michael S. Tsirkin  wrote:
> >>>
> >>> On Tue, Sep 06, 2022 at 12:43:55PM +0200, Jason A. Donenfeld wrote:
> >>>> On Tue, Sep 6, 2022 at 12:40 PM Michael S. Tsirkin  
> >>>> wrote:
> >>>>>
> >>>>> On Tue, Sep 06, 2022 at 12:36:56PM +0200, Jason A. Donenfeld wrote:
> >>>>>> It's only safe to modify the setup_data pointer on newer kernels where
> >>>>>> the EFI stub loader will ignore it. So condition setting that offset on
> >>>>>> the newer boot protocol version. While we're at it, gate this on SEV 
> >>>>>> too.
> >>>>>> This depends on the kernel commit linked below going upstream.
> >>>>>>
> >>>>>> Cc: Gerd Hoffmann 
> >>>>>> Cc: Laurent Vivier 
> >>>>>> Cc: Michael S. Tsirkin 
> >>>>>> Cc: Paolo Bonzini 
> >>>>>> Cc: Peter Maydell 
> >>>>>> Cc: Philippe Mathieu-Daudé 
> >>>>>> Cc: Richard Henderson 
> >>>>>> Cc: Ard Biesheuvel 
> >>>>>> Link: 
> >>>>>> https://lore.kernel.org/linux-efi/20220904165321.1140894-1-ja...@zx2c4.com/
> >>>>>> Signed-off-by: Jason A. Donenfeld 
> >>>>>
> >>>>> BTW what does it have to do with SEV?
> >>>>> Is this because SEV is not going to trust the data to be random anyway?
> >>>>
> >>>> Daniel (now CC'd) pointed out in one of the previous threads that this
> >>>> breaks SEV, because the image hash changes.
> >>>>
> >>>> Jason
> >>>
> >>> Oh I see. I'd add a comment maybe and definitely mention this
> >>> in the commit log.
> >>>
> >>
> >> This does raise the question (as I mentioned before) how things like
> >> secure boot and measured boot are affected when combined with direct
> >> kernel boot: AIUI, libvirt uses direct kernel boot at guest
> >> installation time, and modifying setup_data will corrupt the image
> >> signature.
> >
> > IIUC, qemu already modifies setup_data when using direct kernel boot.
> >
> > It put in logic to skip this if SEV is enabled, to avoid interfering
> > with SEV hashes over the kernel, but there's nothing doing this more
> > generally for non-SEV cases using UEFI. So potentially use of SecureBoot
> > may already be impacted when using direct kernel boot.
>
> Yes,
>
> https://github.com/tianocore/edk2/commit/82808b422617
>

Ah yes, thanks for jogging my memory.

So virt-install --network already ignores secure boot failures on
direct kernel boot, so this is not going to make it any worse.

Re: [PATCH v2 1/2] x86: only modify setup_data if the boot protocol indicates safety

2022-09-06 Thread Ard Biesheuvel

(cc Laszlo)

On Tue, 6 Sept 2022 at 12:45, Michael S. Tsirkin  wrote:
>
> On Tue, Sep 06, 2022 at 12:43:55PM +0200, Jason A. Donenfeld wrote:
> > On Tue, Sep 6, 2022 at 12:40 PM Michael S. Tsirkin  wrote:
> > >
> > > On Tue, Sep 06, 2022 at 12:36:56PM +0200, Jason A. Donenfeld wrote:
> > > > It's only safe to modify the setup_data pointer on newer kernels where
> > > > the EFI stub loader will ignore it. So condition setting that offset on
> > > > the newer boot protocol version. While we're at it, gate this on SEV 
> > > > too.
> > > > This depends on the kernel commit linked below going upstream.
> > > >
> > > > Cc: Gerd Hoffmann 
> > > > Cc: Laurent Vivier 
> > > > Cc: Michael S. Tsirkin 
> > > > Cc: Paolo Bonzini 
> > > > Cc: Peter Maydell 
> > > > Cc: Philippe Mathieu-Daudé 
> > > > Cc: Richard Henderson 
> > > > Cc: Ard Biesheuvel 
> > > > Link: 
> > > > https://lore.kernel.org/linux-efi/20220904165321.1140894-1-ja...@zx2c4.com/
> > > > Signed-off-by: Jason A. Donenfeld 
> > >
> > > BTW what does it have to do with SEV?
> > > Is this because SEV is not going to trust the data to be random anyway?
> >
> > Daniel (now CC'd) pointed out in one of the previous threads that this
> > breaks SEV, because the image hash changes.
> >
> > Jason
>
> Oh I see. I'd add a comment maybe and definitely mention this
> in the commit log.
>

This does raise the question (as I mentioned before) how things like
secure boot and measured boot are affected when combined with direct
kernel boot: AIUI, libvirt uses direct kernel boot at guest
installation time, and modifying setup_data will corrupt the image
signature.

Re: [PATCH v2] hw/i386: place setup_data at fixed place in memory

2022-08-19 Thread Ard Biesheuvel

On Fri, 19 Aug 2022 at 08:41, Gerd Hoffmann  wrote:
>
> On Thu, Aug 18, 2022 at 05:38:37PM +0200, Jason A. Donenfeld wrote:
> > Hey Gerd,
> >
> > > Joining the party late (and still catching up the thread).  Given we
> > > don't need that anyway with EFI, only with legacy BIOS:  Can't that just
> > > be a protocol between qemu and pc-bios/optionrom/*boot*.S on how to pass
> > > those 48 bytes random seed?
> >
> > Actually, I want this to work with EFI, very much so.

Even if we wire this up for EFI in some way, it will only affect
direct kernel boot using -kernel/-initrd etc, which is a niche use
case at best (AIUI libvirt uses it for the initial boot only)

I personally rely on it heavily for development, and I imagine others
might too, but that is hardly relevant here.

> With EFI the kernel stub gets some random seed via EFI_RNG_PROTOCOL.
> I can't see any good reason to derive from that.  It works no matter
> how the kernel gets loaded.
>
> OVMF ships with a driver for the virtio-rng device.  So just add that
> to your virtual machine and seeding works fine ...
>

... or we find other ways for the platform to speak to the OS, using
EFI protocols or other EFI methods.

Currently, the 'pure EFI' boot code is arch agnostic - it can be built
and run on any architecture that supports EFI boot. Adding Linux+x86
specific hacks to it is out of the question.

So that means that setup_data provided by QEMU will be consumed
directly by the kernel (when doing direct kernel boot only), using an
out of band channel that EFI/OVMF are completely unaware of, putting
it outside the scope of secure boot, measure boot, etc.

This is not acceptable to me.

> > If our objective was to just not break EFI, the solution would be
> > simple: in the kernel we can have EFISTUB ignore the setup_data field
> > from the image, and then bump the boot header protocol number. If QEMU
> > sees the boot protocol number is below this one, then it won't set
> > setup_data. Done, fixed.
>
> As mentioned elsewhere in the thread patching in physical addresses on
> qemu side isn't going to fly due to the different load methods we have.
>

And it conflates the file representation with the in-memory
representation, which I object to fundamentally - setup_data is part
of the file image, and becomes part of the in-memory representation
when it gets staged in memory for booting, which only happens in the
EFI stub when using pure EFI boot.

Using setup_data as a hidden comms channel between QEMU and the core
kernel breaks that distinction.

> > Your option ROM idea is interesting; somebody mentioned that elsewhere
> > too I think.
>
> Doing the setup_data patching in the option rom has the advantage that
> it'll only happen with that specific load method being used.  Also the
> option rom knows where it places stuff in memory so it is in a much
> better position to find a good & non-conflicting place for the random
> seed.  Also reserve/allocate memory if needed etc.
>

Exactly. This is the only sensible place to do this.

> > I'm wondering, though: do option ROMs still run when
> > EFI/OVMF is being used?
>
> No, they are not used with EFI.  OVMF has a completely independent
> implementation for direct kernel boot.
>
> The options I see for EFI are:
>
>   (1) Do nothing and continue to depend on virtio-rng.
>   (2) Implement an efi driver which gets those 48 seed bytes from
>   qemu by whatever means we'll define and hands them out via
>   EFI_RNG_PROTOCOL.
>

We could explore other options, but SETUP_RNG_SEED is fundamentally
incompatible with EFI boot (or any other boot method where the image
is treated as an opaque file by the firmware/loader), so that is not
an acceptable approach to me.

Re: [PATCH v3] hw/i386: place setup_data at fixed place in memory

2022-08-05 Thread Ard Biesheuvel

On Fri, 5 Aug 2022 at 19:29, Paolo Bonzini  wrote:
>
> On 8/5/22 13:08, Ard Biesheuvel wrote:
> >>
> >> Does it work to place setup_data at the end of the cmdline file instead
> >> of having it at the end of the kernel file?  This way the first item
> >> will be at 0x2 + cmdline_size.
> >>
> > Does QEMU always allocate the command line statically like that?
> > AFAIK, OVMF never accesses that memory to read the command line, it
> > uses fw_cfg to copy it into a buffer it allocates itself. And I guess
> > that implies that this region could be clobbered by OVMF unless it is
> > told to preserve it.
>
> No it's not. :(  It also goes to gBS->AllocatePages in the end.
>
> At this point it seems to me that without extra changes the whole
> setup_data concept is dead on arrival for OVMF.  In principle there's no
> reason why the individual setup_data items couldn't include interior
> pointers, meaning that the setup_data _has_ to be at the address
> provided in fw_cfg by QEMU.
>

AIUI, the setup_data nodes are appended at the end, so they are not
covered by the setup_data fw_cfg file but the kernel one.

> One way to "fix" it would be for OVMF to overwrite the pointer to the
> head of the list, so that the kernel ignores the setup data provided by
> QEMU. Another way would be to put it in the command line fw_cfg blob and
> teach OVMF to use a fixed address for the command line.  Both are ugly,
> and both are also broken for new QEMU / old OVMF.
>

This is the 'pure EFI' boot path in OVMF, which means that the
firmware does not rely on definitions of struct bootparams or struct
setup_header at all. Introducing that dependency just for this is
something I'd really prefer to avoid.

> In any case, I don't think this should be fixed so close to the release.
>   We have two possibilities:
>
> 1) if we believe "build setup_data in QEMU" is a feasible design that
> only needs more yak shaving, we can keep the code in, but disabled by
> default, and sort it out in 7.2.
>

As I argued before, conflating the 'file' representation with the
'memory' representation like this is fundamentally flawed. fw_cfg
happily DMA's those files anywhere you like, so their contents should
not be position dependent like this.

So Jason's fix gets us halfway there, although we now pass information
to the kernel that is not covered by signatures or measurements, where
the setup_data pointer itself is. This means you can replace a single
SETUP_RNG_SEED node in memory with a whole set of SETUP_xxx nodes that
might be rigged to manipulate the boot in a way that measured boot
won't detect.

This is perhaps a bit of a stretch, and arguably only a problem if
secure or measured boot are enabled to begin with, in which case we
could impose additional policy on the use of setup_data. But still ...

> 2) if we go for an alternative design, it needs to be reverted.  For
> example the randomness could be in _another_ fw_cfg file, and the
> linuxboot DMA can patch it in the setup_data.
>
>
> With (2) the OVMF breakage would be limited to -dtb, which more or less
> nobody cares about, and we can just look the other way.
>
> Paolo

Re: [PATCH v3] hw/i386: place setup_data at fixed place in memory

2022-08-05 Thread Ard Biesheuvel

On Fri, 5 Aug 2022 at 10:10, Paolo Bonzini  wrote:
>
> On 8/5/22 01:04, Jason A. Donenfeld wrote:
> > +/* Nothing else uses this part of the hardware mapped region */
> > +setup_data_base = 0xf - 0x1000;
>
> Isn't this where the BIOS lives?  I don't think this works.
>
> Does it work to place setup_data at the end of the cmdline file instead
> of having it at the end of the kernel file?  This way the first item
> will be at 0x2 + cmdline_size.
>

Does QEMU always allocate the command line statically like that?
AFAIK, OVMF never accesses that memory to read the command line, it
uses fw_cfg to copy it into a buffer it allocates itself. And I guess
that implies that this region could be clobbered by OVMF unless it is
told to preserve it.

Re: [PATCH v2] hw/i386: place setup_data at fixed place in memory

2022-08-04 Thread Ard Biesheuvel

On Thu, 4 Aug 2022 at 14:11, Daniel P. Berrangé  wrote:
>
> On Thu, Aug 04, 2022 at 02:03:29PM +0200, Jason A. Donenfeld wrote:
> > Hi Daniel,
> >
> > On Thu, Aug 04, 2022 at 10:25:36AM +0100, Daniel P. Berrangé wrote:
> > > Yep, and ultimately the inability to distinguish UEFI vs other firmware
> > > is arguably correct by design, as the QEMU <-> firmware interface is
> > > supposed to be arbitrarily pluggable for any firmware implementation
> > > not  limited to merely UEFI + seabios.
> >
> > Indeed, I agree with this.
> >
> > >
> > > > For now I suggest either reverting the original patch, or at least not
> > > > enabling the knob by default for any machine types. In particular, when
> > > > using MicroVM, the user must leave the knob disabled when direct booting
> > > > a kernel on OVMF, and the user may or may not enable the knob when
> > > > direct booting a kernel on SeaBIOS.
> > >
> > > Having it opt-in via a knob would defeat Jason's goal of having the seed
> > > available automatically.
> >
> > Yes, adding a knob is absolutely out of the question.
> >
> > It also doesn't actually solve the problem: this triggers when QEMU
> > passes a DTB too. It's not just for the new RNG seed thing. This bug
> > isn't new.
>
> In the other thread I also mentioned that this RNG Seed addition has
> caused a bug with AMD SEV too, making boot measurement attestation
> fail because the kernel blob passed to the firmware no longer matches
> what the tenant expects, due to the injected seed.
>

I was actually expecting this to be an issue in the
signing/attestation department as well, and you just confirmed my
suspicion.

But does this mean that populating the setup_data pointer is out of
the question altogether? Or only that putting the setup_data linked
list nodes inside the image is a problem?

Re: [PATCH v2] hw/i386: place setup_data at fixed place in memory

2022-08-04 Thread Ard Biesheuvel

On Thu, 4 Aug 2022 at 11:25, Daniel P. Berrangé  wrote:
>
> On Thu, Aug 04, 2022 at 10:58:36AM +0200, Laszlo Ersek wrote:
> > On 08/04/22 09:03, Michael S. Tsirkin wrote:
> > > On Thu, Aug 04, 2022 at 02:44:11AM +0200, Jason A. Donenfeld wrote:
> > >> The boot parameter header refers to setup_data at an absolute address,
> > >> and each setup_data refers to the next setup_data at an absolute address
> > >> too. Currently QEMU simply puts the setup_datas right after the kernel
> > >> image, and since the kernel_image is loaded at prot_addr -- a fixed
> > >> address knowable to QEMU apriori -- the setup_data absolute address
> > >> winds up being just `prot_addr + a_fixed_offset_into_kernel_image`.
> > >>
> > >> This mostly works fine, so long as the kernel image really is loaded at
> > >> prot_addr. However, OVMF doesn't load the kernel at prot_addr, and
> > >> generally EFI doesn't give a good way of predicting where it's going to
> > >> load the kernel. So when it loads it at some address != prot_addr, the
> > >> absolute addresses in setup_data now point somewhere bogus, causing
> > >> crashes when EFI stub tries to follow the next link.
> > >>
> > >> Fix this by placing setup_data at some fixed place in memory, relative
> > >> to real_addr, not as part of the kernel image, and then pointing the
> > >> setup_data absolute address to that fixed place in memory. This way,
> > >> even if OVMF or other chains relocate the kernel image, the boot
> > >> parameter still points to the correct absolute address.
> > >>
> > >> Fixes: 3cbeb52467 ("hw/i386: add device tree support")
> > >> Reported-by: Xiaoyao Li 
> > >> Cc: Paolo Bonzini 
> > >> Cc: Richard Henderson 
> > >> Cc: Peter Maydell 
> > >> Cc: Michael S. Tsirkin 
> > >> Cc: Daniel P. Berrangé 
> > >> Cc: Gerd Hoffmann 
> > >> Cc: Ard Biesheuvel 
> > >> Cc: linux-...@vger.kernel.org
> > >> Signed-off-by: Jason A. Donenfeld 
> > >
> > > Didn't read the patch yet.
> > > Adding Laszlo.
> >
> > As I said in
> > <8bcc7826-91ab-855e-7151-2e9add88025a@redhat.com">http://mid.mail-archive.com/8bcc7826-91ab-855e-7151-2e9add88025a@redhat.com>,
> > I don't believe that the setup_data chaining described in
> > <https://www.kernel.org/doc/Documentation/x86/boot.rst> can be made work
> > for UEFI guests at all, with QEMU pre-populating the links with GPAs.
> >
> > However, rather than introducing a new info channel, or reusing an
> > existent one (ACPI linker/loader, GUID-ed structure chaining in pflash),
> > for the sake of this feature, I suggest simply disabling this feature
> > for UEFI guests. setup_data chaining has not been necessary for UEFI
> > guests for years (this is the first time I've heard about it in more
> > than a decade), and the particular use case (provide guests with good
> > random seed) is solved for UEFI guests via virtio-rng / EFI_RNG_PROTOCOL.
> >
> > ... Now, here's my problem: microvm, and Xen.
> >
> > As far as I can tell, QEMU can determine (it already does determine)
> > whether the guest uses UEFI or not, for the "pc" and "q35" machine
> > types. But not for microvm or Xen!
> >
> > Namely, from pc_system_firmware_init() [hw/i386/pc_sysfw.c], we can
> > derive that
> >
> >   pflash_cfi01_get_blk(pcms->flash[0])
> >
> > returning NULL vs. non-NULL stands for "BIOS vs. UEFI". Note that this
> > is only valid after the inital part of pc_system_firmware_init() has run
> > ("Map legacy -drive if=pflash to machine properties"), but that is not a
> > problem, given the following call tree:
>
> I don't beleve that's a valid check anymore since Gerd introduced the
> ability to load UEFI via -bios, for UEFI builds without persistent
> variables. ( a8152c4e4613c70c2f0573a82babbc8acc00cf90 )
>

I think there is a fundamental flaw in the QEMU logic where it adds
setup_data nodes to the *file* representation of the kernel image.

IOW, loading the kernei image at address x, creating setup_data nodes
in memory relative to x and then invoking the kernel image directly
(as kexec does as well, AIUI) is perfectly fine.

Managing a file system abstraction and a generic interface to load its
contents, and using it to load the kernel image anywhere in memory is
also fine, and OVMF -kernel relies on this.

It is the combination of the two that explodes, unsurprisingly. Making

Re: [PATCH v4] virt: vmgenid: introduce driver for reinitializing RNG on VM fork

2022-02-25 Thread Ard Biesheuvel

On Fri, 25 Feb 2022 at 16:12, Alexander Graf  wrote:
>
>
> On 25.02.22 15:33, Jason A. Donenfeld wrote:
> > On Fri, Feb 25, 2022 at 03:18:43PM +0100, Alexander Graf wrote:
> >>> I recall this part of the old thread. From what I understood, using
> >>> "VMGENID" + "QEMUVGID" worked /well enough/, even if that wasn't
> >>> technically in-spec. Ard noted that relying on _CID like that is
> >>> technically an ACPI spec notification. So we're between one spec and
> >>> another, basically, and doing "VMGENID" + "QEMUVGID" requires fewer
> >>> changes, as mentioned, appears to work fine in my testing.
> >>>
> >>> However, with that said, I think supporting this via "VM_Gen_Counter"
> >>> would be a better eventual thing to do, but will require acks and
> >>> changes from the ACPI maintainers. Do you think you could prepare your
> >>> patch proposal above as something on-top of my tree [1]? And if you can
> >>> convince the ACPI maintainers that that's okay, then I'll happily take
> >>> the patch.
> >>
> >> Sure, let me send the ACPI patch stand alone. No need to include the
> >> VMGenID change in there.
> > That's fine. If the ACPI people take it for 5.18, then we can count on
> > it being there and adjust the vmgenid driver accordingly also for 5.18.
> >
> > I just booted up a Windows VM, and it looks like Hyper-V uses
> > "Hyper_V_Gen_Counter_V1", which is also quite long, so we can't really
> > HID match on that either.
>
>
> Yes, due to the same problem. I'd really prefer we sort out the ACPI
> matching before this goes mainline. Matching on _HID is explicitly
> discouraged in the VMGenID spec.
>

OK, this really sucks. Quoting the ACPI spec:

"""
A _HID object evaluates to either a numeric 32-bit compressed EISA
type ID or a string. If a string, the format must be an alphanumeric
PNP or ACPI ID with no asterisk or other leading characters.
A valid PNP ID must be of the form "AAA" where A is an uppercase
letter and # is a hex digit.
A valid ACPI ID must be of the form "" where N is an uppercase
letter or a digit ('0'-'9') and # is a hex digit. This specification
reserves the string "ACPI" for use only with devices defined herein.
It further reserves all strings representing 4 HEX digits for
exclusive use with PCI-assigned Vendor IDs.
"""

So now we have to implement Microsoft's fork of ACPI to be able to use
this device, even if we expose it from QEMU instead of Hyper-V? I
strongly object to that.

Instead, we can match on _HID exposed by QEMU, and cordially invite
Microsoft to align their spec with the ACPI spec.

Re: [PATCH v4] virt: vmgenid: introduce driver for reinitializing RNG on VM fork

2022-02-25 Thread Ard Biesheuvel

On Fri, 25 Feb 2022 at 13:53, Greg KH  wrote:
>
> On Fri, Feb 25, 2022 at 01:48:48PM +0100, Jason A. Donenfeld wrote:
> > +static struct acpi_driver acpi_driver = {
> > + .name = "vmgenid",
> > + .ids = vmgenid_ids,
> > + .owner = THIS_MODULE,
> > + .ops = {
> > + .add = vmgenid_acpi_add,
> > + .notify = vmgenid_acpi_notify,
> > + }
> > +};
> > +
> > +static int __init vmgenid_init(void)
> > +{
> > + return acpi_bus_register_driver(_driver);
> > +}
> > +
> > +static void __exit vmgenid_exit(void)
> > +{
> > + acpi_bus_unregister_driver(_driver);
> > +}
> > +
> > +module_init(vmgenid_init);
> > +module_exit(vmgenid_exit);
>
> Nit, you could use module_acpi_driver() to make this even smaller if you
> want to.
>

With that suggestion adopted,

Reviewed-by: Ard Biesheuvel

Re: [PATCH v4] virt: vmgenid: introduce driver for reinitializing RNG on VM fork

2022-02-25 Thread Ard Biesheuvel

On Fri, 25 Feb 2022 at 14:58, Alexander Graf  wrote:
>
>
> On 25.02.22 13:48, Jason A. Donenfeld wrote:
> >
> > VM Generation ID is a feature from Microsoft, described at
> > <https://go.microsoft.com/fwlink/?LinkId=260709>, and supported by
> > Hyper-V and QEMU. Its usage is described in Microsoft's RNG whitepaper,
> > <https://aka.ms/win10rng>, as:
> >
> >  If the OS is running in a VM, there is a problem that most
> >  hypervisors can snapshot the state of the machine and later rewind
> >  the VM state to the saved state. This results in the machine running
> >  a second time with the exact same RNG state, which leads to serious
> >  security problems.  To reduce the window of vulnerability, Windows
> >  10 on a Hyper-V VM will detect when the VM state is reset, retrieve
> >  a unique (not random) value from the hypervisor, and reseed the root
> >  RNG with that unique value.  This does not eliminate the
> >  vulnerability, but it greatly reduces the time during which the RNG
> >  system will produce the same outputs as it did during a previous
> >  instantiation of the same VM state.
> >
> > Linux has the same issue, and given that vmgenid is supported already by
> > multiple hypervisors, we can implement more or less the same solution.
> > So this commit wires up the vmgenid ACPI notification to the RNG's newly
> > added add_vmfork_randomness() function.
> >
> > It can be used from qemu via the `-device vmgenid,guid=auto` parameter.
> > After setting that, use `savevm` in the monitor to save the VM state,
> > then quit QEMU, start it again, and use `loadvm`. That will trigger this
> > driver's notify function, which hands the new UUID to the RNG. This is
> > described in 
> > <https://git.qemu.org/?p=qemu.git;a=blob;f=docs/specs/vmgenid.txt>.
> > And there are hooks for this in libvirt as well, described in
> > <https://libvirt.org/formatdomain.html#general-metadata>.
> >
> > Note, however, that the treatment of this as a UUID is considered to be
> > an accidental QEMU nuance, per
> > <https://github.com/libguestfs/virt-v2v/blob/master/docs/vm-generation-id-across-hypervisors.txt>,
> > so this driver simply treats these bytes as an opaque 128-bit binary
> > blob, as per the spec. This doesn't really make a difference anyway,
> > considering that's how it ends up when handed to the RNG in the end.
> >
> > Cc: Adrian Catangiu 
> > Cc: Daniel P. Berrangé 
> > Cc: Dominik Brodowski 
> > Cc: Ard Biesheuvel 
> > Cc: Greg Kroah-Hartman 
> > Reviewed-by: Laszlo Ersek 
> > Signed-off-by: Jason A. Donenfeld 
> > ---
...
> > +
> > +   device->driver_data = state;
> > +
> > +out:
> > +   ACPI_FREE(parsed.pointer);
> > +   return ret;
> > +}
> > +
> > +static void vmgenid_acpi_notify(struct acpi_device *device, u32 event)
> > +{
> > +   struct vmgenid_state *state = acpi_driver_data(device);
> > +   u8 old_id[VMGENID_SIZE];
> > +
> > +   memcpy(old_id, state->this_id, sizeof(old_id));
> > +   memcpy(state->this_id, state->next_id, sizeof(state->this_id));
> > +   if (!memcmp(old_id, state->this_id, sizeof(old_id)))
> > +   return;
> > +   add_vmfork_randomness(state->this_id, sizeof(state->this_id));
> > +}
> > +
> > +static const struct acpi_device_id vmgenid_ids[] = {
> > +   { "VMGENID", 0 },
> > +   { "QEMUVGID", 0 },
>
>
> According to the VMGenID spec[1], you can only rely on _CID and _DDN for
> matching. They both contain "VM_Gen_Counter". The list above contains
> _HID values which are not an official identifier for the VMGenID device.
>
> IIRC the ACPI device match logic does match _CID in addition to _HID.
> However, it is limited to 8 characters. Let me paste an experimental
> hack I did back then to do the _CID matching instead.
>
> [1]
> https://download.microsoft.com/download/3/1/C/31CFC307-98CA-4CA5-914C-D9772691E214/VirtualMachineGenerationID.docx
>

I think matching on the HIDs of two known existing implementations is
fine, as opposed to matching on the (broken) CID of any implementation
that claims to be compatible with it. And dumping random strings into
the _CID property doesn't mesh well with the ACPI spec either, which
is why we don't currently support it.

We could still check _DDN if we wanted to, but I don't think this is
necessary. Other implementations that want to target his driver
explicitly can always put VMGENID or QEMUVGID into the _CID.

Re: [PATCH v3 1/2] random: add mechanism for VM forks to reinitialize crng

2022-02-25 Thread Ard Biesheuvel

On Fri, 25 Feb 2022 at 12:44, Jason A. Donenfeld  wrote:
>
> On Fri, Feb 25, 2022 at 12:26 PM Ard Biesheuvel  wrote:
> >
> > On Thu, 24 Feb 2022 at 14:39, Jason A. Donenfeld  wrote:
> > >
> > > When a VM forks, we must immediately mix in additional information to
> > > the stream of random output so that two forks or a rollback don't
> > > produce the same stream of random numbers, which could have catastrophic
> > > cryptographic consequences. This commit adds a simple API, add_vmfork_
> > > randomness(), for that, by force reseeding the crng.
> > >
> > > This has the added benefit of also draining the entropy pool and setting
> > > its timer back, so that any old entropy that was there prior -- which
> > > could have already been used by a different fork, or generally gone
> > > stale -- does not contribute to the accounting of the next 256 bits.
> > >
> > > Cc: Dominik Brodowski 
> > > Cc: Theodore Ts'o 
> > > Cc: Jann Horn 
> > > Cc: Eric Biggers 
> > > Signed-off-by: Jason A. Donenfeld 
> >
> > Acked-by: Ard Biesheuvel 
>
> Okay if I treat this as a Reviewed-by instead?

Sure no problem.

Reviewed-by: Ard Biesheuvel

Re: [PATCH v3 1/2] random: add mechanism for VM forks to reinitialize crng

2022-02-25 Thread Ard Biesheuvel

On Thu, 24 Feb 2022 at 14:39, Jason A. Donenfeld  wrote:
>
> When a VM forks, we must immediately mix in additional information to
> the stream of random output so that two forks or a rollback don't
> produce the same stream of random numbers, which could have catastrophic
> cryptographic consequences. This commit adds a simple API, add_vmfork_
> randomness(), for that, by force reseeding the crng.
>
> This has the added benefit of also draining the entropy pool and setting
> its timer back, so that any old entropy that was there prior -- which
> could have already been used by a different fork, or generally gone
> stale -- does not contribute to the accounting of the next 256 bits.
>
> Cc: Dominik Brodowski 
> Cc: Theodore Ts'o 
> Cc: Jann Horn 
> Cc: Eric Biggers 
> Signed-off-by: Jason A. Donenfeld 

Acked-by: Ard Biesheuvel 

> ---
>  drivers/char/random.c  | 50 +-
>  include/linux/random.h |  1 +
>  2 files changed, 36 insertions(+), 15 deletions(-)
>
> diff --git a/drivers/char/random.c b/drivers/char/random.c
> index 9fb06fc298d3..e8b84791cefe 100644
> --- a/drivers/char/random.c
> +++ b/drivers/char/random.c
> @@ -289,14 +289,14 @@ static DEFINE_PER_CPU(struct crng, crngs) = {
>  };
>
>  /* Used by crng_reseed() to extract a new seed from the input pool. */
> -static bool drain_entropy(void *buf, size_t nbytes);
> +static bool drain_entropy(void *buf, size_t nbytes, bool force);
>
>  /*
>   * This extracts a new crng key from the input pool, but only if there is a
> - * sufficient amount of entropy available, in order to mitigate bruteforcing
> - * of newly added bits.
> + * sufficient amount of entropy available or force is true, in order to
> + * mitigate bruteforcing of newly added bits.
>   */
> -static void crng_reseed(void)
> +static void crng_reseed(bool force)
>  {
> unsigned long flags;
> unsigned long next_gen;
> @@ -304,7 +304,7 @@ static void crng_reseed(void)
> bool finalize_init = false;
>
> /* Only reseed if we can, to prevent brute forcing a small amount of 
> new bits. */
> -   if (!drain_entropy(key, sizeof(key)))
> +   if (!drain_entropy(key, sizeof(key), force))
> return;
>
> /*
> @@ -406,7 +406,7 @@ static void crng_make_state(u32 
> chacha_state[CHACHA_STATE_WORDS],
>  * in turn bumps the generation counter that we check below.
>  */
> if (unlikely(time_after(jiffies, READ_ONCE(base_crng.birth) + 
> CRNG_RESEED_INTERVAL)))
> -   crng_reseed();
> +   crng_reseed(false);
>
> local_lock_irqsave(, flags);
> crng = raw_cpu_ptr();
> @@ -771,10 +771,10 @@ EXPORT_SYMBOL(get_random_bytes_arch);
>   *
>   * Finally, extract entropy via these two, with the latter one
>   * setting the entropy count to zero and extracting only if there
> - * is POOL_MIN_BITS entropy credited prior:
> + * is POOL_MIN_BITS entropy credited prior or force is true:
>   *
>   * static void extract_entropy(void *buf, size_t nbytes)
> - * static bool drain_entropy(void *buf, size_t nbytes)
> + * static bool drain_entropy(void *buf, size_t nbytes, bool force)
>   *
>   **/
>
> @@ -832,7 +832,7 @@ static void credit_entropy_bits(size_t nbits)
> } while (cmpxchg(_pool.entropy_count, orig, entropy_count) != 
> orig);
>
> if (crng_init < 2 && entropy_count >= POOL_MIN_BITS)
> -   crng_reseed();
> +   crng_reseed(false);
>  }
>
>  /*
> @@ -882,16 +882,16 @@ static void extract_entropy(void *buf, size_t nbytes)
>  }
>
>  /*
> - * First we make sure we have POOL_MIN_BITS of entropy in the pool, and then 
> we
> - * set the entropy count to zero (but don't actually touch any data). Only 
> then
> - * can we extract a new key with extract_entropy().
> + * First we make sure we have POOL_MIN_BITS of entropy in the pool unless 
> force
> + * is true, and then we set the entropy count to zero (but don't actually 
> touch
> + * any data). Only then can we extract a new key with extract_entropy().
>   */
> -static bool drain_entropy(void *buf, size_t nbytes)
> +static bool drain_entropy(void *buf, size_t nbytes, bool force)
>  {
> unsigned int entropy_count;
> do {
> entropy_count = READ_ONCE(input_pool.entropy_count);
> -   if (entropy_count < POOL_MIN_BITS)
> +   if (!force && entropy_count < POOL_MIN_BITS)
> return false;
> } while (cmpxchg(_pool.entropy_count, entropy_count, 0) != 
>

Re: [PATCH v3 2/2] virt: vmgenid: introduce driver for reinitializing RNG on VM fork

2022-02-25 Thread Ard Biesheuvel

On Thu, 24 Feb 2022 at 14:39, Jason A. Donenfeld  wrote:
>
> VM Generation ID is a feature from Microsoft, described at
> <https://go.microsoft.com/fwlink/?LinkId=260709>, and supported by
> Hyper-V and QEMU. Its usage is described in Microsoft's RNG whitepaper,
> <https://aka.ms/win10rng>, as:
>
> If the OS is running in a VM, there is a problem that most
> hypervisors can snapshot the state of the machine and later rewind
> the VM state to the saved state. This results in the machine running
> a second time with the exact same RNG state, which leads to serious
> security problems.  To reduce the window of vulnerability, Windows
> 10 on a Hyper-V VM will detect when the VM state is reset, retrieve
> a unique (not random) value from the hypervisor, and reseed the root
> RNG with that unique value.  This does not eliminate the
> vulnerability, but it greatly reduces the time during which the RNG
> system will produce the same outputs as it did during a previous
> instantiation of the same VM state.
>
> Linux has the same issue, and given that vmgenid is supported already by
> multiple hypervisors, we can implement more or less the same solution.
> So this commit wires up the vmgenid ACPI notification to the RNG's newly
> added add_vmfork_randomness() function.
>
> It can be used from qemu via the `-device vmgenid,guid=auto` parameter.
> After setting that, use `savevm` in the monitor to save the VM state,
> then quit QEMU, start it again, and use `loadvm`. That will trigger this
> driver's notify function, which hands the new UUID to the RNG. This is
> described in 
> <https://git.qemu.org/?p=qemu.git;a=blob;f=docs/specs/vmgenid.txt>.
> And there are hooks for this in libvirt as well, described in
> <https://libvirt.org/formatdomain.html#general-metadata>.
>
> Note, however, that the treatment of this as a UUID is considered to be
> an accidental QEMU nuance, per
> <https://github.com/libguestfs/virt-v2v/blob/master/docs/vm-generation-id-across-hypervisors.txt>,
> so this driver simply treats these bytes as an opaque 128-bit binary
> blob, as per the spec. This doesn't really make a difference anyway,
> considering that's how it ends up when handed to the RNG in the end.
>
> This driver builds on prior work from Adrian Catangiu at Amazon, and it
> is my hope that that team can resume maintenance of this driver.
>
> Cc: Adrian Catangiu 
> Cc: Laszlo Ersek 
> Cc: Daniel P. Berrangé 
> Cc: Dominik Brodowski 
> Cc: Ard Biesheuvel 
> Signed-off-by: Jason A. Donenfeld 
> ---
>  drivers/virt/Kconfig   |   9 +++
>  drivers/virt/Makefile  |   1 +
>  drivers/virt/vmgenid.c | 121 +
>  3 files changed, 131 insertions(+)
>  create mode 100644 drivers/virt/vmgenid.c
>
> diff --git a/drivers/virt/Kconfig b/drivers/virt/Kconfig
> index 8061e8ef449f..d3276dc2095c 100644
> --- a/drivers/virt/Kconfig
> +++ b/drivers/virt/Kconfig

drivers/virt does not have a maintainer and this code needs one.

> @@ -13,6 +13,15 @@ menuconfig VIRT_DRIVERS
>
>  if VIRT_DRIVERS
>
> +config VMGENID
> +   tristate "Virtual Machine Generation ID driver"
> +   default y

Please make this default m - this code can run as a module and the
feature it relies on is discoverable by udev

> +   depends on ACPI
> +   help
> + Say Y here to use the hypervisor-provided Virtual Machine 
> Generation ID
> + to reseed the RNG when the VM is cloned. This is highly recommended 
> if
> + you intend to do any rollback / cloning / snapshotting of VMs.
> +
>  config FSL_HV_MANAGER
> tristate "Freescale hypervisor management driver"
> depends on FSL_SOC
> diff --git a/drivers/virt/Makefile b/drivers/virt/Makefile
> index 3e272ea60cd9..108d0ffcc9aa 100644
> --- a/drivers/virt/Makefile
> +++ b/drivers/virt/Makefile
> @@ -4,6 +4,7 @@
>  #
>
>  obj-$(CONFIG_FSL_HV_MANAGER)   += fsl_hypervisor.o
> +obj-$(CONFIG_VMGENID)  += vmgenid.o
>  obj-y  += vboxguest/
>
>  obj-$(CONFIG_NITRO_ENCLAVES)   += nitro_enclaves/
> diff --git a/drivers/virt/vmgenid.c b/drivers/virt/vmgenid.c
> new file mode 100644
> index ..5da4dc8f25e3
> --- /dev/null
> +++ b/drivers/virt/vmgenid.c
> @@ -0,0 +1,121 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Virtual Machine Generation ID driver
> + *
> + * Copyright (C) 2022 Jason A. Donenfeld . All Rights 
> Reserved.
> + * Copyright (C) 2020 Amazon. All rights reserved.
> + * Copyright (C) 2018 Red Hat Inc. All rights reserved.
> + */
> +
> +#include 
> +#include 
> +#include 
> +#incl

Re: [PATCH v2] target/arm/cpu64: Use 32-bit GDBstub when running in 32-bit KVM mode

2022-01-11 Thread Ard Biesheuvel

On Tue, 11 Jan 2022 at 15:11, Peter Maydell  wrote:
>
> On Sat, 8 Jan 2022 at 15:10, Ard Biesheuvel  wrote:
> >
> > When running under KVM, we may decide to run the CPU in 32-bit mode, by
> > setting the 'aarch64=off' CPU option. In this case, we need to switch to
> > the 32-bit version of the GDB stub too, so that GDB has the correct view
> > of the CPU state. Without this, GDB debugging does not work at all, and
> > errors out upon connecting to the target with a mysterious 'g' packet
> > length error.
> >
> > Cc: Richard Henderson 
> > Cc: Peter Maydell 
> > Cc: Alex Bennee 
> > Signed-off-by: Ard Biesheuvel 
> > ---
> > v2: refactor existing CPUClass::gdb_... member assignments for the
> > 32-bit code so we can reuse it for the 64-bit code
> >
> >  target/arm/cpu.c   | 16 +++-
> >  target/arm/cpu.h   |  2 ++
> >  target/arm/cpu64.c |  3 +++
> >  3 files changed, 16 insertions(+), 5 deletions(-)
> >
> > diff --git a/target/arm/cpu.c b/target/arm/cpu.c
> > index a211804fd3df..ae8e78fc1472 100644
> > --- a/target/arm/cpu.c
> > +++ b/target/arm/cpu.c
> > @@ -2049,6 +2049,15 @@ static const struct TCGCPUOps arm_tcg_ops = {
> >  };
> >  #endif /* CONFIG_TCG */
> >
> > +void arm_cpu_class_gdb_init(CPUClass *cc)
> > +{
> > +cc->gdb_read_register = arm_cpu_gdb_read_register;
> > +cc->gdb_write_register = arm_cpu_gdb_write_register;
> > +cc->gdb_num_core_regs = 26;
> > +cc->gdb_core_xml_file = "arm-core.xml";
> > +cc->gdb_arch_name = arm_gdb_arch_name;
> > +}
>
> Most of these fields are not used by the gdbstub until
> runtime, but cc->gdb_num_core_regs is used earlier.
> In particular, in cpu_common_initfn() we copy that value
> into cpu->gdb_num_regs and cpu->gdb_num_g_regs (this happens
> at the CPU object's instance_init time, ie before the
> aarch64_cpu_set_aarch64 function is called), and these are the
> values that are then used when registering dynamic sysreg
> XML, coprocessor registers, etc.
>

Right.

> > --- a/target/arm/cpu64.c
> > +++ b/target/arm/cpu64.c
> > @@ -906,6 +906,7 @@ static bool aarch64_cpu_get_aarch64(Object *obj, Error 
> > **errp)
> >  static void aarch64_cpu_set_aarch64(Object *obj, bool value, Error **errp)
> >  {
> >  ARMCPU *cpu = ARM_CPU(obj);
> > +CPUClass *cc = CPU_GET_CLASS(obj);
>
> This is called to change the property for a specific CPU
> object -- you can't change the values of the *class* here.
> (Consider a system with 2 CPUs, one of which has aarch64=yes
> and one of which has aarch64=no.)
>

So how is this fundamentally going to work then? Which GDB stub should
we choose in such a case?


> >  /* At this time, this property is only allowed if KVM is enabled.  This
> >   * restriction allows us to avoid fixing up functionality that assumes 
> > a
> > @@ -919,6 +920,8 @@ static void aarch64_cpu_set_aarch64(Object *obj, bool 
> > value, Error **errp)
> >  return;
> >  }
> >  unset_feature(>env, ARM_FEATURE_AARCH64);
> > +
> > +arm_cpu_class_gdb_init(cc)
>
> This fails to compile because of the missing semicolon...
>

Oops, my bad. I spotted this locally as well but failed to fold it
into the patch.

> >  } else {
> >  set_feature(>env, ARM_FEATURE_AARCH64);
>
> If the user (admittedly slightly perversely) toggles the
> aarch64 flag from on to off to on again, we should reset the
> gdb function pointers to the aarch64 versions again.
>

Ack.

So I can fix most of these issues, but the fundamental one remains, so
I'll hold off on a v3 until we can settle that.

Thanks,
Ard.

[PATCH v2] target/arm/cpu64: Use 32-bit GDBstub when running in 32-bit KVM mode

2022-01-08 Thread Ard Biesheuvel

When running under KVM, we may decide to run the CPU in 32-bit mode, by
setting the 'aarch64=off' CPU option. In this case, we need to switch to
the 32-bit version of the GDB stub too, so that GDB has the correct view
of the CPU state. Without this, GDB debugging does not work at all, and
errors out upon connecting to the target with a mysterious 'g' packet
length error.

Cc: Richard Henderson 
Cc: Peter Maydell 
Cc: Alex Bennee 
Signed-off-by: Ard Biesheuvel 
---
v2: refactor existing CPUClass::gdb_... member assignments for the
32-bit code so we can reuse it for the 64-bit code

 target/arm/cpu.c   | 16 +++-
 target/arm/cpu.h   |  2 ++
 target/arm/cpu64.c |  3 +++
 3 files changed, 16 insertions(+), 5 deletions(-)

diff --git a/target/arm/cpu.c b/target/arm/cpu.c
index a211804fd3df..ae8e78fc1472 100644
--- a/target/arm/cpu.c
+++ b/target/arm/cpu.c
@@ -2049,6 +2049,15 @@ static const struct TCGCPUOps arm_tcg_ops = {
 };
 #endif /* CONFIG_TCG */
 
+void arm_cpu_class_gdb_init(CPUClass *cc)
+{
+cc->gdb_read_register = arm_cpu_gdb_read_register;
+cc->gdb_write_register = arm_cpu_gdb_write_register;
+cc->gdb_num_core_regs = 26;
+cc->gdb_core_xml_file = "arm-core.xml";
+cc->gdb_arch_name = arm_gdb_arch_name;
+}
+
 static void arm_cpu_class_init(ObjectClass *oc, void *data)
 {
 ARMCPUClass *acc = ARM_CPU_CLASS(oc);
@@ -2061,18 +2070,15 @@ static void arm_cpu_class_init(ObjectClass *oc, void 
*data)
 device_class_set_props(dc, arm_cpu_properties);
 device_class_set_parent_reset(dc, arm_cpu_reset, >parent_reset);
 
+arm_cpu_class_gdb_init(cc);
+
 cc->class_by_name = arm_cpu_class_by_name;
 cc->has_work = arm_cpu_has_work;
 cc->dump_state = arm_cpu_dump_state;
 cc->set_pc = arm_cpu_set_pc;
-cc->gdb_read_register = arm_cpu_gdb_read_register;
-cc->gdb_write_register = arm_cpu_gdb_write_register;
 #ifndef CONFIG_USER_ONLY
 cc->sysemu_ops = _sysemu_ops;
 #endif
-cc->gdb_num_core_regs = 26;
-cc->gdb_core_xml_file = "arm-core.xml";
-cc->gdb_arch_name = arm_gdb_arch_name;
 cc->gdb_get_dynamic_xml = arm_gdb_get_dynamic_xml;
 cc->gdb_stop_before_watchpoint = true;
 cc->disas_set_info = arm_disas_set_info;
diff --git a/target/arm/cpu.h b/target/arm/cpu.h
index e33f37b70ada..208da8e35697 100644
--- a/target/arm/cpu.h
+++ b/target/arm/cpu.h
@@ -1064,6 +1064,8 @@ int arm_gen_dynamic_svereg_xml(CPUState *cpu, int 
base_reg);
  */
 const char *arm_gdb_get_dynamic_xml(CPUState *cpu, const char *xmlname);
 
+void arm_cpu_class_gdb_init(CPUClass *cc);
+
 int arm_cpu_write_elf64_note(WriteCoreDumpFunction f, CPUState *cs,
  int cpuid, void *opaque);
 int arm_cpu_write_elf32_note(WriteCoreDumpFunction f, CPUState *cs,
diff --git a/target/arm/cpu64.c b/target/arm/cpu64.c
index 15245a60a8c7..df7667864e11 100644
--- a/target/arm/cpu64.c
+++ b/target/arm/cpu64.c
@@ -906,6 +906,7 @@ static bool aarch64_cpu_get_aarch64(Object *obj, Error 
**errp)
 static void aarch64_cpu_set_aarch64(Object *obj, bool value, Error **errp)
 {
 ARMCPU *cpu = ARM_CPU(obj);
+CPUClass *cc = CPU_GET_CLASS(obj);
 
 /* At this time, this property is only allowed if KVM is enabled.  This
  * restriction allows us to avoid fixing up functionality that assumes a
@@ -919,6 +920,8 @@ static void aarch64_cpu_set_aarch64(Object *obj, bool 
value, Error **errp)
 return;
 }
 unset_feature(>env, ARM_FEATURE_AARCH64);
+
+arm_cpu_class_gdb_init(cc)
 } else {
 set_feature(>env, ARM_FEATURE_AARCH64);
 }
-- 
2.30.2

[PATCH] target/arm/cpu64: Use 32-bit GDBstub when running in 32-bit KVM mode

2022-01-07 Thread Ard Biesheuvel

When running under KVM, we may decide to run the CPU in 32-bit mode, by
setting the 'aarch64=off' CPU option. In this case, we need to switch to
the 32-bit version of the GDB stub too, so that GDB has the correct view
of the CPU state. Without this, GDB debugging does not work at all, and
errors out upon connecting to the target with a mysterious 'g' packet
length error.

Cc: Richard Henderson 
Cc: Peter Maydell 
Cc: Alex Bennee 
Signed-off-by: Ard Biesheuvel 
---
 target/arm/cpu64.c | 12 
 1 file changed, 12 insertions(+)

diff --git a/target/arm/cpu64.c b/target/arm/cpu64.c
index 15245a60a8c7..3dede9e2ec31 100644
--- a/target/arm/cpu64.c
+++ b/target/arm/cpu64.c
@@ -903,9 +903,15 @@ static bool aarch64_cpu_get_aarch64(Object *obj, Error 
**errp)
 return arm_feature(>env, ARM_FEATURE_AARCH64);
 }
 
+static gchar *arm_gdb_arch_name(CPUState *cs)
+{
+return g_strdup("arm");
+}
+
 static void aarch64_cpu_set_aarch64(Object *obj, bool value, Error **errp)
 {
 ARMCPU *cpu = ARM_CPU(obj);
+CPUClass *cc = CPU_GET_CLASS(obj);
 
 /* At this time, this property is only allowed if KVM is enabled.  This
  * restriction allows us to avoid fixing up functionality that assumes a
@@ -919,6 +925,12 @@ static void aarch64_cpu_set_aarch64(Object *obj, bool 
value, Error **errp)
 return;
 }
 unset_feature(>env, ARM_FEATURE_AARCH64);
+
+cc->gdb_read_register = arm_cpu_gdb_read_register;
+cc->gdb_write_register = arm_cpu_gdb_write_register;
+cc->gdb_num_core_regs = 26;
+cc->gdb_core_xml_file = "arm-core.xml";
+cc->gdb_arch_name = arm_gdb_arch_name;
 } else {
 set_feature(>env, ARM_FEATURE_AARCH64);
 }
-- 
2.30.2

Re: [PATCH v11 06/10] hvf: arm: Implement -cpu host

2021-09-22 Thread Ard Biesheuvel

On Wed, 22 Sept 2021 at 14:45, Peter Maydell  wrote:
>
> On Wed, 22 Sept 2021 at 12:41, Ard Biesheuvel  wrote:
> >
> > On Thu, 16 Sept 2021 at 18:17, Peter Maydell  
> > wrote:
> > >
> > > On Thu, 16 Sept 2021 at 17:05, Ard Biesheuvel  wrote:
> > > > I'd argue that compliance with the architecture means that the
> > > > software should not clear RES1 bits
> > >
> > > Architecturally, RES1 means that "software
> > >  * Must not rely on the bit reading as 1.
> > >  * Must use an SBOP policy to write to the bit."
> > > (SBOP=="should be 1 or preserved", ie you can preserve the existing value,
> > > as in "read register, change some bits, write back", or you can write a 
> > > 1.)
> > >
> >
> > OVMF preserves the bit, and does not reason or care about its value.
> > So in this sense, it is compliant.
>
> Hmm. Alex, can you give more details about what fails here ?
>

It seems that EDK2 ends up setting EL0 r/o or r/w permissions on some
mappings, even though it never runs anything at EL0. So any execution
that gets initiated via the timer interrupt causing a EL1->EL1 IRQ
exception will run with PAN enabled and lose access to those mappings.

So it seems like a definite bug that is unrelated to reset state of
the registers and assumptions related to that.



> > > > but I don't think we can blame it
> > > > for not touching bits that were in in invalid state upon entry.
> > >
> > > SCTLR_EL1.SPAN == 0 is perfectly valid for a CPU that supports the
> > > PAN feature. It's just not the value OVMF wants, so OVMF should
> > > be setting it to what it does want. Also, as the first thing to
> > > run after reset (ie firmware) OVMF absolutely is responsible for
> > > dealing with system registers which have UNKNOWN values out of
> > > reset.
> > >
> >
> > Fair enough. But I'd still suggest fixing this at both ends.
>
> Yes, the version of this code that we committed sets SPAN to 1.
> (This argument is mostly about what the comment justifying that
> value should say :-))
>

OK, that makes sense. But I'd like to get EDK2 fixed as well, obviously.

Re: [PATCH v11 06/10] hvf: arm: Implement -cpu host

2021-09-22 Thread Ard Biesheuvel

On Thu, 16 Sept 2021 at 18:17, Peter Maydell  wrote:
>
> On Thu, 16 Sept 2021 at 17:05, Ard Biesheuvel  wrote:
> > I'd argue that compliance with the architecture means that the
> > software should not clear RES1 bits
>
> Architecturally, RES1 means that "software
>  * Must not rely on the bit reading as 1.
>  * Must use an SBOP policy to write to the bit."
> (SBOP=="should be 1 or preserved", ie you can preserve the existing value,
> as in "read register, change some bits, write back", or you can write a 1.)
>

OVMF preserves the bit, and does not reason or care about its value.
So in this sense, it is compliant.

> > but I don't think we can blame it
> > for not touching bits that were in in invalid state upon entry.
>
> SCTLR_EL1.SPAN == 0 is perfectly valid for a CPU that supports the
> PAN feature. It's just not the value OVMF wants, so OVMF should
> be setting it to what it does want. Also, as the first thing to
> run after reset (ie firmware) OVMF absolutely is responsible for
> dealing with system registers which have UNKNOWN values out of
> reset.
>

Fair enough. But I'd still suggest fixing this at both ends.
> Put another way, if you want to argue that this is an "invalid
> state" you need to point to the specification that defines
> the valid state that OVMF should see when it is handed control.
>
> -- PMM

Re: [PATCH v11 06/10] hvf: arm: Implement -cpu host

2021-09-16 Thread Ard Biesheuvel

On Thu, 16 Sept 2021 at 17:56, Peter Maydell  wrote:
>
> On Thu, 16 Sept 2021 at 16:30, Alexander Graf  wrote:
> >
> >
> > On 16.09.21 14:24, Peter Maydell wrote:
> > > On Wed, 15 Sept 2021 at 19:10, Alexander Graf  wrote:
> > >> Now that we have working system register sync, we push more target CPU
> > >> properties into the virtual machine. That might be useful in some
> > >> situations, but is not the typical case that users want.
> > >>
> > >> So let's add a -cpu host option that allows them to explicitly pass all
> > >> CPU capabilities of their host CPU into the guest.
> > >>
> > >> Signed-off-by: Alexander Graf 
> > >> Acked-by: Roman Bolshakov 
> > >> Reviewed-by: Sergio Lopez 
> > >>
> > >> +/*
> > >> + * A scratch vCPU returns SCTLR 0, so let's fill our default with 
> > >> the M1
> > >> + * boot SCTLR from https://github.com/AsahiLinux/m1n1/issues/97
>
> Side note: SCTLR_EL1 is a 64-bit register, do you have anything that
> prints the full 64-bits to confirm that [63:32] are indeed all 0?
>
> > >> + */
> > >> +ahcf->reset_sctlr = 0x30100180;
> > >> +/* OVMF chokes on boot if SPAN is not set, so default it to on */
> > >> +ahcf->reset_sctlr |= 0x0080;
> > > Isn't that just an OVMF bug ? If you want this then you need to
> > > convince me why this isn't just a workaround for a buggy guest.
> >
> >
> > I couldn't find anything in the ARMv8 spec that explicitly says "If you
> > support PAN, SCTLR.SPAN should be 1 by default". It is RES1 for CPUs
> > that do not implement PAN. Beware that for SPAN, "1" means disabled and
> > "0" means enabled.
>
> It's UNKNOWN on reset. So unless OVMF is relying on whatever
> is launching it to set SCTLR correctly (ie there is some part of
> the "firmware-to-OVMF" contract it is relying on) then it seems to
> me that it's OVMF's job to initialize it to what it needs. (Lots of
> SCTLR is like that.)
>
> Linux does this here:
>  
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm64/kernel/head.S?h=v5.15-rc1#n485
>  
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm64/include/asm/sysreg.h?h=v5.15-rc1#n695
> because the INIT_SCTLR_EL1_MMU_OFF constant includes forcing
> all "this kernel expects these to be RES0/RES1 because that's all
> the architectural features we know about at this time" bits to
> their RESn values.
>
> But we can probably construct an argument for why having it set
> makes sense, yes.
>

I'd argue that compliance with the architecture means that the
software should not clear RES1 bits, but I don't think we can blame it
for not touching bits that were in in invalid state upon entry.

Re: [PATCH for-6.2] hw/arm/virt_acpi_build: Generate DBG2 table

2021-08-10 Thread Ard Biesheuvel

On Tue, 10 Aug 2021 at 15:11, Samer El-Haj-Mahmoud
 wrote:
>
>
>
> > -Original Message-
> > From: Eric Auger 
> > Sent: Tuesday, August 10, 2021 6:25 AM
> > To: Ard Biesheuvel 
> > Cc: eric.auger@gmail.com; Michael S. Tsirkin ; Igor
> > Mammedov ; Philippe Mathieu-Daudé
> > ; Peter Maydell ; Shannon
> > Zhao ; qemu-arm ;
> > qemu-devel@nongnu.org; Andrew Jones ;
> > gs...@redhat.com; Samer El-Haj-Mahmoud  > mahm...@arm.com>; Al Stone ; j...@redhat.com
> > Subject: Re: [PATCH for-6.2] hw/arm/virt_acpi_build: Generate DBG2 table
> >
> > Hello Ard,
> > On 8/10/21 11:36 AM, Ard Biesheuvel wrote:
> > > On Tue, 10 Aug 2021 at 10:31, Eric Auger  wrote:
> > >> ARM SBBR specification mandates DBG2 table (Debug Port Table 2).
> > >> this latter allows to describe one or more debug ports.
> > >>
> > >> Generate an DBG2 table featuring a single debug port, the PL011.
> > >>
> > >> The DBG2 specification can be found at:
> > >> https://docs.microsoft.com/en-us/windows-hardware/drivers/bringup/acpi-
> > debug-port-table?redirectedfrom=MSDN
> > >>
> > > Have the legal issues around this table been resolved in the mean
> > > time?
> > I don't know exactly what they are. Adding Al and Jon in the loop they
> > have more information about this.
> > How did you resolve the issue for EDK2
> > (DynamicTablesPkg/Library/Acpi/Arm/AcpiDbg2LibArm/Dbg2Generator.c)?
> > >  Also, any clue why this table is mandatory to begin with? The
> > > SBBR has been very trigger happy lately with making things mandatory
> > > that aren't truly required from a functional perspective.
> > It seems there are kernel FW test suites that check all mandated tables
> > are available and they currently fail for ARM virt.
> > Indeed from a function pov, I don't know much about its usage on ARM.
> >
> > Maybe the SBBR spec should not flag the DBG2 as mandatory and test
> > suites shall be updated. I think this should be clarified at ARM then,
> > all the more so if there are legal issues as its spec is owned by Microsoft?
> >
>
> DBG2 has been required in SBBR since SBBR ver 1.0 (published 2016, with the 
> 0.9 draft since 2014)
> https://developer.arm.com/documentation/den0044/b/?lang=en
>
> SBBR requires DBG2 because Windows requires it on all systems: 
> https://docs.microsoft.com/en-us/windows-hardware/drivers/bringup/acpi-system-description-tables#debug-port-table-2-dbg2
>  , and Windows is one of the key OSes targeted by SBBR.
>
> The DBG2 (and SPCR) spec license issue has been resolved since August 2015. 
> Microsoft updated both specs with identical license language, giving patent 
> rights for implementations under the Microsoft Community Promise, and the 
> Open OWF 1.0. This Foundation.
>
> DBG2: 
> https://docs.microsoft.com/en-us/windows-hardware/drivers/bringup/acpi-debug-port-table
> SPCR: 
> https://docs.microsoft.com/en-us/windows-hardware/drivers/serports/serial-port-console-redirection-table
>

Thanks Samer, for stating this on record here - and apologies for
suggesting that this was another frivolous addition to a recent SBBR
revision.

As for the difference between the two: SPCR describes the serial
console, which is an actual interactive console used for maintenance,
which exists in addition to the full blown Windows GUI, which is
always the primary interface.

DBG2 is used as a debug port, which is used for the kernel debugger,
if I am not mistaken. So SPCR and DBG2 are complementary, and it does
make sense to have both.

Re: [PATCH for-6.2] hw/arm/virt_acpi_build: Generate DBG2 table

2021-08-10 Thread Ard Biesheuvel

On Tue, 10 Aug 2021 at 10:31, Eric Auger  wrote:
>
> ARM SBBR specification mandates DBG2 table (Debug Port Table 2).
> this latter allows to describe one or more debug ports.
>
> Generate an DBG2 table featuring a single debug port, the PL011.
>
> The DBG2 specification can be found at:
> https://docs.microsoft.com/en-us/windows-hardware/drivers/bringup/acpi-debug-port-table?redirectedfrom=MSDN
>

Have the legal issues around this table been resolved in the mean
time? Also, any clue why this table is mandatory to begin with? The
SBBR has been very trigger happy lately with making things mandatory
that aren't truly required from a functional perspective.


> Signed-off-by: Eric Auger 
>
> ---
>
> Tested by comparing the content with the table generated
> by EDK2 along with the SBSA-REF machine (code generated by
> DynamicTablesPkg/Library/Acpi/Arm/AcpiDbg2LibArm/Dbg2Generator.c).
>
> I reused the Generic Address Structure filled by QEMU in the SPCR, ie.
> bit_width = 8 and byte access. While EDK2 sets bit_width = 32 and
> dword access. Also the name exposed by acpica tools is different:
> 'COM0' in my case where '\_SB.COM0' in SBSA-REF case?
> ---
>  hw/arm/virt-acpi-build.c| 77 -
>  include/hw/acpi/acpi-defs.h | 50 
>  2 files changed, 126 insertions(+), 1 deletion(-)
>
> diff --git a/hw/arm/virt-acpi-build.c b/hw/arm/virt-acpi-build.c
> index 037cc1fd82..35f27b41df 100644
> --- a/hw/arm/virt-acpi-build.c
> +++ b/hw/arm/virt-acpi-build.c
> @@ -563,6 +563,78 @@ build_gtdt(GArray *table_data, BIOSLinker *linker, 
> VirtMachineState *vms)
>   vms->oem_table_id);
>  }
>
> +#define ACPI_DBG2_PL011_UART_LENGTH 0x1000
> +
> +/* DBG2 */
> +static void
> +build_dbg2(GArray *table_data, BIOSLinker *linker, VirtMachineState *vms)
> +{
> +int addr_offset, addrsize_offset, namespace_offset, namespace_length;
> +const MemMapEntry *uart_memmap = >memmap[VIRT_UART];
> +struct AcpiGenericAddress *base_address;
> +int dbg2_start = table_data->len;
> +AcpiDbg2Device *dbg2dev;
> +char name[] = "COM0";
> +AcpiDbg2Table *dbg2;
> +uint32_t *addr_size;
> +uint8_t *namespace;
> +
> +dbg2 = acpi_data_push(table_data, sizeof *dbg2);
> +dbg2->info_offset = sizeof *dbg2;
> +dbg2->info_count = 1;
> +
> +/* debug device info structure */
> +
> +dbg2dev = acpi_data_push(table_data, sizeof(AcpiDbg2Device));
> +
> +dbg2dev->revision = 0;
> +namespace_length = sizeof name;
> +dbg2dev->length = sizeof *dbg2dev + sizeof(struct AcpiGenericAddress) +
> +  4 + namespace_length;
> +dbg2dev->register_count = 1;
> +
> +addr_offset = sizeof *dbg2dev;
> +addrsize_offset = addr_offset + sizeof(struct AcpiGenericAddress);
> +namespace_offset = addrsize_offset + 4;
> +
> +dbg2dev->namepath_length = cpu_to_le16(namespace_length);
> +dbg2dev->namepath_offset = cpu_to_le16(namespace_offset);
> +dbg2dev->oem_data_length = cpu_to_le16(0);
> +dbg2dev->oem_data_offset = cpu_to_le16(0); /* No OEM data is present */
> +dbg2dev->port_type = cpu_to_le16(ACPI_DBG2_SERIAL_PORT);
> +dbg2dev->port_subtype = cpu_to_le16(ACPI_DBG2_ARM_PL011);
> +
> +dbg2dev->base_address_offset = cpu_to_le16(addr_offset);
> +dbg2dev->address_size_offset = cpu_to_le16(addrsize_offset);
> +
> +/*
> + * variable length content:
> + * BaseAddressRegister[1]
> + * AddressSize[1]
> + * NamespaceString[1]
> + */
> +
> +base_address = acpi_data_push(table_data,
> +  sizeof(struct AcpiGenericAddress));
> +
> +base_address->space_id = AML_SYSTEM_MEMORY;
> +base_address->bit_width = 8;
> +base_address->bit_offset = 0;
> +base_address->access_width = 1;
> +base_address->address = cpu_to_le64(uart_memmap->base);
> +
> +addr_size = acpi_data_push(table_data, sizeof *addr_size);
> +*addr_size = cpu_to_le32(ACPI_DBG2_PL011_UART_LENGTH);
> +
> +namespace = acpi_data_push(table_data, namespace_length);
> +memcpy(namespace, name, namespace_length);
> +
> +build_header(linker, table_data,
> + (void *)(table_data->data + dbg2_start), "DBG2",
> + table_data->len - dbg2_start, 3, vms->oem_id,
> + vms->oem_table_id);
> +}
> +
>  /* MADT */
>  static void
>  build_madt(GArray *table_data, BIOSLinker *linker, VirtMachineState *vms)
> @@ -790,7 +862,7 @@ void virt_acpi_build(VirtMachineState *vms, 
> AcpiBuildTables *tables)
>  dsdt = tables_blob->len;
>  build_dsdt(tables_blob, tables->linker, vms);
>
> -/* FADT MADT GTDT MCFG SPCR pointed to by RSDT */
> +/* FADT MADT GTDT MCFG SPCR DBG2 pointed to by RSDT */
>  acpi_add_table(table_offsets, tables_blob);
>  build_fadt_rev5(tables_blob, tables->linker, vms, dsdt);
>
> @@ -813,6 +885,9 @@ void virt_acpi_build(VirtMachineState *vms, 
> AcpiBuildTables *tables)
>

Re: Windows on ARM64 not able to use attached TPM 2

2021-08-02 Thread Ard Biesheuvel

On Mon, 2 Aug 2021 at 11:51, Eric Auger  wrote:
>
> and also adding Ard if he is aware of any limitation the TPM2
> integration may suffer for Windows support. On my end I am only able to
> test on Linux atm.
>

I never tested Windows with the TPM2 support, so I cannot answer this,
unfortunately.

>
> On 8/2/21 11:04 AM, Philippe Mathieu-Daudé wrote:
> > Cc'ing Marc-André who is your EDK2 co-maintainer.
> >
> > On 8/1/21 2:28 AM, Stefan Berger wrote:
> >> Hello!
> >>
> >>  I maintain the TPM support in QEMU and the TPM emulator (swtpm). I have
> >> a report from a user who would like to use QEMU on ARM64 (aarch64) with
> >> EDK2 and use an attached TPM 2 but it doesn't seem to work for him. We
> >> know that Windows on x86_64 works with EDK2 and can use an attached TPM
> >> 2 (using swtpm). I don't have an aarch64 host myself nor a Microsoft
> >> account to be able to access the Windows ARM64 version, so maybe someone
> >> here has the necessary background, credentials, and hardware to run QEMU
> >> on using kvm to investigate what the problems may be due to on that
> >> platform.
> >>
> >> https://github.com/stefanberger/swtpm/issues/493
> >>
> >> On Linux it seems to access the TPM emulator with the normal tpm_tis
> >> driver.
> >>
> >> Regards,
> >>
> >>Stefan
> >>
> >>
> >>
>

Re: aarch64 efi boot failures with qemu 6.0+

2021-07-28 Thread Ard Biesheuvel

On Wed, 28 Jul 2021 at 15:11, Michael S. Tsirkin  wrote:
>
> On Tue, Jul 27, 2021 at 12:36:03PM +0200, Igor Mammedov wrote:
> > On Tue, 27 Jul 2021 05:01:23 -0400
> > "Michael S. Tsirkin"  wrote:
> >
> > > On Mon, Jul 26, 2021 at 10:12:38PM -0700, Guenter Roeck wrote:
> > > > On 7/26/21 9:45 PM, Michael S. Tsirkin wrote:
> > > > > On Mon, Jul 26, 2021 at 06:00:57PM +0200, Ard Biesheuvel wrote:
> > > > > > (cc Bjorn)
> > > > > >
> > > > > > On Mon, 26 Jul 2021 at 11:08, Philippe Mathieu-Daudé 
> > > > > >  wrote:
> > > > > > >
> > > > > > > On 7/26/21 12:56 AM, Guenter Roeck wrote:
> > > > > > > > On 7/25/21 3:14 PM, Michael S. Tsirkin wrote:
> > > > > > > > > On Sat, Jul 24, 2021 at 11:52:34AM -0700, Guenter Roeck wrote:
> > > > > > > > > > Hi all,
> > > > > > > > > >
> > > > > > > > > > starting with qemu v6.0, some of my aarch64 efi boot tests 
> > > > > > > > > > no longer
> > > > > > > > > > work. Analysis shows that PCI devices with IO ports do not 
> > > > > > > > > > instantiate
> > > > > > > > > > in qemu v6.0 (or v6.1-rc0) when booting through efi. The 
> > > > > > > > > > problem affects
> > > > > > > > > > (at least) ne2k_pci, tulip, dc390, and am53c974. The 
> > > > > > > > > > problem only
> > > > > > > > > > affects
> > > > > > > > > > aarch64, not x86/x86_64.
> > > > > > > > > >
> > > > > > > > > > I bisected the problem to commit 0cf8882fd0 ("acpi/gpex: 
> > > > > > > > > > Inform os to
> > > > > > > > > > keep firmware resource map"). Since this commit, PCI device 
> > > > > > > > > > BAR
> > > > > > > > > > allocation has changed. Taking tulip as example, the kernel 
> > > > > > > > > > reports
> > > > > > > > > > the following PCI bar assignments when running qemu v5.2.
> > > > > > > > > >
> > > > > > > > > > [3.921801] pci :00:01.0: [1011:0019] type 00 class 
> > > > > > > > > > 0x02
> > > > > > > > > > [3.922207] pci :00:01.0: reg 0x10: [io  
> > > > > > > > > > 0x-0x007f]
> > > > > > > > > > [3.922505] pci :00:01.0: reg 0x14: [mem 
> > > > > > > > > > 0x1000-0x107f]
> > > > > >
> > > > > > IIUC, these lines are read back from the BARs
> > > > > >
> > > > > > > > > > [3.927111] pci :00:01.0: BAR 0: assigned [io  
> > > > > > > > > > 0x1000-0x107f]
> > > > > > > > > > [3.927455] pci :00:01.0: BAR 1: assigned [mem
> > > > > > > > > > 0x1000-0x107f]
> > > > > > > > > >
> > > > > >
> > > > > > ... and this is the assignment created by the kernel.
> > > > > >
> > > > > > > > > > With qemu v6.0, the assignment is reported as follows.
> > > > > > > > > >
> > > > > > > > > > [3.922887] pci :00:01.0: [1011:0019] type 00 class 
> > > > > > > > > > 0x02
> > > > > > > > > > [3.923278] pci :00:01.0: reg 0x10: [io  
> > > > > > > > > > 0x-0x007f]
> > > > > > > > > > [3.923451] pci :00:01.0: reg 0x14: [mem 
> > > > > > > > > > 0x1000-0x107f]
> > > > > > > > > >
> > > > > >
> > > > > > The problem here is that Linux, for legacy reasons, does not support
> > > > > > I/O ports <= 0x1000 on PCI, so the I/O assignment created by EFI is
> > > > > > rejected.
> > > > > >
> > > > > > This might make sense on x86, where legacy I/O ports may exist, but 
> > > > > > on
> > > > > > other architectures,

Re: aarch64 efi boot failures with qemu 6.0+

2021-07-27 Thread Ard Biesheuvel

(+ Lorenzo)

On Tue, 27 Jul 2021 at 12:07, Michael S. Tsirkin  wrote:
>
> On Tue, Jul 27, 2021 at 11:50:23AM +0200, Ard Biesheuvel wrote:
> > On Tue, 27 Jul 2021 at 11:30, Michael S. Tsirkin  wrote:
> > >
> > > On Tue, Jul 27, 2021 at 09:04:20AM +0200, Ard Biesheuvel wrote:
> > > > On Tue, 27 Jul 2021 at 07:12, Guenter Roeck  wrote:
> > > > >
> > > > > On 7/26/21 9:45 PM, Michael S. Tsirkin wrote:
> > > > > > On Mon, Jul 26, 2021 at 06:00:57PM +0200, Ard Biesheuvel wrote:
> > > > > >> (cc Bjorn)
> > > > > >>
> > > > > >> On Mon, 26 Jul 2021 at 11:08, Philippe Mathieu-Daudé 
> > > > > >>  wrote:
> > > > > >>>
> > > > > >>> On 7/26/21 12:56 AM, Guenter Roeck wrote:
> > > > > >>>> On 7/25/21 3:14 PM, Michael S. Tsirkin wrote:
> > > > > >>>>> On Sat, Jul 24, 2021 at 11:52:34AM -0700, Guenter Roeck wrote:
> > > > > >>>>>> Hi all,
> > > > > >>>>>>
> > > > > >>>>>> starting with qemu v6.0, some of my aarch64 efi boot tests no 
> > > > > >>>>>> longer
> > > > > >>>>>> work. Analysis shows that PCI devices with IO ports do not 
> > > > > >>>>>> instantiate
> > > > > >>>>>> in qemu v6.0 (or v6.1-rc0) when booting through efi. The 
> > > > > >>>>>> problem affects
> > > > > >>>>>> (at least) ne2k_pci, tulip, dc390, and am53c974. The problem 
> > > > > >>>>>> only
> > > > > >>>>>> affects
> > > > > >>>>>> aarch64, not x86/x86_64.
> > > > > >>>>>>
> > > > > >>>>>> I bisected the problem to commit 0cf8882fd0 ("acpi/gpex: 
> > > > > >>>>>> Inform os to
> > > > > >>>>>> keep firmware resource map"). Since this commit, PCI device BAR
> > > > > >>>>>> allocation has changed. Taking tulip as example, the kernel 
> > > > > >>>>>> reports
> > > > > >>>>>> the following PCI bar assignments when running qemu v5.2.
> > > > > >>>>>>
> > > > > >>>>>> [3.921801] pci :00:01.0: [1011:0019] type 00 class 
> > > > > >>>>>> 0x02
> > > > > >>>>>> [3.922207] pci :00:01.0: reg 0x10: [io  0x-0x007f]
> > > > > >>>>>> [3.922505] pci :00:01.0: reg 0x14: [mem 
> > > > > >>>>>> 0x1000-0x107f]
> > > > > >>
> > > > > >> IIUC, these lines are read back from the BARs
> > > > > >>
> > > > > >>>>>> [3.927111] pci :00:01.0: BAR 0: assigned [io  
> > > > > >>>>>> 0x1000-0x107f]
> > > > > >>>>>> [3.927455] pci :00:01.0: BAR 1: assigned [mem
> > > > > >>>>>> 0x1000-0x107f]
> > > > > >>>>>>
> > > > > >>
> > > > > >> ... and this is the assignment created by the kernel.
> > > > > >>
> > > > > >>>>>> With qemu v6.0, the assignment is reported as follows.
> > > > > >>>>>>
> > > > > >>>>>> [3.922887] pci :00:01.0: [1011:0019] type 00 class 
> > > > > >>>>>> 0x02
> > > > > >>>>>> [3.923278] pci :00:01.0: reg 0x10: [io  0x-0x007f]
> > > > > >>>>>> [3.923451] pci :00:01.0: reg 0x14: [mem 
> > > > > >>>>>> 0x1000-0x107f]
> > > > > >>>>>>
> > > > > >>
> > > > > >> The problem here is that Linux, for legacy reasons, does not 
> > > > > >> support
> > > > > >> I/O ports <= 0x1000 on PCI, so the I/O assignment created by EFI is
> > > > > >> rejected.
> > > > > >>
> > > > > >> This might make sense on x86, where legacy I/O ports may exist, 
> > > > > >> but on

Re: aarch64 efi boot failures with qemu 6.0+

2021-07-27 Thread Ard Biesheuvel

On Tue, 27 Jul 2021 at 11:30, Michael S. Tsirkin  wrote:
>
> On Tue, Jul 27, 2021 at 09:04:20AM +0200, Ard Biesheuvel wrote:
> > On Tue, 27 Jul 2021 at 07:12, Guenter Roeck  wrote:
> > >
> > > On 7/26/21 9:45 PM, Michael S. Tsirkin wrote:
> > > > On Mon, Jul 26, 2021 at 06:00:57PM +0200, Ard Biesheuvel wrote:
> > > >> (cc Bjorn)
> > > >>
> > > >> On Mon, 26 Jul 2021 at 11:08, Philippe Mathieu-Daudé 
> > > >>  wrote:
> > > >>>
> > > >>> On 7/26/21 12:56 AM, Guenter Roeck wrote:
> > > >>>> On 7/25/21 3:14 PM, Michael S. Tsirkin wrote:
> > > >>>>> On Sat, Jul 24, 2021 at 11:52:34AM -0700, Guenter Roeck wrote:
> > > >>>>>> Hi all,
> > > >>>>>>
> > > >>>>>> starting with qemu v6.0, some of my aarch64 efi boot tests no 
> > > >>>>>> longer
> > > >>>>>> work. Analysis shows that PCI devices with IO ports do not 
> > > >>>>>> instantiate
> > > >>>>>> in qemu v6.0 (or v6.1-rc0) when booting through efi. The problem 
> > > >>>>>> affects
> > > >>>>>> (at least) ne2k_pci, tulip, dc390, and am53c974. The problem only
> > > >>>>>> affects
> > > >>>>>> aarch64, not x86/x86_64.
> > > >>>>>>
> > > >>>>>> I bisected the problem to commit 0cf8882fd0 ("acpi/gpex: Inform os 
> > > >>>>>> to
> > > >>>>>> keep firmware resource map"). Since this commit, PCI device BAR
> > > >>>>>> allocation has changed. Taking tulip as example, the kernel reports
> > > >>>>>> the following PCI bar assignments when running qemu v5.2.
> > > >>>>>>
> > > >>>>>> [3.921801] pci :00:01.0: [1011:0019] type 00 class 0x02
> > > >>>>>> [3.922207] pci :00:01.0: reg 0x10: [io  0x-0x007f]
> > > >>>>>> [3.922505] pci :00:01.0: reg 0x14: [mem 
> > > >>>>>> 0x1000-0x107f]
> > > >>
> > > >> IIUC, these lines are read back from the BARs
> > > >>
> > > >>>>>> [3.927111] pci :00:01.0: BAR 0: assigned [io  
> > > >>>>>> 0x1000-0x107f]
> > > >>>>>> [3.927455] pci :00:01.0: BAR 1: assigned [mem
> > > >>>>>> 0x1000-0x107f]
> > > >>>>>>
> > > >>
> > > >> ... and this is the assignment created by the kernel.
> > > >>
> > > >>>>>> With qemu v6.0, the assignment is reported as follows.
> > > >>>>>>
> > > >>>>>> [3.922887] pci :00:01.0: [1011:0019] type 00 class 0x02
> > > >>>>>> [3.923278] pci :00:01.0: reg 0x10: [io  0x-0x007f]
> > > >>>>>> [3.923451] pci :00:01.0: reg 0x14: [mem 
> > > >>>>>> 0x1000-0x107f]
> > > >>>>>>
> > > >>
> > > >> The problem here is that Linux, for legacy reasons, does not support
> > > >> I/O ports <= 0x1000 on PCI, so the I/O assignment created by EFI is
> > > >> rejected.
> > > >>
> > > >> This might make sense on x86, where legacy I/O ports may exist, but on
> > > >> other architectures, this makes no sense.
> > > >
> > > >
> > > > Fixing Linux makes sense but OTOH EFI probably shouldn't create mappings
> > > > that trip up existing guests, right?
> > > >
> > >
> > > I think it is difficult to draw a line. Sure, maybe EFI should not create
> > > such mappings, but then maybe qemu should not suddenly start to enforce
> > > those mappings for existing guests either.
> > >
> >
> > EFI creates the mappings primarily for itself, and up until DSM #5
> > started to be enforced, all PCI resource allocations that existed at
> > boot were ignored by Linux and recreated from scratch.
> >
> > Also, the commit in question looks dubious to me. I don't think it is
> > likely that Linux would fail to create a resource tree. What does
> > happen is that BARs get moved around, which may cause trouble in some
> &g

Re: aarch64 efi boot failures with qemu 6.0+

2021-07-27 Thread Ard Biesheuvel

On Tue, 27 Jul 2021 at 07:12, Guenter Roeck  wrote:
>
> On 7/26/21 9:45 PM, Michael S. Tsirkin wrote:
> > On Mon, Jul 26, 2021 at 06:00:57PM +0200, Ard Biesheuvel wrote:
> >> (cc Bjorn)
> >>
> >> On Mon, 26 Jul 2021 at 11:08, Philippe Mathieu-Daudé  
> >> wrote:
> >>>
> >>> On 7/26/21 12:56 AM, Guenter Roeck wrote:
> >>>> On 7/25/21 3:14 PM, Michael S. Tsirkin wrote:
> >>>>> On Sat, Jul 24, 2021 at 11:52:34AM -0700, Guenter Roeck wrote:
> >>>>>> Hi all,
> >>>>>>
> >>>>>> starting with qemu v6.0, some of my aarch64 efi boot tests no longer
> >>>>>> work. Analysis shows that PCI devices with IO ports do not instantiate
> >>>>>> in qemu v6.0 (or v6.1-rc0) when booting through efi. The problem 
> >>>>>> affects
> >>>>>> (at least) ne2k_pci, tulip, dc390, and am53c974. The problem only
> >>>>>> affects
> >>>>>> aarch64, not x86/x86_64.
> >>>>>>
> >>>>>> I bisected the problem to commit 0cf8882fd0 ("acpi/gpex: Inform os to
> >>>>>> keep firmware resource map"). Since this commit, PCI device BAR
> >>>>>> allocation has changed. Taking tulip as example, the kernel reports
> >>>>>> the following PCI bar assignments when running qemu v5.2.
> >>>>>>
> >>>>>> [3.921801] pci :00:01.0: [1011:0019] type 00 class 0x02
> >>>>>> [3.922207] pci :00:01.0: reg 0x10: [io  0x-0x007f]
> >>>>>> [3.922505] pci :00:01.0: reg 0x14: [mem 0x1000-0x107f]
> >>
> >> IIUC, these lines are read back from the BARs
> >>
> >>>>>> [3.927111] pci :00:01.0: BAR 0: assigned [io  0x1000-0x107f]
> >>>>>> [3.927455] pci :00:01.0: BAR 1: assigned [mem
> >>>>>> 0x1000-0x107f]
> >>>>>>
> >>
> >> ... and this is the assignment created by the kernel.
> >>
> >>>>>> With qemu v6.0, the assignment is reported as follows.
> >>>>>>
> >>>>>> [3.922887] pci :00:01.0: [1011:0019] type 00 class 0x02
> >>>>>> [3.923278] pci :00:01.0: reg 0x10: [io  0x-0x007f]
> >>>>>> [3.923451] pci :00:01.0: reg 0x14: [mem 0x1000-0x107f]
> >>>>>>
> >>
> >> The problem here is that Linux, for legacy reasons, does not support
> >> I/O ports <= 0x1000 on PCI, so the I/O assignment created by EFI is
> >> rejected.
> >>
> >> This might make sense on x86, where legacy I/O ports may exist, but on
> >> other architectures, this makes no sense.
> >
> >
> > Fixing Linux makes sense but OTOH EFI probably shouldn't create mappings
> > that trip up existing guests, right?
> >
>
> I think it is difficult to draw a line. Sure, maybe EFI should not create
> such mappings, but then maybe qemu should not suddenly start to enforce
> those mappings for existing guests either.
>

EFI creates the mappings primarily for itself, and up until DSM #5
started to be enforced, all PCI resource allocations that existed at
boot were ignored by Linux and recreated from scratch.

Also, the commit in question looks dubious to me. I don't think it is
likely that Linux would fail to create a resource tree. What does
happen is that BARs get moved around, which may cause trouble in some
cases: for instance, we had to add special code to the EFI framebuffer
driver to copy with framebuffer BARs being relocated.

> For my own testing, I simply reverted commit 0cf8882fd0 in my copy of
> qemu. That solves my immediate problem, giving us time to find a solution
> that is acceptable for everyone. After all, it doesn't look like anyone
> else has noticed the problem, so there is no real urgency.
>

I would argue that it is better to revert that commit. DSM #5 has a
long history of debate and misinterpretation, and while I think we
ended up with something sane, I don't think we should be using it in
this particular case.

Re: aarch64 efi boot failures with qemu 6.0+

2021-07-26 Thread Ard Biesheuvel

(cc Bjorn)

On Mon, 26 Jul 2021 at 11:08, Philippe Mathieu-Daudé  wrote:
>
> On 7/26/21 12:56 AM, Guenter Roeck wrote:
> > On 7/25/21 3:14 PM, Michael S. Tsirkin wrote:
> >> On Sat, Jul 24, 2021 at 11:52:34AM -0700, Guenter Roeck wrote:
> >>> Hi all,
> >>>
> >>> starting with qemu v6.0, some of my aarch64 efi boot tests no longer
> >>> work. Analysis shows that PCI devices with IO ports do not instantiate
> >>> in qemu v6.0 (or v6.1-rc0) when booting through efi. The problem affects
> >>> (at least) ne2k_pci, tulip, dc390, and am53c974. The problem only
> >>> affects
> >>> aarch64, not x86/x86_64.
> >>>
> >>> I bisected the problem to commit 0cf8882fd0 ("acpi/gpex: Inform os to
> >>> keep firmware resource map"). Since this commit, PCI device BAR
> >>> allocation has changed. Taking tulip as example, the kernel reports
> >>> the following PCI bar assignments when running qemu v5.2.
> >>>
> >>> [3.921801] pci :00:01.0: [1011:0019] type 00 class 0x02
> >>> [3.922207] pci :00:01.0: reg 0x10: [io  0x-0x007f]
> >>> [3.922505] pci :00:01.0: reg 0x14: [mem 0x1000-0x107f]

IIUC, these lines are read back from the BARs

> >>> [3.927111] pci :00:01.0: BAR 0: assigned [io  0x1000-0x107f]
> >>> [3.927455] pci :00:01.0: BAR 1: assigned [mem
> >>> 0x1000-0x107f]
> >>>

... and this is the assignment created by the kernel.

> >>> With qemu v6.0, the assignment is reported as follows.
> >>>
> >>> [3.922887] pci :00:01.0: [1011:0019] type 00 class 0x02
> >>> [3.923278] pci :00:01.0: reg 0x10: [io  0x-0x007f]
> >>> [3.923451] pci :00:01.0: reg 0x14: [mem 0x1000-0x107f]
> >>>

The problem here is that Linux, for legacy reasons, does not support
I/O ports <= 0x1000 on PCI, so the I/O assignment created by EFI is
rejected.

This might make sense on x86, where legacy I/O ports may exist, but on
other architectures, this makes no sense.


> >>> and the controller does not instantiate. The problem disapears after
> >>> reverting commit 0cf8882fd0.
> >>>
> >>> Attached is a summary of test runs with various devices and qemu v5.2
> >>> as well as qemu v6.0, and the command line I use for efi boots.
> >>>
> >>> Did commit 0cf8882fd0 introduce a bug, do I now need need some different
> >>> command line to instantiate PCI devices with io ports, or are such
> >>> devices
> >>> simply no longer supported if the system is booted with efi support ?
> >>>
> >>> Thanks,
> >>> Guenter
> >>
> >>
> >> So that commit basically just says don't ignore what efi did.
> >>
> >> The issue's thus likely efi.
> >>
> >
> > I don't see the problem with efi boots on x86 and x86_64.
> > Any idea why that might be the case ?
> >
> > Thanks,
> > Guenter
> >
> >> Cc the maintainer. Philippe can you comment pls?
>
> I'll have a look. Cc'ing Ard for EDK2/Aarch64.
>

So a potential workaround would be to use a different I/O resource
window for ArmVirtPkg, that starts at 0x1000. But I would prefer to
fix Linux instead.


> >>
> >>> ---
> >>> Command line (tulip network interface):
> >>>
> >>> CMDLINE="root=/dev/vda console=ttyAMA0"
> >>> ROOTFS="rootfs.ext2"
> >>>
> >>> qemu-system-aarch64 -M virt -kernel arch/arm64/boot/Image -no-reboot \
> >>>  -m 512 -cpu cortex-a57 -no-reboot \
> >>>  -device tulip,netdev=net0 -netdev user,id=net0 \
> >>>  -bios QEMU_EFI-aarch64.fd \
> >>>  -snapshot \
> >>>  -device virtio-blk-device,drive=d0 \
> >>>  -drive file=${ROOTFS},if=none,id=d0,format=raw \
> >>>  -nographic -serial stdio -monitor none \
> >>>  --append "${CMDLINE}"
> >>>
> >>> ---
> >>> Boot tests with various devices known to work in qemu v5.2.
> >>>
> >>> v5.2v6.0v6.0
> >>> efinon-efiefi
> >>> e1000passpasspass
> >>> e1000-82544gcpasspasspass
> >>> e1000-82545empasspasspass
> >>> e1000epasspasspass
> >>> i82550passpasspass
> >>> i82557apasspasspass
> >>> i82557bpasspasspass
> >>> i82557cpasspasspass
> >>> i82558apasspasspass
> >>> i82559bpasspasspass
> >>> i82559cpasspasspass
> >>> i82559erpasspasspass
> >>> i82562passpasspass
> >>> i82801passpasspass
> >>> ne2k_pcipasspassfail<--
> >>> pcnetpasspasspass
> >>> rtl8139passpasspass
> >>> tulippasspassfail<--
> >>> usb-netpasspasspass
> >>> virtio-net-device
> >>> passpasspass
> >>> virtio-net-pcipasspasspass
> >>> virtio-net-pci-non-transitional
> >>> passpasspass
> >>>
> >>> usb-xhcipasspasspass
> >>> usb-ehcipasspasspass
> >>> usb-ohcipasspasspass
> >>> usb-uas-xhcipasspasspass
> >>> virtiopasspasspass
> >>>

Re: [PATCH] acpi/gpex: Inform os to keep firmware resource map

2020-12-17 Thread Ard Biesheuvel

On 12/17/20 6:23 PM, Laszlo Ersek wrote:
> On 12/17/20 14:52, Jiahui Cen wrote:
>> +Laszlo
>>
>> On 2020/12/17 21:29, Jiahui Cen wrote:
>>> There may be some differences in pci resource assignment between guest os
>>> and firmware.
>>>
>>> Eg. A Bridge with Bus [d2]
>>> -+-[:d2]---01.0-[d3]01.0
>>>
>>> where [d2:01.00] is a pcie-pci-bridge with BAR0 (mem, 64-bit, non-pref) 
>>> [size=256]
>>>   [d3:01.00] is a PCI Device with BAR0 (mem, 64-bit, pref) 
>>> [size=128K]
>>>   BAR4 (mem, 64-bit, pref) 
>>> [size=64M]
>>>
>>> In EDK2, the Resource Map would be:
>>> PciBus: Resource Map for Bridge [D2|01|00]
>>> Type = PMem64; Base = 0x800400; Length = 0x410; 
>>> Alignment = 0x3FF
>>>Base = 0x800400; Length = 0x400; Alignment = 
>>> 0x3FF;  Owner = PCI [D3|01|00:20]
>>>Base = 0x800800; Length = 0x2;   Alignment = 
>>> 0x1;Owner = PCI [D3|01|00:10]
>>> Type =  Mem64; Base = 0x800810; Length = 0x100; Alignment = 
>>> 0xFFF
>>>
>>> While in Linux, kernel will use 0x2FF as the alignment to calculate
>>> the PMem64 size, which would be 0x600.
>>>
>>> The diffences could result in resource assignment failure.
>>>
>>> Using _DSM #5 method to inform guest os not to ignore the PCI configuration
>>> that firmware has done at boot time could handle the differences.
>>>
>>> Signed-off-by: Jiahui Cen 
>>> ---
>>>  hw/pci-host/gpex-acpi.c | 11 ++-
>>>  1 file changed, 10 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/hw/pci-host/gpex-acpi.c b/hw/pci-host/gpex-acpi.c
>>> index 071aa11b5c..2b490f3379 100644
>>> --- a/hw/pci-host/gpex-acpi.c
>>> +++ b/hw/pci-host/gpex-acpi.c
>>> @@ -112,10 +112,19 @@ static void acpi_dsdt_add_pci_osc(Aml *dev)
>>>  UUID = aml_touuid("E5C937D0-3553-4D7A-9117-EA4D19C3434D");
>>>  ifctx = aml_if(aml_equal(aml_arg(0), UUID));
>>>  ifctx1 = aml_if(aml_equal(aml_arg(2), aml_int(0)));
>>> -uint8_t byte_list[1] = {1};
>>> +uint8_t byte_list[1] = {0x21};
>>>  buf = aml_buffer(1, byte_list);
>>>  aml_append(ifctx1, aml_return(buf));
>>>  aml_append(ifctx, ifctx1);
>>> +
>>> +/* PCI Firmware Specification 3.2
>>> + * 4.6.5. _DSM for Ignoring PCI Boot Configurations
>>> + * The UUID in _DSM in this context is
>>> + * {E5C937D0-3553-4D7A-9117-EA4D19C3434D}
>>> + */
>>> +ifctx1 = aml_if(aml_equal(aml_arg(2), aml_int(5)));
>>> +aml_append(ifctx1, aml_return(aml_int(0)));
>>> +aml_append(ifctx, ifctx1);
>>>  aml_append(method, ifctx);
>>>  
>>>  byte_list[0] = 0;
>>>
>>
> 
> Seems to make sense to me (I didn't realize we already had the _DSM
> method with this GUID!), but now I'm not sure what to expect of the
> guest kernel, in light of what Ard said. So if it works now, is that by
> accident, or is it an intentional, fresh commit in the kernel? Like
> a78cf9657ba5 ("PCI/ACPI: Evaluate PCI Boot Configuration _DSM", 2019-06-21)?
> 
> Benjamin: can you please tell us something about this Linux commit? What
> was the motivation for it?
> 
> Hmm this commit seems to be a part of the following series:
> 
> a78cf9657ba5 PCI/ACPI: Evaluate PCI Boot Configuration _DSM
> 7ac0d094fbe9 PCI: Don't auto-realloc if we're preserving firmware config
> 3e8ba9686600 arm64: PCI: Allow resource reallocation if necessary
> 85dc04136e86 arm64: PCI: Preserve firmware configuration when desired
> 
> OK, after reading through the commit messages in those commits (esp.
> 7ac0d094fbe9), I think the Linux change was made exactly for the purpose
> that we want it for -- stick with the firmware assignments.
> 
> Ard, does that seem right, or am I misunderstanding the kernel series?
> 

Hmm, I had no recollection of those changes going in, but I was clearly
aware at the time, given my acks.

So we clearly do support DSM #5 to prevent resource reallocation, and it
makes sense to use it in the proposed way.

Re: Question on UEFI ACPI tables setup and probing on arm64

2020-11-04 Thread Ard Biesheuvel


On 11/4/20 10:46 PM, Laszlo Ersek wrote:
...


(9) (Ard, please correct the below if necessary; thanks.)

The UEFI stub of the guest kernel (which is a UEFI application) uses a
device tree as its main communication channel to the (later-started)
kernel entry point, AIUI.

The UEFI stub basically inverts the importance of the UEFI system table
versus the device tree -- the UEFI stub *converts* the UEFI system table
(the multitude of UEFI config tables) into a device tree. This is my
understanding anyway.



Not entirely. The UEFI stub uses DT to communicate with the kernel 
proper, just like a non-EFI bootloader does. There are two pieces of 
information regarding EFI that the stub passes via the device tree:

- the EFI system table address
- the EFI memory map (address, size, descriptor size etc)

(Aside: unfortunately, we cannot pass the latter information via a EFI 
configuration table, given that we call SetVirtualAddressMap() in the 
stub, which causes the config_tables member of the system table to be 
converted into a virtual address. That virtual address can only be 
converted into a physical address if we have access to the EFI memory map.)


All other information passed between the EFI stub and the kernel proper 
is passed via Linux-specific EFI configuration tables.



(9a) If ACPI was disabled on the QEMU command line, then the guest
kernel *adopts* the device tree that was forwarded to it in (6), via the
UEFI config table marked with DEVICE_TREE_GUID.



Yes, although the EFI stub updates/augments it with the two data items 
mentioned above, as well as the kernel command line, initrd base and 
size and a KASLR seed [if enabled].



(9b) If ACPI was enabled on the QEMU command line, then the UEFI stub
creates a brand new (empty) device tree (AIUI).



... unless GRUB executed first and loaded a initrd, and passed this 
information via the device tree. In this case, GRUB creates an empty DT 
(Note that I posted the GRUB patches to implement LoadFile2 based initrd 
loading just a week or so ago)



Either way, the UEFI system table is linked *under* the -- adopted or
new -- device tree, through the "chosen" node. And so, if ACPI was
enabled, the ACPI RSD PTR (coming from step (7)) becomes visible to the
kernel proper as well, through the UEFI config table with
ACPI_20_TABLE_GUID.

I believe this is implemented under "drivers/firmware/efi/libstub" in
the kernel tree.

Thanks,
Laszlo

Re: [PATCH v7 0/3] vTPM/aarch64 ACPI support

2020-06-24 Thread Ard Biesheuvel

On Mon, 22 Jun 2020 at 16:06, Eric Auger  wrote:
>
> Those patches bring MMIO TPM TIS ACPI support in machvirt.
>
> On ARM, the TPM2 table is added when the TPM TIS sysbus
> device is dynamically instantiated in machvirt.
>
> Also the TPM2 device object is described in the DSDT.
>
> Many thanks to Ard for his support.
>
> Tested with LUKS partition automatic decryption. Also
> tested with new bios-tables-test dedicated tests,
> sent separately.
>
> Best Regards
>
> Eric
>
> This series can be found at:
> https://github.com/eauger/qemu/tree/v5.0-tpm-acpi-v7
>
> History:
>
> v6 -> v7:
> - Collected Stefan and Igor's R-bs
> - Eventually removed Acpi20TPM2 struct
> - Updated the reference to the spec v1.2 rev8
>
> v5 -> v6:
> - added reference to the spec
> - add some comments about LAML and LASA fields which are
>   strangely undocumented in the spec for TPM2.0. So I kept
>   the decision to keep the Acpi20TPM2 struct for documentation
>   purpose.
>
> v4 -> v5:
> - Move of build_tpm2() in the generic acpi code was upstreamed
>   but this does not correspond to latest proposed version.
> - Rebase on top of edfcb1f21a
>
> v3 -> v4:
> - some rework in build_tpm2() as suggested by Igor
> - Restored tpm presence check in acpi_dsdt_add_tpm()
> - add the doc related patch
>
> v2 -> v3:
> - Rebase on top of Stefan's
>   "acpi: tpm: Do not build TCPA table for TPM 2"
> - brings conversion to build_append
>
> v1 -> v2:
> - move build_tpm2() in the generic code (Michael)
> - collect Stefan's R-b on 3/3
>
> Eric Auger (3):
>   acpi: Some build_tpm2() code reshape
>   arm/acpi: Add the TPM2.0 device under the DSDT
>   docs/specs/tpm: ACPI boot now supported for TPM/ARM
>

For the series

Tested-by: Ard Biesheuvel 

Thanks!

>  docs/specs/tpm.rst  |  2 --
>  include/hw/acpi/acpi-defs.h | 18 -
>  hw/acpi/aml-build.c | 51 +++--
>  hw/arm/virt-acpi-build.c| 34 +
>  4 files changed, 66 insertions(+), 39 deletions(-)
>
> --
> 2.20.1
>

Re: [PATCH v2 3/3] arm/acpi: Add the TPM2.0 device under the DSDT

2020-05-08 Thread Ard Biesheuvel

On Fri, 8 May 2020 at 17:24, Shannon Zhao  wrote:
>
> Hi,
>
> On 2020/5/5 22:44, Eric Auger wrote:
> > +static void acpi_dsdt_add_tpm(Aml *scope, VirtMachineState *vms)
> > +{
> > +hwaddr pbus_base = vms->memmap[VIRT_PLATFORM_BUS].base;
> > +PlatformBusDevice *pbus = PLATFORM_BUS_DEVICE(vms->platform_bus_dev);
> > +MemoryRegion *sbdev_mr;
> > +SysBusDevice *sbdev;
> > +hwaddr tpm_base;
> > +
> > +sbdev = (SysBusDevice *)object_dynamic_cast(OBJECT(tpm_find()),
> > +TYPE_SYS_BUS_DEVICE);
>
> Does it need to check the tpm version like you do in previous patch?
>
> tpm_get_version(tpm_find()) == TPM_VERSION_2_0
>

I don't think so. The device node could in theory be used to describe
a TPM 1.2/1.3 as well, even though we never actually do that.

Re: [PATCH v2 0/3] vTPM/aarch64 ACPI support

2020-05-05 Thread Ard Biesheuvel

On Tue, 5 May 2020 at 16:44, Eric Auger  wrote:
>
> Those patches bring MMIO TPM TIS ACPI support in machvirt.
> The first patch moves the TPM2 ACPI table generation code
> in the generic code. Then the table is added if the TPM2
> sysbus device is dynamically instantiated in machvirt.
> Also the TPM2 device object is described in the DSDT.
>
> Many thanks to Ard for his support.
>
> Tested with LUKS partition automatic decryption.
>

Thanks Eric

Tested-by: Ard Biesheuvel

1 2 3 4 >

1 - 100 of 337 matches

Mail list logo