Re: RFR: 8348561: Add aarch64 intrinsics for ML-DSA [v5]

Andrew Dinn Mon, 24 Feb 2025 08:24:57 -0800

On Fri, 21 Feb 2025 10:23:37 GMT, Ferenc Rakoczi <d...@openjdk.org> wrote:


>> Hi. Here is the test result of our CI.
>> 
>> ### copyright year
>> 
>> the following files should update the copyright year to 2025.
>> 
>> 
>> src/hotspot/cpu/aarch64/assembler_aarch64.hpp
>> src/hotspot/cpu/aarch64/stubRoutines_aarch64.hpp
>> src/hotspot/share/runtime/globals.hpp
>> src/java.base/share/classes/sun/security/provider/ML_DSA.java
>> src/java.base/share/classes/sun/security/provider/SHA3Parallel.java
>> test/micro/org/openjdk/bench/java/security/MLDSA.java
>> 
>> 
>> ### cross-build failure
>> 
>> Cross build for riscv64/s390/ppc64 failed.
>> 
>> Here shows the error msg for ppc64
>> 
>> 
>> === Output from failing command(s) repeated here ===
>> * For target support_interim-jmods_support__create_java.base.jmod_exec:
>> #
>> # A fatal error has been detected by the Java Runtime Environment:
>> #
>> #  Internal Error (/tmp/jdk-src/src/hotspot/share/asm/codeBuffer.hpp:200), 
>> pid=72752, tid=72769
>> #  assert(allocates2(pc)) failed: not in CodeBuffer memory: 
>> 0x0000e85cb03dc620 <= 0x0000e85cb03e8ab4 <= 0x0000e85cb03e8ab0
>> #
>> # JRE version: OpenJDK Runtime Environment (25.0) (fastdebug build 
>> 25-internal-git-1e01c6deec3)
>> # Java VM: OpenJDK 64-Bit Server VM (fastdebug 25-internal-git-1e01c6deec3, 
>> mixed mode, tiered, compressed oops, compressed class ptrs, g1 gc, 
>> linux-aarch64)
>> # Problematic frame:
>> # V  [libjvm.so+0x3b391c]  Instruction_aarch64::~Instruction_aarch64()+0xbc
>> #
>> # Core dump will be written. Default location: Core dumps may be processed 
>> with "/usr/share/apport/apport -p%p -s%s -c%c -d%d -P%P -u%u -g%g -- %E" (or 
>> dumping to /tmp/ci-scripts/jdk-src/make/
>> #
>> # An error report file with more information is saved as:
>> # /tmp/jdk-src/make/hs_err_pid72752.log
>>    ... (rest of output omitted)
>> 
>> * All command lines available in 
>> /sysroot/ppc64el/tmp/build-ppc64el/make-support/failure-logs.
>> === End of repeated output ===
>> 
>> 
>> I suppose we should make the similar update at file 
>> `src/hotspot/cpu/aarch64/stubDeclarations_aarch64.hpp` to other platforms
>
> @shqking, I changed the copyright years, but I don't really understand how 
> the aarch64-specific code can overflow buffers on other architectures. As far 
> as I understand, Instruction_aarch64 should not have been there in a ppc 
> build.
> Was this a build attempted on an aarch64 for the other architectures?

@ferakocz I have indicated a few places where I think you should add comments 
to clarify the relationship to the original Java code or just clarify what data 
is being used. I think the code is ok to go in as it is but I would really like 
to investigate a better structuring of the generator code. This can be done as 
a follow-up rather than delay getting this version committed.

There are two things I still see as problematic with the current code.

1) There are lots of places in your auxiliary generator methods and also in 
their client methods where you generate distinct sequences of calls to the 
assembler sharing essentially the same code shape i.e. the same instructions 
but with different vector register arguments. For example, in 
`dilithium_montmul32` you generate the multiply sequence to montgomery multiply 
4x4s registers in v0..v3 by 4x4s registers in v16..v19 and then repeat exactly 
the same code in exactly the same sequence to multiply the 4x4s registers in 
v4..v7 by 4x4s registers in v20..v23.

Likewise, `dilithium_sub_add_montmul16` generates that same shape code but uses 
the montmul sequence with odd registers v1..v7 paired against the compact 
sequence v16..19. As another example, you generate various 4 or 8 long 
sequences of subv and addv operations at various points, including in some of 
the top level methods.

I appreciate that you have folded one of the montmult cases into the other by 
adding the `bool by_constant` parameter to `dilithium_montmul32`. However, I 
think it would be worth investigating an alternative that would allow more use 
more, systematic use of auxiliary methods.

2) Your current auxiliary generator methods rely on a fixed mapping of input, 
output and scratch registers to specific registers. This is part of why the 
reason why you cannot always call your auxiliaries (or smaller pieces of them) 
from other locations where the same code shape is generated -- the input and 
output mappings of data to registers expected by the auxiliary do not match the 
register sequences in which the relevant data are (transiently) located.

This same fact also means that the repeated code sections heavily depend on 
naming exactly the right register on each generator line. That makes it harder 
for a maintainer to recognize how, essentially, what is really just one common, 
abstract operation is, at each different occurrence, consuming, combining and 
updating several input sequences of related registers to generate one or more 
output sequences. That also means that it would be very easy to introduce an 
error if the code ever needed to be changed.

I would like to investigate an alternative approach where your auxiliary 
generator methods and their callers pass arguments that identify the vector 
register sequences to be consumed as inputs, used as temporaries and written as 
outputs. In cases where the routines operate on sequences of 4 or 8 successive 
vectors then, at the very least, that would involve specifying the first 
register for each input, temporary or output e.g. for the montmult32 multiply 
v0+ by v16+ using v24+ as temporaries and v30+ as constants and output the 
results to v16+. However, that leaves it implicit that the first two inputs 
involve 8 registers while the temporaries involves 4 and the constants 2. The 
more general requirement is not just to specify the vector sequence length (2, 
4 or 8) but also allow the default stride of one (e.g. v0, v1, ...) to be 
varied to allow for skip sequences (e.g. v0, v2, ...) or constant sequences 
(v28, v28, ... as would be needed for multiply constant).

I have prototyped a simple vector sequence type `VRSeq<N>` that models an 
indexable sequence of FloatRegisters and allows many of your higher level 
routines to simply declare register sets they operate on and then pass them as 
arguments to a range of simply auxiliary generator functions that can be used 
in many places where you currently have a lot of inline calls to the assembler 
-- see attachment:
[vseq.zip](https://github.com/user-attachments/files/18946470/vseq.zip)

I'll raise a JIRA to cover recoding the current implementation using this type 
and post a follow-up PR that uses it to see how far it helps simplify the code. 
I believe it will make it easier for maintainers to understand the structure of 
the generated code and observe/verify the use of registers to store specific 
values. It should also allow assertions about the use of registers to be added 
to the code to ensure that values are not being overwritten (expect in 
circumstances where that is legitimate).

Meanwhile I'll approve this PR modulo the commenting I suggested.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/23300#issuecomment-2678977770

Re: RFR: 8348561: Add aarch64 intrinsics for ML-DSA [v5]

Reply via email to