On Tue, 20 May 2025 11:51:49 GMT, Ferenc Rakoczi <d...@openjdk.org> wrote:

>> src/hotspot/cpu/x86/stubGenerator_x86_64_kyber.cpp line 250:
>> 
>>> 248: static void montmul(int outputRegs[], int inputRegs1[], int 
>>> inputRegs2[],
>>> 249:              int scratchRegs1[], int scratchRegs2[], MacroAssembler 
>>> *_masm) {
>>> 250:    for (int i = 0; i < 4; i++) {
>> 
>> In the intrinsic for montMul we are treating as if MONT_R_BITS is 16 and 
>> MONT_Q_INV_MOD_R is 0xF301 whereas in the Java code MONT_R_BITS is 20 and 
>> MONT_Q_INT_MOD_R is 0x8F301. Are these equivalent?
>
> As used in this case, they are equivalent.  For z = montmul(a,b), z will be  
> between -q and q and congruent to a * b * R^-1 mod q, where R > 2 * q, R is a 
> power of 2, -R/2 * q <= a * b < R/2 * q. For the Java code, we use R = 2^20 
> and for the intrinsic, R = 2^16. In our computations, b is always c * R mod 
> q, so the montmul() really  computes a * c mod q. In the Java code, we use 
> 32-bit numbers for the computations, and we use R = 2^20 because that way the 
> a * b numbers that occur during all computations stay in the required range 
> (the inverse NTT computation is where they can grow the most), so we don't 
> have to do Barrett reductions during that computation. For the intrinsics, we 
> use R = 2^16, because this way we can do twice as much work in parallel, but 
> we have to do Barrett reduction after levels 2 and 4 in the inverse NTT 
> computation.

Thanks a lot for the explanation. It would be good to add it as a comment in 
the stubGenerator_x86_64_kyber.cpp.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/24953#discussion_r2098491524

Reply via email to