On Thu, 22 Sep 2022 20:40:08 GMT, Xue-Lei Andrew Fan <xue...@openjdk.org> wrote:

> Hi,
> 
> Please review this performance improvement for Secp256R1 implementation in 
> OpenJDK.  With this update, there is an about 20% performance improvement for 
> Secp256R1 key generation and signature.
> 
> Basically, 256 bits EC curves could use 9 integer limbs for the computation.  
> The current implementation use 10 limbs instead.  By reducing the number of 
> limbs, the implementation could benefit from less integer computation 
> (add/sub/multiply/square/inverse/mod/pow, etc), and thus improve the 
> performance.
> 
> Here are the benchmark numbers without the patch:
> 
> Benchmark         (messageLength)   Mode  Cnt  Score   Error   Units
> Signatures.sign               64  thrpt   15  1.414 ± 0.022  ops/ms
> Signatures.sign              512  thrpt   15  1.418 ± 0.004  ops/ms
> Signatures.sign             2048  thrpt   15  1.419 ± 0.005  ops/ms
> Signatures.sign            16384  thrpt   15  1.395 ± 0.003  ops/ms
> 
> KeyGenerators.keyPairGen          thrpt   15  1.475 ± 0.043  ops/ms
> 
> 
> And here are the numbers with the patch applied:
> 
> Benchmark         (messageLength)   Mode  Cnt  Score   Error   Units
> ECSignature.sign               64  thrpt   15  1.719 ± 0.010  ops/ms
> ECSignature.sign              512  thrpt   15  1.704 ± 0.012  ops/ms
> ECSignature.sign             2048  thrpt   15  1.699 ± 0.018  ops/ms
> ECSignature.sign            16384  thrpt   15  1.681 ± 0.006  ops/ms
> 
> KeyGenerators.keyPairGen           thrpt   15  1.881 ± 0.008  ops/ms
> 
> 
> Thanks,
> Xuelei

I check with the real bytes for the following computation.  For Secp256R1, the 
p is defined.

p = FFFFFFFF 00000001 00000000 00000000 00000000 FFFFFFFF FFFFFFFF FFFFFFFF
  = 2^224(2^32 −1) + 2^192 + 2^96 − 1
  = 2^256 - 2^224 + 2^192 + 2^96 − 1


Let's use a full 256 bits integer for following computation, which is bigger 
than p:


FFFFFFFF FFFFFFFF FFFFFFFF FFFFFFFF FFFFFFFF FFFFFFFF FFFFFFFF FFFFFFFF

> starting with 29 bit limbs

With 29 bit limbs, it could be expressed in 9 29-bit words in little-endian 
order, as the following:


a[] = 1FFFFFFF 1FFFFFFF 1FFFFFFF 1FFFFFFF 1FFFFFFF 1FFFFFFF 1FFFFFFF 1FFFFFFF 
00FFFFFF 


>From a[0] to a[7], 29 bits for each limbs, and a[8] is 24 bits.  29 * 8 + 24 = 
>256

> after 2 additions (maxAdds) we have up to 31 bits

after 2 additions, the 9-29 bits words are:

a[] = 7FFFFFFF 7FFFFFFF 7FFFFFFF 7FFFFFFF 7FFFFFFF 7FFFFFFF 7FFFFFFF 7FFFFFFF 
03FFFFFF

>From a[0] to a[7],  31 bits for each limbs, and a[8] is 26 bits.  This is the 
>biggest number we could get from the finite filed computation.

> then we multiply limbs (62 bits) and add up to numLimbs (9) results together 
> = 65 bits, which overflows long
> and we didn't start reducing yet.

Let's multiply two limbs:

a[] = 7FFFFFFF 7FFFFFFF 7FFFFFFF 7FFFFFFF 7FFFFFFF 7FFFFFFF 7FFFFFFF 7FFFFFFF 
03FFFFFF
b[] = 7FFFFFFF 7FFFFFFF 7FFFFFFF 7FFFFFFF 7FFFFFFF 7FFFFFFF 7FFFFFFF 7FFFFFFF 
03FFFFFF 


The overflow we cared about is the following computation, the longest additions.

        long c8 = (a[0] * b[8]) + (a[1] * b[7]) + (a[2] * b[6]) + (a[3] * b[5]) 
+ (a[4] * b[4]) + (a[5] * b[3]) + (a[6] * b[2]) + (a[7] * b[1]) + (a[8] * b[0]);

Now we have a[0]-a[7] are 31-bit words, and b[0]-b[7] are 31-bit words. a[8] 
and b[8] is 26-bit words.

Then, we know that (a[0] * b[8]) and (a[8] * b[0]) are multiplying between 
31-bits words and 26-bits words.  So we have the equivalent computation for the 
a and b values above:

a[1]=a[2]=a[3]=a[4]=a[5]=a[6]=a[7]=0x7FFFFFFF;
b[1]=b[2]=b[3]=b[4]=b[5]=b[6]=b[7]=0x7FFFFFFF;
long i1x7 = (a[1] * b[7]) + (a[2] * b[6]) + (a[3] * b[5]) + (a[4] * b[4]) + 
(a[5] * b[3]) + (a[6] * b[2]) + (a[7] * b[1]) 
              = 0xBFFFFFF900000007L
long i0x8 =  (a[0] * b[8]) +  (a[8] * b[0]);
              = 0x03FFFFFEF8000002L

long c8 = i1x7 + i0x8
             = 0xC3FFFFF7F8000009L


c8 fills the 64 bits fully, but it does not overflow.

@djelinski does it sound right to you?

Although there is not overflow if I get the above computation right, it is 
still not a straightforward thing we can know without detailed computation.  
Instead, this kind of computation and checking could be performed in FieldGen 
code automatically, or just reduce limbs operations result (up to no more than 
twice the modulus) and remove the limit of maxAdd.

This RFE, JDK-8294248, is one of a few performance improvement of the EC 
implementation in JDK 
([JDK-8294188](https://bugs.openjdk.org/browse/JDK-8294188)).  I would like to 
address the checking or/and reducing improvement in a separately RFE.

-------------

PR: https://git.openjdk.org/jdk/pull/10398

Reply via email to