Re: [PATCH 2/3] crypto: X25519 core functions for ppc64le

2024-05-16 Thread Segher Boessenkool
On Wed, May 15, 2024 at 10:29:56AM +0200, Andy Polyakov wrote:
> >+static void cswap(fe51 p, fe51 q, unsigned int bit)
> 
> The "c" in cswap stands for "constant-time," and the problem is that 
> contemporary compilers have exhibited the ability to produce 
> non-constant-time machine code as result of compilation of the above 
> kind of technique.

This can happen with *any* comnpiler, on *any* platform.  In general,
you have to write machine code if you want to be sure what machine code
will eventually be executed.

>  The outcome is platform-specific and ironically some 
> of PPC code generators were observed to generate "most" 
> non-constant-time code. "Most" in sense that execution time variations 
> would be most easy to catch. One way to work around the problem, at 
> least for the time being, is to add 'asm volatile("" : "+r"(c))' after 
> you calculate 'c'. But there is no guarantee that the next compiler 
> version won't see through it, hence the permanent solution is to do it 
> in assembly. I can put together something...

Such tricks can help ameliorate the problem, sure.  But it is not a
solution ever.


Segher


Re: [PATCH 2/3] crypto: X25519 core functions for ppc64le

2024-05-15 Thread Andy Polyakov
Thanks for the info.  I should be able to do it.  I was hoping an 
assembly guru like you can show me some tricks here if there is :)


No tricks in cswap, it's as straightforward as it gets, so go ahead :-)



Re: [PATCH 2/3] crypto: X25519 core functions for ppc64le

2024-05-15 Thread Danny Tsen

Hi Andy,

Thanks for the info.  I should be able to do it.  I was hoping an 
assembly guru like you can show me some tricks here if there is :)


Thanks.

-Danny

On 5/15/24 8:33 AM, Andy Polyakov wrote:

+static void cswap(fe51 p, fe51 q, unsigned int bit)
+{
+    u64 t, i;
+    u64 c = 0 - (u64) bit;
+
+    for (i = 0; i < 5; ++i) {
+    t = c & (p[i] ^ q[i]);
+    p[i] ^= t;
+    q[i] ^= t;
+    }
+}


The "c" in cswap stands for "constant-time," and the problem is that 
contemporary compilers have exhibited the ability to produce 
non-constant-time machine code as result of compilation of the above 
kind of technique. The outcome is platform-specific and ironically 
some of PPC code generators were observed to generate "most" 
non-constant-time code. "Most" in sense that execution time 
variations would be most easy to catch.


Just to substantiate the point, consider 
https://godbolt.org/z/faYnEcPT7, and note the conditional branch in 
the middle of the loop, which flies in the face of constant-time-ness. 
In case you object 'bit &= 1' on line 7 in the C code. Indeed, if you 
comment it out, the generated code will be fine. But the point is that 
the compiler is capable of and was in fact observed to figure out that 
the caller passes either one or zero and generate the machine code in 
the assembly window. In other words 'bit &= 1' is just a reflection of 
what the caller does.


... the permanent solution is to do it in assembly. I can put 
together something...


Though you should be able to do this just as well :-) So should I or 
would you?


Cheers.



Re: [PATCH 2/3] crypto: X25519 core functions for ppc64le

2024-05-15 Thread Andy Polyakov

+static void cswap(fe51 p, fe51 q, unsigned int bit)
+{
+    u64 t, i;
+    u64 c = 0 - (u64) bit;
+
+    for (i = 0; i < 5; ++i) {
+    t = c & (p[i] ^ q[i]);
+    p[i] ^= t;
+    q[i] ^= t;
+    }
+}


The "c" in cswap stands for "constant-time," and the problem is that 
contemporary compilers have exhibited the ability to produce 
non-constant-time machine code as result of compilation of the above 
kind of technique. The outcome is platform-specific and ironically some 
of PPC code generators were observed to generate "most" 
non-constant-time code. "Most" in sense that execution time variations 
would be most easy to catch.


Just to substantiate the point, consider 
https://godbolt.org/z/faYnEcPT7, and note the conditional branch in the 
middle of the loop, which flies in the face of constant-time-ness. In 
case you object 'bit &= 1' on line 7 in the C code. Indeed, if you 
comment it out, the generated code will be fine. But the point is that 
the compiler is capable of and was in fact observed to figure out that 
the caller passes either one or zero and generate the machine code in 
the assembly window. In other words 'bit &= 1' is just a reflection of 
what the caller does.


... the permanent solution is to do it 
in assembly. I can put together something...


Though you should be able to do this just as well :-) So should I or 
would you?


Cheers.



Re: [PATCH 2/3] crypto: X25519 core functions for ppc64le

2024-05-15 Thread Danny Tsen

Hi Andy,

Points taken.  And much appreciate for the help.

Thanks.

-Danny

On 5/15/24 3:29 AM, Andy Polyakov wrote:

Hi,


+static void cswap(fe51 p, fe51 q, unsigned int bit)
+{
+    u64 t, i;
+    u64 c = 0 - (u64) bit;
+
+    for (i = 0; i < 5; ++i) {
+    t = c & (p[i] ^ q[i]);
+    p[i] ^= t;
+    q[i] ^= t;
+    }
+}


The "c" in cswap stands for "constant-time," and the problem is that 
contemporary compilers have exhibited the ability to produce 
non-constant-time machine code as result of compilation of the above 
kind of technique. The outcome is platform-specific and ironically 
some of PPC code generators were observed to generate "most" 
non-constant-time code. "Most" in sense that execution time variations 
would be most easy to catch. One way to work around the problem, at 
least for the time being, is to add 'asm volatile("" : "+r"(c))' after 
you calculate 'c'. But there is no guarantee that the next compiler 
version won't see through it, hence the permanent solution is to do it 
in assembly. I can put together something...


Cheers.



Re: [PATCH 2/3] crypto: X25519 core functions for ppc64le

2024-05-15 Thread Andy Polyakov

Hi,


+static void cswap(fe51 p, fe51 q, unsigned int bit)
+{
+   u64 t, i;
+   u64 c = 0 - (u64) bit;
+
+   for (i = 0; i < 5; ++i) {
+   t = c & (p[i] ^ q[i]);
+   p[i] ^= t;
+   q[i] ^= t;
+   }
+}


The "c" in cswap stands for "constant-time," and the problem is that 
contemporary compilers have exhibited the ability to produce 
non-constant-time machine code as result of compilation of the above 
kind of technique. The outcome is platform-specific and ironically some 
of PPC code generators were observed to generate "most" 
non-constant-time code. "Most" in sense that execution time variations 
would be most easy to catch. One way to work around the problem, at 
least for the time being, is to add 'asm volatile("" : "+r"(c))' after 
you calculate 'c'. But there is no guarantee that the next compiler 
version won't see through it, hence the permanent solution is to do it 
in assembly. I can put together something...


Cheers.



[PATCH 2/3] crypto: X25519 core functions for ppc64le

2024-05-14 Thread Danny Tsen
X25519 core functions to handle scalar multiplication for ppc64le.

Signed-off-by: Danny Tsen 
---
 arch/powerpc/crypto/curve25519-ppc64le-core.c | 299 ++
 1 file changed, 299 insertions(+)
 create mode 100644 arch/powerpc/crypto/curve25519-ppc64le-core.c

diff --git a/arch/powerpc/crypto/curve25519-ppc64le-core.c 
b/arch/powerpc/crypto/curve25519-ppc64le-core.c
new file mode 100644
index ..6a8b5efc40ce
--- /dev/null
+++ b/arch/powerpc/crypto/curve25519-ppc64le-core.c
@@ -0,0 +1,299 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright 2024- IBM Corp. All rights reserved.
+ *
+ * X25519 scalar multiplication with 51 bits limbs for PPC64le.
+ *   Based on RFC7748 and AArch64 optimized implementation for X25519
+ * - Algorithm 1 Scalar multiplication of a variable point
+ */
+
+#include 
+#include 
+
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+#include 
+
+typedef uint64_t fe51[5];
+
+asmlinkage void x25519_fe51_mul(fe51 h, const fe51 f, const fe51 g);
+asmlinkage void x25519_fe51_sqr(fe51 h, const fe51 f);
+asmlinkage void x25519_fe51_mul121666(fe51 h, fe51 f);
+asmlinkage void x25519_fe51_sqr_times(fe51 h, const fe51 f, int n);
+asmlinkage void x25519_fe51_frombytes(fe51 h, const uint8_t *s);
+asmlinkage void x25519_fe51_tobytes(uint8_t *s, const fe51 h);
+
+#define fmul x25519_fe51_mul
+#define fsqr x25519_fe51_sqr
+#define fmul121666 x25519_fe51_mul121666
+#define fe51_tobytes x25519_fe51_tobytes
+#define fe51_frombytes x25519_fe51_frombytes
+
+static void cswap(fe51 p, fe51 q, unsigned int bit)
+{
+   u64 t, i;
+   u64 c = 0 - (u64) bit;
+
+   for (i = 0; i < 5; ++i) {
+   t = c & (p[i] ^ q[i]);
+   p[i] ^= t;
+   q[i] ^= t;
+   }
+}
+
+static void fadd(fe51 h, const fe51 f, const fe51 g)
+{
+   h[0] = f[0] + g[0];
+   h[1] = f[1] + g[1];
+   h[2] = f[2] + g[2];
+   h[3] = f[3] + g[3];
+   h[4] = f[4] + g[4];
+}
+
+/*
+ * Prime = 2 ** 255 - 19, 255 bits
+ *(0x7fff       
ffed)
+ *
+ * Prime in 5 51-bit limbs
+ */
+static fe51 prime51 = { 0x7ffed, 0x7, 0x7, 
0x7, 0x7};
+
+static void fsub(fe51 h, const fe51 f, const fe51 g)
+{
+   h[0] = (f[0] + ((prime51[0] * 2))) - g[0];
+   h[1] = (f[1] + ((prime51[1] * 2))) - g[1];
+   h[2] = (f[2] + ((prime51[2] * 2))) - g[2];
+   h[3] = (f[3] + ((prime51[3] * 2))) - g[3];
+   h[4] = (f[4] + ((prime51[4] * 2))) - g[4];
+}
+
+static void finv(fe51 o, const fe51 i)
+{
+   fe51 a0, b, c, t00;
+
+   fsqr(a0, i);
+   x25519_fe51_sqr_times(t00, a0, 2);
+
+   fmul(b, t00, i);
+   fmul(a0, b, a0);
+
+   fsqr(t00, a0);
+
+   fmul(b, t00, b);
+   x25519_fe51_sqr_times(t00, b, 5);
+
+   fmul(b, t00, b);
+   x25519_fe51_sqr_times(t00, b, 10);
+
+   fmul(c, t00, b);
+   x25519_fe51_sqr_times(t00, c, 20);
+
+   fmul(t00, t00, c);
+   x25519_fe51_sqr_times(t00, t00, 10);
+
+   fmul(b, t00, b);
+   x25519_fe51_sqr_times(t00, b, 50);
+
+   fmul(c, t00, b);
+   x25519_fe51_sqr_times(t00, c, 100);
+
+   fmul(t00, t00, c);
+   x25519_fe51_sqr_times(t00, t00, 50);
+
+   fmul(t00, t00, b);
+   x25519_fe51_sqr_times(t00, t00, 5);
+
+   fmul(o, t00, a0);
+}
+
+static void curve25519_fe51(uint8_t out[32], const uint8_t scalar[32],
+   const uint8_t point[32])
+{
+   fe51 x1, x2, z2, x3, z3;
+   uint8_t s[32];
+   unsigned int swap = 0;
+   int i;
+
+   memcpy(s, scalar, 32);
+   s[0]  &= 0xf8;
+   s[31] &= 0x7f;
+   s[31] |= 0x40;
+   fe51_frombytes(x1, point);
+
+   z2[0] = z2[1] = z2[2] = z2[3] = z2[4] = 0;
+   x3[0] = x1[0];
+   x3[1] = x1[1];
+   x3[2] = x1[2];
+   x3[3] = x1[3];
+   x3[4] = x1[4];
+
+   x2[0] = z3[0] = 1;
+   x2[1] = z3[1] = 0;
+   x2[2] = z3[2] = 0;
+   x2[3] = z3[3] = 0;
+   x2[4] = z3[4] = 0;
+
+   for (i = 254; i >= 0; --i) {
+   unsigned int k_t = 1 & (s[i / 8] >> (i & 7));
+   fe51 a, b, c, d, e;
+   fe51 da, cb, aa, bb;
+   fe51 dacb_p, dacb_m;
+
+   swap ^= k_t;
+   cswap(x2, x3, swap);
+   cswap(z2, z3, swap);
+   swap = k_t;
+
+   fsub(b, x2, z2);// B = x_2 - z_2
+   fadd(a, x2, z2);// A = x_2 + z_2
+   fsub(d, x3, z3);// D = x_3 - z_3
+   fadd(c, x3, z3);// C = x_3 + z_3
+
+   fsqr(bb, b);// BB = B^2
+   fsqr(aa, a);// AA = A^2
+   fmul(da, d, a); // DA = D * A
+   fmul(cb, c, b); // CB = C * B
+
+   fsub(e, aa, bb);// E = AA - BB
+