Re: [PATCH 2/3] crypto: X25519 core functions for ppc64le
On Wed, May 15, 2024 at 10:29:56AM +0200, Andy Polyakov wrote: > >+static void cswap(fe51 p, fe51 q, unsigned int bit) > > The "c" in cswap stands for "constant-time," and the problem is that > contemporary compilers have exhibited the ability to produce > non-constant-time machine code as result of compilation of the above > kind of technique. This can happen with *any* comnpiler, on *any* platform. In general, you have to write machine code if you want to be sure what machine code will eventually be executed. > The outcome is platform-specific and ironically some > of PPC code generators were observed to generate "most" > non-constant-time code. "Most" in sense that execution time variations > would be most easy to catch. One way to work around the problem, at > least for the time being, is to add 'asm volatile("" : "+r"(c))' after > you calculate 'c'. But there is no guarantee that the next compiler > version won't see through it, hence the permanent solution is to do it > in assembly. I can put together something... Such tricks can help ameliorate the problem, sure. But it is not a solution ever. Segher
Re: [PATCH 2/3] crypto: X25519 core functions for ppc64le
Thanks for the info. I should be able to do it. I was hoping an assembly guru like you can show me some tricks here if there is :) No tricks in cswap, it's as straightforward as it gets, so go ahead :-)
Re: [PATCH 2/3] crypto: X25519 core functions for ppc64le
Hi Andy, Thanks for the info. I should be able to do it. I was hoping an assembly guru like you can show me some tricks here if there is :) Thanks. -Danny On 5/15/24 8:33 AM, Andy Polyakov wrote: +static void cswap(fe51 p, fe51 q, unsigned int bit) +{ + u64 t, i; + u64 c = 0 - (u64) bit; + + for (i = 0; i < 5; ++i) { + t = c & (p[i] ^ q[i]); + p[i] ^= t; + q[i] ^= t; + } +} The "c" in cswap stands for "constant-time," and the problem is that contemporary compilers have exhibited the ability to produce non-constant-time machine code as result of compilation of the above kind of technique. The outcome is platform-specific and ironically some of PPC code generators were observed to generate "most" non-constant-time code. "Most" in sense that execution time variations would be most easy to catch. Just to substantiate the point, consider https://godbolt.org/z/faYnEcPT7, and note the conditional branch in the middle of the loop, which flies in the face of constant-time-ness. In case you object 'bit &= 1' on line 7 in the C code. Indeed, if you comment it out, the generated code will be fine. But the point is that the compiler is capable of and was in fact observed to figure out that the caller passes either one or zero and generate the machine code in the assembly window. In other words 'bit &= 1' is just a reflection of what the caller does. ... the permanent solution is to do it in assembly. I can put together something... Though you should be able to do this just as well :-) So should I or would you? Cheers.
Re: [PATCH 2/3] crypto: X25519 core functions for ppc64le
+static void cswap(fe51 p, fe51 q, unsigned int bit) +{ + u64 t, i; + u64 c = 0 - (u64) bit; + + for (i = 0; i < 5; ++i) { + t = c & (p[i] ^ q[i]); + p[i] ^= t; + q[i] ^= t; + } +} The "c" in cswap stands for "constant-time," and the problem is that contemporary compilers have exhibited the ability to produce non-constant-time machine code as result of compilation of the above kind of technique. The outcome is platform-specific and ironically some of PPC code generators were observed to generate "most" non-constant-time code. "Most" in sense that execution time variations would be most easy to catch. Just to substantiate the point, consider https://godbolt.org/z/faYnEcPT7, and note the conditional branch in the middle of the loop, which flies in the face of constant-time-ness. In case you object 'bit &= 1' on line 7 in the C code. Indeed, if you comment it out, the generated code will be fine. But the point is that the compiler is capable of and was in fact observed to figure out that the caller passes either one or zero and generate the machine code in the assembly window. In other words 'bit &= 1' is just a reflection of what the caller does. ... the permanent solution is to do it in assembly. I can put together something... Though you should be able to do this just as well :-) So should I or would you? Cheers.
Re: [PATCH 2/3] crypto: X25519 core functions for ppc64le
Hi Andy, Points taken. And much appreciate for the help. Thanks. -Danny On 5/15/24 3:29 AM, Andy Polyakov wrote: Hi, +static void cswap(fe51 p, fe51 q, unsigned int bit) +{ + u64 t, i; + u64 c = 0 - (u64) bit; + + for (i = 0; i < 5; ++i) { + t = c & (p[i] ^ q[i]); + p[i] ^= t; + q[i] ^= t; + } +} The "c" in cswap stands for "constant-time," and the problem is that contemporary compilers have exhibited the ability to produce non-constant-time machine code as result of compilation of the above kind of technique. The outcome is platform-specific and ironically some of PPC code generators were observed to generate "most" non-constant-time code. "Most" in sense that execution time variations would be most easy to catch. One way to work around the problem, at least for the time being, is to add 'asm volatile("" : "+r"(c))' after you calculate 'c'. But there is no guarantee that the next compiler version won't see through it, hence the permanent solution is to do it in assembly. I can put together something... Cheers.
Re: [PATCH 2/3] crypto: X25519 core functions for ppc64le
Hi, +static void cswap(fe51 p, fe51 q, unsigned int bit) +{ + u64 t, i; + u64 c = 0 - (u64) bit; + + for (i = 0; i < 5; ++i) { + t = c & (p[i] ^ q[i]); + p[i] ^= t; + q[i] ^= t; + } +} The "c" in cswap stands for "constant-time," and the problem is that contemporary compilers have exhibited the ability to produce non-constant-time machine code as result of compilation of the above kind of technique. The outcome is platform-specific and ironically some of PPC code generators were observed to generate "most" non-constant-time code. "Most" in sense that execution time variations would be most easy to catch. One way to work around the problem, at least for the time being, is to add 'asm volatile("" : "+r"(c))' after you calculate 'c'. But there is no guarantee that the next compiler version won't see through it, hence the permanent solution is to do it in assembly. I can put together something... Cheers.
[PATCH 2/3] crypto: X25519 core functions for ppc64le
X25519 core functions to handle scalar multiplication for ppc64le. Signed-off-by: Danny Tsen --- arch/powerpc/crypto/curve25519-ppc64le-core.c | 299 ++ 1 file changed, 299 insertions(+) create mode 100644 arch/powerpc/crypto/curve25519-ppc64le-core.c diff --git a/arch/powerpc/crypto/curve25519-ppc64le-core.c b/arch/powerpc/crypto/curve25519-ppc64le-core.c new file mode 100644 index ..6a8b5efc40ce --- /dev/null +++ b/arch/powerpc/crypto/curve25519-ppc64le-core.c @@ -0,0 +1,299 @@ +// SPDX-License-Identifier: GPL-2.0-or-later +/* + * Copyright 2024- IBM Corp. All rights reserved. + * + * X25519 scalar multiplication with 51 bits limbs for PPC64le. + * Based on RFC7748 and AArch64 optimized implementation for X25519 + * - Algorithm 1 Scalar multiplication of a variable point + */ + +#include +#include + +#include +#include +#include +#include +#include + +#include +#include + +typedef uint64_t fe51[5]; + +asmlinkage void x25519_fe51_mul(fe51 h, const fe51 f, const fe51 g); +asmlinkage void x25519_fe51_sqr(fe51 h, const fe51 f); +asmlinkage void x25519_fe51_mul121666(fe51 h, fe51 f); +asmlinkage void x25519_fe51_sqr_times(fe51 h, const fe51 f, int n); +asmlinkage void x25519_fe51_frombytes(fe51 h, const uint8_t *s); +asmlinkage void x25519_fe51_tobytes(uint8_t *s, const fe51 h); + +#define fmul x25519_fe51_mul +#define fsqr x25519_fe51_sqr +#define fmul121666 x25519_fe51_mul121666 +#define fe51_tobytes x25519_fe51_tobytes +#define fe51_frombytes x25519_fe51_frombytes + +static void cswap(fe51 p, fe51 q, unsigned int bit) +{ + u64 t, i; + u64 c = 0 - (u64) bit; + + for (i = 0; i < 5; ++i) { + t = c & (p[i] ^ q[i]); + p[i] ^= t; + q[i] ^= t; + } +} + +static void fadd(fe51 h, const fe51 f, const fe51 g) +{ + h[0] = f[0] + g[0]; + h[1] = f[1] + g[1]; + h[2] = f[2] + g[2]; + h[3] = f[3] + g[3]; + h[4] = f[4] + g[4]; +} + +/* + * Prime = 2 ** 255 - 19, 255 bits + *(0x7fff ffed) + * + * Prime in 5 51-bit limbs + */ +static fe51 prime51 = { 0x7ffed, 0x7, 0x7, 0x7, 0x7}; + +static void fsub(fe51 h, const fe51 f, const fe51 g) +{ + h[0] = (f[0] + ((prime51[0] * 2))) - g[0]; + h[1] = (f[1] + ((prime51[1] * 2))) - g[1]; + h[2] = (f[2] + ((prime51[2] * 2))) - g[2]; + h[3] = (f[3] + ((prime51[3] * 2))) - g[3]; + h[4] = (f[4] + ((prime51[4] * 2))) - g[4]; +} + +static void finv(fe51 o, const fe51 i) +{ + fe51 a0, b, c, t00; + + fsqr(a0, i); + x25519_fe51_sqr_times(t00, a0, 2); + + fmul(b, t00, i); + fmul(a0, b, a0); + + fsqr(t00, a0); + + fmul(b, t00, b); + x25519_fe51_sqr_times(t00, b, 5); + + fmul(b, t00, b); + x25519_fe51_sqr_times(t00, b, 10); + + fmul(c, t00, b); + x25519_fe51_sqr_times(t00, c, 20); + + fmul(t00, t00, c); + x25519_fe51_sqr_times(t00, t00, 10); + + fmul(b, t00, b); + x25519_fe51_sqr_times(t00, b, 50); + + fmul(c, t00, b); + x25519_fe51_sqr_times(t00, c, 100); + + fmul(t00, t00, c); + x25519_fe51_sqr_times(t00, t00, 50); + + fmul(t00, t00, b); + x25519_fe51_sqr_times(t00, t00, 5); + + fmul(o, t00, a0); +} + +static void curve25519_fe51(uint8_t out[32], const uint8_t scalar[32], + const uint8_t point[32]) +{ + fe51 x1, x2, z2, x3, z3; + uint8_t s[32]; + unsigned int swap = 0; + int i; + + memcpy(s, scalar, 32); + s[0] &= 0xf8; + s[31] &= 0x7f; + s[31] |= 0x40; + fe51_frombytes(x1, point); + + z2[0] = z2[1] = z2[2] = z2[3] = z2[4] = 0; + x3[0] = x1[0]; + x3[1] = x1[1]; + x3[2] = x1[2]; + x3[3] = x1[3]; + x3[4] = x1[4]; + + x2[0] = z3[0] = 1; + x2[1] = z3[1] = 0; + x2[2] = z3[2] = 0; + x2[3] = z3[3] = 0; + x2[4] = z3[4] = 0; + + for (i = 254; i >= 0; --i) { + unsigned int k_t = 1 & (s[i / 8] >> (i & 7)); + fe51 a, b, c, d, e; + fe51 da, cb, aa, bb; + fe51 dacb_p, dacb_m; + + swap ^= k_t; + cswap(x2, x3, swap); + cswap(z2, z3, swap); + swap = k_t; + + fsub(b, x2, z2);// B = x_2 - z_2 + fadd(a, x2, z2);// A = x_2 + z_2 + fsub(d, x3, z3);// D = x_3 - z_3 + fadd(c, x3, z3);// C = x_3 + z_3 + + fsqr(bb, b);// BB = B^2 + fsqr(aa, a);// AA = A^2 + fmul(da, d, a); // DA = D * A + fmul(cb, c, b); // CB = C * B + + fsub(e, aa, bb);// E = AA - BB +