In preparing for merging the gcm-aes "stitched" implementation, I'm
reviewing the existing ghash code. WIP branch "ppc-ghash-macros.

I've introduced a macro GHASH_REDUCE, for the reduction logic. Besides
that, I've been able to improve scheduling of the reduction instructions
(adding in the result of vpmsumd last seems to improve parallelism, some
3% speedup of gcm_update on power10, benchmarked on cfarm120). I've also
streamlined the way load offsets are used, and trimmed the number of
needed vector registers slightly.

For the AES code, I've merged the new macros (I settled on the names
OPN_XXY and OPN_XXXY), no change in speed expected from that change.

I've also tried to understand the differenct between AES encrypt and
decrypt, where decrypt is much slower, and uses an extra xor instruction
in the round loop. I think the reason for that is that other AES
implementations (including x86_64 and arm64 instructions, and Nettle's C
implementation) expect the decryption subkeys to be transformed via the
AES "MIX_COLUMN" operation, see
https://gitlab.com/gnutls/nettle/-/blob/master/aes-invert-internal.c?ref_type=heads#L163

While the powerpc64 vncipher instruction really wants the original
subkeys, not transformed. So on power, it would be better to have a
_nettle_aes_invert that is essentially a memcpy, and then the aes
decrypt assembly code could be reworked without the xors, and run at exactly
the same speed as encryption. Current _nettle_aes_invert also changes
the order of the subkeys, with a FIXME comment suggesting that it would
be better to update the order keys are accessed in the aes decryption
functions.

Regards,
/Niels

-- 
Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677.
Internet email is subject to wholesale government surveillance.

_______________________________________________
nettle-bugs mailing list -- nettle-bugs@lists.lysator.liu.se
To unsubscribe send an email to nettle-bugs-le...@lists.lysator.liu.se

Reply via email to