> Date: Mon, 22 Jun 2020 23:43:20 +0000 > From: Taylor R Campbell <riastr...@netbsd.org> > > There is some more room for improvement -- SSSE3 provides PSHUFB which > can sequentially speed up parts of AES, and is supported by a good > number of amd64 CPUs starting around 14 years ago that lack AES-NI -- > but there are diminishing returns for increasing implementation and > maintenance effort, so I'd like to focus on making an impact on > systems that matter. (That includes non-x86 CPUs -- e.g., we could > probably easily adapt the Intel SSE2 logic to ARM NEON -- but I would > like to focus on systems where there is demand.)
I drafted derivatives of Mike Hamburg's vpaes code using Intel SSSE3 and using ARM NEON / aarch64 SIMD. In principle the ARM NEON code should work on armv7, but I have only compile-tested it there, and there are a few kinks to be worked out before it can be used in the kernel on armv7. I pushed it to the riastradh-kernelcrypto topic on hg src-draft, and I updated the userland aestest utility if you want to get a rough idea of the performance without updating your kernel (see previous message for usage instructions): https://www.NetBSD.org/~riastradh/tmp/20200627/aestest.tgz The summary of the patch set now is (kernel only -- no userland changes): - every architecture gets constant-time AES, with BearSSL's aes_ct 32-bit bitsliced implementation -- there is no more vulnerable AES code in the NetBSD kernel, although there is a substantial performance hit on many platforms - every architecture gets new cgd(4) support for Adiantum, which is generally as fast as or faster than AES-CBC and AES-XTS were before and provides better security (and has lots of room to be sped up; any speedups would also be applicable to other purposes too, like Wireguard) - most high-end x86 of the past decade gets much much faster AES with AES-NI CPU support (no 32-bit yet) - almost all x86 of the past decade gets faster or much faster AES with a vpaes-style SSSE3-based implementation (32-bit included) - most x86 of the past two decades, including all amd64, mitigates the performance hit with a bitsliced SSE2-based implementation (32-bit included) - VIA gets much faster AES with VIA ACE (for all users in the kernel, including cgd, not just those that use opencrypto as we had before with the via_padlock.c driver) - almost all aarch64 (except rpi) gets much much faster AES with ARMv8.0-AES CPU support - 64-bit rpi (and, with a little more work, armv7 with NEON) mitigates the performance hit -- and may get faster -- with a vpaes-style NEON-based implementation Some other CPUs like modern POWER have AES CPU instructions these days too. The vpaes approach could probably be adapted to PowerPC Altivec, and maybe some other vector units I'm not as familiar with (MIPS SIMD Architecture, MSA?). BearSSL's aes_ct64 64-bit bitsliced implementation might be worth adopting for 64-bit CPUs without a vector unit, if anyone cares -- maybe alpha or mips64. But I think I'm at the limit of what I'm willing to do for fun with the hardware I have easy access to.