Re: AES leaks, cgd ciphers, and vector units in the kernel

Taylor R Campbell Sat, 27 Jun 2020 20:28:34 -0700

> Date: Mon, 22 Jun 2020 23:43:20 +0000
> From: Taylor R Campbell <riastr...@netbsd.org>
> 
> There is some more room for improvement -- SSSE3 provides PSHUFB which
> can sequentially speed up parts of AES, and is supported by a good
> number of amd64 CPUs starting around 14 years ago that lack AES-NI --
> but there are diminishing returns for increasing implementation and
> maintenance effort, so I'd like to focus on making an impact on
> systems that matter.  (That includes non-x86 CPUs -- e.g., we could
> probably easily adapt the Intel SSE2 logic to ARM NEON -- but I would
> like to focus on systems where there is demand.)


I drafted derivatives of Mike Hamburg's vpaes code using Intel SSSE3
and using ARM NEON / aarch64 SIMD.  In principle the ARM NEON code
should work on armv7, but I have only compile-tested it there, and
there are a few kinks to be worked out before it can be used in the
kernel on armv7.

I pushed it to the riastradh-kernelcrypto topic on hg src-draft, and I
updated the userland aestest utility if you want to get a rough idea
of the performance without updating your kernel (see previous message
for usage instructions):

https://www.NetBSD.org/~riastradh/tmp/20200627/aestest.tgz

The summary of the patch set now is (kernel only -- no userland
changes):

- every architecture gets constant-time AES, with BearSSL's aes_ct
  32-bit bitsliced implementation -- there is no more vulnerable AES
  code in the NetBSD kernel, although there is a substantial
  performance hit on many platforms

- every architecture gets new cgd(4) support for Adiantum, which is
  generally as fast as or faster than AES-CBC and AES-XTS were before
  and provides better security (and has lots of room to be sped up;
  any speedups would also be applicable to other purposes too, like
  Wireguard)

- most high-end x86 of the past decade gets much much faster AES with
  AES-NI CPU support (no 32-bit yet)

- almost all x86 of the past decade gets faster or much faster AES
  with a vpaes-style SSSE3-based implementation (32-bit included)

- most x86 of the past two decades, including all amd64, mitigates the
  performance hit with a bitsliced SSE2-based implementation (32-bit
  included)

- VIA gets much faster AES with VIA ACE (for all users in the kernel,
  including cgd, not just those that use opencrypto as we had before
  with the via_padlock.c driver)

- almost all aarch64 (except rpi) gets much much faster AES with
  ARMv8.0-AES CPU support

- 64-bit rpi (and, with a little more work, armv7 with NEON) mitigates
  the performance hit -- and may get faster -- with a vpaes-style
  NEON-based implementation

Some other CPUs like modern POWER have AES CPU instructions these days
too.  The vpaes approach could probably be adapted to PowerPC Altivec,
and maybe some other vector units I'm not as familiar with (MIPS SIMD
Architecture, MSA?).  BearSSL's aes_ct64 64-bit bitsliced
implementation might be worth adopting for 64-bit CPUs without a
vector unit, if anyone cares -- maybe alpha or mips64.  But I think
I'm at the limit of what I'm willing to do for fun with the hardware I
have easy access to.

Re: AES leaks, cgd ciphers, and vector units in the kernel

Reply via email to