from:"Eric Biggers"

[PATCH] crypto: adiantum - adjust some comments to match latest paper

2018-12-06 Thread Eric Biggers

From: Eric Biggers 

The 2018-11-28 revision of the Adiantum paper has revised some notation:

- 'M' was replaced with 'L' (meaning "Left", for the left-hand part of
  the message) in the definition of Adiantum hashing, to avoid confusion
  with the full message
- ε-almost-∆-universal is now abbreviated as ε-∆U instead of εA∆U
- "block" is now used only to mean block cipher and Poly1305 blocks

Also, Adiantum hashing was moved from the appendix to the main paper.

To avoid confusion, update relevant comments in the code to match.

Signed-off-by: Eric Biggers 
---
 crypto/adiantum.c   | 35 +++
 crypto/nhpoly1305.c |  8 
 2 files changed, 23 insertions(+), 20 deletions(-)

diff --git a/crypto/adiantum.c b/crypto/adiantum.c
index ca27e0dc2958c..e62e34f5e389b 100644
--- a/crypto/adiantum.c
+++ b/crypto/adiantum.c
@@ -9,7 +9,7 @@
  * Adiantum is a tweakable, length-preserving encryption mode designed for fast
  * and secure disk encryption, especially on CPUs without dedicated crypto
  * instructions.  Adiantum encrypts each sector using the XChaCha12 stream
- * cipher, two passes of an ε-almost-∆-universal (εA∆U) hash function based on
+ * cipher, two passes of an ε-almost-∆-universal (ε-∆U) hash function based on
  * NH and Poly1305, and an invocation of the AES-256 block cipher on a single
  * 16-byte block.  See the paper for details:
  *
@@ -21,12 +21,12 @@
  * - Stream cipher: XChaCha12 or XChaCha20
  * - Block cipher: any with a 128-bit block size and 256-bit key
  *
- * This implementation doesn't currently allow other εA∆U hash functions, i.e.
+ * This implementation doesn't currently allow other ε-∆U hash functions, i.e.
  * HPolyC is not supported.  This is because Adiantum is ~20% faster than 
HPolyC
- * but still provably as secure, and also the εA∆U hash function of HBSH is
+ * but still provably as secure, and also the ε-∆U hash function of HBSH is
  * formally defined to take two inputs (tweak, message) which makes it 
difficult
  * to wrap with the crypto_shash API.  Rather, some details need to be handled
- * here.  Nevertheless, if needed in the future, support for other εA∆U hash
+ * here.  Nevertheless, if needed in the future, support for other ε-∆U hash
  * functions could be added here.
  */
 
@@ -41,7 +41,7 @@
 #include "internal.h"
 
 /*
- * Size of right-hand block of input data, in bytes; also the size of the block
+ * Size of right-hand part of input data, in bytes; also the size of the block
  * cipher's block size and the hash function's output.
  */
 #define BLOCKCIPHER_BLOCK_SIZE 16
@@ -77,7 +77,7 @@ struct adiantum_tfm_ctx {
 struct adiantum_request_ctx {
 
/*
-* Buffer for right-hand block of data, i.e.
+* Buffer for right-hand part of data, i.e.
 *
 *P_L => P_M => C_M => C_R when encrypting, or
 *C_R => C_M => P_M => P_L when decrypting.
@@ -93,8 +93,8 @@ struct adiantum_request_ctx {
bool enc; /* true if encrypting, false if decrypting */
 
/*
-* The result of the Poly1305 εA∆U hash function applied to
-* (message length, tweak).
+* The result of the Poly1305 ε-∆U hash function applied to
+* (bulk length, tweak)
 */
le128 header_hash;
 
@@ -213,13 +213,16 @@ static inline void le128_sub(le128 *r, const le128 *v1, 
const le128 *v2)
 }
 
 /*
- * Apply the Poly1305 εA∆U hash function to (message length, tweak) and save 
the
- * result to rctx->header_hash.
+ * Apply the Poly1305 ε-∆U hash function to (bulk length, tweak) and save the
+ * result to rctx->header_hash.  This is the calculation
  *
- * This value is reused in both the first and second hash steps.  Specifically,
- * it's added to the result of an independently keyed εA∆U hash function (for
- * equal length inputs only) taken over the message.  This gives the overall
- * Adiantum hash of the (tweak, message) pair.
+ * H_T ← Poly1305_{K_T}(bin_{128}(|L|) || T)
+ *
+ * from the procedure in section 6.4 of the Adiantum paper.  The resulting 
value
+ * is reused in both the first and second hash steps.  Specifically, it's added
+ * to the result of an independently keyed ε-∆U hash function (for equal length
+ * inputs only) taken over the left-hand part (the "bulk") of the message, to
+ * give the overall Adiantum hash of the (tweak, left-hand part) pair.
  */
 static void adiantum_hash_header(struct skcipher_request *req)
 {
@@ -248,7 +251,7 @@ static void adiantum_hash_header(struct skcipher_request 
*req)
poly1305_core_emit(, >header_hash);
 }
 
-/* Hash the left-hand block (the "bulk") of the message using NHPoly1305 */
+/* Hash the left-hand part (the "bulk") of the message using NHPoly1305 */
 static int adiantum_hash_message(struct skcipher_request *req,
 struct scatterlist *sgl, le128 *digest)
 {
@@ -550,7

[PATCH] crypto: xchacha20 - fix comments for test vectors

2018-12-06 Thread Eric Biggers

From: Eric Biggers 

The kernel's ChaCha20 uses the RFC7539 convention of the nonce being 12
bytes rather than 8, so actually I only appended 12 random bytes (not
16) to its test vectors to form 24-byte nonces for the XChaCha20 test
vectors.  The other 4 bytes were just from zero-padding the stream
position to 8 bytes.  Fix the comments above the test vectors.

Signed-off-by: Eric Biggers 
---
 crypto/testmgr.h | 14 ++
 1 file changed, 6 insertions(+), 8 deletions(-)

diff --git a/crypto/testmgr.h b/crypto/testmgr.h
index 357cf4cbcbb1c..e8f47d7b92cdd 100644
--- a/crypto/testmgr.h
+++ b/crypto/testmgr.h
@@ -32281,8 +32281,9 @@ static const struct cipher_testvec 
xchacha20_tv_template[] = {
  "\x57\x78\x8e\x6f\xae\x90\xfc\x31"
  "\x09\x7c\xfc",
.len= 91,
-   }, { /* Taken from the ChaCha20 test vectors, appended 16 random bytes
-   to nonce, and recomputed the ciphertext with libsodium */
+   }, { /* Taken from the ChaCha20 test vectors, appended 12 random bytes
+   to the nonce, zero-padded the stream position from 4 to 8 bytes,
+   and recomputed the ciphertext using libsodium's XChaCha20 */
.key= "\x00\x00\x00\x00\x00\x00\x00\x00"
  "\x00\x00\x00\x00\x00\x00\x00\x00"
  "\x00\x00\x00\x00\x00\x00\x00\x00"
@@ -32309,8 +32310,7 @@ static const struct cipher_testvec 
xchacha20_tv_template[] = {
  "\x03\xdc\xf8\x2b\xc1\xe1\x75\x67"
  "\x23\x7b\xe6\xfc\xd4\x03\x86\x54",
.len= 64,
-   }, { /* Taken from the ChaCha20 test vectors, appended 16 random bytes
-   to nonce, and recomputed the ciphertext with libsodium */
+   }, { /* Derived from a ChaCha20 test vector, via the process above */
.key= "\x00\x00\x00\x00\x00\x00\x00\x00"
  "\x00\x00\x00\x00\x00\x00\x00\x00"
  "\x00\x00\x00\x00\x00\x00\x00\x00"
@@ -32419,8 +32419,7 @@ static const struct cipher_testvec 
xchacha20_tv_template[] = {
.np = 3,
.tap= { 375 - 20, 4, 16 },
 
-   }, { /* Taken from the ChaCha20 test vectors, appended 16 random bytes
-   to nonce, and recomputed the ciphertext with libsodium */
+   }, { /* Derived from a ChaCha20 test vector, via the process above */
.key= "\x1c\x92\x40\xa5\xeb\x55\xd3\x8a"
  "\xf3\x33\x88\x86\x04\xf6\xb5\xf0"
  "\x47\x39\x17\xc1\x40\x2b\x80\x09"
@@ -32463,8 +32462,7 @@ static const struct cipher_testvec 
xchacha20_tv_template[] = {
  "\x65\x03\xfa\x45\xf7\x9e\x53\x7a"
  "\x99\xf1\x82\x25\x4f\x8d\x07",
.len= 127,
-   }, { /* Taken from the ChaCha20 test vectors, appended 16 random bytes
-   to nonce, and recomputed the ciphertext with libsodium */
+   }, { /* Derived from a ChaCha20 test vector, via the process above */
.key= "\x1c\x92\x40\xa5\xeb\x55\xd3\x8a"
  "\xf3\x33\x88\x86\x04\xf6\xb5\xf0"
  "\x47\x39\x17\xc1\x40\x2b\x80\x09"
-- 
2.20.0.rc2.403.gdbc3b29805-goog

[PATCH] crypto: xchacha - add test vector from XChaCha20 draft RFC

2018-12-06 Thread Eric Biggers

From: Eric Biggers 

There is a draft specification for XChaCha20 being worked on.  Add the
XChaCha20 test vector from the appendix so that we can be extra sure the
kernel's implementation is compatible.

I also recomputed the ciphertext with XChaCha12 and added it there too,
to keep the tests for XChaCha20 and XChaCha12 in sync.

Signed-off-by: Eric Biggers 
---
 crypto/testmgr.h | 178 ++-
 1 file changed, 176 insertions(+), 2 deletions(-)

diff --git a/crypto/testmgr.h b/crypto/testmgr.h
index e7e56a8febbca..357cf4cbcbb1c 100644
--- a/crypto/testmgr.h
+++ b/crypto/testmgr.h
@@ -32800,7 +32800,94 @@ static const struct cipher_testvec 
xchacha20_tv_template[] = {
.also_non_np = 1,
.np = 3,
.tap= { 1200, 1, 80 },
-   },
+   }, { /* test vector from 
https://tools.ietf.org/html/draft-arciszewski-xchacha-02#appendix-A.3.2 */
+   .key= "\x80\x81\x82\x83\x84\x85\x86\x87"
+ "\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f"
+ "\x90\x91\x92\x93\x94\x95\x96\x97"
+ "\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f",
+   .klen   = 32,
+   .iv = "\x40\x41\x42\x43\x44\x45\x46\x47"
+ "\x48\x49\x4a\x4b\x4c\x4d\x4e\x4f"
+ "\x50\x51\x52\x53\x54\x55\x56\x58"
+ "\x00\x00\x00\x00\x00\x00\x00\x00",
+   .ptext  = "\x54\x68\x65\x20\x64\x68\x6f\x6c"
+ "\x65\x20\x28\x70\x72\x6f\x6e\x6f"
+ "\x75\x6e\x63\x65\x64\x20\x22\x64"
+ "\x6f\x6c\x65\x22\x29\x20\x69\x73"
+ "\x20\x61\x6c\x73\x6f\x20\x6b\x6e"
+ "\x6f\x77\x6e\x20\x61\x73\x20\x74"
+ "\x68\x65\x20\x41\x73\x69\x61\x74"
+ "\x69\x63\x20\x77\x69\x6c\x64\x20"
+ "\x64\x6f\x67\x2c\x20\x72\x65\x64"
+ "\x20\x64\x6f\x67\x2c\x20\x61\x6e"
+ "\x64\x20\x77\x68\x69\x73\x74\x6c"
+ "\x69\x6e\x67\x20\x64\x6f\x67\x2e"
+ "\x20\x49\x74\x20\x69\x73\x20\x61"
+ "\x62\x6f\x75\x74\x20\x74\x68\x65"
+ "\x20\x73\x69\x7a\x65\x20\x6f\x66"
+ "\x20\x61\x20\x47\x65\x72\x6d\x61"
+ "\x6e\x20\x73\x68\x65\x70\x68\x65"
+ "\x72\x64\x20\x62\x75\x74\x20\x6c"
+ "\x6f\x6f\x6b\x73\x20\x6d\x6f\x72"
+ "\x65\x20\x6c\x69\x6b\x65\x20\x61"
+ "\x20\x6c\x6f\x6e\x67\x2d\x6c\x65"
+ "\x67\x67\x65\x64\x20\x66\x6f\x78"
+ "\x2e\x20\x54\x68\x69\x73\x20\x68"
+ "\x69\x67\x68\x6c\x79\x20\x65\x6c"
+ "\x75\x73\x69\x76\x65\x20\x61\x6e"
+ "\x64\x20\x73\x6b\x69\x6c\x6c\x65"
+ "\x64\x20\x6a\x75\x6d\x70\x65\x72"
+ "\x20\x69\x73\x20\x63\x6c\x61\x73"
+ "\x73\x69\x66\x69\x65\x64\x20\x77"
+ "\x69\x74\x68\x20\x77\x6f\x6c\x76"
+ "\x65\x73\x2c\x20\x63\x6f\x79\x6f"
+ "\x74\x65\x73\x2c\x20\x6a\x61\x63"
+ "\x6b\x61\x6c\x73\x2c\x20\x61\x6e"
+ "\x64\x20\x66\x6f\x78\x65\x73\x20"
+ "\x69\x6e\x20\x74\x68\x65\x20\x74"
+ "\x61\x78\x6f\x6e\x6f\x6d\x69\x63"
+ "\x20\x66\x61\x6d\x69\x6c\x79\x20"
+ "\x43\x61\x6e\x69\x64\x61\x65\x2e",
+   .ctext  = "\x45\x59\xab\xba\x4e\x48\xc1\x61"
+ "\x02\xe8\xbb\x2c\x05\xe6\x94\x7f"
+ "\x50\xa7\x86\xde\x16\x2f\x9b\x0b"
+ "\x7e\x59\x2a\x9b\x53\xd0\xd4\xe9"
+ "\x8d\x8d\x64\x10\xd5\x40\xa1\xa6"
+ "\x37\x5b\x26\xd8\x0d\xac\xe4\xfa"
+ "\xb5\x23\x84\xc7\x31\xac\xbf\x16"
+ "\xa5\x92\x3c\x0c\x48\xd3\x57\x5d"
+ "\x4d\x0d\x2c\x67\x3b\x66\x6f\xaa"
+ "\x73\x10\x61\x27\x77\x01\x09\x3a"
+ "\x6b\xf7\xa1

[PATCH] crypto: adiantum - propagate CRYPTO_ALG_ASYNC flag to instance

2018-12-04 Thread Eric Biggers

From: Eric Biggers 

If the stream cipher implementation is asynchronous, then the Adiantum
instance must be flagged as asynchronous as well.  Otherwise someone
asking for a synchronous algorithm can get an asynchronous algorithm.

There are no asynchronous xchacha12 or xchacha20 implementations yet
which makes this largely a theoretical issue, but it should be fixed.

Fixes: 059c2a4d8e16 ("crypto: adiantum - add Adiantum support")
Signed-off-by: Eric Biggers 
---
 crypto/adiantum.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/crypto/adiantum.c b/crypto/adiantum.c
index 2dfcf12fd4529..ca27e0dc2958c 100644
--- a/crypto/adiantum.c
+++ b/crypto/adiantum.c
@@ -590,6 +590,8 @@ static int adiantum_create(struct crypto_template *tmpl, 
struct rtattr **tb)
 hash_alg->base.cra_driver_name) >= CRYPTO_MAX_ALG_NAME)
goto out_drop_hash;
 
+   inst->alg.base.cra_flags = streamcipher_alg->base.cra_flags &
+  CRYPTO_ALG_ASYNC;
inst->alg.base.cra_blocksize = BLOCKCIPHER_BLOCK_SIZE;
inst->alg.base.cra_ctxsize = sizeof(struct adiantum_tfm_ctx);
inst->alg.base.cra_alignmask = streamcipher_alg->base.cra_alignmask |
-- 
2.20.0.rc1.387.gf8505762e3-goog

Re: [PATCH] fscrypt: remove CRYPTO_CTR dependency

2018-12-04 Thread Eric Biggers

On Thu, Sep 06, 2018 at 12:43:41PM +0200, Ard Biesheuvel wrote:
> On 5 September 2018 at 21:24, Eric Biggers  wrote:
> > From: Eric Biggers 
> >
> > fscrypt doesn't use the CTR mode of operation for anything, so there's
> > no need to select CRYPTO_CTR.  It was added by commit 71dea01ea2ed
> > ("ext4 crypto: require CONFIG_CRYPTO_CTR if ext4 encryption is
> > enabled").  But, I've been unable to identify the arm64 crypto bug it
> > was supposedly working around.
> >
> > I suspect the issue was seen only on some old Android device kernel
> > (circa 3.10?).  So if the fix wasn't mistaken, the real bug is probably
> > already fixed.  Or maybe it was actually a bug in a non-upstream crypto
> > driver.
> >
> > So, remove the dependency.  If it turns out there's actually still a
> > bug, we'll fix it properly.
> >
> > Signed-off-by: Eric Biggers 
> 
> Acked-by: Ard Biesheuvel 
> 
> This may be related to
> 
> 11e3b725cfc2 crypto: arm64/aes-blk - honour iv_out requirement in CBC
> and CTR modes
> 
> given that the commit in question mentions CTS. How it actually works
> around the issue is unclear to me, though.
> 
> 
> 
> 
> > ---
> >  fs/crypto/Kconfig | 1 -
> >  1 file changed, 1 deletion(-)
> >
> > diff --git a/fs/crypto/Kconfig b/fs/crypto/Kconfig
> > index 02b7d91c92310..284b589b4774d 100644
> > --- a/fs/crypto/Kconfig
> > +++ b/fs/crypto/Kconfig
> > @@ -6,7 +6,6 @@ config FS_ENCRYPTION
> > select CRYPTO_ECB
> > select CRYPTO_XTS
> > select CRYPTO_CTS
> > -   select CRYPTO_CTR
> > select CRYPTO_SHA256
> > select KEYS
> > help
> > --
> > 2.19.0.rc2.392.g5ba43deb5a-goog
> >

Ping.  Ted, can you consider applying this to the fscrypt tree for 4.21?

Thanks,

- Eric

[PATCH] crypto: drop mask=CRYPTO_ALG_ASYNC from 'shash' tfm allocations

2018-11-14 Thread Eric Biggers

From: Eric Biggers 

'shash' algorithms are always synchronous, so passing CRYPTO_ALG_ASYNC
in the mask to crypto_alloc_shash() has no effect.  Many users therefore
already don't pass it, but some still do.  This inconsistency can cause
confusion, especially since the way the 'mask' argument works is
somewhat counterintuitive.

Thus, just remove the unneeded CRYPTO_ALG_ASYNC flags.

This patch shouldn't change any actual behavior.

Signed-off-by: Eric Biggers 
---
 drivers/block/drbd/drbd_receiver.c  | 2 +-
 drivers/md/dm-integrity.c   | 2 +-
 drivers/net/wireless/intersil/orinoco/mic.c | 6 ++
 fs/ubifs/auth.c | 5 ++---
 net/bluetooth/smp.c | 2 +-
 security/apparmor/crypto.c  | 2 +-
 security/integrity/evm/evm_crypto.c | 3 +--
 security/keys/encrypted-keys/encrypted.c| 4 ++--
 security/keys/trusted.c | 4 ++--
 9 files changed, 13 insertions(+), 17 deletions(-)

diff --git a/drivers/block/drbd/drbd_receiver.c 
b/drivers/block/drbd/drbd_receiver.c
index 61c392752fe4b..ccfcf00f2798d 100644
--- a/drivers/block/drbd/drbd_receiver.c
+++ b/drivers/block/drbd/drbd_receiver.c
@@ -3623,7 +3623,7 @@ static int receive_protocol(struct drbd_connection 
*connection, struct packet_in
 * change.
 */
 
-   peer_integrity_tfm = crypto_alloc_shash(integrity_alg, 0, 
CRYPTO_ALG_ASYNC);
+   peer_integrity_tfm = crypto_alloc_shash(integrity_alg, 0, 0);
if (IS_ERR(peer_integrity_tfm)) {
peer_integrity_tfm = NULL;
drbd_err(connection, "peer data-integrity-alg %s not 
supported\n",
diff --git a/drivers/md/dm-integrity.c b/drivers/md/dm-integrity.c
index bb3096bf2cc6b..d4ad0bfee2519 100644
--- a/drivers/md/dm-integrity.c
+++ b/drivers/md/dm-integrity.c
@@ -2804,7 +2804,7 @@ static int get_mac(struct crypto_shash **hash, struct 
alg_spec *a, char **error,
int r;
 
if (a->alg_string) {
-   *hash = crypto_alloc_shash(a->alg_string, 0, CRYPTO_ALG_ASYNC);
+   *hash = crypto_alloc_shash(a->alg_string, 0, 0);
if (IS_ERR(*hash)) {
*error = error_alg;
r = PTR_ERR(*hash);
diff --git a/drivers/net/wireless/intersil/orinoco/mic.c 
b/drivers/net/wireless/intersil/orinoco/mic.c
index 08bc7822f8209..709d9ab3e7bcb 100644
--- a/drivers/net/wireless/intersil/orinoco/mic.c
+++ b/drivers/net/wireless/intersil/orinoco/mic.c
@@ -16,8 +16,7 @@
 //
 int orinoco_mic_init(struct orinoco_private *priv)
 {
-   priv->tx_tfm_mic = crypto_alloc_shash("michael_mic", 0,
- CRYPTO_ALG_ASYNC);
+   priv->tx_tfm_mic = crypto_alloc_shash("michael_mic", 0, 0);
if (IS_ERR(priv->tx_tfm_mic)) {
printk(KERN_DEBUG "orinoco_mic_init: could not allocate "
   "crypto API michael_mic\n");
@@ -25,8 +24,7 @@ int orinoco_mic_init(struct orinoco_private *priv)
return -ENOMEM;
}
 
-   priv->rx_tfm_mic = crypto_alloc_shash("michael_mic", 0,
- CRYPTO_ALG_ASYNC);
+   priv->rx_tfm_mic = crypto_alloc_shash("michael_mic", 0, 0);
if (IS_ERR(priv->rx_tfm_mic)) {
printk(KERN_DEBUG "orinoco_mic_init: could not allocate "
   "crypto API michael_mic\n");
diff --git a/fs/ubifs/auth.c b/fs/ubifs/auth.c
index 124e965a28b30..5bf5fd08879e6 100644
--- a/fs/ubifs/auth.c
+++ b/fs/ubifs/auth.c
@@ -269,8 +269,7 @@ int ubifs_init_authentication(struct ubifs_info *c)
goto out;
}
 
-   c->hash_tfm = crypto_alloc_shash(c->auth_hash_name, 0,
-CRYPTO_ALG_ASYNC);
+   c->hash_tfm = crypto_alloc_shash(c->auth_hash_name, 0, 0);
if (IS_ERR(c->hash_tfm)) {
err = PTR_ERR(c->hash_tfm);
ubifs_err(c, "Can not allocate %s: %d",
@@ -286,7 +285,7 @@ int ubifs_init_authentication(struct ubifs_info *c)
goto out_free_hash;
}
 
-   c->hmac_tfm = crypto_alloc_shash(hmac_name, 0, CRYPTO_ALG_ASYNC);
+   c->hmac_tfm = crypto_alloc_shash(hmac_name, 0, 0);
if (IS_ERR(c->hmac_tfm)) {
err = PTR_ERR(c->hmac_tfm);
ubifs_err(c, "Can not allocate %s: %d", hmac_name, err);
diff --git a/net/bluetooth/smp.c b/net/bluetooth/smp.c
index 1f94a25beef69..621146d04c038 100644
--- a/net/bluetooth/smp.c
+++ b/net/bluetooth/smp.c
@@ -3912,7 +3912,7 @@ int __init bt_selftest_smp(void)
return PTR_ERR(tfm_aes);
}
 
-   tfm_cmac = crypto_alloc_shash("cmac

[PATCH] crypto: drop mask=CRYPTO_ALG_ASYNC from 'cipher' tfm allocations

2018-11-14 Thread Eric Biggers

From: Eric Biggers 

'cipher' algorithms (single block ciphers) are always synchronous, so
passing CRYPTO_ALG_ASYNC in the mask to crypto_alloc_cipher() has no
effect.  Many users therefore already don't pass it, but some still do.
This inconsistency can cause confusion, especially since the way the
'mask' argument works is somewhat counterintuitive.

Thus, just remove the unneeded CRYPTO_ALG_ASYNC flags.

This patch shouldn't change any actual behavior.

Signed-off-by: Eric Biggers 
---
 arch/s390/crypto/aes_s390.c   | 2 +-
 drivers/crypto/amcc/crypto4xx_alg.c   | 3 +--
 drivers/crypto/ccp/ccp-crypto-aes-cmac.c  | 4 +---
 drivers/crypto/geode-aes.c| 2 +-
 drivers/md/dm-crypt.c | 2 +-
 drivers/net/wireless/cisco/airo.c | 2 +-
 drivers/staging/rtl8192e/rtllib_crypt_ccmp.c  | 2 +-
 drivers/staging/rtl8192u/ieee80211/ieee80211_crypt_ccmp.c | 2 +-
 drivers/usb/wusbcore/crypto.c | 2 +-
 net/bluetooth/smp.c   | 6 +++---
 net/mac80211/wep.c| 4 ++--
 net/wireless/lib80211_crypt_ccmp.c| 2 +-
 net/wireless/lib80211_crypt_tkip.c| 4 ++--
 net/wireless/lib80211_crypt_wep.c | 4 ++--
 14 files changed, 19 insertions(+), 22 deletions(-)

diff --git a/arch/s390/crypto/aes_s390.c b/arch/s390/crypto/aes_s390.c
index 812d9498d97be..dd456725189f2 100644
--- a/arch/s390/crypto/aes_s390.c
+++ b/arch/s390/crypto/aes_s390.c
@@ -137,7 +137,7 @@ static int fallback_init_cip(struct crypto_tfm *tfm)
struct s390_aes_ctx *sctx = crypto_tfm_ctx(tfm);
 
sctx->fallback.cip = crypto_alloc_cipher(name, 0,
-   CRYPTO_ALG_ASYNC | CRYPTO_ALG_NEED_FALLBACK);
+CRYPTO_ALG_NEED_FALLBACK);
 
if (IS_ERR(sctx->fallback.cip)) {
pr_err("Allocating AES fallback algorithm %s failed\n",
diff --git a/drivers/crypto/amcc/crypto4xx_alg.c 
b/drivers/crypto/amcc/crypto4xx_alg.c
index f5c07498ea4f0..4092c2aad8e21 100644
--- a/drivers/crypto/amcc/crypto4xx_alg.c
+++ b/drivers/crypto/amcc/crypto4xx_alg.c
@@ -520,8 +520,7 @@ static int crypto4xx_compute_gcm_hash_key_sw(__le32 
*hash_start, const u8 *key,
uint8_t src[16] = { 0 };
int rc = 0;
 
-   aes_tfm = crypto_alloc_cipher("aes", 0, CRYPTO_ALG_ASYNC |
- CRYPTO_ALG_NEED_FALLBACK);
+   aes_tfm = crypto_alloc_cipher("aes", 0, CRYPTO_ALG_NEED_FALLBACK);
if (IS_ERR(aes_tfm)) {
rc = PTR_ERR(aes_tfm);
pr_warn("could not load aes cipher driver: %d\n", rc);
diff --git a/drivers/crypto/ccp/ccp-crypto-aes-cmac.c 
b/drivers/crypto/ccp/ccp-crypto-aes-cmac.c
index 3c6fe57f91f8c..9108015e56cc5 100644
--- a/drivers/crypto/ccp/ccp-crypto-aes-cmac.c
+++ b/drivers/crypto/ccp/ccp-crypto-aes-cmac.c
@@ -346,9 +346,7 @@ static int ccp_aes_cmac_cra_init(struct crypto_tfm *tfm)
 
crypto_ahash_set_reqsize(ahash, sizeof(struct ccp_aes_cmac_req_ctx));
 
-   cipher_tfm = crypto_alloc_cipher("aes", 0,
-CRYPTO_ALG_ASYNC |
-CRYPTO_ALG_NEED_FALLBACK);
+   cipher_tfm = crypto_alloc_cipher("aes", 0, CRYPTO_ALG_NEED_FALLBACK);
if (IS_ERR(cipher_tfm)) {
pr_warn("could not load aes cipher driver\n");
return PTR_ERR(cipher_tfm);
diff --git a/drivers/crypto/geode-aes.c b/drivers/crypto/geode-aes.c
index eb2a0a73cbed1..b4c24a35b3d08 100644
--- a/drivers/crypto/geode-aes.c
+++ b/drivers/crypto/geode-aes.c
@@ -261,7 +261,7 @@ static int fallback_init_cip(struct crypto_tfm *tfm)
struct geode_aes_op *op = crypto_tfm_ctx(tfm);
 
op->fallback.cip = crypto_alloc_cipher(name, 0,
-   CRYPTO_ALG_ASYNC | CRYPTO_ALG_NEED_FALLBACK);
+  CRYPTO_ALG_NEED_FALLBACK);
 
if (IS_ERR(op->fallback.cip)) {
printk(KERN_ERR "Error allocating fallback algo %s\n", name);
diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
index b8eec515a003c..a7195eb5b8d89 100644
--- a/drivers/md/dm-crypt.c
+++ b/drivers/md/dm-crypt.c
@@ -377,7 +377,7 @@ static struct crypto_cipher *alloc_essiv_cipher(struct 
crypt_config *cc,
int err;
 
/* Setup the essiv_tfm with the given salt */
-   essiv_tfm = crypto_alloc_cipher(cc->cipher, 0, CRYPTO_ALG_ASYNC);
+   essiv_tfm = crypto_alloc_cipher(cc->cipher, 0, 0);
if (IS_ERR(essiv_tfm)) {
ti->error = "Error allocating crypto tfm for ESSIV";
return essiv_tfm;
diff --git a/drivers/net/wireles

[PATCH] crypto: remove useless initializations of cra_list

2018-11-14 Thread Eric Biggers

From: Eric Biggers 

Some algorithms initialize their .cra_list prior to registration.
But this is unnecessary since crypto_register_alg() will overwrite
.cra_list when adding the algorithm to the 'crypto_alg_list'.
Apparently the useless assignment has just been copy+pasted around.

So, remove the useless assignments.

Exception: paes_s390.c uses cra_list to check whether the algorithm is
registered or not, so I left that as-is for now.

This patch shouldn't change any actual behavior.

Signed-off-by: Eric Biggers 
---
 arch/sparc/crypto/aes_glue.c  | 5 -
 arch/sparc/crypto/camellia_glue.c | 5 -
 arch/sparc/crypto/des_glue.c  | 5 -
 crypto/lz4.c  | 1 -
 crypto/lz4hc.c| 1 -
 drivers/crypto/bcm/cipher.c   | 2 --
 drivers/crypto/omap-aes.c | 2 --
 drivers/crypto/omap-des.c | 1 -
 drivers/crypto/qce/ablkcipher.c   | 1 -
 drivers/crypto/qce/sha.c  | 1 -
 drivers/crypto/sahara.c   | 1 -
 11 files changed, 25 deletions(-)

diff --git a/arch/sparc/crypto/aes_glue.c b/arch/sparc/crypto/aes_glue.c
index 3cd4f6b198b65..a9b8b0b94a8d4 100644
--- a/arch/sparc/crypto/aes_glue.c
+++ b/arch/sparc/crypto/aes_glue.c
@@ -476,11 +476,6 @@ static bool __init sparc64_has_aes_opcode(void)
 
 static int __init aes_sparc64_mod_init(void)
 {
-   int i;
-
-   for (i = 0; i < ARRAY_SIZE(algs); i++)
-   INIT_LIST_HEAD([i].cra_list);
-
if (sparc64_has_aes_opcode()) {
pr_info("Using sparc64 aes opcodes optimized AES 
implementation\n");
return crypto_register_algs(algs, ARRAY_SIZE(algs));
diff --git a/arch/sparc/crypto/camellia_glue.c 
b/arch/sparc/crypto/camellia_glue.c
index 561a84d93cf68..900d5c617e83b 100644
--- a/arch/sparc/crypto/camellia_glue.c
+++ b/arch/sparc/crypto/camellia_glue.c
@@ -299,11 +299,6 @@ static bool __init sparc64_has_camellia_opcode(void)
 
 static int __init camellia_sparc64_mod_init(void)
 {
-   int i;
-
-   for (i = 0; i < ARRAY_SIZE(algs); i++)
-   INIT_LIST_HEAD([i].cra_list);
-
if (sparc64_has_camellia_opcode()) {
pr_info("Using sparc64 camellia opcodes optimized CAMELLIA 
implementation\n");
return crypto_register_algs(algs, ARRAY_SIZE(algs));
diff --git a/arch/sparc/crypto/des_glue.c b/arch/sparc/crypto/des_glue.c
index 61af794aa2d31..56499ea39fd36 100644
--- a/arch/sparc/crypto/des_glue.c
+++ b/arch/sparc/crypto/des_glue.c
@@ -510,11 +510,6 @@ static bool __init sparc64_has_des_opcode(void)
 
 static int __init des_sparc64_mod_init(void)
 {
-   int i;
-
-   for (i = 0; i < ARRAY_SIZE(algs); i++)
-   INIT_LIST_HEAD([i].cra_list);
-
if (sparc64_has_des_opcode()) {
pr_info("Using sparc64 des opcodes optimized DES 
implementation\n");
return crypto_register_algs(algs, ARRAY_SIZE(algs));
diff --git a/crypto/lz4.c b/crypto/lz4.c
index 2ce2660d3519e..c160dfdbf2e07 100644
--- a/crypto/lz4.c
+++ b/crypto/lz4.c
@@ -122,7 +122,6 @@ static struct crypto_alg alg_lz4 = {
.cra_flags  = CRYPTO_ALG_TYPE_COMPRESS,
.cra_ctxsize= sizeof(struct lz4_ctx),
.cra_module = THIS_MODULE,
-   .cra_list   = LIST_HEAD_INIT(alg_lz4.cra_list),
.cra_init   = lz4_init,
.cra_exit   = lz4_exit,
.cra_u  = { .compress = {
diff --git a/crypto/lz4hc.c b/crypto/lz4hc.c
index 2be14f054dafd..583b5e013d7a5 100644
--- a/crypto/lz4hc.c
+++ b/crypto/lz4hc.c
@@ -123,7 +123,6 @@ static struct crypto_alg alg_lz4hc = {
.cra_flags  = CRYPTO_ALG_TYPE_COMPRESS,
.cra_ctxsize= sizeof(struct lz4hc_ctx),
.cra_module = THIS_MODULE,
-   .cra_list   = LIST_HEAD_INIT(alg_lz4hc.cra_list),
.cra_init   = lz4hc_init,
.cra_exit   = lz4hc_exit,
.cra_u  = { .compress = {
diff --git a/drivers/crypto/bcm/cipher.c b/drivers/crypto/bcm/cipher.c
index 2d1f1db9f8074..8808eacc65801 100644
--- a/drivers/crypto/bcm/cipher.c
+++ b/drivers/crypto/bcm/cipher.c
@@ -4605,7 +4605,6 @@ static int spu_register_ablkcipher(struct iproc_alg_s 
*driver_alg)
crypto->cra_priority = cipher_pri;
crypto->cra_alignmask = 0;
crypto->cra_ctxsize = sizeof(struct iproc_ctx_s);
-   INIT_LIST_HEAD(>cra_list);
 
crypto->cra_init = ablkcipher_cra_init;
crypto->cra_exit = generic_cra_exit;
@@ -4687,7 +4686,6 @@ static int spu_register_aead(struct iproc_alg_s 
*driver_alg)
aead->base.cra_priority = aead_pri;
aead->base.cra_alignmask = 0;
aead->base.cra_ctxsize = sizeof(struct iproc_ctx_s);
-   INIT_LIST_HEAD(>base.cra_list);
 
aead->base.cra_flags |= CRYPTO_ALG_ASYNC;
/* setkey set in alg initializati

[PATCH] crypto: inside-secure - remove useless setting of type flags

2018-11-14 Thread Eric Biggers

From: Eric Biggers 

Remove the unnecessary setting of CRYPTO_ALG_TYPE_SKCIPHER.
Commit 2c95e6d97892 ("crypto: skcipher - remove useless setting of type
flags") took care of this everywhere else, but a few more instances made
it into the tree at about the same time.  Squash them before they get
copy+pasted around again.

This patch shouldn't change any actual behavior.

Signed-off-by: Eric Biggers 
---
 drivers/crypto/inside-secure/safexcel_cipher.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/crypto/inside-secure/safexcel_cipher.c 
b/drivers/crypto/inside-secure/safexcel_cipher.c
index 3aef1d43e4351..d531c14020dcb 100644
--- a/drivers/crypto/inside-secure/safexcel_cipher.c
+++ b/drivers/crypto/inside-secure/safexcel_cipher.c
@@ -970,7 +970,7 @@ struct safexcel_alg_template safexcel_alg_cbc_des = {
.cra_name = "cbc(des)",
.cra_driver_name = "safexcel-cbc-des",
.cra_priority = 300,
-   .cra_flags = CRYPTO_ALG_TYPE_SKCIPHER | 
CRYPTO_ALG_ASYNC |
+   .cra_flags = CRYPTO_ALG_ASYNC |
 CRYPTO_ALG_KERN_DRIVER_ONLY,
.cra_blocksize = DES_BLOCK_SIZE,
.cra_ctxsize = sizeof(struct safexcel_cipher_ctx),
@@ -1010,7 +1010,7 @@ struct safexcel_alg_template safexcel_alg_ecb_des = {
.cra_name = "ecb(des)",
.cra_driver_name = "safexcel-ecb-des",
.cra_priority = 300,
-   .cra_flags = CRYPTO_ALG_TYPE_SKCIPHER | 
CRYPTO_ALG_ASYNC |
+   .cra_flags = CRYPTO_ALG_ASYNC |
 CRYPTO_ALG_KERN_DRIVER_ONLY,
.cra_blocksize = DES_BLOCK_SIZE,
.cra_ctxsize = sizeof(struct safexcel_cipher_ctx),
@@ -1074,7 +1074,7 @@ struct safexcel_alg_template safexcel_alg_cbc_des3_ede = {
.cra_name = "cbc(des3_ede)",
.cra_driver_name = "safexcel-cbc-des3_ede",
.cra_priority = 300,
-   .cra_flags = CRYPTO_ALG_TYPE_SKCIPHER | 
CRYPTO_ALG_ASYNC |
+   .cra_flags = CRYPTO_ALG_ASYNC |
 CRYPTO_ALG_KERN_DRIVER_ONLY,
.cra_blocksize = DES3_EDE_BLOCK_SIZE,
.cra_ctxsize = sizeof(struct safexcel_cipher_ctx),
@@ -1114,7 +1114,7 @@ struct safexcel_alg_template safexcel_alg_ecb_des3_ede = {
.cra_name = "ecb(des3_ede)",
.cra_driver_name = "safexcel-ecb-des3_ede",
.cra_priority = 300,
-   .cra_flags = CRYPTO_ALG_TYPE_SKCIPHER | 
CRYPTO_ALG_ASYNC |
+   .cra_flags = CRYPTO_ALG_ASYNC |
 CRYPTO_ALG_KERN_DRIVER_ONLY,
.cra_blocksize = DES3_EDE_BLOCK_SIZE,
.cra_ctxsize = sizeof(struct safexcel_cipher_ctx),
-- 
2.19.1.930.g4563a0d9d0-goog

Re: [PATCH v3 2/2] crypto: arm/aes - add some hardening against cache-timing attacks

2018-10-19 Thread Eric Biggers

On Fri, Oct 19, 2018 at 05:54:12PM +0800, Ard Biesheuvel wrote:
> On 19 October 2018 at 13:41, Ard Biesheuvel  wrote:
> > On 18 October 2018 at 12:37, Eric Biggers  wrote:
> >> From: Eric Biggers 
> >>
> >> Make the ARM scalar AES implementation closer to constant-time by
> >> disabling interrupts and prefetching the tables into L1 cache.  This is
> >> feasible because due to ARM's "free" rotations, the main tables are only
> >> 1024 bytes instead of the usual 4096 used by most AES implementations.
> >>
> >> On ARM Cortex-A7, the speed loss is only about 5%.  The resulting code
> >> is still over twice as fast as aes_ti.c.  Responsiveness is potentially
> >> a concern, but interrupts are only disabled for a single AES block.
> >>
> >
> > So that would be in the order of 700 cycles, based on the numbers you
> > shared in v1 of the aes_ti.c patch. Does that sound about right? So
> > that would be around 1 microsecond, which is really not a number to
> > obsess about imo.
> >
> > I considered another option, which is to detect whether an interrupt
> > has been taken (by writing some canary value below that stack pointer
> > in the location where the exception handler will preserve the value of
> > sp, and checking at the end whether it has been modified) and doing a
> > usleep_range(x, y) if that is the case.
> >
> > But this is much simpler so let's only go there if we must.
> >
> 
> I played around a bit and implemented it for discussion purposes, but
> restarting the operation if it gets interrupted, as suggested in the
> paper (whitespace corruption courtesy of Gmail)
> 
> 
> diff --git a/arch/arm/crypto/aes-cipher-core.S
> b/arch/arm/crypto/aes-cipher-core.S
> index 184d6c2d15d5..2e8a84a47784 100644
> --- a/arch/arm/crypto/aes-cipher-core.S
> +++ b/arch/arm/crypto/aes-cipher-core.S
> @@ -10,6 +10,7 @@
>   */
> 
>  #include 
> +#include 
>  #include 
> 
>   .text
> @@ -139,6 +140,34 @@
> 
>   __adrl ttab, \ttab
> 
> + /*
> + * Set a canary that will allow us to tell whether any
> + * interrupts were taken while this function was executing.
> + * The zero value will be overwritten with the process counter
> + * value at the point where the IRQ exception is taken.
> + */
> + mov t0, #0
> + str t0, [sp, #-(SVC_REGS_SIZE - S_PC)]
> +
> + /*
> + * Prefetch the 1024-byte 'ft' or 'it' table into L1 cache,
> + * assuming cacheline size >= 32.  This is a hardening measure
> + * intended to make cache-timing attacks more difficult.
> + * They may not be fully prevented, however; see the paper
> + * https://cr.yp.to/antiforgery/cachetiming-20050414.pdf
> + * ("Cache-timing attacks on AES") for a discussion of the many
> + * difficulties involved in writing truly constant-time AES
> + * software.
> + */
> + .set i, 0
> + .rept 1024 / 128
> + ldr r8, [ttab, #i + 0]
> + ldr r9, [ttab, #i + 32]
> + ldr r10, [ttab, #i + 64]
> + ldr r11, [ttab, #i + 96]
> + .set i, i + 128
> + .endr
> +
>   tst rounds, #2
>   bne 1f
> 
> @@ -154,6 +183,8 @@
>  2: __adrl ttab, \ltab
>   \round r4, r5, r6, r7, r8, r9, r10, r11, \bsz, b
> 
> + ldr r0, [sp, #-(SVC_REGS_SIZE - S_PC)] // check canary
> +
>  #ifdef CONFIG_CPU_BIG_ENDIAN
>   __rev r4, r4
>   __rev r5, r5
> diff --git a/arch/arm/crypto/aes-cipher-glue.c
> b/arch/arm/crypto/aes-cipher-glue.c
> index c222f6e072ad..de8f32121511 100644
> --- a/arch/arm/crypto/aes-cipher-glue.c
> +++ b/arch/arm/crypto/aes-cipher-glue.c
> @@ -11,28 +11,39 @@
> 
>  #include 
>  #include 
> +#include 
>  #include 
> 
> -asmlinkage void __aes_arm_encrypt(u32 *rk, int rounds, const u8 *in, u8 
> *out);
> +asmlinkage int __aes_arm_encrypt(u32 *rk, int rounds, const u8 *in, u8 *out);
>  EXPORT_SYMBOL(__aes_arm_encrypt);
> 
> -asmlinkage void __aes_arm_decrypt(u32 *rk, int rounds, const u8 *in, u8 
> *out);
> +asmlinkage int __aes_arm_decrypt(u32 *rk, int rounds, const u8 *in, u8 *out);
>  EXPORT_SYMBOL(__aes_arm_decrypt);
> 
>  static void aes_encrypt(struct crypto_tfm *tfm, u8 *out, const u8 *in)
>  {
>   struct crypto_aes_ctx *ctx = crypto_tfm_ctx(tfm);
>   int rounds = 6 + ctx->key_length / 4;
> + u8 buf[AES_BLOCK_SIZE];
> 
> - __aes_arm_encrypt(ctx->key_enc, rounds, in, out);
> + if (out == in)
> +   in = memcpy(buf, in, AES_BLOCK_SIZE);
> +
> + while (unlikely(__aes_arm_encrypt(ctx->key_enc, rounds, in, out)))
> +   cpu_relax();
>  }
> 
>  static void aes_decrypt(struct crypto_tfm *tfm, u8 *out, const u8 *in)
>  {
>   struct crypto_aes_ctx *ctx = crypto_tfm_ctx(tfm);
>   int ro

Re: [PATCH v3 2/2] crypto: arm/aes - add some hardening against cache-timing attacks

2018-10-19 Thread Eric Biggers

On Fri, Oct 19, 2018 at 01:41:35PM +0800, Ard Biesheuvel wrote:
> On 18 October 2018 at 12:37, Eric Biggers  wrote:
> > From: Eric Biggers 
> >
> > Make the ARM scalar AES implementation closer to constant-time by
> > disabling interrupts and prefetching the tables into L1 cache.  This is
> > feasible because due to ARM's "free" rotations, the main tables are only
> > 1024 bytes instead of the usual 4096 used by most AES implementations.
> >
> > On ARM Cortex-A7, the speed loss is only about 5%.  The resulting code
> > is still over twice as fast as aes_ti.c.  Responsiveness is potentially
> > a concern, but interrupts are only disabled for a single AES block.
> >
> 
> So that would be in the order of 700 cycles, based on the numbers you
> shared in v1 of the aes_ti.c patch. Does that sound about right? So
> that would be around 1 microsecond, which is really not a number to
> obsess about imo.
> 

Correct, on ARM Cortex-A7 I'm seeing slightly over 700 cycles per block
encrypted or decrypted, including the prefetching.

- Eric

[PATCH v3 2/2] crypto: arm/aes - add some hardening against cache-timing attacks

2018-10-17 Thread Eric Biggers

From: Eric Biggers 

Make the ARM scalar AES implementation closer to constant-time by
disabling interrupts and prefetching the tables into L1 cache.  This is
feasible because due to ARM's "free" rotations, the main tables are only
1024 bytes instead of the usual 4096 used by most AES implementations.

On ARM Cortex-A7, the speed loss is only about 5%.  The resulting code
is still over twice as fast as aes_ti.c.  Responsiveness is potentially
a concern, but interrupts are only disabled for a single AES block.

Note that even after these changes, the implementation still isn't
necessarily guaranteed to be constant-time; see
https://cr.yp.to/antiforgery/cachetiming-20050414.pdf for a discussion
of the many difficulties involved in writing truly constant-time AES
software.  But it's valuable to make such attacks more difficult.

Much of this patch is based on patches suggested by Ard Biesheuvel.

Suggested-by: Ard Biesheuvel 
Signed-off-by: Eric Biggers 
---
 arch/arm/crypto/Kconfig   |  9 +
 arch/arm/crypto/aes-cipher-core.S | 62 ++-
 crypto/aes_generic.c  |  9 +++--
 3 files changed, 66 insertions(+), 14 deletions(-)

diff --git a/arch/arm/crypto/Kconfig b/arch/arm/crypto/Kconfig
index ef0c7feea6e29..0473a8f683896 100644
--- a/arch/arm/crypto/Kconfig
+++ b/arch/arm/crypto/Kconfig
@@ -69,6 +69,15 @@ config CRYPTO_AES_ARM
help
  Use optimized AES assembler routines for ARM platforms.
 
+ On ARM processors without the Crypto Extensions, this is the
+ fastest AES implementation for single blocks.  For multiple
+ blocks, the NEON bit-sliced implementation is usually faster.
+
+ This implementation may be vulnerable to cache timing attacks,
+ since it uses lookup tables.  However, as countermeasures it
+ disables IRQs and preloads the tables; it is hoped this makes
+ such attacks very difficult.
+
 config CRYPTO_AES_ARM_BS
tristate "Bit sliced AES using NEON instructions"
depends on KERNEL_MODE_NEON
diff --git a/arch/arm/crypto/aes-cipher-core.S 
b/arch/arm/crypto/aes-cipher-core.S
index 184d6c2d15d5e..f2d67c095e596 100644
--- a/arch/arm/crypto/aes-cipher-core.S
+++ b/arch/arm/crypto/aes-cipher-core.S
@@ -10,6 +10,7 @@
  */
 
 #include 
+#include 
 #include 
 
.text
@@ -41,7 +42,7 @@
.endif
.endm
 
-   .macro  __hround, out0, out1, in0, in1, in2, in3, t3, t4, enc, 
sz, op
+   .macro  __hround, out0, out1, in0, in1, in2, in3, t3, t4, enc, 
sz, op, oldcpsr
__select\out0, \in0, 0
__selectt0, \in1, 1
__load  \out0, \out0, 0, \sz, \op
@@ -73,6 +74,14 @@
__load  t0, t0, 3, \sz, \op
__load  \t4, \t4, 3, \sz, \op
 
+   .ifnb   \oldcpsr
+   /*
+* This is the final round and we're done with all data-dependent table
+* lookups, so we can safely re-enable interrupts.
+*/
+   restore_irqs\oldcpsr
+   .endif
+
eor \out1, \out1, t1, ror #24
eor \out0, \out0, t2, ror #16
ldm rk!, {t1, t2}
@@ -83,14 +92,14 @@
eor \out1, \out1, t2
.endm
 
-   .macro  fround, out0, out1, out2, out3, in0, in1, in2, in3, 
sz=2, op
+   .macro  fround, out0, out1, out2, out3, in0, in1, in2, in3, 
sz=2, op, oldcpsr
__hround\out0, \out1, \in0, \in1, \in2, \in3, \out2, \out3, 1, 
\sz, \op
-   __hround\out2, \out3, \in2, \in3, \in0, \in1, \in1, \in2, 1, 
\sz, \op
+   __hround\out2, \out3, \in2, \in3, \in0, \in1, \in1, \in2, 1, 
\sz, \op, \oldcpsr
.endm
 
-   .macro  iround, out0, out1, out2, out3, in0, in1, in2, in3, 
sz=2, op
+   .macro  iround, out0, out1, out2, out3, in0, in1, in2, in3, 
sz=2, op, oldcpsr
__hround\out0, \out1, \in0, \in3, \in2, \in1, \out2, \out3, 0, 
\sz, \op
-   __hround\out2, \out3, \in2, \in1, \in0, \in3, \in1, \in0, 0, 
\sz, \op
+   __hround\out2, \out3, \in2, \in1, \in0, \in3, \in1, \in0, 0, 
\sz, \op, \oldcpsr
.endm
 
.macro  __rev, out, in
@@ -118,13 +127,14 @@
.macro  do_crypt, round, ttab, ltab, bsz
push{r3-r11, lr}
 
+   // Load keys first, to reduce latency in case they're not cached yet.
+   ldm rk!, {r8-r11}
+
ldr r4, [in]
ldr r5, [in, #4]
ldr r6, [in, #8]
ldr r7, [in, #12]
 
-   ldm rk!, {r8-r11}
-
 #ifdef CONFIG_CPU_BIG_ENDIAN
__rev   r4, r4
__rev   r5, r5
@@ -138,6 +148,25 @@
eor r7, r7, r11
 
__adrl  ttab, \ttab
+   /*
+* Disable interrupts and prefetch the 1024-byte 'ft' or 'it' table into
+* L1 cache

[PATCH v3 1/2] crypto: aes_ti - disable interrupts while accessing S-box

2018-10-17 Thread Eric Biggers

From: Eric Biggers 

In the "aes-fixed-time" AES implementation, disable interrupts while
accessing the S-box, in order to make cache-timing attacks more
difficult.  Previously it was possible for the CPU to be interrupted
while the S-box was loaded into L1 cache, potentially evicting the
cachelines and causing later table lookups to be time-variant.

In tests I did on x86 and ARM, this doesn't affect performance
significantly.  Responsiveness is potentially a concern, but interrupts
are only disabled for a single AES block.

Note that even after this change, the implementation still isn't
necessarily guaranteed to be constant-time; see
https://cr.yp.to/antiforgery/cachetiming-20050414.pdf for a discussion
of the many difficulties involved in writing truly constant-time AES
software.  But it's valuable to make such attacks more difficult.

Reviewed-by: Ard Biesheuvel 
Signed-off-by: Eric Biggers 
---
 crypto/Kconfig  |  3 ++-
 crypto/aes_ti.c | 18 ++
 2 files changed, 20 insertions(+), 1 deletion(-)

diff --git a/crypto/Kconfig b/crypto/Kconfig
index f2c19cc63c778..f6db916f1b760 100644
--- a/crypto/Kconfig
+++ b/crypto/Kconfig
@@ -1006,7 +1006,8 @@ config CRYPTO_AES_TI
  8 for decryption), this implementation only uses just two S-boxes of
  256 bytes each, and attempts to eliminate data dependent latencies by
  prefetching the entire table into the cache at the start of each
- block.
+ block. Interrupts are also disabled to avoid races where cachelines
+ are evicted when the CPU is interrupted to do something else.
 
 config CRYPTO_AES_586
tristate "AES cipher algorithms (i586)"
diff --git a/crypto/aes_ti.c b/crypto/aes_ti.c
index 03023b2290e8e..1ff9785b30f55 100644
--- a/crypto/aes_ti.c
+++ b/crypto/aes_ti.c
@@ -269,6 +269,7 @@ static void aesti_encrypt(struct crypto_tfm *tfm, u8 *out, 
const u8 *in)
const u32 *rkp = ctx->key_enc + 4;
int rounds = 6 + ctx->key_length / 4;
u32 st0[4], st1[4];
+   unsigned long flags;
int round;
 
st0[0] = ctx->key_enc[0] ^ get_unaligned_le32(in);
@@ -276,6 +277,12 @@ static void aesti_encrypt(struct crypto_tfm *tfm, u8 *out, 
const u8 *in)
st0[2] = ctx->key_enc[2] ^ get_unaligned_le32(in + 8);
st0[3] = ctx->key_enc[3] ^ get_unaligned_le32(in + 12);
 
+   /*
+* Temporarily disable interrupts to avoid races where cachelines are
+* evicted when the CPU is interrupted to do something else.
+*/
+   local_irq_save(flags);
+
st0[0] ^= __aesti_sbox[ 0] ^ __aesti_sbox[128];
st0[1] ^= __aesti_sbox[32] ^ __aesti_sbox[160];
st0[2] ^= __aesti_sbox[64] ^ __aesti_sbox[192];
@@ -300,6 +307,8 @@ static void aesti_encrypt(struct crypto_tfm *tfm, u8 *out, 
const u8 *in)
put_unaligned_le32(subshift(st1, 1) ^ rkp[5], out + 4);
put_unaligned_le32(subshift(st1, 2) ^ rkp[6], out + 8);
put_unaligned_le32(subshift(st1, 3) ^ rkp[7], out + 12);
+
+   local_irq_restore(flags);
 }
 
 static void aesti_decrypt(struct crypto_tfm *tfm, u8 *out, const u8 *in)
@@ -308,6 +317,7 @@ static void aesti_decrypt(struct crypto_tfm *tfm, u8 *out, 
const u8 *in)
const u32 *rkp = ctx->key_dec + 4;
int rounds = 6 + ctx->key_length / 4;
u32 st0[4], st1[4];
+   unsigned long flags;
int round;
 
st0[0] = ctx->key_dec[0] ^ get_unaligned_le32(in);
@@ -315,6 +325,12 @@ static void aesti_decrypt(struct crypto_tfm *tfm, u8 *out, 
const u8 *in)
st0[2] = ctx->key_dec[2] ^ get_unaligned_le32(in + 8);
st0[3] = ctx->key_dec[3] ^ get_unaligned_le32(in + 12);
 
+   /*
+* Temporarily disable interrupts to avoid races where cachelines are
+* evicted when the CPU is interrupted to do something else.
+*/
+   local_irq_save(flags);
+
st0[0] ^= __aesti_inv_sbox[ 0] ^ __aesti_inv_sbox[128];
st0[1] ^= __aesti_inv_sbox[32] ^ __aesti_inv_sbox[160];
st0[2] ^= __aesti_inv_sbox[64] ^ __aesti_inv_sbox[192];
@@ -339,6 +355,8 @@ static void aesti_decrypt(struct crypto_tfm *tfm, u8 *out, 
const u8 *in)
put_unaligned_le32(inv_subshift(st1, 1) ^ rkp[5], out + 4);
put_unaligned_le32(inv_subshift(st1, 2) ^ rkp[6], out + 8);
put_unaligned_le32(inv_subshift(st1, 3) ^ rkp[7], out + 12);
+
+   local_irq_restore(flags);
 }
 
 static struct crypto_alg aes_alg = {
-- 
2.19.1

[PATCH v3 0/2] crypto: some hardening against AES cache-timing attacks

2018-10-17 Thread Eric Biggers

This series makes the "aes-fixed-time" and "aes-arm" implementations of
AES more resistant to cache-timing attacks.

Note that even after these changes, the implementations still aren't
necessarily guaranteed to be constant-time; see
https://cr.yp.to/antiforgery/cachetiming-20050414.pdf for a discussion
of the many difficulties involved in writing truly constant-time AES
software.  But it's valuable to make such attacks more difficult.

Changed since v2:
- In aes-arm, move the IRQ disable/enable into the assembly file.
- Other aes-arm tweaks.
- Add Kconfig help text.

Thanks to Ard Biesheuvel for the suggestions.

Eric Biggers (2):
  crypto: aes_ti - disable interrupts while accessing S-box
  crypto: arm/aes - add some hardening against cache-timing attacks

 arch/arm/crypto/Kconfig   |  9 +
 arch/arm/crypto/aes-cipher-core.S | 62 ++-
 crypto/Kconfig|  3 +-
 crypto/aes_generic.c  |  9 +++--
 crypto/aes_ti.c   | 18 +
 5 files changed, 86 insertions(+), 15 deletions(-)

-- 
2.19.1

[PATCH v2 2/2] crypto: arm/aes - add some hardening against cache-timing attacks

2018-10-17 Thread Eric Biggers

From: Eric Biggers 

Make the ARM scalar AES implementation closer to constant-time by
disabling interrupts and prefetching the tables into L1 cache.  This is
feasible because due to ARM's "free" rotations, the main tables are only
1024 bytes instead of the usual 4096 used by most AES implementations.

On ARM Cortex-A7, the speed loss is only about 5%.  The resulting
implementation is still over twice as fast as aes_ti.c.

Note that even after these changes, the implementation still isn't
necessarily guaranteed to be constant-time; see
https://cr.yp.to/antiforgery/cachetiming-20050414.pdf for a discussion
of the many difficulties involved in writing truly constant-time AES
software.  But it's valuable to make such attacks more difficult.

Suggested-by: Ard Biesheuvel 
Signed-off-by: Eric Biggers 
---
 arch/arm/crypto/aes-cipher-core.S | 26 ++
 arch/arm/crypto/aes-cipher-glue.c | 13 +
 crypto/aes_generic.c  |  9 +
 3 files changed, 44 insertions(+), 4 deletions(-)

diff --git a/arch/arm/crypto/aes-cipher-core.S 
b/arch/arm/crypto/aes-cipher-core.S
index 184d6c2d15d5..ba9d4aefe585 100644
--- a/arch/arm/crypto/aes-cipher-core.S
+++ b/arch/arm/crypto/aes-cipher-core.S
@@ -138,6 +138,23 @@
eor r7, r7, r11
 
__adrl  ttab, \ttab
+   /*
+* Prefetch the 1024-byte 'ft' or 'it' table into L1 cache, assuming
+* cacheline size >= 32.  This, along with the caller disabling
+* interrupts, is a hardening measure intended to make cache-timing
+* attacks more difficult.  They may not be fully prevented, however;
+* see the paper https://cr.yp.to/antiforgery/cachetiming-20050414.pdf
+* ("Cache-timing attacks on AES") for a discussion of the many
+* difficulties involved in writing truly constant-time AES software.
+*/
+   .set i, 0
+.rept 1024 / 128
+   ldr r8, [ttab, #i + 0]
+   ldr r9, [ttab, #i + 32]
+   ldr r10, [ttab, #i + 64]
+   ldr r11, [ttab, #i + 96]
+   .set i, i + 128
+.endr
 
tst rounds, #2
bne 1f
@@ -152,6 +169,15 @@
b   0b
 
 2: __adrl  ttab, \ltab
+.if \bsz == 0
+   /* Prefetch the 256-byte inverse S-box; see explanation above */
+   .set i, 0
+.rept 256 / 64
+   ldr t0, [ttab, #i + 0]
+   ldr t1, [ttab, #i + 32]
+   .set i, i + 64
+.endr
+.endif
\round  r4, r5, r6, r7, r8, r9, r10, r11, \bsz, b
 
 #ifdef CONFIG_CPU_BIG_ENDIAN
diff --git a/arch/arm/crypto/aes-cipher-glue.c 
b/arch/arm/crypto/aes-cipher-glue.c
index c222f6e072ad..f40e35eb22e4 100644
--- a/arch/arm/crypto/aes-cipher-glue.c
+++ b/arch/arm/crypto/aes-cipher-glue.c
@@ -23,16 +23,29 @@ static void aes_encrypt(struct crypto_tfm *tfm, u8 *out, 
const u8 *in)
 {
struct crypto_aes_ctx *ctx = crypto_tfm_ctx(tfm);
int rounds = 6 + ctx->key_length / 4;
+   unsigned long flags;
 
+   /*
+* This AES implementation prefetches the lookup table into L1 cache to
+* try to make timing attacks on the table lookups more difficult.
+* Temporarily disable interrupts to avoid races where cachelines are
+* evicted when the CPU is interrupted to do something else.
+*/
+   local_irq_save(flags);
__aes_arm_encrypt(ctx->key_enc, rounds, in, out);
+   local_irq_restore(flags);
 }
 
 static void aes_decrypt(struct crypto_tfm *tfm, u8 *out, const u8 *in)
 {
struct crypto_aes_ctx *ctx = crypto_tfm_ctx(tfm);
int rounds = 6 + ctx->key_length / 4;
+   unsigned long flags;
 
+   /* Disable interrupts to help mitigate timing attacks, see above */
+   local_irq_save(flags);
__aes_arm_decrypt(ctx->key_dec, rounds, in, out);
+   local_irq_restore(flags);
 }
 
 static struct crypto_alg aes_alg = {
diff --git a/crypto/aes_generic.c b/crypto/aes_generic.c
index ca554d57d01e..13df33aca463 100644
--- a/crypto/aes_generic.c
+++ b/crypto/aes_generic.c
@@ -63,7 +63,8 @@ static inline u8 byte(const u32 x, const unsigned n)
 
 static const u32 rco_tab[10] = { 1, 2, 4, 8, 16, 32, 64, 128, 27, 54 };
 
-__visible const u32 crypto_ft_tab[4][256] = {
+/* cacheline-aligned to facilitate prefetching into cache */
+__visible const u32 crypto_ft_tab[4][256] __cacheline_aligned = {
{
0xa56363c6, 0x847c7cf8, 0x99ee, 0x8d7b7bf6,
0x0df2f2ff, 0xbd6b6bd6, 0xb16f6fde, 0x54c5c591,
@@ -327,7 +328,7 @@ __visible const u32 crypto_ft_tab[4][256] = {
}
 };
 
-__visible const u32 crypto_fl_tab[4][256] = {
+__visible const u32 crypto_fl_tab[4][256] __cacheline_aligned = {
{
0x0063, 0x007c, 0x0077, 0x007b,
0x00f2, 0x006b, 0x006f, 0x00c5,
@@ -591,7 +592,7 @@ __vis

[PATCH v2 0/2] crypto: some hardening against AES cache-timing attacks

2018-10-17 Thread Eric Biggers

This series makes the "aes-fixed-time" and "aes-arm" implementations of
AES more resistant to cache-timing attacks.

Note that even after these changes, the implementations still aren't
necessarily guaranteed to be constant-time; see
https://cr.yp.to/antiforgery/cachetiming-20050414.pdf for a discussion
of the many difficulties involved in writing truly constant-time AES
software.  But it's valuable to make such attacks more difficult.

Eric Biggers (2):
  crypto: aes_ti - disable interrupts while accessing S-box
  crypto: arm/aes - add some hardening against cache-timing attacks

 arch/arm/crypto/aes-cipher-core.S | 26 ++
 arch/arm/crypto/aes-cipher-glue.c | 13 +
 crypto/aes_generic.c  |  9 +
 crypto/aes_ti.c   | 18 ++
 4 files changed, 62 insertions(+), 4 deletions(-)

-- 
2.19.1

[PATCH v2 1/2] crypto: aes_ti - disable interrupts while accessing S-box

2018-10-17 Thread Eric Biggers

From: Eric Biggers 

In the "aes-fixed-time" AES implementation, disable interrupts while
accessing the S-box, in order to make cache-timing attacks more
difficult.  Previously it was possible for the CPU to be interrupted
while the S-box was loaded into L1 cache, potentially evicting the
cachelines and causing later table lookups to be time-variant.

In tests I did on x86 and ARM, this doesn't affect performance
significantly.  Responsiveness is potentially a concern, but interrupts
are only disabled for a single AES block.

Note that even after this change, the implementation still isn't
necessarily guaranteed to be constant-time; see
https://cr.yp.to/antiforgery/cachetiming-20050414.pdf for a discussion
of the many difficulties involved in writing truly constant-time AES
software.  But it's valuable to make such attacks more difficult.

Signed-off-by: Eric Biggers 
---
 crypto/aes_ti.c | 18 ++
 1 file changed, 18 insertions(+)

diff --git a/crypto/aes_ti.c b/crypto/aes_ti.c
index 03023b2290e8..1ff9785b30f5 100644
--- a/crypto/aes_ti.c
+++ b/crypto/aes_ti.c
@@ -269,6 +269,7 @@ static void aesti_encrypt(struct crypto_tfm *tfm, u8 *out, 
const u8 *in)
const u32 *rkp = ctx->key_enc + 4;
int rounds = 6 + ctx->key_length / 4;
u32 st0[4], st1[4];
+   unsigned long flags;
int round;
 
st0[0] = ctx->key_enc[0] ^ get_unaligned_le32(in);
@@ -276,6 +277,12 @@ static void aesti_encrypt(struct crypto_tfm *tfm, u8 *out, 
const u8 *in)
st0[2] = ctx->key_enc[2] ^ get_unaligned_le32(in + 8);
st0[3] = ctx->key_enc[3] ^ get_unaligned_le32(in + 12);
 
+   /*
+* Temporarily disable interrupts to avoid races where cachelines are
+* evicted when the CPU is interrupted to do something else.
+*/
+   local_irq_save(flags);
+
st0[0] ^= __aesti_sbox[ 0] ^ __aesti_sbox[128];
st0[1] ^= __aesti_sbox[32] ^ __aesti_sbox[160];
st0[2] ^= __aesti_sbox[64] ^ __aesti_sbox[192];
@@ -300,6 +307,8 @@ static void aesti_encrypt(struct crypto_tfm *tfm, u8 *out, 
const u8 *in)
put_unaligned_le32(subshift(st1, 1) ^ rkp[5], out + 4);
put_unaligned_le32(subshift(st1, 2) ^ rkp[6], out + 8);
put_unaligned_le32(subshift(st1, 3) ^ rkp[7], out + 12);
+
+   local_irq_restore(flags);
 }
 
 static void aesti_decrypt(struct crypto_tfm *tfm, u8 *out, const u8 *in)
@@ -308,6 +317,7 @@ static void aesti_decrypt(struct crypto_tfm *tfm, u8 *out, 
const u8 *in)
const u32 *rkp = ctx->key_dec + 4;
int rounds = 6 + ctx->key_length / 4;
u32 st0[4], st1[4];
+   unsigned long flags;
int round;
 
st0[0] = ctx->key_dec[0] ^ get_unaligned_le32(in);
@@ -315,6 +325,12 @@ static void aesti_decrypt(struct crypto_tfm *tfm, u8 *out, 
const u8 *in)
st0[2] = ctx->key_dec[2] ^ get_unaligned_le32(in + 8);
st0[3] = ctx->key_dec[3] ^ get_unaligned_le32(in + 12);
 
+   /*
+* Temporarily disable interrupts to avoid races where cachelines are
+* evicted when the CPU is interrupted to do something else.
+*/
+   local_irq_save(flags);
+
st0[0] ^= __aesti_inv_sbox[ 0] ^ __aesti_inv_sbox[128];
st0[1] ^= __aesti_inv_sbox[32] ^ __aesti_inv_sbox[160];
st0[2] ^= __aesti_inv_sbox[64] ^ __aesti_inv_sbox[192];
@@ -339,6 +355,8 @@ static void aesti_decrypt(struct crypto_tfm *tfm, u8 *out, 
const u8 *in)
put_unaligned_le32(inv_subshift(st1, 1) ^ rkp[5], out + 4);
put_unaligned_le32(inv_subshift(st1, 2) ^ rkp[6], out + 8);
put_unaligned_le32(inv_subshift(st1, 3) ^ rkp[7], out + 12);
+
+   local_irq_restore(flags);
 }
 
 static struct crypto_alg aes_alg = {
-- 
2.19.1

Re: [PATCH] crypto: aes_ti - disable interrupts while accessing sbox

2018-10-16 Thread Eric Biggers

Hi Ard,

On Thu, Oct 04, 2018 at 08:55:14AM +0200, Ard Biesheuvel wrote:
> Hi Eric,
> 
> On 4 October 2018 at 06:07, Eric Biggers  wrote:
> > From: Eric Biggers 
> >
> > The generic constant-time AES implementation is supposed to preload the
> > AES S-box into the CPU's L1 data cache.  But, an interrupt handler can
> > run on the CPU and muck with the cache.  Worse, on preemptible kernels
> > the process can even be preempted and moved to a different CPU.  So the
> > implementation may actually still be vulnerable to cache-timing attacks.
> >
> > Make it more robust by disabling interrupts while the sbox is used.
> >
> > In some quick tests on x86 and ARM, this doesn't affect performance
> > significantly.  Responsiveness is also a concern, but interrupts are
> > only disabled for a single AES block which even on ARM Cortex-A7 is
> > "only" ~1500 cycles to encrypt or ~2600 cycles to decrypt.
> >
> 
> I share your concern, but that is quite a big hammer.
> 
> Also, does it really take ~100 cycles per byte? That is terrible :-)
> 
> Given that the full lookup table is only 1024 bytes (or 1024+256 bytes
> for decryption), I wonder if something like the below would be a
> better option for A7 (apologies for the mangled whitespace)
> 
> diff --git a/arch/arm/crypto/aes-cipher-core.S
> b/arch/arm/crypto/aes-cipher-core.S
> index 184d6c2d15d5..83e893f7e581 100644
> --- a/arch/arm/crypto/aes-cipher-core.S
> +++ b/arch/arm/crypto/aes-cipher-core.S
> @@ -139,6 +139,13 @@
> 
>   __adrl ttab, \ttab
> 
> + .irpc r, 01234567
> + ldr r8, [ttab, #(32 * \r)]
> + ldr r9, [ttab, #(32 * \r) + 256]
> + ldr r10, [ttab, #(32 * \r) + 512]
> + ldr r11, [ttab, #(32 * \r) + 768]
> + .endr
> +
>   tst rounds, #2
>   bne 1f
> 
> @@ -180,6 +187,12 @@ ENDPROC(__aes_arm_encrypt)
> 
>   .align 5
>  ENTRY(__aes_arm_decrypt)
> + __adrl ttab, __aes_arm_inverse_sbox
> +
> + .irpc r, 01234567
> + ldr r8, [ttab, #(32 * \r)]
> + .endr
> +
>   do_crypt iround, crypto_it_tab, __aes_arm_inverse_sbox, 0
>  ENDPROC(__aes_arm_decrypt)
> 
> diff --git a/arch/arm/crypto/aes-cipher-glue.c
> b/arch/arm/crypto/aes-cipher-glue.c
> index c222f6e072ad..630e1a436f1d 100644
> --- a/arch/arm/crypto/aes-cipher-glue.c
> +++ b/arch/arm/crypto/aes-cipher-glue.c
> @@ -23,16 +23,22 @@ static void aes_encrypt(struct crypto_tfm *tfm, u8
> *out, const u8 *in)
>  {
>   struct crypto_aes_ctx *ctx = crypto_tfm_ctx(tfm);
>   int rounds = 6 + ctx->key_length / 4;
> + unsigned long flags;
> 
> + local_irq_save(flags);
>   __aes_arm_encrypt(ctx->key_enc, rounds, in, out);
> + local_irq_restore(flags);
>  }
> 
>  static void aes_decrypt(struct crypto_tfm *tfm, u8 *out, const u8 *in)
>  {
>   struct crypto_aes_ctx *ctx = crypto_tfm_ctx(tfm);
>   int rounds = 6 + ctx->key_length / 4;
> + unsigned long flags;
> 
> + local_irq_save(flags);
>   __aes_arm_decrypt(ctx->key_dec, rounds, in, out);
> + local_irq_restore(flags);
>  }
> 
>  static struct crypto_alg aes_alg = {
> diff --git a/crypto/aes_generic.c b/crypto/aes_generic.c
> index ca554d57d01e..82fa860c9cb9 100644
> --- a/crypto/aes_generic.c
> +++ b/crypto/aes_generic.c
> @@ -63,7 +63,7 @@ static inline u8 byte(const u32 x, const unsigned n)
> 
>  static const u32 rco_tab[10] = { 1, 2, 4, 8, 16, 32, 64, 128, 27, 54 };
> 
> -__visible const u32 crypto_ft_tab[4][256] = {
> +__visible const u32 crypto_ft_tab[4][256] __cacheline_aligned = {
>   {
>   0xa56363c6, 0x847c7cf8, 0x99ee, 0x8d7b7bf6,
>   0x0df2f2ff, 0xbd6b6bd6, 0xb16f6fde, 0x54c5c591,
> @@ -327,7 +327,7 @@ __visible const u32 crypto_ft_tab[4][256] = {
>   }
>  };
> 
> -__visible const u32 crypto_fl_tab[4][256] = {
> +__visible const u32 crypto_fl_tab[4][256] __cacheline_aligned = {
>   {
>   0x0063, 0x007c, 0x0077, 0x007b,
>   0x00f2, 0x006b, 0x006f, 0x00c5,
> @@ -591,7 +591,7 @@ __visible const u32 crypto_fl_tab[4][256] = {
>   }
>  };
> 
> -__visible const u32 crypto_it_tab[4][256] = {
> +__visible const u32 crypto_it_tab[4][256] __cacheline_aligned = {
>   {
>   0x50a7f451, 0x5365417e, 0xc3a4171a, 0x965e273a,
>   0xcb6bab3b, 0xf1459d1f, 0xab58faac, 0x9303e34b,
> @@ -855,7 +855,7 @@ __visible const u32 crypto_it_tab[4][256] = {
>   }
>  };
> 
> -__visible const u32 crypto_il_tab[4][256] = {
> +__visible const u32 crypto_il_tab[4][256] __cacheline_aligned = {
>   {
>   0x0052, 0x0009, 0x006a, 0x00d5,
>   0x0030, 0x0036, 0x00a5, 0x0038,
> 

Thanks for the suggestion -- this turns out to work pretty well.  At least in a
microbenchmark, loading the larger

Re: [PATCH 2/3] crypto: crypto_xor - use unaligned accessors for aligned fast path

2018-10-08 Thread Eric Biggers

Hi Ard,

On Mon, Oct 08, 2018 at 11:15:53PM +0200, Ard Biesheuvel wrote:
> On ARM v6 and later, we define CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
> because the ordinary load/store instructions (ldr, ldrh, ldrb) can
> tolerate any misalignment of the memory address. However, load/store
> double and load/store multiple instructions (ldrd, ldm) may still only
> be used on memory addresses that are 32-bit aligned, and so we have to
> use the CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS macro with care, or we
> may end up with a severe performance hit due to alignment traps that
> require fixups by the kernel.
> 
> Fortunately, the get_unaligned() accessors do the right thing: when
> building for ARMv6 or later, the compiler will emit unaligned accesses
> using the ordinary load/store instructions (but avoid the ones that
> require 32-bit alignment). When building for older ARM, those accessors
> will emit the appropriate sequence of ldrb/mov/orr instructions. And on
> architectures that can truly tolerate any kind of misalignment, the
> get_unaligned() accessors resolve to the leXX_to_cpup accessors that
> operate on aligned addresses.
> 
> So switch to the unaligned accessors for the aligned fast path. This
> will create the exact same code on architectures that can really
> tolerate any kind of misalignment, and generate code for ARMv6+ that
> avoids load/store instructions that trigger alignment faults.
> 
> Signed-off-by: Ard Biesheuvel 
> ---
>  crypto/algapi.c |  7 +++
>  include/crypto/algapi.h | 11 +--
>  2 files changed, 12 insertions(+), 6 deletions(-)
> 
> diff --git a/crypto/algapi.c b/crypto/algapi.c
> index 2545c5f89c4c..52ce3c5a0499 100644
> --- a/crypto/algapi.c
> +++ b/crypto/algapi.c
> @@ -988,11 +988,10 @@ void crypto_inc(u8 *a, unsigned int size)
>   __be32 *b = (__be32 *)(a + size);
>   u32 c;
>  
> - if (IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS) ||
> - IS_ALIGNED((unsigned long)b, __alignof__(*b)))
> + if (IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS))
>   for (; size >= 4; size -= 4) {
> - c = be32_to_cpu(*--b) + 1;
> - *b = cpu_to_be32(c);
> + c = get_unaligned_be32(--b) + 1;
> + put_unaligned_be32(c, b);
>   if (likely(c))
>   return;
>   }
> diff --git a/include/crypto/algapi.h b/include/crypto/algapi.h
> index 4a5ad10e75f0..86267c232f34 100644
> --- a/include/crypto/algapi.h
> +++ b/include/crypto/algapi.h
> @@ -17,6 +17,8 @@
>  #include 
>  #include 
>  
> +#include 
> +
>  /*
>   * Maximum values for blocksize and alignmask, used to allocate
>   * static buffers that are big enough for any combination of
> @@ -212,7 +214,9 @@ static inline void crypto_xor(u8 *dst, const u8 *src, 
> unsigned int size)
>   unsigned long *s = (unsigned long *)src;
>  
>   while (size > 0) {
> - *d++ ^= *s++;
> + put_unaligned(get_unaligned(d) ^ get_unaligned(s), d);
> + d++;
> + s++;
>   size -= sizeof(unsigned long);
>   }
>   } else {
> @@ -231,7 +235,10 @@ static inline void crypto_xor_cpy(u8 *dst, const u8 
> *src1, const u8 *src2,
>   unsigned long *s2 = (unsigned long *)src2;
>  
>   while (size > 0) {
> - *d++ = *s1++ ^ *s2++;
> + put_unaligned(get_unaligned(s1) ^ get_unaligned(s2), d);
> + d++;
> + s1++;
> + s2++;
>   size -= sizeof(unsigned long);
>   }
>   } else {
> -- 
> 2.11.0
> 

Doesn't __crypto_xor() have the same problem too?

- Eric

Re: [PATCH 1/3] crypto: memneq - use unaligned accessors for aligned fast path

2018-10-08 Thread Eric Biggers

Hi Ard,

On Mon, Oct 08, 2018 at 11:15:52PM +0200, Ard Biesheuvel wrote:
> On ARM v6 and later, we define CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
> because the ordinary load/store instructions (ldr, ldrh, ldrb) can
> tolerate any misalignment of the memory address. However, load/store
> double and load/store multiple instructions (ldrd, ldm) may still only
> be used on memory addresses that are 32-bit aligned, and so we have to
> use the CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS macro with care, or we
> may end up with a severe performance hit due to alignment traps that
> require fixups by the kernel.
> 
> Fortunately, the get_unaligned() accessors do the right thing: when
> building for ARMv6 or later, the compiler will emit unaligned accesses
> using the ordinary load/store instructions (but avoid the ones that
> require 32-bit alignment). When building for older ARM, those accessors
> will emit the appropriate sequence of ldrb/mov/orr instructions. And on
> architectures that can truly tolerate any kind of misalignment, the
> get_unaligned() accessors resolve to the leXX_to_cpup accessors that
> operate on aligned addresses.
> 
> So switch to the unaligned accessors for the aligned fast path. This
> will create the exact same code on architectures that can really
> tolerate any kind of misalignment, and generate code for ARMv6+ that
> avoids load/store instructions that trigger alignment faults.
> 
> Signed-off-by: Ard Biesheuvel 
> ---
>  crypto/memneq.c | 24 ++--
>  1 file changed, 17 insertions(+), 7 deletions(-)
> 
> diff --git a/crypto/memneq.c b/crypto/memneq.c
> index afed1bd16aee..0f46a6150f22 100644
> --- a/crypto/memneq.c
> +++ b/crypto/memneq.c
> @@ -60,6 +60,7 @@
>   */
>  
>  #include 
> +#include 
>  
>  #ifndef __HAVE_ARCH_CRYPTO_MEMNEQ
>  
> @@ -71,7 +72,10 @@ __crypto_memneq_generic(const void *a, const void *b, 
> size_t size)
>  
>  #if defined(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS)
>   while (size >= sizeof(unsigned long)) {
> - neq |= *(unsigned long *)a ^ *(unsigned long *)b;
> + unsigned long const *p = a;
> + unsigned long const *q = b;
> +
> + neq |= get_unaligned(p) ^ get_unaligned(q);
>   OPTIMIZER_HIDE_VAR(neq);
>   a += sizeof(unsigned long);
>   b += sizeof(unsigned long);
> @@ -95,18 +99,24 @@ static inline unsigned long __crypto_memneq_16(const void 
> *a, const void *b)
>  
>  #ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
>   if (sizeof(unsigned long) == 8) {
> - neq |= *(unsigned long *)(a)   ^ *(unsigned long *)(b);
> + unsigned long const *p = a;
> + unsigned long const *q = b;
> +
> + neq |= get_unaligned(p++) ^ get_unaligned(q++);
>   OPTIMIZER_HIDE_VAR(neq);
> - neq |= *(unsigned long *)(a+8) ^ *(unsigned long *)(b+8);
> + neq |= get_unaligned(p) ^ get_unaligned(q);
>   OPTIMIZER_HIDE_VAR(neq);
>   } else if (sizeof(unsigned int) == 4) {
> - neq |= *(unsigned int *)(a)^ *(unsigned int *)(b);
> + unsigned int const *p = a;
> + unsigned int const *q = b;
> +
> + neq |= get_unaligned(p++) ^ get_unaligned(q++);
>   OPTIMIZER_HIDE_VAR(neq);
> - neq |= *(unsigned int *)(a+4)  ^ *(unsigned int *)(b+4);
> + neq |= get_unaligned(p++) ^ get_unaligned(q++);
>   OPTIMIZER_HIDE_VAR(neq);
> - neq |= *(unsigned int *)(a+8)  ^ *(unsigned int *)(b+8);
> + neq |= get_unaligned(p++) ^ get_unaligned(q++);
>   OPTIMIZER_HIDE_VAR(neq);
> - neq |= *(unsigned int *)(a+12) ^ *(unsigned int *)(b+12);
> + neq |= get_unaligned(p) ^ get_unaligned(q);
>   OPTIMIZER_HIDE_VAR(neq);
>   } else
>  #endif /* CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS */

This looks good, but maybe now we should get rid of the
!CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS path too?
At least for the 16-byte case:

static inline unsigned long __crypto_memneq_16(const void *a, const void *b)
{
const unsigned long *p = a, *q = b;
unsigned long neq = 0;

BUILD_BUG_ON(sizeof(*p) != 4 && sizeof(*p) != 8);
neq |= get_unaligned(p++) ^ get_unaligned(q++);
OPTIMIZER_HIDE_VAR(neq);
neq |= get_unaligned(p++) ^ get_unaligned(q++);
OPTIMIZER_HIDE_VAR(neq);
if (sizeof(*p) == 4) {
neq |= get_unaligned(p++) ^ get_unaligned(q++);
OPTIMIZER_HIDE_VAR(neq);
neq |= get_unaligned(p++) ^ get_unaligned(q++);
OPTIMIZER_HIDE_VAR(neq);
}
return neq;
}

Re: [PATCH] crypto: x86/aes-ni - fix build error following fpu template removal

2018-10-05 Thread Eric Biggers

On Fri, Oct 05, 2018 at 07:16:13PM +0200, Ard Biesheuvel wrote:
> On 5 October 2018 at 19:13, Eric Biggers  wrote:
> > From: Eric Biggers 
> >
> > aesni-intel_glue.c still calls crypto_fpu_init() and crypto_fpu_exit()
> > to register/unregister the "fpu" template.  But these functions don't
> > exist anymore, causing a build error.  Remove the calls to them.
> >
> > Fixes: 944585a64f5e ("crypto: x86/aes-ni - remove special handling of AES 
> > in PCBC mode")
> > Signed-off-by: Eric Biggers 
> 
> Thanks for spotting that.
> 
> I had actually noticed myself, but wasn't really expecting this RFC
> patch to be picked up without discussion.
> 

The patch seems reasonable to me -- we shouldn't maintain a special FPU template
just for AES-PCBC when possibly no one is even using that algorithm.

- Eric

[PATCH] crypto: x86/aes-ni - fix build error following fpu template removal

2018-10-05 Thread Eric Biggers

From: Eric Biggers 

aesni-intel_glue.c still calls crypto_fpu_init() and crypto_fpu_exit()
to register/unregister the "fpu" template.  But these functions don't
exist anymore, causing a build error.  Remove the calls to them.

Fixes: 944585a64f5e ("crypto: x86/aes-ni - remove special handling of AES in 
PCBC mode")
Signed-off-by: Eric Biggers 
---
 arch/x86/crypto/aesni-intel_glue.c | 13 +
 1 file changed, 1 insertion(+), 12 deletions(-)

diff --git a/arch/x86/crypto/aesni-intel_glue.c 
b/arch/x86/crypto/aesni-intel_glue.c
index 89bae64eef4f9..661f7daf43da9 100644
--- a/arch/x86/crypto/aesni-intel_glue.c
+++ b/arch/x86/crypto/aesni-intel_glue.c
@@ -102,9 +102,6 @@ asmlinkage void aesni_cbc_enc(struct crypto_aes_ctx *ctx, 
u8 *out,
 asmlinkage void aesni_cbc_dec(struct crypto_aes_ctx *ctx, u8 *out,
  const u8 *in, unsigned int len, u8 *iv);
 
-int crypto_fpu_init(void);
-void crypto_fpu_exit(void);
-
 #define AVX_GEN2_OPTSIZE 640
 #define AVX_GEN4_OPTSIZE 4096
 
@@ -1449,13 +1446,9 @@ static int __init aesni_init(void)
 #endif
 #endif
 
-   err = crypto_fpu_init();
-   if (err)
-   return err;
-
err = crypto_register_algs(aesni_algs, ARRAY_SIZE(aesni_algs));
if (err)
-   goto fpu_exit;
+   return err;
 
err = crypto_register_skciphers(aesni_skciphers,
ARRAY_SIZE(aesni_skciphers));
@@ -1489,8 +1482,6 @@ static int __init aesni_init(void)
ARRAY_SIZE(aesni_skciphers));
 unregister_algs:
crypto_unregister_algs(aesni_algs, ARRAY_SIZE(aesni_algs));
-fpu_exit:
-   crypto_fpu_exit();
return err;
 }
 
@@ -1501,8 +1492,6 @@ static void __exit aesni_exit(void)
crypto_unregister_skciphers(aesni_skciphers,
ARRAY_SIZE(aesni_skciphers));
crypto_unregister_algs(aesni_algs, ARRAY_SIZE(aesni_algs));
-
-   crypto_fpu_exit();
 }
 
 late_initcall(aesni_init);
-- 
2.19.0.605.g01d371f741-goog

[PATCH] crypto: aes_ti - disable interrupts while accessing sbox

2018-10-03 Thread Eric Biggers

From: Eric Biggers 

The generic constant-time AES implementation is supposed to preload the
AES S-box into the CPU's L1 data cache.  But, an interrupt handler can
run on the CPU and muck with the cache.  Worse, on preemptible kernels
the process can even be preempted and moved to a different CPU.  So the
implementation may actually still be vulnerable to cache-timing attacks.

Make it more robust by disabling interrupts while the sbox is used.

In some quick tests on x86 and ARM, this doesn't affect performance
significantly.  Responsiveness is also a concern, but interrupts are
only disabled for a single AES block which even on ARM Cortex-A7 is
"only" ~1500 cycles to encrypt or ~2600 cycles to decrypt.

Fixes: b5e0b032b6c3 ("crypto: aes - add generic time invariant AES cipher")
Signed-off-by: Eric Biggers 
---
 crypto/aes_ti.c | 18 ++
 1 file changed, 18 insertions(+)

diff --git a/crypto/aes_ti.c b/crypto/aes_ti.c
index 03023b2290e8e..81b604419ee0e 100644
--- a/crypto/aes_ti.c
+++ b/crypto/aes_ti.c
@@ -269,6 +269,7 @@ static void aesti_encrypt(struct crypto_tfm *tfm, u8 *out, 
const u8 *in)
const u32 *rkp = ctx->key_enc + 4;
int rounds = 6 + ctx->key_length / 4;
u32 st0[4], st1[4];
+   unsigned long flags;
int round;
 
st0[0] = ctx->key_enc[0] ^ get_unaligned_le32(in);
@@ -276,6 +277,12 @@ static void aesti_encrypt(struct crypto_tfm *tfm, u8 *out, 
const u8 *in)
st0[2] = ctx->key_enc[2] ^ get_unaligned_le32(in + 8);
st0[3] = ctx->key_enc[3] ^ get_unaligned_le32(in + 12);
 
+   /*
+* Disable interrupts (including preemption) while the sbox is loaded
+* into L1 cache and used for encryption on this CPU.
+*/
+   local_irq_save(flags);
+
st0[0] ^= __aesti_sbox[ 0] ^ __aesti_sbox[128];
st0[1] ^= __aesti_sbox[32] ^ __aesti_sbox[160];
st0[2] ^= __aesti_sbox[64] ^ __aesti_sbox[192];
@@ -300,6 +307,8 @@ static void aesti_encrypt(struct crypto_tfm *tfm, u8 *out, 
const u8 *in)
put_unaligned_le32(subshift(st1, 1) ^ rkp[5], out + 4);
put_unaligned_le32(subshift(st1, 2) ^ rkp[6], out + 8);
put_unaligned_le32(subshift(st1, 3) ^ rkp[7], out + 12);
+
+   local_irq_restore(flags);
 }
 
 static void aesti_decrypt(struct crypto_tfm *tfm, u8 *out, const u8 *in)
@@ -308,6 +317,7 @@ static void aesti_decrypt(struct crypto_tfm *tfm, u8 *out, 
const u8 *in)
const u32 *rkp = ctx->key_dec + 4;
int rounds = 6 + ctx->key_length / 4;
u32 st0[4], st1[4];
+   unsigned long flags;
int round;
 
st0[0] = ctx->key_dec[0] ^ get_unaligned_le32(in);
@@ -315,6 +325,12 @@ static void aesti_decrypt(struct crypto_tfm *tfm, u8 *out, 
const u8 *in)
st0[2] = ctx->key_dec[2] ^ get_unaligned_le32(in + 8);
st0[3] = ctx->key_dec[3] ^ get_unaligned_le32(in + 12);
 
+   /*
+* Disable interrupts (including preemption) while the sbox is loaded
+* into L1 cache and used for decryption on this CPU.
+*/
+   local_irq_save(flags);
+
st0[0] ^= __aesti_inv_sbox[ 0] ^ __aesti_inv_sbox[128];
st0[1] ^= __aesti_inv_sbox[32] ^ __aesti_inv_sbox[160];
st0[2] ^= __aesti_inv_sbox[64] ^ __aesti_inv_sbox[192];
@@ -339,6 +355,8 @@ static void aesti_decrypt(struct crypto_tfm *tfm, u8 *out, 
const u8 *in)
put_unaligned_le32(inv_subshift(st1, 1) ^ rkp[5], out + 4);
put_unaligned_le32(inv_subshift(st1, 2) ^ rkp[6], out + 8);
put_unaligned_le32(inv_subshift(st1, 3) ^ rkp[7], out + 12);
+
+   local_irq_restore(flags);
 }
 
 static struct crypto_alg aes_alg = {
-- 
2.19.0

[PATCH] crypto: arm64/aes - fix handling sub-block CTS-CBC inputs

2018-10-02 Thread Eric Biggers

From: Eric Biggers 

In the new arm64 CTS-CBC implementation, return an error code rather
than crashing on inputs shorter than AES_BLOCK_SIZE bytes.  Also set
cra_blocksize to AES_BLOCK_SIZE (like is done in the cts template) to
indicate the minimum input size.

Fixes: dd597fb33ff0 ("crypto: arm64/aes-blk - add support for CTS-CBC mode")
Signed-off-by: Eric Biggers 
---
 arch/arm64/crypto/aes-glue.c | 13 +
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/arch/arm64/crypto/aes-glue.c b/arch/arm64/crypto/aes-glue.c
index 26d2b0263ba63..1e676625ef33f 100644
--- a/arch/arm64/crypto/aes-glue.c
+++ b/arch/arm64/crypto/aes-glue.c
@@ -243,8 +243,11 @@ static int cts_cbc_encrypt(struct skcipher_request *req)
 
skcipher_request_set_tfm(>subreq, tfm);
 
-   if (req->cryptlen == AES_BLOCK_SIZE)
+   if (req->cryptlen <= AES_BLOCK_SIZE) {
+   if (req->cryptlen < AES_BLOCK_SIZE)
+   return -EINVAL;
cbc_blocks = 1;
+   }
 
if (cbc_blocks > 0) {
unsigned int blocks;
@@ -305,8 +308,11 @@ static int cts_cbc_decrypt(struct skcipher_request *req)
 
skcipher_request_set_tfm(>subreq, tfm);
 
-   if (req->cryptlen == AES_BLOCK_SIZE)
+   if (req->cryptlen <= AES_BLOCK_SIZE) {
+   if (req->cryptlen < AES_BLOCK_SIZE)
+   return -EINVAL;
cbc_blocks = 1;
+   }
 
if (cbc_blocks > 0) {
unsigned int blocks;
@@ -486,14 +492,13 @@ static struct skcipher_alg aes_algs[] = { {
.cra_driver_name= "__cts-cbc-aes-" MODE,
.cra_priority   = PRIO,
.cra_flags  = CRYPTO_ALG_INTERNAL,
-   .cra_blocksize  = 1,
+   .cra_blocksize  = AES_BLOCK_SIZE,
.cra_ctxsize= sizeof(struct crypto_aes_ctx),
.cra_module = THIS_MODULE,
},
.min_keysize= AES_MIN_KEY_SIZE,
.max_keysize= AES_MAX_KEY_SIZE,
.ivsize = AES_BLOCK_SIZE,
-   .chunksize  = AES_BLOCK_SIZE,
.walksize   = 2 * AES_BLOCK_SIZE,
.setkey = skcipher_aes_setkey,
.encrypt= cts_cbc_encrypt,
-- 
2.19.0

Re: [PATCH] crypto: chacha20 - Fix chacha20_block() keystream alignment (again)

2018-09-14 Thread Eric Biggers

Hi Yann,

On Wed, Sep 12, 2018 at 11:50:00AM +0200, Yann Droneaud wrote:
> Hi,
> 
> Le mardi 11 septembre 2018 à 20:05 -0700, Eric Biggers a écrit :
> > From: Eric Biggers 
> > 
> > In commit 9f480faec58c ("crypto: chacha20 - Fix keystream alignment for
> > chacha20_block()"), I had missed that chacha20_block() can be called
> > directly on the buffer passed to get_random_bytes(), which can have any
> > alignment.  So, while my commit didn't break anything, it didn't fully
> > solve the alignment problems.
> > 
> > Revert my solution and just update chacha20_block() to use
> > put_unaligned_le32(), so the output buffer need not be aligned.
> > This is simpler, and on many CPUs it's the same speed.
> > 
> > But, I kept the 'tmp' buffers in extract_crng_user() and
> > _get_random_bytes() 4-byte aligned, since that alignment is actually
> > needed for _crng_backtrack_protect() too.
> > 
> > Reported-by: Stephan Müller 
> > Cc: Theodore Ts'o 
> > Signed-off-by: Eric Biggers 
> > ---
> >  crypto/chacha20_generic.c |  7 ---
> >  drivers/char/random.c | 24 
> >  include/crypto/chacha20.h |  3 +--
> >  lib/chacha20.c|  6 +++---
> >  4 files changed, 20 insertions(+), 20 deletions(-)
> > 
> > diff --git a/crypto/chacha20_generic.c b/crypto/chacha20_generic.c
> > index e451c3cb6a56..3ae96587caf9 100644
> > --- a/crypto/chacha20_generic.c
> > +++ b/crypto/chacha20_generic.c
> > @@ -18,20 +18,21 @@
> >  static void chacha20_docrypt(u32 *state, u8 *dst, const u8 *src,
> >  unsigned int bytes)
> >  {
> > -   u32 stream[CHACHA20_BLOCK_WORDS];
> > +   /* aligned to potentially speed up crypto_xor() */
> > +   u8 stream[CHACHA20_BLOCK_SIZE] __aligned(sizeof(long));
> >  
> > if (dst != src)
> > memcpy(dst, src, bytes);
> >  
> > while (bytes >= CHACHA20_BLOCK_SIZE) {
> > chacha20_block(state, stream);
> > -   crypto_xor(dst, (const u8 *)stream, CHACHA20_BLOCK_SIZE);
> > +   crypto_xor(dst, stream, CHACHA20_BLOCK_SIZE);
> > bytes -= CHACHA20_BLOCK_SIZE;
> > dst += CHACHA20_BLOCK_SIZE;
> > }
> > if (bytes) {
> > chacha20_block(state, stream);
> > -   crypto_xor(dst, (const u8 *)stream, bytes);
> > +   crypto_xor(dst, stream, bytes);
> > }
> >  }
> >  
> > diff --git a/drivers/char/random.c b/drivers/char/random.c
> > index bf5f99fc36f1..d22d967c50f0 100644
> > --- a/drivers/char/random.c
> > +++ b/drivers/char/random.c
> > @@ -1003,7 +1003,7 @@ static void extract_crng(__u32 
> > out[CHACHA20_BLOCK_WORDS])
> >   * enough) to mutate the CRNG key to provide backtracking protection.
> >   */
> >  static void _crng_backtrack_protect(struct crng_state *crng,
> > -   __u32 tmp[CHACHA20_BLOCK_WORDS], int used)
> > +   __u8 tmp[CHACHA20_BLOCK_SIZE], int used)
> >  {
> > unsigned long   flags;
> > __u32   *s, *d;
> > @@ -1015,14 +1015,14 @@ static void _crng_backtrack_protect(struct 
> > crng_state *crng,
> > used = 0;
> > }
> > spin_lock_irqsave(>lock, flags);
> > -   s = [used / sizeof(__u32)];
> > +   s = (__u32 *) [used];
> 
> This introduces a alignment issue: tmp is not aligned for __u32, but is
> dereferenced as such later.
> 
> > d = >state[4];
> > for (i=0; i < 8; i++)
> > *d++ ^= *s++;
> > spin_unlock_irqrestore(>lock, flags);
> >  }
> >  
> 

I explained this in the patch; the callers ensure the buffer is aligned.

- Eric

[PATCH] crypto: chacha20 - Fix chacha20_block() keystream alignment (again)

2018-09-11 Thread Eric Biggers

From: Eric Biggers 

In commit 9f480faec58c ("crypto: chacha20 - Fix keystream alignment for
chacha20_block()"), I had missed that chacha20_block() can be called
directly on the buffer passed to get_random_bytes(), which can have any
alignment.  So, while my commit didn't break anything, it didn't fully
solve the alignment problems.

Revert my solution and just update chacha20_block() to use
put_unaligned_le32(), so the output buffer need not be aligned.
This is simpler, and on many CPUs it's the same speed.

But, I kept the 'tmp' buffers in extract_crng_user() and
_get_random_bytes() 4-byte aligned, since that alignment is actually
needed for _crng_backtrack_protect() too.

Reported-by: Stephan Müller 
Cc: Theodore Ts'o 
Signed-off-by: Eric Biggers 
---
 crypto/chacha20_generic.c |  7 ---
 drivers/char/random.c | 24 
 include/crypto/chacha20.h |  3 +--
 lib/chacha20.c|  6 +++---
 4 files changed, 20 insertions(+), 20 deletions(-)

diff --git a/crypto/chacha20_generic.c b/crypto/chacha20_generic.c
index e451c3cb6a56..3ae96587caf9 100644
--- a/crypto/chacha20_generic.c
+++ b/crypto/chacha20_generic.c
@@ -18,20 +18,21 @@
 static void chacha20_docrypt(u32 *state, u8 *dst, const u8 *src,
 unsigned int bytes)
 {
-   u32 stream[CHACHA20_BLOCK_WORDS];
+   /* aligned to potentially speed up crypto_xor() */
+   u8 stream[CHACHA20_BLOCK_SIZE] __aligned(sizeof(long));
 
if (dst != src)
memcpy(dst, src, bytes);
 
while (bytes >= CHACHA20_BLOCK_SIZE) {
chacha20_block(state, stream);
-   crypto_xor(dst, (const u8 *)stream, CHACHA20_BLOCK_SIZE);
+   crypto_xor(dst, stream, CHACHA20_BLOCK_SIZE);
bytes -= CHACHA20_BLOCK_SIZE;
dst += CHACHA20_BLOCK_SIZE;
}
if (bytes) {
chacha20_block(state, stream);
-   crypto_xor(dst, (const u8 *)stream, bytes);
+   crypto_xor(dst, stream, bytes);
}
 }
 
diff --git a/drivers/char/random.c b/drivers/char/random.c
index bf5f99fc36f1..d22d967c50f0 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -433,9 +433,9 @@ static int crng_init_cnt = 0;
 static unsigned long crng_global_init_time = 0;
 #define CRNG_INIT_CNT_THRESH (2*CHACHA20_KEY_SIZE)
 static void _extract_crng(struct crng_state *crng,
- __u32 out[CHACHA20_BLOCK_WORDS]);
+ __u8 out[CHACHA20_BLOCK_SIZE]);
 static void _crng_backtrack_protect(struct crng_state *crng,
-   __u32 tmp[CHACHA20_BLOCK_WORDS], int used);
+   __u8 tmp[CHACHA20_BLOCK_SIZE], int used);
 static void process_random_ready_list(void);
 static void _get_random_bytes(void *buf, int nbytes);
 
@@ -921,7 +921,7 @@ static void crng_reseed(struct crng_state *crng, struct 
entropy_store *r)
unsigned long   flags;
int i, num;
union {
-   __u32   block[CHACHA20_BLOCK_WORDS];
+   __u8block[CHACHA20_BLOCK_SIZE];
__u32   key[8];
} buf;
 
@@ -968,7 +968,7 @@ static void crng_reseed(struct crng_state *crng, struct 
entropy_store *r)
 }
 
 static void _extract_crng(struct crng_state *crng,
- __u32 out[CHACHA20_BLOCK_WORDS])
+ __u8 out[CHACHA20_BLOCK_SIZE])
 {
unsigned long v, flags;
 
@@ -985,7 +985,7 @@ static void _extract_crng(struct crng_state *crng,
spin_unlock_irqrestore(>lock, flags);
 }
 
-static void extract_crng(__u32 out[CHACHA20_BLOCK_WORDS])
+static void extract_crng(__u8 out[CHACHA20_BLOCK_SIZE])
 {
struct crng_state *crng = NULL;
 
@@ -1003,7 +1003,7 @@ static void extract_crng(__u32 out[CHACHA20_BLOCK_WORDS])
  * enough) to mutate the CRNG key to provide backtracking protection.
  */
 static void _crng_backtrack_protect(struct crng_state *crng,
-   __u32 tmp[CHACHA20_BLOCK_WORDS], int used)
+   __u8 tmp[CHACHA20_BLOCK_SIZE], int used)
 {
unsigned long   flags;
__u32   *s, *d;
@@ -1015,14 +1015,14 @@ static void _crng_backtrack_protect(struct crng_state 
*crng,
used = 0;
}
spin_lock_irqsave(>lock, flags);
-   s = [used / sizeof(__u32)];
+   s = (__u32 *) [used];
d = >state[4];
for (i=0; i < 8; i++)
*d++ ^= *s++;
spin_unlock_irqrestore(>lock, flags);
 }
 
-static void crng_backtrack_protect(__u32 tmp[CHACHA20_BLOCK_WORDS], int used)
+static void crng_backtrack_protect(__u8 tmp[CHACHA20_BLOCK_SIZE], int used)
 {
struct crng_state *crng = NULL;
 
@@ -1038,7 +1038,7 @@ static void crng_backtrack_protect(__u32 
tmp[CHACHA20_BLOCK_WORDS], int used)
 static ssize_t extract_crng_user(void __user *buf, size_t nbytes)
 {
ssize_t ret

Re: random: ensure use of aligned buffers with ChaCha20

2018-09-11 Thread Eric Biggers

To revive this...

On Fri, Aug 10, 2018 at 08:27:58AM +0200, Stephan Mueller wrote:
> Am Donnerstag, 9. August 2018, 21:40:12 CEST schrieb Eric Biggers:
> 
> Hi Eric,
> 
> > while (bytes >= CHACHA20_BLOCK_SIZE) {
> > chacha20_block(state, stream);
> > -   crypto_xor(dst, (const u8 *)stream, CHACHA20_BLOCK_SIZE);
> > +   crypto_xor(dst, stream, CHACHA20_BLOCK_SIZE);
> 
> If we are at it, I am wondering whether we should use crypto_xor. At this 
> point we exactly know that the data is CHACHA20_BLOCK_SIZE bytes in length 
> which is divisible by u32. Hence, shouldn't we disregard crypto_xor in favor 
> of a loop iterating in 32 bits words? crypto_xor contains some checks for 
> trailing bytes which we could spare.

crypto_xor() here is fine.  It already meets the conditions for the inlined
version that XOR's a long at a time:

if (IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS) &&
__builtin_constant_p(size) &&
(size % sizeof(unsigned long)) == 0) {
unsigned long *d = (unsigned long *)dst;
unsigned long *s = (unsigned long *)src;

while (size > 0) {
*d++ ^= *s++;
size -= sizeof(unsigned long);
}
}

And regardless, it's better to optimize crypto_xor() once, than all the callers.

> 
> > bytes -= CHACHA20_BLOCK_SIZE;
> > dst += CHACHA20_BLOCK_SIZE;
> > }
> > if (bytes) {
> > chacha20_block(state, stream);
> > -   crypto_xor(dst, (const u8 *)stream, bytes);
> > +   crypto_xor(dst, stream, bytes);
> 
> Same here.

'bytes' need not be a multiple of sizeof(u32) or sizeof(long), and 'dst' can
have any alignment...  So we should just call crypto_xor() which does the right
thing, and is intended to do so as efficiently as possible.

> 
> > @@ -1006,14 +1006,14 @@ static void _crng_backtrack_protect(struct
> > crng_state *crng, used = 0;
> > }
> > spin_lock_irqsave(>lock, flags);
> > -   s = [used / sizeof(__u32)];
> > +   s = (__u32 *) [used];
> 
> As Yann said, wouldn't you have the alignment problem here again?
> 
> Somehow, somebody must check the provided input buffer at one time.
> 

I guess we should just explicitly align the 'tmp' buffers in _get_random_bytes()
and extract_crng_user().

- Eric

[PATCH] fscrypt: remove CRYPTO_CTR dependency

2018-09-05 Thread Eric Biggers

From: Eric Biggers 

fscrypt doesn't use the CTR mode of operation for anything, so there's
no need to select CRYPTO_CTR.  It was added by commit 71dea01ea2ed
("ext4 crypto: require CONFIG_CRYPTO_CTR if ext4 encryption is
enabled").  But, I've been unable to identify the arm64 crypto bug it
was supposedly working around.

I suspect the issue was seen only on some old Android device kernel
(circa 3.10?).  So if the fix wasn't mistaken, the real bug is probably
already fixed.  Or maybe it was actually a bug in a non-upstream crypto
driver.

So, remove the dependency.  If it turns out there's actually still a
bug, we'll fix it properly.

Signed-off-by: Eric Biggers 
---
 fs/crypto/Kconfig | 1 -
 1 file changed, 1 deletion(-)

diff --git a/fs/crypto/Kconfig b/fs/crypto/Kconfig
index 02b7d91c92310..284b589b4774d 100644
--- a/fs/crypto/Kconfig
+++ b/fs/crypto/Kconfig
@@ -6,7 +6,6 @@ config FS_ENCRYPTION
select CRYPTO_ECB
select CRYPTO_XTS
select CRYPTO_CTS
-   select CRYPTO_CTR
select CRYPTO_SHA256
select KEYS
help
-- 
2.19.0.rc2.392.g5ba43deb5a-goog

[PATCH v2] crypto: arm/chacha20 - faster 8-bit rotations and other optimizations

2018-09-01 Thread Eric Biggers

From: Eric Biggers 

Optimize ChaCha20 NEON performance by:

- Implementing the 8-bit rotations using the 'vtbl.8' instruction.
- Streamlining the part that adds the original state and XORs the data.
- Making some other small tweaks.

On ARM Cortex-A7, these optimizations improve ChaCha20 performance from
about 12.08 cycles per byte to about 11.37 -- a 5.9% improvement.

There is a tradeoff involved with the 'vtbl.8' rotation method since
there is at least one CPU (Cortex-A53) where it's not fastest.  But it
seems to be a better default; see the added comment.  Overall, this
patch reduces Cortex-A53 performance by less than 0.5%.

Signed-off-by: Eric Biggers 
---
 arch/arm/crypto/chacha20-neon-core.S | 277 ++-
 1 file changed, 143 insertions(+), 134 deletions(-)

diff --git a/arch/arm/crypto/chacha20-neon-core.S 
b/arch/arm/crypto/chacha20-neon-core.S
index 451a849ad5186..50e7b98968189 100644
--- a/arch/arm/crypto/chacha20-neon-core.S
+++ b/arch/arm/crypto/chacha20-neon-core.S
@@ -18,6 +18,34 @@
  * (at your option) any later version.
  */
 
+ /*
+  * NEON doesn't have a rotate instruction.  The alternatives are, more or 
less:
+  *
+  * (a)  vshl.u32 + vsri.u32   (needs temporary register)
+  * (b)  vshl.u32 + vshr.u32 + vorr(needs temporary register)
+  * (c)  vrev32.16 (16-bit rotations only)
+  * (d)  vtbl.8 + vtbl.8   (multiple of 8 bits rotations only,
+  * needs index vector)
+  *
+  * ChaCha20 has 16, 12, 8, and 7-bit rotations.  For the 12 and 7-bit
+  * rotations, the only choices are (a) and (b).  We use (a) since it takes
+  * two-thirds the cycles of (b) on both Cortex-A7 and Cortex-A53.
+  *
+  * For the 16-bit rotation, we use vrev32.16 since it's consistently fastest
+  * and doesn't need a temporary register.
+  *
+  * For the 8-bit rotation, we use vtbl.8 + vtbl.8.  On Cortex-A7, this 
sequence
+  * is twice as fast as (a), even when doing (a) on multiple registers
+  * simultaneously to eliminate the stall between vshl and vsri.  Also, it
+  * parallelizes better when temporary registers are scarce.
+  *
+  * A disadvantage is that on Cortex-A53, the vtbl sequence is the same speed 
as
+  * (a), so the need to load the rotation table actually makes the vtbl method
+  * slightly slower overall on that CPU (~1.3% slower ChaCha20).  Still, it
+  * seems to be a good compromise to get a more significant speed boost on some
+  * CPUs, e.g. ~4.8% faster ChaCha20 on Cortex-A7.
+  */
+
 #include 
 
.text
@@ -46,7 +74,9 @@ ENTRY(chacha20_block_xor_neon)
vmovq10, q2
vmovq11, q3
 
+   adr ip, .Lrol8_table
mov r3, #10
+   vld1.8  {d10}, [ip, :64]
 
 .Ldoubleround:
// x0 += x1, x3 = rotl32(x3 ^ x0, 16)
@@ -62,9 +92,9 @@ ENTRY(chacha20_block_xor_neon)
 
// x0 += x1, x3 = rotl32(x3 ^ x0, 8)
vadd.i32q0, q0, q1
-   veorq4, q3, q0
-   vshl.u32q3, q4, #8
-   vsri.u32q3, q4, #24
+   veorq3, q3, q0
+   vtbl.8  d6, {d6}, d10
+   vtbl.8  d7, {d7}, d10
 
// x2 += x3, x1 = rotl32(x1 ^ x2, 7)
vadd.i32q2, q2, q3
@@ -92,9 +122,9 @@ ENTRY(chacha20_block_xor_neon)
 
// x0 += x1, x3 = rotl32(x3 ^ x0, 8)
vadd.i32q0, q0, q1
-   veorq4, q3, q0
-   vshl.u32q3, q4, #8
-   vsri.u32q3, q4, #24
+   veorq3, q3, q0
+   vtbl.8  d6, {d6}, d10
+   vtbl.8  d7, {d7}, d10
 
// x2 += x3, x1 = rotl32(x1 ^ x2, 7)
vadd.i32q2, q2, q3
@@ -139,13 +169,17 @@ ENTRY(chacha20_block_xor_neon)
bx  lr
 ENDPROC(chacha20_block_xor_neon)
 
+   .align  4
+.Lctrinc:  .word   0, 1, 2, 3
+.Lrol8_table:  .byte   3, 0, 1, 2, 7, 4, 5, 6
+
.align  5
 ENTRY(chacha20_4block_xor_neon)
-   push{r4-r6, lr}
-   mov ip, sp  // preserve the stack pointer
-   sub r3, sp, #0x20   // allocate a 32 byte buffer
-   bic r3, r3, #0x1f   // aligned to 32 bytes
-   mov sp, r3
+   push{r4-r5}
+   mov r4, sp  // preserve the stack pointer
+   sub ip, sp, #0x20   // allocate a 32 byte buffer
+   bic ip, ip, #0x1f   // aligned to 32 bytes
+   mov sp, ip
 
// r0: Input state matrix, s
// r1: 4 data blocks output, o
@@ -155,25 +189,24 @@ ENTRY(chacha20_4block_xor_neon)
// This function encrypts four consecutive ChaCha20 blocks by loading
// the state matrix in NEON registers four times. The algorithm performs
// each operation on the corresponding word of each state matrix, hence
-   // requires no word

Re: [PATCH] crypto: arm/chacha20 - faster 8-bit rotations and other optimizations

2018-09-01 Thread Eric Biggers

On Fri, Aug 31, 2018 at 06:51:34PM +0200, Ard Biesheuvel wrote:
> >>
> >> +   adr ip, .Lrol8_table
> >> mov r3, #10
> >>
> >>  .Ldoubleround4:
> >> @@ -238,24 +268,25 @@ ENTRY(chacha20_4block_xor_neon)
> >> // x1 += x5, x13 = rotl32(x13 ^ x1, 8)
> >> // x2 += x6, x14 = rotl32(x14 ^ x2, 8)
> >> // x3 += x7, x15 = rotl32(x15 ^ x3, 8)
> >> +   vld1.8  {d16}, [ip, :64]
> 
> Also, would it perhaps be more efficient to keep the rotation vector
> in a pair of GPRs, and use something like
> 
> vmov d16, r4, r5
> 
> here?
> 

I tried that, but it doesn't help on either Cortex-A7 or Cortex-A53.
In fact it's very slightly worse.

- Eric

Re: [PATCH] crypto: arm/chacha20 - faster 8-bit rotations and other optimizations

2018-09-01 Thread Eric Biggers

Hi Ard,

On Fri, Aug 31, 2018 at 05:56:24PM +0200, Ard Biesheuvel wrote:
> Hi Eric,
> 
> On 31 August 2018 at 10:01, Eric Biggers  wrote:
> > From: Eric Biggers 
> >
> > Optimize ChaCha20 NEON performance by:
> >
> > - Implementing the 8-bit rotations using the 'vtbl.8' instruction.
> > - Streamlining the part that adds the original state and XORs the data.
> > - Making some other small tweaks.
> >
> > On ARM Cortex-A7, these optimizations improve ChaCha20 performance from
> > about 11.9 cycles per byte to 11.3.
> >
> > There is a tradeoff involved with the 'vtbl.8' rotation method since
> > there is at least one CPU where it's not fastest.  But it seems to be a
> > better default; see the added comment.
> >
> > Signed-off-by: Eric Biggers 
> > ---
> >  arch/arm/crypto/chacha20-neon-core.S | 289 ++-
> >  1 file changed, 147 insertions(+), 142 deletions(-)
> >
> > diff --git a/arch/arm/crypto/chacha20-neon-core.S 
> > b/arch/arm/crypto/chacha20-neon-core.S
> > index 3fecb2124c35a..d381cebaba31d 100644
> > --- a/arch/arm/crypto/chacha20-neon-core.S
> > +++ b/arch/arm/crypto/chacha20-neon-core.S
> > @@ -18,6 +18,33 @@
> >   * (at your option) any later version.
> >   */
> >
> > + /*
> > +  * NEON doesn't have a rotate instruction.  The alternatives are, more or 
> > less:
> > +  *
> > +  * (a)  vshl.u32 + vsri.u32   (needs temporary register)
> > +  * (b)  vshl.u32 + vshr.u32 + vorr(needs temporary register)
> > +  * (c)  vrev32.16 (16-bit rotations only)
> > +  * (d)  vtbl.8 + vtbl.8   (multiple of 8 bits rotations only,
> > +  * needs index vector)
> > +  *
> > +  * ChaCha20 has 16, 12, 8, and 7-bit rotations.  For the 12 and 7-bit
> > +  * rotations, the only choices are (a) and (b).  We use (a) since it takes
> > +  * two-thirds the cycles of (b) on both Cortex-A7 and Cortex-A53.
> > +  *
> > +  * For the 16-bit rotation, we use vrev32.16 since it's consistently 
> > fastest
> > +  * and doesn't need a temporary register.
> > +  *
> > +  * For the 8-bit rotation, we use vtbl.8 + vtbl.8.  On Cortex-A7, this 
> > sequence
> > +  * is twice as fast as (a), even when doing (a) on multiple registers
> > +  * simultaneously to eliminate the stall between vshl and vsri.  Also, it
> > +  * parallelizes better when temporary registers are scarce.
> > +  *
> > +  * A disadvantage is that on Cortex-A53, the vtbl sequence is the same 
> > speed as
> > +  * (a), so the need to load the rotation table actually makes the vtbl 
> > method
> > +  * slightly slower overall on that CPU.  Still, it seems to be a good
> > +  * compromise to get a significant speed boost on some CPUs.
> > +  */
> > +
> 
> Thanks for sharing these results. I have been working on 32-bit ARM
> code under the assumption that the A53 pipeline more or less resembles
> the A7 one, but this is obviously not the case looking at your
> results. My contributions to arch/arm/crypto mainly involved Crypto
> Extensions code, which the A7 does not support in the first place, so
> it does not really matter, but I will keep this in mind going forward.
> 
> >  #include 
> >
> > .text
> > @@ -46,6 +73,9 @@ ENTRY(chacha20_block_xor_neon)
> > vmovq10, q2
> > vmovq11, q3
> >
> > +   ldr ip, =.Lrol8_table
> > +   vld1.8  {d10}, [ip, :64]
> > +
> 
> I usually try to avoid the =literal ldr notation, because it involves
> an additional load via the D-cache. Could you use a 64-bit literal
> instead of a byte array and use vldr instead? Or switch to adr? (and
> move the literal in range, I suppose)

'adr' works if I move rol8_table to between chacha20_block_xor_neon() and
chacha20_4block_xor_neon().

> >  ENTRY(chacha20_4block_xor_neon)
> > -   push{r4-r6, lr}
> > -   mov ip, sp  // preserve the stack 
> > pointer
> > -   sub r3, sp, #0x20   // allocate a 32 byte buffer
> > -   bic r3, r3, #0x1f   // aligned to 32 bytes
> > -   mov sp, r3
> > +   push{r4}
> 
> The ARM EABI mandates 8 byte stack alignment, and if you take an
> interrupt right at this point, you will enter the interrupt handler
> with a misaligned stack. Whether this could actually cause any
> problems is a different question, but it is better to keep it 8-byte
> aligned to

[PATCH] crypto: arm/chacha20 - faster 8-bit rotations and other optimizations

2018-08-31 Thread Eric Biggers

From: Eric Biggers 

Optimize ChaCha20 NEON performance by:

- Implementing the 8-bit rotations using the 'vtbl.8' instruction.
- Streamlining the part that adds the original state and XORs the data.
- Making some other small tweaks.

On ARM Cortex-A7, these optimizations improve ChaCha20 performance from
about 11.9 cycles per byte to 11.3.

There is a tradeoff involved with the 'vtbl.8' rotation method since
there is at least one CPU where it's not fastest.  But it seems to be a
better default; see the added comment.

Signed-off-by: Eric Biggers 
---
 arch/arm/crypto/chacha20-neon-core.S | 289 ++-
 1 file changed, 147 insertions(+), 142 deletions(-)

diff --git a/arch/arm/crypto/chacha20-neon-core.S 
b/arch/arm/crypto/chacha20-neon-core.S
index 3fecb2124c35a..d381cebaba31d 100644
--- a/arch/arm/crypto/chacha20-neon-core.S
+++ b/arch/arm/crypto/chacha20-neon-core.S
@@ -18,6 +18,33 @@
  * (at your option) any later version.
  */
 
+ /*
+  * NEON doesn't have a rotate instruction.  The alternatives are, more or 
less:
+  *
+  * (a)  vshl.u32 + vsri.u32   (needs temporary register)
+  * (b)  vshl.u32 + vshr.u32 + vorr(needs temporary register)
+  * (c)  vrev32.16 (16-bit rotations only)
+  * (d)  vtbl.8 + vtbl.8   (multiple of 8 bits rotations only,
+  * needs index vector)
+  *
+  * ChaCha20 has 16, 12, 8, and 7-bit rotations.  For the 12 and 7-bit
+  * rotations, the only choices are (a) and (b).  We use (a) since it takes
+  * two-thirds the cycles of (b) on both Cortex-A7 and Cortex-A53.
+  *
+  * For the 16-bit rotation, we use vrev32.16 since it's consistently fastest
+  * and doesn't need a temporary register.
+  *
+  * For the 8-bit rotation, we use vtbl.8 + vtbl.8.  On Cortex-A7, this 
sequence
+  * is twice as fast as (a), even when doing (a) on multiple registers
+  * simultaneously to eliminate the stall between vshl and vsri.  Also, it
+  * parallelizes better when temporary registers are scarce.
+  *
+  * A disadvantage is that on Cortex-A53, the vtbl sequence is the same speed 
as
+  * (a), so the need to load the rotation table actually makes the vtbl method
+  * slightly slower overall on that CPU.  Still, it seems to be a good
+  * compromise to get a significant speed boost on some CPUs.
+  */
+
 #include 
 
.text
@@ -46,6 +73,9 @@ ENTRY(chacha20_block_xor_neon)
vmovq10, q2
vmovq11, q3
 
+   ldr ip, =.Lrol8_table
+   vld1.8  {d10}, [ip, :64]
+
mov r3, #10
 
 .Ldoubleround:
@@ -63,9 +93,9 @@ ENTRY(chacha20_block_xor_neon)
 
// x0 += x1, x3 = rotl32(x3 ^ x0, 8)
vadd.i32q0, q0, q1
-   veorq4, q3, q0
-   vshl.u32q3, q4, #8
-   vsri.u32q3, q4, #24
+   veorq3, q3, q0
+   vtbl.8  d6, {d6}, d10
+   vtbl.8  d7, {d7}, d10
 
// x2 += x3, x1 = rotl32(x1 ^ x2, 7)
vadd.i32q2, q2, q3
@@ -94,9 +124,9 @@ ENTRY(chacha20_block_xor_neon)
 
// x0 += x1, x3 = rotl32(x3 ^ x0, 8)
vadd.i32q0, q0, q1
-   veorq4, q3, q0
-   vshl.u32q3, q4, #8
-   vsri.u32q3, q4, #24
+   veorq3, q3, q0
+   vtbl.8  d6, {d6}, d10
+   vtbl.8  d7, {d7}, d10
 
// x2 += x3, x1 = rotl32(x1 ^ x2, 7)
vadd.i32q2, q2, q3
@@ -143,11 +173,11 @@ ENDPROC(chacha20_block_xor_neon)
 
.align  5
 ENTRY(chacha20_4block_xor_neon)
-   push{r4-r6, lr}
-   mov ip, sp  // preserve the stack pointer
-   sub r3, sp, #0x20   // allocate a 32 byte buffer
-   bic r3, r3, #0x1f   // aligned to 32 bytes
-   mov sp, r3
+   push{r4}
+   mov r4, sp  // preserve the stack pointer
+   sub ip, sp, #0x20   // allocate a 32 byte buffer
+   bic ip, ip, #0x1f   // aligned to 32 bytes
+   mov sp, ip
 
// r0: Input state matrix, s
// r1: 4 data blocks output, o
@@ -157,25 +187,24 @@ ENTRY(chacha20_4block_xor_neon)
// This function encrypts four consecutive ChaCha20 blocks by loading
// the state matrix in NEON registers four times. The algorithm performs
// each operation on the corresponding word of each state matrix, hence
-   // requires no word shuffling. For final XORing step we transpose the
-   // matrix by interleaving 32- and then 64-bit words, which allows us to
-   // do XOR in NEON registers.
+   // requires no word shuffling. The words are re-interleaved before the
+   // final addition of the original state and the XORing step.
//
 
-   // x0..15[0-3] = s0..3[0..3]
-   add r3, r0, #0x20

Re: random: ensure use of aligned buffers with ChaCha20

2018-08-09 Thread Eric Biggers

On Thu, Aug 09, 2018 at 12:07:18PM -0700, Eric Biggers wrote:
> On Thu, Aug 09, 2018 at 08:38:56PM +0200, Stephan Müller wrote:
> > The function extract_crng invokes the ChaCha20 block operation directly
> > on the user-provided buffer. The block operation operates on u32 words.
> > Thus the extract_crng function expects the buffer to be aligned to u32
> > as it is visible with the parameter type of extract_crng. However,
> > get_random_bytes uses a void pointer which may or may not be aligned.
> > Thus, an alignment check is necessary and the temporary buffer must be
> > used if the alignment to u32 is not ensured.
> > 
> > Cc:  # v4.16+
> > Cc: Ted Tso 
> > Signed-off-by: Stephan Mueller 
> > ---
> >  drivers/char/random.c | 10 --
> >  1 file changed, 8 insertions(+), 2 deletions(-)
> > 
> > diff --git a/drivers/char/random.c b/drivers/char/random.c
> > index bd449ad52442..23f336872426 100644
> > --- a/drivers/char/random.c
> > +++ b/drivers/char/random.c
> > @@ -1617,8 +1617,14 @@ static void _get_random_bytes(void *buf, int nbytes)
> > trace_get_random_bytes(nbytes, _RET_IP_);
> >  
> > while (nbytes >= CHACHA20_BLOCK_SIZE) {
> > -   extract_crng(buf);
> > -   buf += CHACHA20_BLOCK_SIZE;
> > +   if (likely((unsigned long)buf & (sizeof(tmp[0]) - 1))) {
> > +   extract_crng(buf);
> > +   buf += CHACHA20_BLOCK_SIZE;
> > +   } else {
> > +   extract_crng(tmp);
> > +   memcpy(buf, tmp, CHACHA20_BLOCK_SIZE);
> > +   }
> > +
> > nbytes -= CHACHA20_BLOCK_SIZE;
> > }
> >  
> > -- 
> > 2.17.1
> 
> This patch is backwards: the temporary buffer is used when the buffer is
> *aligned*, not misaligned.  And more problematically, 'buf' is never 
> incremented
> in one of the cases...
> 
> Note that I had tried to fix the chacha20_block() alignment bugs in commit
> 9f480faec58cd6197a ("crypto: chacha20 - Fix keystream alignment for
> chacha20_block()"), but I had missed this case.  I don't like seeing the
> alignment requirement being worked around with a temporary buffer; it's
> error-prone, and inefficient on common platforms.  How about we instead make 
> the
> output of chacha20_block() a u8 array and output the 16 32-bit words using
> put_unaligned_le32()?  In retrospect I probably should have just done that, 
> but
> at the time I didn't know of any case where the alignment would be a problem.
> 
> - Eric

For example:

-8<-

From: Eric Biggers 
Subject: [PATCH] crypto: chacha20 - Fix keystream alignment for 
chacha20_block() (again)

In commit 9f480faec58cd6 ("crypto: chacha20 - Fix keystream alignment
for chacha20_block()") I had missed that chacha20_block() can end up
being called on the buffer passed to get_random_bytes(), which can have
any alignment.  So, while my commit didn't break anything since
chacha20_block() has actually always had a u32-alignment requirement for
the output, it didn't fully solve the alignment problems.

Revert my solution and just update chacha20_block() to use
put_unaligned_le32(), so the output buffer doesn't have to be aligned.

This is simpler, and on many CPUs it's the same speed.

Reported-by: Stephan Müller 
Signed-off-by: Eric Biggers 
---
 crypto/chacha20_generic.c |  7 ---
 drivers/char/random.c | 24 
 include/crypto/chacha20.h |  3 +--
 lib/chacha20.c|  6 +++---
 4 files changed, 20 insertions(+), 20 deletions(-)

diff --git a/crypto/chacha20_generic.c b/crypto/chacha20_generic.c
index e451c3cb6a56..3ae96587caf9 100644
--- a/crypto/chacha20_generic.c
+++ b/crypto/chacha20_generic.c
@@ -18,20 +18,21 @@
 static void chacha20_docrypt(u32 *state, u8 *dst, const u8 *src,
 unsigned int bytes)
 {
-   u32 stream[CHACHA20_BLOCK_WORDS];
+   /* aligned to potentially speed up crypto_xor() */
+   u8 stream[CHACHA20_BLOCK_SIZE] __aligned(sizeof(long));
 
if (dst != src)
memcpy(dst, src, bytes);
 
while (bytes >= CHACHA20_BLOCK_SIZE) {
chacha20_block(state, stream);
-   crypto_xor(dst, (const u8 *)stream, CHACHA20_BLOCK_SIZE);
+   crypto_xor(dst, stream, CHACHA20_BLOCK_SIZE);
bytes -= CHACHA20_BLOCK_SIZE;
dst += CHACHA20_BLOCK_SIZE;
}
if (bytes) {
chacha20_block(state, stream);
-   crypto_xor(dst, (const u8 *)stream, bytes);
+   crypto_xor(dst, stream, bytes);
}
 }
 
diff --git a/drivers/char/random.c b/drivers/char/random.c
index bd449ad52442..b8f4345a50f4 1006

Re: random: ensure use of aligned buffers with ChaCha20

2018-08-09 Thread Eric Biggers

On Thu, Aug 09, 2018 at 08:38:56PM +0200, Stephan Müller wrote:
> The function extract_crng invokes the ChaCha20 block operation directly
> on the user-provided buffer. The block operation operates on u32 words.
> Thus the extract_crng function expects the buffer to be aligned to u32
> as it is visible with the parameter type of extract_crng. However,
> get_random_bytes uses a void pointer which may or may not be aligned.
> Thus, an alignment check is necessary and the temporary buffer must be
> used if the alignment to u32 is not ensured.
> 
> Cc:  # v4.16+
> Cc: Ted Tso 
> Signed-off-by: Stephan Mueller 
> ---
>  drivers/char/random.c | 10 --
>  1 file changed, 8 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/char/random.c b/drivers/char/random.c
> index bd449ad52442..23f336872426 100644
> --- a/drivers/char/random.c
> +++ b/drivers/char/random.c
> @@ -1617,8 +1617,14 @@ static void _get_random_bytes(void *buf, int nbytes)
>   trace_get_random_bytes(nbytes, _RET_IP_);
>  
>   while (nbytes >= CHACHA20_BLOCK_SIZE) {
> - extract_crng(buf);
> - buf += CHACHA20_BLOCK_SIZE;
> + if (likely((unsigned long)buf & (sizeof(tmp[0]) - 1))) {
> + extract_crng(buf);
> + buf += CHACHA20_BLOCK_SIZE;
> + } else {
> + extract_crng(tmp);
> + memcpy(buf, tmp, CHACHA20_BLOCK_SIZE);
> + }
> +
>   nbytes -= CHACHA20_BLOCK_SIZE;
>   }
>  
> -- 
> 2.17.1

This patch is backwards: the temporary buffer is used when the buffer is
*aligned*, not misaligned.  And more problematically, 'buf' is never incremented
in one of the cases...

Note that I had tried to fix the chacha20_block() alignment bugs in commit
9f480faec58cd6197a ("crypto: chacha20 - Fix keystream alignment for
chacha20_block()"), but I had missed this case.  I don't like seeing the
alignment requirement being worked around with a temporary buffer; it's
error-prone, and inefficient on common platforms.  How about we instead make the
output of chacha20_block() a u8 array and output the 16 32-bit words using
put_unaligned_le32()?  In retrospect I probably should have just done that, but
at the time I didn't know of any case where the alignment would be a problem.

- Eric

[PATCH v2 2/2] crypto: dh - make crypto_dh_encode_key() make robust

2018-07-27 Thread Eric Biggers

From: Eric Biggers 

Make it return -EINVAL if crypto_dh_key_len() is incorrect rather than
overflowing the buffer.

Signed-off-by: Eric Biggers 
---
 crypto/dh_helper.c | 30 --
 1 file changed, 16 insertions(+), 14 deletions(-)

diff --git a/crypto/dh_helper.c b/crypto/dh_helper.c
index db9b2d9c58f04..edacda5f6a4d3 100644
--- a/crypto/dh_helper.c
+++ b/crypto/dh_helper.c
@@ -16,8 +16,10 @@
 
 #define DH_KPP_SECRET_MIN_SIZE (sizeof(struct kpp_secret) + 4 * sizeof(int))
 
-static inline u8 *dh_pack_data(void *dst, const void *src, size_t size)
+static inline u8 *dh_pack_data(u8 *dst, u8 *end, const void *src, size_t size)
 {
+   if (!dst || size > end - dst)
+   return NULL;
memcpy(dst, src, size);
return dst + size;
 }
@@ -42,27 +44,27 @@ EXPORT_SYMBOL_GPL(crypto_dh_key_len);
 int crypto_dh_encode_key(char *buf, unsigned int len, const struct dh *params)
 {
u8 *ptr = buf;
+   u8 * const end = ptr + len;
struct kpp_secret secret = {
.type = CRYPTO_KPP_SECRET_TYPE_DH,
.len = len
};
 
-   if (unlikely(!buf))
+   if (unlikely(!len))
return -EINVAL;
 
-   if (len != crypto_dh_key_len(params))
+   ptr = dh_pack_data(ptr, end, , sizeof(secret));
+   ptr = dh_pack_data(ptr, end, >key_size,
+  sizeof(params->key_size));
+   ptr = dh_pack_data(ptr, end, >p_size, sizeof(params->p_size));
+   ptr = dh_pack_data(ptr, end, >q_size, sizeof(params->q_size));
+   ptr = dh_pack_data(ptr, end, >g_size, sizeof(params->g_size));
+   ptr = dh_pack_data(ptr, end, params->key, params->key_size);
+   ptr = dh_pack_data(ptr, end, params->p, params->p_size);
+   ptr = dh_pack_data(ptr, end, params->q, params->q_size);
+   ptr = dh_pack_data(ptr, end, params->g, params->g_size);
+   if (ptr != end)
return -EINVAL;
-
-   ptr = dh_pack_data(ptr, , sizeof(secret));
-   ptr = dh_pack_data(ptr, >key_size, sizeof(params->key_size));
-   ptr = dh_pack_data(ptr, >p_size, sizeof(params->p_size));
-   ptr = dh_pack_data(ptr, >q_size, sizeof(params->q_size));
-   ptr = dh_pack_data(ptr, >g_size, sizeof(params->g_size));
-   ptr = dh_pack_data(ptr, params->key, params->key_size);
-   ptr = dh_pack_data(ptr, params->p, params->p_size);
-   ptr = dh_pack_data(ptr, params->q, params->q_size);
-   dh_pack_data(ptr, params->g, params->g_size);
-
return 0;
 }
 EXPORT_SYMBOL_GPL(crypto_dh_encode_key);
-- 
2.18.0.345.g5c9ce644c3-goog

[PATCH v2 1/2] crypto: dh - fix calculating encoded key size

2018-07-27 Thread Eric Biggers

From: Eric Biggers 

It was forgotten to increase DH_KPP_SECRET_MIN_SIZE to include 'q_size',
causing an out-of-bounds write of 4 bytes in crypto_dh_encode_key(), and
an out-of-bounds read of 4 bytes in crypto_dh_decode_key().  Fix it, and
fix the lengths of the test vectors to match this.

Reported-by: syzbot+6d38d558c25b53b8f...@syzkaller.appspotmail.com
Fixes: e3fe0ae12962 ("crypto: dh - add public key verification test")
Signed-off-by: Eric Biggers 
---
 crypto/dh_helper.c |  2 +-
 crypto/testmgr.h   | 12 ++--
 2 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/crypto/dh_helper.c b/crypto/dh_helper.c
index a7de3d9ce5ace..db9b2d9c58f04 100644
--- a/crypto/dh_helper.c
+++ b/crypto/dh_helper.c
@@ -14,7 +14,7 @@
 #include 
 #include 
 
-#define DH_KPP_SECRET_MIN_SIZE (sizeof(struct kpp_secret) + 3 * sizeof(int))
+#define DH_KPP_SECRET_MIN_SIZE (sizeof(struct kpp_secret) + 4 * sizeof(int))
 
 static inline u8 *dh_pack_data(void *dst, const void *src, size_t size)
 {
diff --git a/crypto/testmgr.h b/crypto/testmgr.h
index 759462d65f412..173111c70746e 100644
--- a/crypto/testmgr.h
+++ b/crypto/testmgr.h
@@ -641,14 +641,14 @@ static const struct kpp_testvec dh_tv_template[] = {
.secret =
 #ifdef __LITTLE_ENDIAN
"\x01\x00" /* type */
-   "\x11\x02" /* len */
+   "\x15\x02" /* len */
"\x00\x01\x00\x00" /* key_size */
"\x00\x01\x00\x00" /* p_size */
"\x00\x00\x00\x00" /* q_size */
"\x01\x00\x00\x00" /* g_size */
 #else
"\x00\x01" /* type */
-   "\x02\x11" /* len */
+   "\x02\x15" /* len */
"\x00\x00\x01\x00" /* key_size */
"\x00\x00\x01\x00" /* p_size */
"\x00\x00\x00\x00" /* q_size */
@@ -741,7 +741,7 @@ static const struct kpp_testvec dh_tv_template[] = {
"\xd3\x34\x49\xad\x64\xa6\xb1\xc0\x59\x28\x75\x60\xa7\x8a\xb0\x11"
"\x56\x89\x42\x74\x11\xf5\xf6\x5e\x6f\x16\x54\x6a\xb1\x76\x4d\x50"
"\x8a\x68\xc1\x5b\x82\xb9\x0d\x00\x32\x50\xed\x88\x87\x48\x92\x17",
-   .secret_size = 529,
+   .secret_size = 533,
.b_public_size = 256,
.expected_a_public_size = 256,
.expected_ss_size = 256,
@@ -750,14 +750,14 @@ static const struct kpp_testvec dh_tv_template[] = {
.secret =
 #ifdef __LITTLE_ENDIAN
"\x01\x00" /* type */
-   "\x11\x02" /* len */
+   "\x15\x02" /* len */
"\x00\x01\x00\x00" /* key_size */
"\x00\x01\x00\x00" /* p_size */
"\x00\x00\x00\x00" /* q_size */
"\x01\x00\x00\x00" /* g_size */
 #else
"\x00\x01" /* type */
-   "\x02\x11" /* len */
+   "\x02\x15" /* len */
"\x00\x00\x01\x00" /* key_size */
"\x00\x00\x01\x00" /* p_size */
"\x00\x00\x00\x00" /* q_size */
@@ -850,7 +850,7 @@ static const struct kpp_testvec dh_tv_template[] = {
"\x5e\x5a\x64\xbd\xf6\x85\x04\xe8\x28\x6a\xac\xef\xce\x19\x8e\x9a"
"\xfe\x75\xc0\x27\x69\xe3\xb3\x7b\x21\xa7\xb1\x16\xa4\x85\x23\xee"
"\xb0\x1b\x04\x6e\xbd\xab\x16\xde\xfd\x86\x6b\xa9\x95\xd7\x0b\xfd",
-   .secret_size = 529,
+   .secret_size = 533,
.b_public_size = 256,
.expected_a_public_size = 256,
.expected_ss_size = 256,
-- 
2.18.0.345.g5c9ce644c3-goog

Re: 答复: [PATCH 1/3] crypto: skcipher - fix crash flushing dcache in error path

2018-07-25 Thread Eric Biggers

Hi GaoKui,

On Thu, Jul 26, 2018 at 02:44:30AM +, gaokui (A) wrote:
> Hi, Eric,
>   Thanks for your reply.
> 
>   I have run  your program on an original kernel and it reproduced the 
> crash. And I also run the program on a kernel with our patch, but there was 
> no crash. 
> 
>   I think the reason of the crash is  the parameter buffer is aligned 
> with the page .  So the address of the parameter buffer starts at the 
> beginning of the page, which making "walk->offset = 0" and generating the 
> crash. I add some logs in "scatterwalk_pagedone()" to print the value of 
> walk->offset, and the log before the crash shows that "walk->offset = 0".
> 
>   And I do not understand why "walk->offset = 0" means no data to be 
> processed. In the structure " scatterlist", the member "offset" represents 
> the offset of the buffer in the page, and the member length represents the 
> length of the buffer. In function "af_alg_make_sg()", if a buffer occupies 
> more than one pages, the offset will also be set to 0 in the second and 
> following pages. And In function scatterwalk_done(), walk->offset = 0 will 
> also allow to call "scatterwalk_pagedone()". So I think that when 
> "walk->offset = 0" the page  needs to be flushed  as well. 
> 
> BRs
> GaoKui
> 

Did you test my patches or just yours?  Your patch fixes the crash, but I don't
agree that it's the best fix.  What you're missing is that walk->offset has
already been increased by scatterwalk_advance() to the offset of the *end* of
the data region processed.  Hence, walk->offset = 0 implies that 0 bytes were
processed (as walk->offset must have been 0 initially, then had 0 added to it),
which I think isn't meant to be a valid case.  And in particular it does *not*
make sense to flush any page when 0 bytes were processed.

Note that this could also be a problem for empty scatterlist elements, but
AFAICS the scatterlist walk code doesn't actually support those when the total
length isn't 0.  I think that needs improvement too, but AFAICS other changes
would be needed to properly fix that limitation, and you apparently cannot
generate empty scatterlist elements via AF_ALG anyway so only in-kernel users
would be affected.

- Eric

[PATCH] crypto: arm/chacha20 - always use vrev for 16-bit rotates

2018-07-24 Thread Eric Biggers

From: Eric Biggers 

The 4-way ChaCha20 NEON code implements 16-bit rotates with vrev32.16,
but the one-way code (used on remainder blocks) implements it with
vshl + vsri, which is slower.  Switch the one-way code to vrev32.16 too.

Signed-off-by: Eric Biggers 
---
 arch/arm/crypto/chacha20-neon-core.S | 10 --
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/arch/arm/crypto/chacha20-neon-core.S 
b/arch/arm/crypto/chacha20-neon-core.S
index 3fecb2124c35..451a849ad518 100644
--- a/arch/arm/crypto/chacha20-neon-core.S
+++ b/arch/arm/crypto/chacha20-neon-core.S
@@ -51,9 +51,8 @@ ENTRY(chacha20_block_xor_neon)
 .Ldoubleround:
// x0 += x1, x3 = rotl32(x3 ^ x0, 16)
vadd.i32q0, q0, q1
-   veorq4, q3, q0
-   vshl.u32q3, q4, #16
-   vsri.u32q3, q4, #16
+   veorq3, q3, q0
+   vrev32.16   q3, q3
 
// x2 += x3, x1 = rotl32(x1 ^ x2, 12)
vadd.i32q2, q2, q3
@@ -82,9 +81,8 @@ ENTRY(chacha20_block_xor_neon)
 
// x0 += x1, x3 = rotl32(x3 ^ x0, 16)
vadd.i32q0, q0, q1
-   veorq4, q3, q0
-   vshl.u32q3, q4, #16
-   vsri.u32q3, q4, #16
+   veorq3, q3, q0
+   vrev32.16   q3, q3
 
// x2 += x3, x1 = rotl32(x1 ^ x2, 12)
vadd.i32q2, q2, q3
-- 
2.18.0

[PATCH 3/3] crypto: ablkcipher - fix crash flushing dcache in error path

2018-07-23 Thread Eric Biggers

From: Eric Biggers 

Like the skcipher_walk and blkcipher_walk cases:

scatterwalk_done() is only meant to be called after a nonzero number of
bytes have been processed, since scatterwalk_pagedone() will flush the
dcache of the *previous* page.  But in the error case of
ablkcipher_walk_done(), e.g. if the input wasn't an integer number of
blocks, scatterwalk_done() was actually called after advancing 0 bytes.
This caused a crash ("BUG: unable to handle kernel paging request")
during '!PageSlab(page)' on architectures like arm and arm64 that define
ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE, provided that the input was
page-aligned as in that case walk->offset == 0.

Fix it by reorganizing ablkcipher_walk_done() to skip the
scatterwalk_advance() and scatterwalk_done() if an error has occurred.

Reported-by: Liu Chao 
Fixes: bf06099db18a ("crypto: skcipher - Add ablkcipher_walk interfaces")
Cc:  # v2.6.35+
Signed-off-by: Eric Biggers 
---
 crypto/ablkcipher.c | 57 +
 1 file changed, 26 insertions(+), 31 deletions(-)

diff --git a/crypto/ablkcipher.c b/crypto/ablkcipher.c
index 1edb5000d783..8882e90e868e 100644
--- a/crypto/ablkcipher.c
+++ b/crypto/ablkcipher.c
@@ -71,11 +71,9 @@ static inline u8 *ablkcipher_get_spot(u8 *start, unsigned 
int len)
return max(start, end_page);
 }
 
-static inline unsigned int ablkcipher_done_slow(struct ablkcipher_walk *walk,
-   unsigned int bsize)
+static inline void ablkcipher_done_slow(struct ablkcipher_walk *walk,
+   unsigned int n)
 {
-   unsigned int n = bsize;
-
for (;;) {
unsigned int len_this_page = scatterwalk_pagelen(>out);
 
@@ -87,17 +85,13 @@ static inline unsigned int ablkcipher_done_slow(struct 
ablkcipher_walk *walk,
n -= len_this_page;
scatterwalk_start(>out, sg_next(walk->out.sg));
}
-
-   return bsize;
 }
 
-static inline unsigned int ablkcipher_done_fast(struct ablkcipher_walk *walk,
-   unsigned int n)
+static inline void ablkcipher_done_fast(struct ablkcipher_walk *walk,
+   unsigned int n)
 {
scatterwalk_advance(>in, n);
scatterwalk_advance(>out, n);
-
-   return n;
 }
 
 static int ablkcipher_walk_next(struct ablkcipher_request *req,
@@ -107,39 +101,40 @@ int ablkcipher_walk_done(struct ablkcipher_request *req,
 struct ablkcipher_walk *walk, int err)
 {
struct crypto_tfm *tfm = req->base.tfm;
-   unsigned int nbytes = 0;
+   unsigned int n; /* bytes processed */
+   bool more;
 
-   if (likely(err >= 0)) {
-   unsigned int n = walk->nbytes - err;
+   if (unlikely(err < 0))
+   goto finish;
 
-   if (likely(!(walk->flags & ABLKCIPHER_WALK_SLOW)))
-   n = ablkcipher_done_fast(walk, n);
-   else if (WARN_ON(err)) {
-   err = -EINVAL;
-   goto err;
-   } else
-   n = ablkcipher_done_slow(walk, n);
+   n = walk->nbytes - err;
+   walk->total -= n;
+   more = (walk->total != 0);
 
-   nbytes = walk->total - n;
-   err = 0;
+   if (likely(!(walk->flags & ABLKCIPHER_WALK_SLOW))) {
+   ablkcipher_done_fast(walk, n);
+   } else {
+   if (WARN_ON(err)) {
+   /* unexpected case; didn't process all bytes */
+   err = -EINVAL;
+   goto finish;
+   }
+   ablkcipher_done_slow(walk, n);
}
 
-   scatterwalk_done(>in, 0, nbytes);
-   scatterwalk_done(>out, 1, nbytes);
-
-err:
-   walk->total = nbytes;
-   walk->nbytes = nbytes;
+   scatterwalk_done(>in, 0, more);
+   scatterwalk_done(>out, 1, more);
 
-   if (nbytes) {
+   if (more) {
crypto_yield(req->base.flags);
return ablkcipher_walk_next(req, walk);
}
-
+   err = 0;
+finish:
+   walk->nbytes = 0;
if (walk->iv != req->info)
memcpy(req->info, walk->iv, tfm->crt_ablkcipher.ivsize);
kfree(walk->iv_buffer);
-
return err;
 }
 EXPORT_SYMBOL_GPL(ablkcipher_walk_done);
-- 
2.18.0.233.g985f88cf7e-goog

[PATCH 2/3] crypto: blkcipher - fix crash flushing dcache in error path

2018-07-23 Thread Eric Biggers

From: Eric Biggers 

Like the skcipher_walk case:

scatterwalk_done() is only meant to be called after a nonzero number of
bytes have been processed, since scatterwalk_pagedone() will flush the
dcache of the *previous* page.  But in the error case of
blkcipher_walk_done(), e.g. if the input wasn't an integer number of
blocks, scatterwalk_done() was actually called after advancing 0 bytes.
This caused a crash ("BUG: unable to handle kernel paging request")
during '!PageSlab(page)' on architectures like arm and arm64 that define
ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE, provided that the input was
page-aligned as in that case walk->offset == 0.

Fix it by reorganizing blkcipher_walk_done() to skip the
scatterwalk_advance() and scatterwalk_done() if an error has occurred.

This bug was found by syzkaller fuzzing.

Reproducer, assuming ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE:

#include 
#include 
#include 

int main()
{
struct sockaddr_alg addr = {
.salg_type = "skcipher",
.salg_name = "ecb(aes-generic)",
};
char buffer[4096] __attribute__((aligned(4096))) = { 0 };
int fd;

fd = socket(AF_ALG, SOCK_SEQPACKET, 0);
bind(fd, (void *), sizeof(addr));
setsockopt(fd, SOL_ALG, ALG_SET_KEY, buffer, 16);
fd = accept(fd, NULL, NULL);
write(fd, buffer, 15);
read(fd, buffer, 15);
}

Reported-by: Liu Chao 
Fixes: 5cde0af2a982 ("[CRYPTO] cipher: Added block cipher type")
Cc:  # v2.6.19+
Signed-off-by: Eric Biggers 
---
 crypto/blkcipher.c | 54 ++
 1 file changed, 26 insertions(+), 28 deletions(-)

diff --git a/crypto/blkcipher.c b/crypto/blkcipher.c
index dd4dcab3766a..f93abf13b5d4 100644
--- a/crypto/blkcipher.c
+++ b/crypto/blkcipher.c
@@ -70,19 +70,18 @@ static inline u8 *blkcipher_get_spot(u8 *start, unsigned 
int len)
return max(start, end_page);
 }
 
-static inline unsigned int blkcipher_done_slow(struct blkcipher_walk *walk,
-  unsigned int bsize)
+static inline void blkcipher_done_slow(struct blkcipher_walk *walk,
+  unsigned int bsize)
 {
u8 *addr;
 
addr = (u8 *)ALIGN((unsigned long)walk->buffer, walk->alignmask + 1);
addr = blkcipher_get_spot(addr, bsize);
scatterwalk_copychunks(addr, >out, bsize, 1);
-   return bsize;
 }
 
-static inline unsigned int blkcipher_done_fast(struct blkcipher_walk *walk,
-  unsigned int n)
+static inline void blkcipher_done_fast(struct blkcipher_walk *walk,
+  unsigned int n)
 {
if (walk->flags & BLKCIPHER_WALK_COPY) {
blkcipher_map_dst(walk);
@@ -96,49 +95,48 @@ static inline unsigned int blkcipher_done_fast(struct 
blkcipher_walk *walk,
 
scatterwalk_advance(>in, n);
scatterwalk_advance(>out, n);
-
-   return n;
 }
 
 int blkcipher_walk_done(struct blkcipher_desc *desc,
struct blkcipher_walk *walk, int err)
 {
-   unsigned int nbytes = 0;
+   unsigned int n; /* bytes processed */
+   bool more;
 
-   if (likely(err >= 0)) {
-   unsigned int n = walk->nbytes - err;
+   if (unlikely(err < 0))
+   goto finish;
 
-   if (likely(!(walk->flags & BLKCIPHER_WALK_SLOW)))
-   n = blkcipher_done_fast(walk, n);
-   else if (WARN_ON(err)) {
-   err = -EINVAL;
-   goto err;
-   } else
-   n = blkcipher_done_slow(walk, n);
+   n = walk->nbytes - err;
+   walk->total -= n;
+   more = (walk->total != 0);
 
-   nbytes = walk->total - n;
-   err = 0;
+   if (likely(!(walk->flags & BLKCIPHER_WALK_SLOW))) {
+   blkcipher_done_fast(walk, n);
+   } else {
+   if (WARN_ON(err)) {
+   /* unexpected case; didn't process all bytes */
+   err = -EINVAL;
+   goto finish;
+   }
+   blkcipher_done_slow(walk, n);
}
 
-   scatterwalk_done(>in, 0, nbytes);
-   scatterwalk_done(>out, 1, nbytes);
+   scatterwalk_done(>in, 0, more);
+   scatterwalk_done(>out, 1, more);
 
-err:
-   walk->total = nbytes;
-   walk->nbytes = nbytes;
-
-   if (nbytes) {
+   if (more) {
crypto_yield(desc->flags);
return blkcipher_walk_next(desc, walk);
}
-
+   err = 0;
+finish:
+   walk->nbytes = 0;
if (walk->iv != desc->info)
memcpy(desc-&

[PATCH 1/3] crypto: skcipher - fix crash flushing dcache in error path

2018-07-23 Thread Eric Biggers

From: Eric Biggers 

scatterwalk_done() is only meant to be called after a nonzero number of
bytes have been processed, since scatterwalk_pagedone() will flush the
dcache of the *previous* page.  But in the error case of
skcipher_walk_done(), e.g. if the input wasn't an integer number of
blocks, scatterwalk_done() was actually called after advancing 0 bytes.
This caused a crash ("BUG: unable to handle kernel paging request")
during '!PageSlab(page)' on architectures like arm and arm64 that define
ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE, provided that the input was
page-aligned as in that case walk->offset == 0.

Fix it by reorganizing skcipher_walk_done() to skip the
scatterwalk_advance() and scatterwalk_done() if an error has occurred.

This bug was found by syzkaller fuzzing.

Reproducer, assuming ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE:

#include 
#include 
#include 

int main()
{
struct sockaddr_alg addr = {
.salg_type = "skcipher",
.salg_name = "cbc(aes-generic)",
};
char buffer[4096] __attribute__((aligned(4096))) = { 0 };
int fd;

fd = socket(AF_ALG, SOCK_SEQPACKET, 0);
bind(fd, (void *), sizeof(addr));
setsockopt(fd, SOL_ALG, ALG_SET_KEY, buffer, 16);
fd = accept(fd, NULL, NULL);
write(fd, buffer, 15);
read(fd, buffer, 15);
}

Reported-by: Liu Chao 
Fixes: b286d8b1a690 ("crypto: skcipher - Add skcipher walk interface")
Cc:  # v4.10+
Signed-off-by: Eric Biggers 
---
 crypto/skcipher.c | 53 ---
 1 file changed, 27 insertions(+), 26 deletions(-)

diff --git a/crypto/skcipher.c b/crypto/skcipher.c
index 7d6a49fe3047..5f7017b36d75 100644
--- a/crypto/skcipher.c
+++ b/crypto/skcipher.c
@@ -95,7 +95,7 @@ static inline u8 *skcipher_get_spot(u8 *start, unsigned int 
len)
return max(start, end_page);
 }
 
-static int skcipher_done_slow(struct skcipher_walk *walk, unsigned int bsize)
+static void skcipher_done_slow(struct skcipher_walk *walk, unsigned int bsize)
 {
u8 *addr;
 
@@ -103,23 +103,24 @@ static int skcipher_done_slow(struct skcipher_walk *walk, 
unsigned int bsize)
addr = skcipher_get_spot(addr, bsize);
scatterwalk_copychunks(addr, >out, bsize,
   (walk->flags & SKCIPHER_WALK_PHYS) ? 2 : 1);
-   return 0;
 }
 
 int skcipher_walk_done(struct skcipher_walk *walk, int err)
 {
-   unsigned int n = walk->nbytes - err;
-   unsigned int nbytes;
-
-   nbytes = walk->total - n;
-
-   if (unlikely(err < 0)) {
-   nbytes = 0;
-   n = 0;
-   } else if (likely(!(walk->flags & (SKCIPHER_WALK_PHYS |
-  SKCIPHER_WALK_SLOW |
-  SKCIPHER_WALK_COPY |
-  SKCIPHER_WALK_DIFF {
+   unsigned int n; /* bytes processed */
+   bool more;
+
+   if (unlikely(err < 0))
+   goto finish;
+
+   n = walk->nbytes - err;
+   walk->total -= n;
+   more = (walk->total != 0);
+
+   if (likely(!(walk->flags & (SKCIPHER_WALK_PHYS |
+   SKCIPHER_WALK_SLOW |
+   SKCIPHER_WALK_COPY |
+   SKCIPHER_WALK_DIFF {
 unmap_src:
skcipher_unmap_src(walk);
} else if (walk->flags & SKCIPHER_WALK_DIFF) {
@@ -131,28 +132,28 @@ int skcipher_walk_done(struct skcipher_walk *walk, int 
err)
skcipher_unmap_dst(walk);
} else if (unlikely(walk->flags & SKCIPHER_WALK_SLOW)) {
if (WARN_ON(err)) {
+   /* unexpected case; didn't process all bytes */
err = -EINVAL;
-   nbytes = 0;
-   } else
-   n = skcipher_done_slow(walk, n);
+   goto finish;
+   }
+   skcipher_done_slow(walk, n);
+   goto already_advanced;
}
 
-   if (err > 0)
-   err = 0;
-
-   walk->total = nbytes;
-   walk->nbytes = nbytes;
-
scatterwalk_advance(>in, n);
scatterwalk_advance(>out, n);
-   scatterwalk_done(>in, 0, nbytes);
-   scatterwalk_done(>out, 1, nbytes);
+already_advanced:
+   scatterwalk_done(>in, 0, more);
+   scatterwalk_done(>out, 1, more);
 
-   if (nbytes) {
+   if (more) {
crypto_yield(walk->flags & SKCIPHER_WALK_SLEEP ?
 CRYPTO_TFM_REQ_MAY_SLEEP : 0);
return skcipher_walk_next(walk);
}
+   err = 0;
+finish:
+   walk->nbytes = 0;

[PATCH 0/3] crypto: fix crash in scatterwalk_pagedone()

2018-07-23 Thread Eric Biggers

From: Eric Biggers 

This series fixes the bug reported by Liu Chao (found using syzkaller)
where a crash occurs in scatterwalk_pagedone() on architectures such as
arm and arm64 that implement flush_dcache_page(), due to an invalid page
pointer when walk->offset == 0.  This series attempts to address the
underlying problem which is that scatterwalk_pagedone() shouldn't have
been called at all in that case.

Eric Biggers (3):
  crypto: skcipher - fix crash flushing dcache in error path
  crypto: blkcipher - fix crash flushing dcache in error path
  crypto: ablkcipher - fix crash flushing dcache in error path

 crypto/ablkcipher.c | 57 +
 crypto/blkcipher.c  | 54 +-
 crypto/skcipher.c   | 53 -
 3 files changed, 79 insertions(+), 85 deletions(-)

-- 
2.18.0.233.g985f88cf7e-goog

[PATCH] crypto: skcipher - remove unnecessary setting of walk->nbytes

2018-07-23 Thread Eric Biggers

From: Eric Biggers 

Setting 'walk->nbytes = walk->total' in skcipher_walk_first() doesn't
make sense because actually walk->nbytes needs to be set to the length
of the first step in the walk, which may be less than walk->total.  This
is done by skcipher_walk_next() which is called immediately afterwards.
Also walk->nbytes was already set to 0 in skcipher_walk_skcipher(),
which is a better default value in case it's forgotten to be set later.

Therefore, remove the unnecessary assignment to walk->nbytes.

Signed-off-by: Eric Biggers 
---
 crypto/skcipher.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/crypto/skcipher.c b/crypto/skcipher.c
index 7d6a49fe3047..9f7d229827b5 100644
--- a/crypto/skcipher.c
+++ b/crypto/skcipher.c
@@ -436,7 +436,6 @@ static int skcipher_walk_first(struct skcipher_walk *walk)
}
 
walk->page = NULL;
-   walk->nbytes = walk->total;
 
return skcipher_walk_next(walk);
 }
-- 
2.18.0.233.g985f88cf7e-goog

[PATCH] crypto: scatterwalk - remove scatterwalk_samebuf()

2018-07-23 Thread Eric Biggers

From: Eric Biggers 

scatterwalk_samebuf() is never used.  Remove it.

Signed-off-by: Eric Biggers 
---
 include/crypto/scatterwalk.h | 7 ---
 1 file changed, 7 deletions(-)

diff --git a/include/crypto/scatterwalk.h b/include/crypto/scatterwalk.h
index eac72840a7d2..a66c127a20ed 100644
--- a/include/crypto/scatterwalk.h
+++ b/include/crypto/scatterwalk.h
@@ -30,13 +30,6 @@ static inline void scatterwalk_crypto_chain(struct 
scatterlist *head,
sg_mark_end(head);
 }
 
-static inline unsigned long scatterwalk_samebuf(struct scatter_walk *walk_in,
-   struct scatter_walk *walk_out)
-{
-   return !(((sg_page(walk_in->sg) - sg_page(walk_out->sg)) << PAGE_SHIFT) 
+
-(int)(walk_in->offset - walk_out->offset));
-}
-
 static inline unsigned int scatterwalk_pagelen(struct scatter_walk *walk)
 {
unsigned int len = walk->sg->offset + walk->sg->length - walk->offset;
-- 
2.18.0.233.g985f88cf7e-goog

[PATCH] crypto: scatterwalk - remove 'chain' argument from scatterwalk_crypto_chain()

2018-07-23 Thread Eric Biggers

From: Eric Biggers 

All callers pass chain=0 to scatterwalk_crypto_chain().

Remove this unneeded parameter.

Signed-off-by: Eric Biggers 
---
 crypto/lrw.c  | 4 ++--
 crypto/scatterwalk.c  | 2 +-
 crypto/xts.c  | 4 ++--
 include/crypto/scatterwalk.h  | 8 +---
 net/tls/tls_device_fallback.c | 2 +-
 5 files changed, 7 insertions(+), 13 deletions(-)

diff --git a/crypto/lrw.c b/crypto/lrw.c
index 954a7064a179..393a782679c7 100644
--- a/crypto/lrw.c
+++ b/crypto/lrw.c
@@ -188,7 +188,7 @@ static int post_crypt(struct skcipher_request *req)
if (rctx->dst != sg) {
rctx->dst[0] = *sg;
sg_unmark_end(rctx->dst);
-   scatterwalk_crypto_chain(rctx->dst, sg_next(sg), 0, 2);
+   scatterwalk_crypto_chain(rctx->dst, sg_next(sg), 2);
}
rctx->dst[0].length -= offset - sg->offset;
rctx->dst[0].offset = offset;
@@ -265,7 +265,7 @@ static int pre_crypt(struct skcipher_request *req)
if (rctx->src != sg) {
rctx->src[0] = *sg;
sg_unmark_end(rctx->src);
-   scatterwalk_crypto_chain(rctx->src, sg_next(sg), 0, 2);
+   scatterwalk_crypto_chain(rctx->src, sg_next(sg), 2);
}
rctx->src[0].length -= offset - sg->offset;
rctx->src[0].offset = offset;
diff --git a/crypto/scatterwalk.c b/crypto/scatterwalk.c
index c16c94f88733..d0b92c1cd6e9 100644
--- a/crypto/scatterwalk.c
+++ b/crypto/scatterwalk.c
@@ -91,7 +91,7 @@ struct scatterlist *scatterwalk_ffwd(struct scatterlist 
dst[2],
 
sg_init_table(dst, 2);
sg_set_page(dst, sg_page(src), src->length - len, src->offset + len);
-   scatterwalk_crypto_chain(dst, sg_next(src), 0, 2);
+   scatterwalk_crypto_chain(dst, sg_next(src), 2);
 
return dst;
 }
diff --git a/crypto/xts.c b/crypto/xts.c
index 12284183bd20..ccf55fbb8bc2 100644
--- a/crypto/xts.c
+++ b/crypto/xts.c
@@ -138,7 +138,7 @@ static int post_crypt(struct skcipher_request *req)
if (rctx->dst != sg) {
rctx->dst[0] = *sg;
sg_unmark_end(rctx->dst);
-   scatterwalk_crypto_chain(rctx->dst, sg_next(sg), 0, 2);
+   scatterwalk_crypto_chain(rctx->dst, sg_next(sg), 2);
}
rctx->dst[0].length -= offset - sg->offset;
rctx->dst[0].offset = offset;
@@ -204,7 +204,7 @@ static int pre_crypt(struct skcipher_request *req)
if (rctx->src != sg) {
rctx->src[0] = *sg;
sg_unmark_end(rctx->src);
-   scatterwalk_crypto_chain(rctx->src, sg_next(sg), 0, 2);
+   scatterwalk_crypto_chain(rctx->src, sg_next(sg), 2);
}
rctx->src[0].length -= offset - sg->offset;
rctx->src[0].offset = offset;
diff --git a/include/crypto/scatterwalk.h b/include/crypto/scatterwalk.h
index 880e6be9e95e..eac72840a7d2 100644
--- a/include/crypto/scatterwalk.h
+++ b/include/crypto/scatterwalk.h
@@ -22,14 +22,8 @@
 #include 
 
 static inline void scatterwalk_crypto_chain(struct scatterlist *head,
-   struct scatterlist *sg,
-   int chain, int num)
+   struct scatterlist *sg, int num)
 {
-   if (chain) {
-   head->length += sg->length;
-   sg = sg_next(sg);
-   }
-
if (sg)
sg_chain(head, num, sg);
else
diff --git a/net/tls/tls_device_fallback.c b/net/tls/tls_device_fallback.c
index 748914abdb60..4e1ec12bc0fb 100644
--- a/net/tls/tls_device_fallback.c
+++ b/net/tls/tls_device_fallback.c
@@ -42,7 +42,7 @@ static void chain_to_walk(struct scatterlist *sg, struct 
scatter_walk *walk)
sg_set_page(sg, sg_page(src),
src->length - diff, walk->offset);
 
-   scatterwalk_crypto_chain(sg, sg_next(src), 0, 2);
+   scatterwalk_crypto_chain(sg, sg_next(src), 2);
 }
 
 static int tls_enc_record(struct aead_request *aead_req,
-- 
2.18.0.233.g985f88cf7e-goog

[PATCH] crypto: skcipher - fix aligning block size in skcipher_copy_iv()

2018-07-23 Thread Eric Biggers

From: Eric Biggers 

The ALIGN() macro needs to be passed the alignment, not the alignmask
(which is the alignment minus 1).

Fixes: b286d8b1a690 ("crypto: skcipher - Add skcipher walk interface")
Cc:  # v4.10+
Signed-off-by: Eric Biggers 
---
 crypto/skcipher.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/crypto/skcipher.c b/crypto/skcipher.c
index 7d6a49fe3047..4f6b8dadaceb 100644
--- a/crypto/skcipher.c
+++ b/crypto/skcipher.c
@@ -398,7 +398,7 @@ static int skcipher_copy_iv(struct skcipher_walk *walk)
unsigned size;
u8 *iv;
 
-   aligned_bs = ALIGN(bs, alignmask);
+   aligned_bs = ALIGN(bs, alignmask + 1);
 
/* Minimum size to align buffer by alignmask. */
size = alignmask & ~a;
-- 
2.18.0.233.g985f88cf7e-goog

[PATCH] crypto: arm64/sha256 - increase cra_priority of scalar implementations

2018-07-17 Thread Eric Biggers

From: Eric Biggers 

Commit b73b7ac0a774 ("crypto: sha256_generic - add cra_priority") gave
sha256-generic and sha224-generic a cra_priority of 100, to match the
convention for generic implementations.  But sha256-arm64 and
sha224-arm64 also have priority 100, so their order relative to the
generic implementations became ambiguous.

Therefore, increase their priority to 125 so that they have higher
priority than the generic implementations but lower priority than the
NEON implementations which have priority 150.

Signed-off-by: Eric Biggers 
---
 arch/arm64/crypto/sha256-glue.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/crypto/sha256-glue.c b/arch/arm64/crypto/sha256-glue.c
index f1b4f4420ca1..4aedeaefd61f 100644
--- a/arch/arm64/crypto/sha256-glue.c
+++ b/arch/arm64/crypto/sha256-glue.c
@@ -67,7 +67,7 @@ static struct shash_alg algs[] = { {
.descsize   = sizeof(struct sha256_state),
.base.cra_name  = "sha256",
.base.cra_driver_name   = "sha256-arm64",
-   .base.cra_priority  = 100,
+   .base.cra_priority  = 125,
.base.cra_blocksize = SHA256_BLOCK_SIZE,
.base.cra_module= THIS_MODULE,
 }, {
@@ -79,7 +79,7 @@ static struct shash_alg algs[] = { {
.descsize   = sizeof(struct sha256_state),
.base.cra_name  = "sha224",
.base.cra_driver_name   = "sha224-arm64",
-   .base.cra_priority  = 100,
+   .base.cra_priority  = 125,
.base.cra_blocksize = SHA224_BLOCK_SIZE,
.base.cra_module= THIS_MODULE,
 } };
-- 
2.18.0.203.gfac676dfb9-goog

Re: [PATCH] crypto: Add 0 walk-offset check in scatterwalk_pagedone()

2018-07-15 Thread Eric Biggers

Hi Liu,

On Mon, Jul 09, 2018 at 05:10:19PM +0800, Liu Chao wrote:
> From: Luo Xinqiang 
> 
> In function scatterwalk_pagedone(), a kernel panic of invalid
> page will occur if walk->offset equals 0. This patch fixes the
> problem by setting the page addresswith sg_page(walk->sg)
> directly if walk->offset equals 0.
> 
> Panic call stack:
> [] blkcipher_walk_done+0x430/0x8dc
> [] blkcipher_walk_next+0x750/0x9e8
> [] blkcipher_walk_first+0x110/0x2c0
> [] blkcipher_walk_virt+0xcc/0xe0
> [] cbc_decrypt+0xdc/0x1a8
> [] ablk_decrypt+0x138/0x224
> [] skcipher_decrypt_ablkcipher+0x130/0x150
> [] skcipher_recvmsg_sync.isra.17+0x270/0x404
> [] skcipher_recvmsg+0x98/0xb8
> [] SyS_recvfrom+0x2ac/0x2fc
> [] el0_svc_naked+0x34/0x38
> 
> Test: do syskaller fuzz test on 4.9 & 4.4
> 
> Signed-off-by: Gao Kui 
> Signed-off-by: Luo Xinqiang 
> ---
>  crypto/scatterwalk.c | 6 +-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/crypto/scatterwalk.c b/crypto/scatterwalk.c
> index bc769c4..a265907 100644
> --- a/crypto/scatterwalk.c
> +++ b/crypto/scatterwalk.c
> @@ -53,7 +53,11 @@ static void scatterwalk_pagedone(struct scatter_walk 
> *walk, int out,
>   if (out) {
>   struct page *page;
>  
> - page = sg_page(walk->sg) + ((walk->offset - 1) >> PAGE_SHIFT);
> + if (likely(walk->offset))
> + page = sg_page(walk->sg) +
> + ((walk->offset - 1) >> PAGE_SHIFT);
> + else
> + page = sg_page(walk->sg);
>   /* Test ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE first as
>* PageSlab cannot be optimised away per se due to
>* use of volatile pointer.

Interesting, I guess the reason this wasn't found by syzbot yet is that syzbot
currently only runs on x86, where ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE isn't
defined.  Otherwise this crash reproduces on the latest kernel by running the
following program:

#include 
#include 
#include 

int main()
{
struct sockaddr_alg addr = {
.salg_type = "skcipher",
.salg_name = "cbc(aes)",
};
int algfd, reqfd;
char buffer[4096] __attribute__((aligned(4096))) = { 0 };

algfd = socket(AF_ALG, SOCK_SEQPACKET, 0);
bind(algfd, (void *), sizeof(addr));
setsockopt(algfd, SOL_ALG, ALG_SET_KEY, buffer, 32);
reqfd = accept(algfd, NULL, NULL);
write(reqfd, buffer, 15);
read(reqfd, buffer, 15);
}

I don't think your fix makes sense though, because if walk->offset = 0 then no
data was processed, so there would be no need to flush any page at all.  I think
the original intent was that scatterwalk_pagedone() only be called when a
nonzero length was processed.  So a better fix is probably to update
blkcipher_walk_done() (and skcipher_walk_done() and ablkcipher_walk_done()) to
avoid calling scatterwalk_pagedone() in the error case where no bytes were
processed.  I'm working on that fix but it's not ready quite yet.

Thanks!

- Eric

Re: [PATCH] crypto: dh - fix calculating encoded key size

2018-07-11 Thread Eric Biggers

On Wed, Jul 11, 2018 at 03:26:56PM +0800, Herbert Xu wrote:
> On Tue, Jul 10, 2018 at 08:59:05PM -0700, Eric Biggers wrote:
> > From: Eric Biggers 
> > 
> > It was forgotten to increase DH_KPP_SECRET_MIN_SIZE to include 'q_size',
> > causing an out-of-bounds write of 4 bytes in crypto_dh_encode_key(), and
> > an out-of-bounds read of 4 bytes in crypto_dh_decode_key().  Fix it.
> > Also add a BUG_ON() if crypto_dh_encode_key() doesn't exactly fill the
> > buffer, as that would have found this bug without resorting to KASAN.
> > 
> > Reported-by: syzbot+6d38d558c25b53b8f...@syzkaller.appspotmail.com
> > Fixes: e3fe0ae12962 ("crypto: dh - add public key verification test")
> > Signed-off-by: Eric Biggers 
> 
> Is it possible to return an error and use WARN_ON instead of BUG_ON?
> Or do the callers not bother to check for errors?
> 

The callers do check for errors, but at the point of the proposed BUG_ON() a
buffer overflow may have already occurred, so I think a BUG_ON() would be more
appropriate than a WARN_ON().  Of course, it would be better to prevent any
buffer overflow from occurring in the first place, but that's already the
purpose of the 'len != crypto_dh_key_len(params)' check; the issue was that
'crypto_dh_key_len()' calculated the wrong length.

- Eric

Re: [PATCH v2] crypto: DH - add public key verification test

2018-07-10 Thread Eric Biggers

Hi Stephan,

On Wed, Jun 27, 2018 at 08:15:31AM +0200, Stephan Müller wrote:
> Hi,
> 
> Changes v2:
> * addition of a check that mpi_alloc succeeds.
> 
> ---8<---
> 
> According to SP800-56A section 5.6.2.1, the public key to be processed
> for the DH operation shall be checked for appropriateness. The check
> shall covers the full verification test in case the domain parameter Q
> is provided as defined in SP800-56A section 5.6.2.3.1. If Q is not
> provided, the partial check according to SP800-56A section 5.6.2.3.2 is
> performed.
> 
> The full verification test requires the presence of the domain parameter
> Q. Thus, the patch adds the support to handle Q. It is permissible to
> not provide the Q value as part of the domain parameters. This implies
> that the interface is still backwards-compatible where so far only P and
> G are to be provided. However, if Q is provided, it is imported.
> 
> Without the test, the NIST ACVP testing fails. After adding this check,
> the NIST ACVP testing passes. Testing without providing the Q domain
> parameter has been performed to verify the interface has not changed.

You forgot to update the self-tests in the kernel, so they're failing now, as
you *did* change the interface (the "key" is encoded differently now).

- Eric

[PATCH 6/6] crypto: remove redundant type flags from tfm allocation

2018-06-30 Thread Eric Biggers

From: Eric Biggers 

Some crypto API users allocating a tfm with crypto_alloc_$FOO() are also
specifying the type flags for $FOO, e.g. crypto_alloc_shash() with
CRYPTO_ALG_TYPE_SHASH.  But, that's redundant since the crypto API will
override any specified type flag/mask with the correct ones.

So, remove the unneeded flags.

This patch shouldn't change any actual behavior.

Signed-off-by: Eric Biggers 
---
 Documentation/crypto/api-samples.rst | 2 +-
 drivers/crypto/atmel-sha.c   | 4 +---
 drivers/crypto/inside-secure/safexcel_hash.c | 3 +--
 drivers/crypto/marvell/hash.c| 3 +--
 drivers/crypto/qce/sha.c | 3 +--
 security/keys/dh.c   | 2 +-
 6 files changed, 6 insertions(+), 11 deletions(-)

diff --git a/Documentation/crypto/api-samples.rst 
b/Documentation/crypto/api-samples.rst
index 006827e30d066..0f6ca8b7261e9 100644
--- a/Documentation/crypto/api-samples.rst
+++ b/Documentation/crypto/api-samples.rst
@@ -162,7 +162,7 @@ Code Example For Use of Operational State Memory With SHASH
 char *hash_alg_name = "sha1-padlock-nano";
 int ret;
 
-alg = crypto_alloc_shash(hash_alg_name, CRYPTO_ALG_TYPE_SHASH, 0);
+alg = crypto_alloc_shash(hash_alg_name, 0, 0);
 if (IS_ERR(alg)) {
 pr_info("can't alloc alg %s\n", hash_alg_name);
 return PTR_ERR(alg);
diff --git a/drivers/crypto/atmel-sha.c b/drivers/crypto/atmel-sha.c
index 4d43081120db1..8a19df2fba6a3 100644
--- a/drivers/crypto/atmel-sha.c
+++ b/drivers/crypto/atmel-sha.c
@@ -2316,9 +2316,7 @@ struct atmel_sha_authenc_ctx 
*atmel_sha_authenc_spawn(unsigned long mode)
goto error;
}
 
-   tfm = crypto_alloc_ahash(name,
-CRYPTO_ALG_TYPE_AHASH,
-CRYPTO_ALG_TYPE_AHASH_MASK);
+   tfm = crypto_alloc_ahash(name, 0, 0);
if (IS_ERR(tfm)) {
err = PTR_ERR(tfm);
goto error;
diff --git a/drivers/crypto/inside-secure/safexcel_hash.c 
b/drivers/crypto/inside-secure/safexcel_hash.c
index 188ba0734337a..2ebf8ff710813 100644
--- a/drivers/crypto/inside-secure/safexcel_hash.c
+++ b/drivers/crypto/inside-secure/safexcel_hash.c
@@ -949,8 +949,7 @@ int safexcel_hmac_setkey(const char *alg, const u8 *key, 
unsigned int keylen,
u8 *ipad, *opad;
int ret;
 
-   tfm = crypto_alloc_ahash(alg, CRYPTO_ALG_TYPE_AHASH,
-CRYPTO_ALG_TYPE_AHASH_MASK);
+   tfm = crypto_alloc_ahash(alg, 0, 0);
if (IS_ERR(tfm))
return PTR_ERR(tfm);
 
diff --git a/drivers/crypto/marvell/hash.c b/drivers/crypto/marvell/hash.c
index e34d80b6b7e58..99ff54cc8a15e 100644
--- a/drivers/crypto/marvell/hash.c
+++ b/drivers/crypto/marvell/hash.c
@@ -1183,8 +1183,7 @@ static int mv_cesa_ahmac_setkey(const char *hash_alg_name,
u8 *opad;
int ret;
 
-   tfm = crypto_alloc_ahash(hash_alg_name, CRYPTO_ALG_TYPE_AHASH,
-CRYPTO_ALG_TYPE_AHASH_MASK);
+   tfm = crypto_alloc_ahash(hash_alg_name, 0, 0);
if (IS_ERR(tfm))
return PTR_ERR(tfm);
 
diff --git a/drivers/crypto/qce/sha.c b/drivers/crypto/qce/sha.c
index 53227d70d3970..d8a5db11b7ea1 100644
--- a/drivers/crypto/qce/sha.c
+++ b/drivers/crypto/qce/sha.c
@@ -378,8 +378,7 @@ static int qce_ahash_hmac_setkey(struct crypto_ahash *tfm, 
const u8 *key,
else
return -EINVAL;
 
-   ahash_tfm = crypto_alloc_ahash(alg_name, CRYPTO_ALG_TYPE_AHASH,
-  CRYPTO_ALG_TYPE_AHASH_MASK);
+   ahash_tfm = crypto_alloc_ahash(alg_name, 0, 0);
if (IS_ERR(ahash_tfm))
return PTR_ERR(ahash_tfm);
 
diff --git a/security/keys/dh.c b/security/keys/dh.c
index b203f7758f976..711e89d8c4153 100644
--- a/security/keys/dh.c
+++ b/security/keys/dh.c
@@ -317,7 +317,7 @@ long __keyctl_dh_compute(struct keyctl_dh_params __user 
*params,
if (ret)
goto out3;
 
-   tfm = crypto_alloc_kpp("dh", CRYPTO_ALG_TYPE_KPP, 0);
+   tfm = crypto_alloc_kpp("dh", 0, 0);
if (IS_ERR(tfm)) {
ret = PTR_ERR(tfm);
goto out3;
-- 
2.18.0

[PATCH 1/6] crypto: shash - remove useless setting of type flags

2018-06-30 Thread Eric Biggers

From: Eric Biggers 

Many shash algorithms set .cra_flags = CRYPTO_ALG_TYPE_SHASH.  But this
is redundant with the C structure type ('struct shash_alg'), and
crypto_register_shash() already sets the type flag automatically,
clearing any type flag that was already there.  Apparently the useless
assignment has just been copy+pasted around.

So, remove the useless assignment from all the shash algorithms.

This patch shouldn't change any actual behavior.

Signed-off-by: Eric Biggers 
---
 arch/arm/crypto/ghash-ce-glue.c| 2 +-
 arch/arm/crypto/sha1-ce-glue.c | 1 -
 arch/arm/crypto/sha1_glue.c| 1 -
 arch/arm/crypto/sha1_neon_glue.c   | 1 -
 arch/arm/crypto/sha2-ce-glue.c | 2 --
 arch/arm/crypto/sha256_glue.c  | 2 --
 arch/arm/crypto/sha256_neon_glue.c | 2 --
 arch/arm/crypto/sha512-glue.c  | 2 --
 arch/arm/crypto/sha512-neon-glue.c | 2 --
 arch/arm64/crypto/aes-glue.c   | 3 ---
 arch/arm64/crypto/ghash-ce-glue.c  | 1 -
 arch/arm64/crypto/sha1-ce-glue.c   | 1 -
 arch/arm64/crypto/sha2-ce-glue.c   | 2 --
 arch/arm64/crypto/sha256-glue.c| 4 
 arch/arm64/crypto/sha3-ce-glue.c   | 4 
 arch/arm64/crypto/sha512-ce-glue.c | 2 --
 arch/arm64/crypto/sha512-glue.c| 2 --
 arch/arm64/crypto/sm3-ce-glue.c| 1 -
 arch/mips/cavium-octeon/crypto/octeon-md5.c| 1 -
 arch/mips/cavium-octeon/crypto/octeon-sha1.c   | 1 -
 arch/mips/cavium-octeon/crypto/octeon-sha256.c | 2 --
 arch/mips/cavium-octeon/crypto/octeon-sha512.c | 2 --
 arch/powerpc/crypto/md5-glue.c | 1 -
 arch/powerpc/crypto/sha1-spe-glue.c| 1 -
 arch/powerpc/crypto/sha1.c | 1 -
 arch/powerpc/crypto/sha256-spe-glue.c  | 2 --
 arch/s390/crypto/ghash_s390.c  | 1 -
 arch/s390/crypto/sha1_s390.c   | 1 -
 arch/s390/crypto/sha256_s390.c | 2 --
 arch/s390/crypto/sha512_s390.c | 2 --
 arch/sparc/crypto/md5_glue.c   | 1 -
 arch/sparc/crypto/sha1_glue.c  | 1 -
 arch/sparc/crypto/sha256_glue.c| 2 --
 arch/sparc/crypto/sha512_glue.c| 2 --
 arch/x86/crypto/ghash-clmulni-intel_glue.c | 3 +--
 arch/x86/crypto/poly1305_glue.c| 1 -
 arch/x86/crypto/sha1_ssse3_glue.c  | 4 
 arch/x86/crypto/sha256_ssse3_glue.c| 8 
 arch/x86/crypto/sha512_ssse3_glue.c| 6 --
 crypto/crypto_null.c   | 1 -
 crypto/ghash-generic.c | 1 -
 crypto/md4.c   | 1 -
 crypto/md5.c   | 1 -
 crypto/poly1305_generic.c  | 1 -
 crypto/rmd128.c| 1 -
 crypto/rmd160.c| 1 -
 crypto/rmd256.c| 1 -
 crypto/rmd320.c| 1 -
 crypto/sha1_generic.c  | 1 -
 crypto/sha256_generic.c| 2 --
 crypto/sha3_generic.c  | 4 
 crypto/sha512_generic.c| 2 --
 crypto/sm3_generic.c   | 1 -
 crypto/tgr192.c| 3 ---
 crypto/wp512.c | 3 ---
 drivers/crypto/nx/nx-aes-xcbc.c| 1 -
 drivers/crypto/nx/nx-sha256.c  | 1 -
 drivers/crypto/nx/nx-sha512.c  | 1 -
 drivers/crypto/padlock-sha.c   | 8 ++--
 drivers/crypto/vmx/ghash.c | 2 +-
 drivers/staging/skein/skein_generic.c  | 3 ---
 61 files changed, 5 insertions(+), 116 deletions(-)

diff --git a/arch/arm/crypto/ghash-ce-glue.c b/arch/arm/crypto/ghash-ce-glue.c
index d9bb52cae2ac9..f93c0761929d5 100644
--- a/arch/arm/crypto/ghash-ce-glue.c
+++ b/arch/arm/crypto/ghash-ce-glue.c
@@ -152,7 +152,7 @@ static struct shash_alg ghash_alg = {
.cra_name   = "__ghash",
.cra_driver_name = "__driver-ghash-ce",
.cra_priority   = 0,
-   .cra_flags  = CRYPTO_ALG_TYPE_SHASH | CRYPTO_ALG_INTERNAL,
+   .cra_flags  = CRYPTO_ALG_INTERNAL,
.cra_blocksize  = GHASH_BLOCK_SIZE,
.cra_ctxsize= sizeof(struct ghash_key),
.cra_module = THIS_MODULE,
diff --git a/arch/arm/crypto/sha1-ce-glue.c b/arch/arm/crypto/sha1-ce-glue.c
index 555f72b5e659b..b732522e20f80 100644
--- a/arch/arm/crypto/sha1-ce-glue.c
+++ b/arch/arm/crypto/sha1-ce-glue.c
@@ -75,7 +75,6 @@ static struct shash_alg alg = {
.cra_name   = "sha1",
.cra_driver_name= "sha1-ce",

[PATCH 2/6] crypto: ahash - remove useless setting of type flags

2018-06-30 Thread Eric Biggers

From: Eric Biggers 

Many ahash algorithms set .cra_flags = CRYPTO_ALG_TYPE_AHASH.  But this
is redundant with the C structure type ('struct ahash_alg'), and
crypto_register_ahash() already sets the type flag automatically,
clearing any type flag that was already there.  Apparently the useless
assignment has just been copy+pasted around.

So, remove the useless assignment from all the ahash algorithms.

This patch shouldn't change any actual behavior.

Signed-off-by: Eric Biggers 
---
 arch/arm/crypto/ghash-ce-glue.c|  2 +-
 arch/x86/crypto/ghash-clmulni-intel_glue.c |  2 +-
 arch/x86/crypto/sha1-mb/sha1_mb.c  |  7 ++---
 arch/x86/crypto/sha256-mb/sha256_mb.c  |  8 ++---
 arch/x86/crypto/sha512-mb/sha512_mb.c  |  8 ++---
 drivers/crypto/axis/artpec6_crypto.c   | 14 -
 drivers/crypto/bcm/cipher.c|  5 ++-
 drivers/crypto/caam/caamhash.c |  2 +-
 drivers/crypto/ccp/ccp-crypto-aes-cmac.c   |  2 +-
 drivers/crypto/ccp/ccp-crypto-sha.c|  2 +-
 drivers/crypto/ccree/cc_hash.c |  3 +-
 drivers/crypto/chelsio/chcr_algo.c |  3 +-
 drivers/crypto/n2_core.c   |  3 +-
 drivers/crypto/omap-sham.c | 36 --
 drivers/crypto/s5p-sss.c   |  9 ++
 drivers/crypto/sahara.c|  6 ++--
 drivers/crypto/stm32/stm32-hash.c  | 24 +--
 drivers/crypto/sunxi-ss/sun4i-ss-core.c|  2 --
 drivers/crypto/talitos.c   | 36 --
 drivers/crypto/ux500/hash/hash_core.c  | 12 +++-
 20 files changed, 67 insertions(+), 119 deletions(-)

diff --git a/arch/arm/crypto/ghash-ce-glue.c b/arch/arm/crypto/ghash-ce-glue.c
index f93c0761929d5..124fee03246e2 100644
--- a/arch/arm/crypto/ghash-ce-glue.c
+++ b/arch/arm/crypto/ghash-ce-glue.c
@@ -308,7 +308,7 @@ static struct ahash_alg ghash_async_alg = {
.cra_name   = "ghash",
.cra_driver_name = "ghash-ce",
.cra_priority   = 300,
-   .cra_flags  = CRYPTO_ALG_TYPE_AHASH | CRYPTO_ALG_ASYNC,
+   .cra_flags  = CRYPTO_ALG_ASYNC,
.cra_blocksize  = GHASH_BLOCK_SIZE,
.cra_type   = _ahash_type,
.cra_ctxsize= sizeof(struct ghash_async_ctx),
diff --git a/arch/x86/crypto/ghash-clmulni-intel_glue.c 
b/arch/x86/crypto/ghash-clmulni-intel_glue.c
index b1430e92e6382..a3de43b5e20a0 100644
--- a/arch/x86/crypto/ghash-clmulni-intel_glue.c
+++ b/arch/x86/crypto/ghash-clmulni-intel_glue.c
@@ -314,7 +314,7 @@ static struct ahash_alg ghash_async_alg = {
.cra_driver_name= "ghash-clmulni",
.cra_priority   = 400,
.cra_ctxsize= sizeof(struct 
ghash_async_ctx),
-   .cra_flags  = CRYPTO_ALG_TYPE_AHASH | 
CRYPTO_ALG_ASYNC,
+   .cra_flags  = CRYPTO_ALG_ASYNC,
.cra_blocksize  = GHASH_BLOCK_SIZE,
.cra_type   = _ahash_type,
.cra_module = THIS_MODULE,
diff --git a/arch/x86/crypto/sha1-mb/sha1_mb.c 
b/arch/x86/crypto/sha1-mb/sha1_mb.c
index 4b2430274935b..f7929ba6cfb43 100644
--- a/arch/x86/crypto/sha1-mb/sha1_mb.c
+++ b/arch/x86/crypto/sha1-mb/sha1_mb.c
@@ -746,9 +746,8 @@ static struct ahash_alg sha1_mb_areq_alg = {
 * algo may not have completed before hashing thread
 * sleep
 */
-   .cra_flags  = CRYPTO_ALG_TYPE_AHASH |
-   CRYPTO_ALG_ASYNC |
-   CRYPTO_ALG_INTERNAL,
+   .cra_flags  = CRYPTO_ALG_ASYNC |
+ CRYPTO_ALG_INTERNAL,
.cra_blocksize  = SHA1_BLOCK_SIZE,
.cra_module = THIS_MODULE,
.cra_list   = LIST_HEAD_INIT
@@ -879,7 +878,7 @@ static struct ahash_alg sha1_mb_async_alg = {
 * priority at runtime using NETLINK_CRYPTO.
 */
.cra_priority   = 50,
-   .cra_flags  = CRYPTO_ALG_TYPE_AHASH | 
CRYPTO_ALG_ASYNC,
+   .cra_flags  = CRYPTO_ALG_ASYNC,
.cra_blocksize  = SHA1_BLOCK_SIZE,
.cra_type   = _ahash_type,
.cra_module = THIS_MODULE,
diff --git a/arch/x86/crypto/sha256-mb/sha256_mb.c 
b/arch/x86/crypto/sha256-mb/sha256_mb.c
index 4c07f6c12c37b..59a47048920ab 100644
--- a/arch/x86/crypto/sha256-mb/sha256_mb.c
+++ b/arch/x86/crypto/sha256-mb/sha256_mb.c
@@ -745,9 +745,8 @@ static struc

[PATCH 3/6] crypto: ahash - remove useless setting of cra_type

2018-06-30 Thread Eric Biggers

From: Eric Biggers 

Some ahash algorithms set .cra_type = _ahash_type.  But this is
redundant with the C structure type ('struct ahash_alg'), and
crypto_register_ahash() already sets the .cra_type automatically.
Apparently the useless assignment has just been copy+pasted around.

So, remove the useless assignment from all the ahash algorithms.

This patch shouldn't change any actual behavior.

Signed-off-by: Eric Biggers 
---
 arch/arm/crypto/ghash-ce-glue.c| 1 -
 arch/x86/crypto/ghash-clmulni-intel_glue.c | 1 -
 arch/x86/crypto/sha1-mb/sha1_mb.c  | 1 -
 arch/x86/crypto/sha256-mb/sha256_mb.c  | 1 -
 arch/x86/crypto/sha512-mb/sha512_mb.c  | 1 -
 drivers/crypto/bcm/cipher.c| 1 -
 drivers/crypto/caam/caamhash.c | 1 -
 drivers/crypto/ccp/ccp-crypto-aes-cmac.c   | 1 -
 drivers/crypto/ccp/ccp-crypto-sha.c| 1 -
 drivers/crypto/ccree/cc_hash.c | 1 -
 drivers/crypto/chelsio/chcr_algo.c | 1 -
 drivers/crypto/sunxi-ss/sun4i-ss-core.c| 2 --
 drivers/crypto/talitos.c   | 1 -
 drivers/crypto/ux500/hash/hash_core.c  | 3 ---
 14 files changed, 17 deletions(-)

diff --git a/arch/arm/crypto/ghash-ce-glue.c b/arch/arm/crypto/ghash-ce-glue.c
index 124fee03246e2..8930fc4e7c228 100644
--- a/arch/arm/crypto/ghash-ce-glue.c
+++ b/arch/arm/crypto/ghash-ce-glue.c
@@ -310,7 +310,6 @@ static struct ahash_alg ghash_async_alg = {
.cra_priority   = 300,
.cra_flags  = CRYPTO_ALG_ASYNC,
.cra_blocksize  = GHASH_BLOCK_SIZE,
-   .cra_type   = _ahash_type,
.cra_ctxsize= sizeof(struct ghash_async_ctx),
.cra_module = THIS_MODULE,
.cra_init   = ghash_async_init_tfm,
diff --git a/arch/x86/crypto/ghash-clmulni-intel_glue.c 
b/arch/x86/crypto/ghash-clmulni-intel_glue.c
index a3de43b5e20a0..3582ae885ee11 100644
--- a/arch/x86/crypto/ghash-clmulni-intel_glue.c
+++ b/arch/x86/crypto/ghash-clmulni-intel_glue.c
@@ -316,7 +316,6 @@ static struct ahash_alg ghash_async_alg = {
.cra_ctxsize= sizeof(struct 
ghash_async_ctx),
.cra_flags  = CRYPTO_ALG_ASYNC,
.cra_blocksize  = GHASH_BLOCK_SIZE,
-   .cra_type   = _ahash_type,
.cra_module = THIS_MODULE,
.cra_init   = ghash_async_init_tfm,
.cra_exit   = ghash_async_exit_tfm,
diff --git a/arch/x86/crypto/sha1-mb/sha1_mb.c 
b/arch/x86/crypto/sha1-mb/sha1_mb.c
index f7929ba6cfb43..b93805664c1dd 100644
--- a/arch/x86/crypto/sha1-mb/sha1_mb.c
+++ b/arch/x86/crypto/sha1-mb/sha1_mb.c
@@ -880,7 +880,6 @@ static struct ahash_alg sha1_mb_async_alg = {
.cra_priority   = 50,
.cra_flags  = CRYPTO_ALG_ASYNC,
.cra_blocksize  = SHA1_BLOCK_SIZE,
-   .cra_type   = _ahash_type,
.cra_module = THIS_MODULE,
.cra_list   = 
LIST_HEAD_INIT(sha1_mb_async_alg.halg.base.cra_list),
.cra_init   = sha1_mb_async_init_tfm,
diff --git a/arch/x86/crypto/sha256-mb/sha256_mb.c 
b/arch/x86/crypto/sha256-mb/sha256_mb.c
index 59a47048920ab..97c5fc43e115d 100644
--- a/arch/x86/crypto/sha256-mb/sha256_mb.c
+++ b/arch/x86/crypto/sha256-mb/sha256_mb.c
@@ -879,7 +879,6 @@ static struct ahash_alg sha256_mb_async_alg = {
.cra_priority   = 50,
.cra_flags  = CRYPTO_ALG_ASYNC,
.cra_blocksize  = SHA256_BLOCK_SIZE,
-   .cra_type   = _ahash_type,
.cra_module = THIS_MODULE,
.cra_list   = LIST_HEAD_INIT
(sha256_mb_async_alg.halg.base.cra_list),
diff --git a/arch/x86/crypto/sha512-mb/sha512_mb.c 
b/arch/x86/crypto/sha512-mb/sha512_mb.c
index d3a758ac3ade0..26b85678012d0 100644
--- a/arch/x86/crypto/sha512-mb/sha512_mb.c
+++ b/arch/x86/crypto/sha512-mb/sha512_mb.c
@@ -913,7 +913,6 @@ static struct ahash_alg sha512_mb_async_alg = {
.cra_priority   = 50,
.cra_flags  = CRYPTO_ALG_ASYNC,
.cra_blocksize  = SHA512_BLOCK_SIZE,
-   .cra_type   = _ahash_type,
.cra_module = THIS_MODULE,
.cra_list   = LIST_HEAD_INIT
(sha512_mb_async_alg.halg.base.cra_list),
diff --git a/drivers/crypto/bcm/cipher.c b/drivers/crypto/bcm/cipher.c
index 2f85a989c4761..4e2babd6b89d7 100644
--- a/drivers/crypto/bcm/cipher.c
+++ b

[PATCH 0/6] crypto: remove redundant type specifications

2018-06-30 Thread Eric Biggers

Originally, algorithms had to declare their type in .cra_flags as a
CRYPTO_ALG_TYPE_* value.  Some types of algorithms such as "cipher"
still have to do this.  But now most algorithm types use different
top-level C data structures, and different registration and allocation
functions.  And for these, the core crypto API automatically sets the
.cra_flags type as well as .cra_type, mainly for its own use (users
shouldn't care about these).

Yet, many algorithms are still explicitly setting their .cra_flags type
and sometimes even .cra_type, which is confusing as this actually does
nothing.  Apparently, people are just copy-and-pasting this from
existing code without understanding it.

Therefore, this patchset removes the useless initializations, as well as
useless type flags passed to the strongly-typed tfm allocators.

This doesn't change any actual behavior, AFAIK.

For now I didn't bother with 'blkcipher' and 'ablkcipher' algorithms,
since those should eventually be migrated to 'skcipher' anyway.

Eric Biggers (6):
  crypto: shash - remove useless setting of type flags
  crypto: ahash - remove useless setting of type flags
  crypto: ahash - remove useless setting of cra_type
  crypto: aead - remove useless setting of type flags
  crypto: skcipher - remove useless setting of type flags
  crypto: remove redundant type flags from tfm allocation

 Documentation/crypto/api-samples.rst  |  2 +-
 arch/arm/crypto/ghash-ce-glue.c   |  5 +--
 arch/arm/crypto/sha1-ce-glue.c|  1 -
 arch/arm/crypto/sha1_glue.c   |  1 -
 arch/arm/crypto/sha1_neon_glue.c  |  1 -
 arch/arm/crypto/sha2-ce-glue.c|  2 -
 arch/arm/crypto/sha256_glue.c |  2 -
 arch/arm/crypto/sha256_neon_glue.c|  2 -
 arch/arm/crypto/sha512-glue.c |  2 -
 arch/arm/crypto/sha512-neon-glue.c|  2 -
 arch/arm64/crypto/aes-glue.c  |  3 --
 arch/arm64/crypto/ghash-ce-glue.c |  1 -
 arch/arm64/crypto/sha1-ce-glue.c  |  1 -
 arch/arm64/crypto/sha2-ce-glue.c  |  2 -
 arch/arm64/crypto/sha256-glue.c   |  4 --
 arch/arm64/crypto/sha3-ce-glue.c  |  4 --
 arch/arm64/crypto/sha512-ce-glue.c|  2 -
 arch/arm64/crypto/sha512-glue.c   |  2 -
 arch/arm64/crypto/sm3-ce-glue.c   |  1 -
 arch/mips/cavium-octeon/crypto/octeon-md5.c   |  1 -
 arch/mips/cavium-octeon/crypto/octeon-sha1.c  |  1 -
 .../mips/cavium-octeon/crypto/octeon-sha256.c |  2 -
 .../mips/cavium-octeon/crypto/octeon-sha512.c |  2 -
 arch/powerpc/crypto/md5-glue.c|  1 -
 arch/powerpc/crypto/sha1-spe-glue.c   |  1 -
 arch/powerpc/crypto/sha1.c|  1 -
 arch/powerpc/crypto/sha256-spe-glue.c |  2 -
 arch/s390/crypto/aes_s390.c   |  1 -
 arch/s390/crypto/ghash_s390.c |  1 -
 arch/s390/crypto/sha1_s390.c  |  1 -
 arch/s390/crypto/sha256_s390.c|  2 -
 arch/s390/crypto/sha512_s390.c|  2 -
 arch/sparc/crypto/md5_glue.c  |  1 -
 arch/sparc/crypto/sha1_glue.c |  1 -
 arch/sparc/crypto/sha256_glue.c   |  2 -
 arch/sparc/crypto/sha512_glue.c   |  2 -
 arch/x86/crypto/ghash-clmulni-intel_glue.c|  6 +--
 arch/x86/crypto/poly1305_glue.c   |  1 -
 arch/x86/crypto/sha1-mb/sha1_mb.c |  8 ++--
 arch/x86/crypto/sha1_ssse3_glue.c |  4 --
 arch/x86/crypto/sha256-mb/sha256_mb.c |  9 ++---
 arch/x86/crypto/sha256_ssse3_glue.c   |  8 
 arch/x86/crypto/sha512-mb/sha512_mb.c |  9 ++---
 arch/x86/crypto/sha512_ssse3_glue.c   |  6 ---
 crypto/aegis128.c |  1 -
 crypto/aegis128l.c|  1 -
 crypto/aegis256.c |  1 -
 crypto/crypto_null.c  |  1 -
 crypto/ghash-generic.c|  1 -
 crypto/md4.c  |  1 -
 crypto/md5.c  |  1 -
 crypto/morus1280.c|  1 -
 crypto/morus640.c |  1 -
 crypto/poly1305_generic.c |  1 -
 crypto/rmd128.c   |  1 -
 crypto/rmd160.c   |  1 -
 crypto/rmd256.c   |  1 -
 crypto/rmd320.c   |  1 -
 crypto/sha1_generic.c |  1 -
 crypto/sha256_generic.c   |  2 -
 crypto/sha3_generic.c |  4 --
 crypto/sha512_generic.c   |  2 -
 crypto/sm3_generic.c  |  1 -
 crypto/tgr192.c   |  3 --
 crypto/wp512.c|  3 --
 drivers/crypto/amcc/crypto4xx_core.c  | 18 +++--
 drivers/crypto/a

[PATCH 5/6] crypto: skcipher - remove useless setting of type flags

2018-06-30 Thread Eric Biggers

From: Eric Biggers 

Some skcipher algorithms set .cra_flags = CRYPTO_ALG_TYPE_SKCIPHER.  But
this is redundant with the C structure type ('struct skcipher_alg'), and
crypto_register_skcipher() already sets the type flag automatically,
clearing any type flag that was already there.  Apparently the useless
assignment has just been copy+pasted around.

So, remove the useless assignment from all the skcipher algorithms.

This patch shouldn't change any actual behavior.

Signed-off-by: Eric Biggers 
---
 drivers/crypto/amcc/crypto4xx_core.c   | 18 ++
 drivers/crypto/axis/artpec6_crypto.c   | 12 
 drivers/crypto/ccree/cc_cipher.c   |  3 +--
 drivers/crypto/inside-secure/safexcel_cipher.c |  4 ++--
 drivers/crypto/sunxi-ss/sun4i-ss-core.c| 16 +---
 5 files changed, 18 insertions(+), 35 deletions(-)

diff --git a/drivers/crypto/amcc/crypto4xx_core.c 
b/drivers/crypto/amcc/crypto4xx_core.c
index 05981ccd9901a..6eaec9ba0f68b 100644
--- a/drivers/crypto/amcc/crypto4xx_core.c
+++ b/drivers/crypto/amcc/crypto4xx_core.c
@@ -1132,8 +1132,7 @@ static struct crypto4xx_alg_common crypto4xx_alg[] = {
.cra_name = "cbc(aes)",
.cra_driver_name = "cbc-aes-ppc4xx",
.cra_priority = CRYPTO4XX_CRYPTO_PRIORITY,
-   .cra_flags = CRYPTO_ALG_TYPE_SKCIPHER |
-   CRYPTO_ALG_ASYNC |
+   .cra_flags = CRYPTO_ALG_ASYNC |
CRYPTO_ALG_KERN_DRIVER_ONLY,
.cra_blocksize = AES_BLOCK_SIZE,
.cra_ctxsize = sizeof(struct crypto4xx_ctx),
@@ -1153,8 +1152,7 @@ static struct crypto4xx_alg_common crypto4xx_alg[] = {
.cra_name = "cfb(aes)",
.cra_driver_name = "cfb-aes-ppc4xx",
.cra_priority = CRYPTO4XX_CRYPTO_PRIORITY,
-   .cra_flags = CRYPTO_ALG_TYPE_SKCIPHER |
-   CRYPTO_ALG_ASYNC |
+   .cra_flags = CRYPTO_ALG_ASYNC |
CRYPTO_ALG_KERN_DRIVER_ONLY,
.cra_blocksize = AES_BLOCK_SIZE,
.cra_ctxsize = sizeof(struct crypto4xx_ctx),
@@ -1174,8 +1172,7 @@ static struct crypto4xx_alg_common crypto4xx_alg[] = {
.cra_name = "ctr(aes)",
.cra_driver_name = "ctr-aes-ppc4xx",
.cra_priority = CRYPTO4XX_CRYPTO_PRIORITY,
-   .cra_flags = CRYPTO_ALG_TYPE_SKCIPHER |
-   CRYPTO_ALG_NEED_FALLBACK |
+   .cra_flags = CRYPTO_ALG_NEED_FALLBACK |
CRYPTO_ALG_ASYNC |
CRYPTO_ALG_KERN_DRIVER_ONLY,
.cra_blocksize = AES_BLOCK_SIZE,
@@ -1196,8 +1193,7 @@ static struct crypto4xx_alg_common crypto4xx_alg[] = {
.cra_name = "rfc3686(ctr(aes))",
.cra_driver_name = "rfc3686-ctr-aes-ppc4xx",
.cra_priority = CRYPTO4XX_CRYPTO_PRIORITY,
-   .cra_flags = CRYPTO_ALG_TYPE_SKCIPHER |
-   CRYPTO_ALG_ASYNC |
+   .cra_flags = CRYPTO_ALG_ASYNC |
CRYPTO_ALG_KERN_DRIVER_ONLY,
.cra_blocksize = AES_BLOCK_SIZE,
.cra_ctxsize = sizeof(struct crypto4xx_ctx),
@@ -1217,8 +1213,7 @@ static struct crypto4xx_alg_common crypto4xx_alg[] = {
.cra_name = "ecb(aes)",
.cra_driver_name = "ecb-aes-ppc4xx",
.cra_priority = CRYPTO4XX_CRYPTO_PRIORITY,
-   .cra_flags = CRYPTO_ALG_TYPE_SKCIPHER |
-   CRYPTO_ALG_ASYNC |
+   .cra_flags = CRYPTO_ALG_ASYNC |
CRYPTO_ALG_KERN_DRIVER_ONLY,
.cra_blocksize = AES_BLOCK_SIZE,
.cra_ctxsize = sizeof(struct crypto4xx_ctx),
@@ -1237,8 +1232,7 @@ static struct crypto4xx_alg_common crypto4xx_alg[] = {
.cra_name = "ofb(aes)",
.cra_driver_name = "ofb-aes-ppc4xx",
.cra_priority = CRYPTO4XX_CRYPTO_PRIORITY,
-   .cra_flags = CRYPTO_ALG_TYPE_SKCIPHER |
-   CRYPTO_ALG_ASYNC |
+   .cra_flags = CRYPTO_ALG_ASYNC |
CRYPTO_ALG_KERN_DRIVER_ONLY,
.cra_blocksize = AES_BLOCK_SIZE,
.cra_ctxsize = sizeof(struct crypto4xx_ctx),
diff --git a/drivers/crypto/axis/artpec6_crypto.c 
b/drivers/crypto/axis/artpec6_crypto.c
index 5

[PATCH 4/6] crypto: aead - remove useless setting of type flags

2018-06-30 Thread Eric Biggers

From: Eric Biggers 

Some aead algorithms set .cra_flags = CRYPTO_ALG_TYPE_AEAD.  But this is
redundant with the C structure type ('struct aead_alg'), and
crypto_register_aead() already sets the type flag automatically,
clearing any type flag that was already there.  Apparently the useless
assignment has just been copy+pasted around.

So, remove the useless assignment from all the aead algorithms.

This patch shouldn't change any actual behavior.

Signed-off-by: Eric Biggers 
---
 arch/s390/crypto/aes_s390.c|  1 -
 crypto/aegis128.c  |  1 -
 crypto/aegis128l.c |  1 -
 crypto/aegis256.c  |  1 -
 crypto/morus1280.c |  1 -
 crypto/morus640.c  |  1 -
 drivers/crypto/axis/artpec6_crypto.c   |  2 +-
 drivers/crypto/bcm/cipher.c|  2 +-
 drivers/crypto/chelsio/chcr_algo.c |  3 +--
 drivers/crypto/inside-secure/safexcel_cipher.c | 10 +-
 10 files changed, 8 insertions(+), 15 deletions(-)

diff --git a/arch/s390/crypto/aes_s390.c b/arch/s390/crypto/aes_s390.c
index ad47abd086308..c54cb26eb7f50 100644
--- a/arch/s390/crypto/aes_s390.c
+++ b/arch/s390/crypto/aes_s390.c
@@ -1035,7 +1035,6 @@ static struct aead_alg gcm_aes_aead = {
.chunksize  = AES_BLOCK_SIZE,
 
.base   = {
-   .cra_flags  = CRYPTO_ALG_TYPE_AEAD,
.cra_blocksize  = 1,
.cra_ctxsize= sizeof(struct s390_aes_ctx),
.cra_priority   = 900,
diff --git a/crypto/aegis128.c b/crypto/aegis128.c
index 38271303ce16c..c22f4414856d9 100644
--- a/crypto/aegis128.c
+++ b/crypto/aegis128.c
@@ -429,7 +429,6 @@ static struct aead_alg crypto_aegis128_alg = {
.chunksize = AEGIS_BLOCK_SIZE,
 
.base = {
-   .cra_flags = CRYPTO_ALG_TYPE_AEAD,
.cra_blocksize = 1,
.cra_ctxsize = sizeof(struct aegis_ctx),
.cra_alignmask = 0,
diff --git a/crypto/aegis128l.c b/crypto/aegis128l.c
index 64dc2654b863e..b6fb21ebdc3e8 100644
--- a/crypto/aegis128l.c
+++ b/crypto/aegis128l.c
@@ -493,7 +493,6 @@ static struct aead_alg crypto_aegis128l_alg = {
.chunksize = AEGIS128L_CHUNK_SIZE,
 
.base = {
-   .cra_flags = CRYPTO_ALG_TYPE_AEAD,
.cra_blocksize = 1,
.cra_ctxsize = sizeof(struct aegis_ctx),
.cra_alignmask = 0,
diff --git a/crypto/aegis256.c b/crypto/aegis256.c
index a489d741d33ad..11f0f8ec9c7c2 100644
--- a/crypto/aegis256.c
+++ b/crypto/aegis256.c
@@ -444,7 +444,6 @@ static struct aead_alg crypto_aegis256_alg = {
.chunksize = AEGIS_BLOCK_SIZE,
 
.base = {
-   .cra_flags = CRYPTO_ALG_TYPE_AEAD,
.cra_blocksize = 1,
.cra_ctxsize = sizeof(struct aegis_ctx),
.cra_alignmask = 0,
diff --git a/crypto/morus1280.c b/crypto/morus1280.c
index 6180b2557836a..d057cf5ac4a8b 100644
--- a/crypto/morus1280.c
+++ b/crypto/morus1280.c
@@ -514,7 +514,6 @@ static struct aead_alg crypto_morus1280_alg = {
.chunksize = MORUS1280_BLOCK_SIZE,
 
.base = {
-   .cra_flags = CRYPTO_ALG_TYPE_AEAD,
.cra_blocksize = 1,
.cra_ctxsize = sizeof(struct morus1280_ctx),
.cra_alignmask = 0,
diff --git a/crypto/morus640.c b/crypto/morus640.c
index 5eede3749e646..1ca76e54281bf 100644
--- a/crypto/morus640.c
+++ b/crypto/morus640.c
@@ -511,7 +511,6 @@ static struct aead_alg crypto_morus640_alg = {
.chunksize = MORUS640_BLOCK_SIZE,
 
.base = {
-   .cra_flags = CRYPTO_ALG_TYPE_AEAD,
.cra_blocksize = 1,
.cra_ctxsize = sizeof(struct morus640_ctx),
.cra_alignmask = 0,
diff --git a/drivers/crypto/axis/artpec6_crypto.c 
b/drivers/crypto/axis/artpec6_crypto.c
index 049af6de3cb69..59392178d8bc6 100644
--- a/drivers/crypto/axis/artpec6_crypto.c
+++ b/drivers/crypto/axis/artpec6_crypto.c
@@ -2964,7 +2964,7 @@ static struct aead_alg aead_algos[] = {
.cra_name = "gcm(aes)",
.cra_driver_name = "artpec-gcm-aes",
.cra_priority = 300,
-   .cra_flags = CRYPTO_ALG_TYPE_AEAD | CRYPTO_ALG_ASYNC |
+   .cra_flags = CRYPTO_ALG_ASYNC |
 CRYPTO_ALG_KERN_DRIVER_ONLY,
.cra_blocksize = 1,
.cra_ctxsize = sizeof(struct artpec6_cryptotfm_context),
diff --git a/drivers/crypto/bcm/cipher.c b/drivers/crypto/bcm/cipher.c
index 4e2babd6b89d7..2d1f1db9f8074 100644
--- a/drivers/crypto/bcm/cipher.c
+++ b/drivers/crypto/bcm/cipher.c
@@ -4689,7 +4689,7 @@ static int spu_register_aead(struct iproc_alg_s 
*driver_alg)
aead->base.cra_

[PATCH 1/4] crypto: sha1_generic - add cra_priority

2018-06-29 Thread Eric Biggers

From: Eric Biggers 

sha1-generic had a cra_priority of 0, so it wasn't possible to have a
lower priority SHA-1 implementation, as is desired for sha1_mb which is
only useful under certain workloads and is otherwise extremely slow.
Change it to priority 100, which is the priority used for many of the
other generic algorithms.

Signed-off-by: Eric Biggers 
---
 crypto/sha1_generic.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/crypto/sha1_generic.c b/crypto/sha1_generic.c
index 6877cbb9105f..a3d701632ca2 100644
--- a/crypto/sha1_generic.c
+++ b/crypto/sha1_generic.c
@@ -76,6 +76,7 @@ static struct shash_alg alg = {
.base   =   {
.cra_name   =   "sha1",
.cra_driver_name=   "sha1-generic",
+   .cra_priority   =   100,
.cra_flags  =   CRYPTO_ALG_TYPE_SHASH,
.cra_blocksize  =   SHA1_BLOCK_SIZE,
.cra_module =   THIS_MODULE,
-- 
2.18.0.399.gad0ab374a1-goog

[PATCH 2/4] crypto: sha256_generic - add cra_priority

2018-06-29 Thread Eric Biggers

From: Eric Biggers 

sha256-generic and sha224-generic had a cra_priority of 0, so it wasn't
possible to have a lower priority SHA-256 or SHA-224 implementation, as
is desired for sha256_mb which is only useful under certain workloads
and is otherwise extremely slow.  Change them to priority 100, which is
the priority used for many of the other generic algorithms.

Signed-off-by: Eric Biggers 
---
 crypto/sha256_generic.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/crypto/sha256_generic.c b/crypto/sha256_generic.c
index 8f9c47e1a96e..dfcb7beb73a7 100644
--- a/crypto/sha256_generic.c
+++ b/crypto/sha256_generic.c
@@ -271,6 +271,7 @@ static struct shash_alg sha256_algs[2] = { {
.base   =   {
.cra_name   =   "sha256",
.cra_driver_name=   "sha256-generic",
+   .cra_priority   =   100,
.cra_flags  =   CRYPTO_ALG_TYPE_SHASH,
.cra_blocksize  =   SHA256_BLOCK_SIZE,
.cra_module =   THIS_MODULE,
@@ -285,6 +286,7 @@ static struct shash_alg sha256_algs[2] = { {
.base   =   {
.cra_name   =   "sha224",
.cra_driver_name=   "sha224-generic",
+   .cra_priority   =   100,
.cra_flags  =   CRYPTO_ALG_TYPE_SHASH,
.cra_blocksize  =   SHA224_BLOCK_SIZE,
.cra_module =   THIS_MODULE,
-- 
2.18.0.399.gad0ab374a1-goog

[PATCH 3/4] crypto: sha512_generic - add cra_priority

2018-06-29 Thread Eric Biggers

From: Eric Biggers 

sha512-generic and sha384-generic had a cra_priority of 0, so it wasn't
possible to have a lower priority SHA-512 or SHA-384 implementation, as
is desired for sha512_mb which is only useful under certain workloads
and is otherwise extremely slow.  Change them to priority 100, which is
the priority used for many of the other generic algorithms.

Signed-off-by: Eric Biggers 
---
 crypto/sha512_generic.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/crypto/sha512_generic.c b/crypto/sha512_generic.c
index eba965d18bfc..c92efac0f060 100644
--- a/crypto/sha512_generic.c
+++ b/crypto/sha512_generic.c
@@ -171,6 +171,7 @@ static struct shash_alg sha512_algs[2] = { {
.base   =   {
.cra_name   =   "sha512",
.cra_driver_name =  "sha512-generic",
+   .cra_priority   =   100,
.cra_flags  =   CRYPTO_ALG_TYPE_SHASH,
.cra_blocksize  =   SHA512_BLOCK_SIZE,
.cra_module =   THIS_MODULE,
@@ -185,6 +186,7 @@ static struct shash_alg sha512_algs[2] = { {
.base   =   {
.cra_name   =   "sha384",
.cra_driver_name =  "sha384-generic",
+   .cra_priority   =   100,
.cra_flags  =   CRYPTO_ALG_TYPE_SHASH,
.cra_blocksize  =   SHA384_BLOCK_SIZE,
.cra_module =   THIS_MODULE,
-- 
2.18.0.399.gad0ab374a1-goog

[PATCH 4/4] crypto: x86/sha-mb - decrease priority of multibuffer algorithms

2018-06-29 Thread Eric Biggers

From: Eric Biggers 

With all the crypto modules enabled on x86, and with a CPU that supports
AVX-2 but not SHA-NI instructions (e.g. Haswell, Broadwell, Skylake),
the "multibuffer" implementations of SHA-1, SHA-256, and SHA-512 are the
highest priority.  However, these implementations only perform well when
many hash requests are being submitted concurrently, filling all 8 AVX-2
lanes.  Otherwise, they are incredibly slow, as they waste time waiting
for more requests to arrive before proceeding to execute each request.

For example, here are the speeds I see hashing 4096-byte buffers with a
single thread on a Haswell-based processor:

genericavx2  mb (multibuffer)
---  
sha1602 MB/s   997 MB/s  0.61 MB/s
sha256  228 MB/s   412 MB/s  0.61 MB/s
sha512  312 MB/s   559 MB/s  0.61 MB/s

So, the multibuffer implementation is 500 to 1000 times slower than the
other implementations.  Note that with smaller buffers or more update()s
per digest, the difference would be even greater.

I believe the vast majority of people are in the boat where the
multibuffer code is much slower, and only a small minority are doing the
highly parallel, hashing-intensive, latency-flexible workloads (maybe
IPsec on servers?) where the multibuffer code may be beneficial.  Yet,
people often aren't familiar with all the crypto config options and so
the multibuffer code may inadvertently be built into the kernel.

Also the multibuffer code apparently hasn't been very well tested,
seeing as it was sometimes computing the wrong SHA-256 digest.

So, let's make the multibuffer algorithms low priority.  Users who want
to use them can either request them explicitly by driver name, or use
NETLINK_CRYPTO (crypto_user) to increase their priority at runtime.

Signed-off-by: Eric Biggers 
---
 arch/x86/crypto/sha1-mb/sha1_mb.c | 9 -
 arch/x86/crypto/sha256-mb/sha256_mb.c | 9 -
 arch/x86/crypto/sha512-mb/sha512_mb.c | 9 -
 3 files changed, 24 insertions(+), 3 deletions(-)

diff --git a/arch/x86/crypto/sha1-mb/sha1_mb.c 
b/arch/x86/crypto/sha1-mb/sha1_mb.c
index e17655ffde79..4b2430274935 100644
--- a/arch/x86/crypto/sha1-mb/sha1_mb.c
+++ b/arch/x86/crypto/sha1-mb/sha1_mb.c
@@ -871,7 +871,14 @@ static struct ahash_alg sha1_mb_async_alg = {
.base = {
.cra_name   = "sha1",
.cra_driver_name= "sha1_mb",
-   .cra_priority   = 200,
+   /*
+* Low priority, since with few concurrent hash requests
+* this is extremely slow due to the flush delay.  Users
+* whose workloads would benefit from this can request
+* it explicitly by driver name, or can increase its
+* priority at runtime using NETLINK_CRYPTO.
+*/
+   .cra_priority   = 50,
.cra_flags  = CRYPTO_ALG_TYPE_AHASH | 
CRYPTO_ALG_ASYNC,
.cra_blocksize  = SHA1_BLOCK_SIZE,
.cra_type   = _ahash_type,
diff --git a/arch/x86/crypto/sha256-mb/sha256_mb.c 
b/arch/x86/crypto/sha256-mb/sha256_mb.c
index 4c46ac1b6653..4c07f6c12c37 100644
--- a/arch/x86/crypto/sha256-mb/sha256_mb.c
+++ b/arch/x86/crypto/sha256-mb/sha256_mb.c
@@ -870,7 +870,14 @@ static struct ahash_alg sha256_mb_async_alg = {
.base = {
.cra_name   = "sha256",
.cra_driver_name= "sha256_mb",
-   .cra_priority   = 200,
+   /*
+* Low priority, since with few concurrent hash requests
+* this is extremely slow due to the flush delay.  Users
+* whose workloads would benefit from this can request
+* it explicitly by driver name, or can increase its
+* priority at runtime using NETLINK_CRYPTO.
+*/
+   .cra_priority   = 50,
.cra_flags  = CRYPTO_ALG_TYPE_AHASH |
CRYPTO_ALG_ASYNC,
.cra_blocksize  = SHA256_BLOCK_SIZE,
diff --git a/arch/x86/crypto/sha512-mb/sha512_mb.c 
b/arch/x86/crypto/sha512-mb/sha512_mb.c
index 39e2bbdc1836..6a8c31581604 100644
--- a/arch/x86/crypto/sha512-mb/sha512_mb.c
+++ b/arch/x86/crypto/sha512-mb/sha512_mb.c
@@ -904,7 +904,14 @@ static struct ahash_alg sha512_mb_async_alg = {
.base = {
.cra_name   = "sha512",

[PATCH 0/4] crypto: decrease priority of multibuffer SHA algorithms

2018-06-29 Thread Eric Biggers

From: Eric Biggers 

I found that not only was sha256_mb sometimes computing the wrong digest
(fixed by a separately sent patch), but under normal workloads it's
hundreds of times slower than sha256-avx2, due to the flush delay.  The
same applies to sha1_mb and sha512_mb.  Yet, currently these can be the
highest priority implementations and therefore used by default.
Therefore, this series decreases their priority so that users have to
more explicitly opt-in to using them.

Note that I don't believe the status quo of just having them behind
kernel config options is sufficient, since people often aren't familiar
with all the crypto options and error on the side of enabling too many.
And it's especially unexpected that enabling an "optimized"
implementation would actually make things 1000 times slower.

Eric Biggers (4):
  crypto: sha1_generic - add cra_priority
  crypto: sha256_generic - add cra_priority
  crypto: sha512_generic - add cra_priority
  crypto: x86/sha-mb - decrease priority of multibuffer algorithms

 arch/x86/crypto/sha1-mb/sha1_mb.c | 9 -
 arch/x86/crypto/sha256-mb/sha256_mb.c | 9 -
 arch/x86/crypto/sha512-mb/sha512_mb.c | 9 -
 crypto/sha1_generic.c | 1 +
 crypto/sha256_generic.c   | 2 ++
 crypto/sha512_generic.c   | 2 ++
 6 files changed, 29 insertions(+), 3 deletions(-)

-- 
2.18.0.399.gad0ab374a1-goog

[PATCH] crypto: MAINTAINERS - fix file path for SHA multibuffer code

2018-06-29 Thread Eric Biggers

From: Eric Biggers 

"arch/x86/crypto/sha*-mb" needs a trailing slash, since it refers to
directories.  Otherwise get_maintainer.pl doesn't find the entry.

Signed-off-by: Eric Biggers 
---
 MAINTAINERS | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index 6cfd16790add..2a39dcaa79a0 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -7345,7 +7345,7 @@ M:Megha Dey 
 R: Tim Chen 
 L: linux-crypto@vger.kernel.org
 S: Supported
-F: arch/x86/crypto/sha*-mb
+F: arch/x86/crypto/sha*-mb/
 F: crypto/mcryptd.c
 
 INTEL TELEMETRY DRIVER
-- 
2.18.0.399.gad0ab374a1-goog

[PATCH] crypto: x86/sha256-mb - fix digest copy in sha256_mb_mgr_get_comp_job_avx2()

2018-06-29 Thread Eric Biggers

From: Eric Biggers 

There is a copy-paste error where sha256_mb_mgr_get_comp_job_avx2()
copies the SHA-256 digest state from sha256_mb_mgr::args::digest to
job_sha256::result_digest.  Consequently, the sha256_mb algorithm
sometimes calculates the wrong digest.  Fix it.

Reproducer using AF_ALG:

#include 
#include 
#include 
#include 
#include 
#include 

static const __u8 expected[32] =
"\xad\x7f\xac\xb2\x58\x6f\xc6\xe9\x66\xc0\x04\xd7\xd1\xd1\x6b\x02"
"\x4f\x58\x05\xff\x7c\xb4\x7c\x7a\x85\xda\xbd\x8b\x48\x89\x2c\xa7";

int main()
{
int fd;
struct sockaddr_alg addr = {
.salg_type = "hash",
.salg_name = "sha256_mb",
};
__u8 data[4096] = { 0 };
__u8 digest[32];
int ret;
int i;

fd = socket(AF_ALG, SOCK_SEQPACKET, 0);
bind(fd, (void *), sizeof(addr));
fork();
fd = accept(fd, 0, 0);
do {
ret = write(fd, data, 4096);
assert(ret == 4096);
ret = read(fd, digest, 32);
assert(ret == 32);
} while (memcmp(digest, expected, 32) == 0);

printf("wrong digest: ");
for (i = 0; i < 32; i++)
printf("%02x", digest[i]);
printf("\n");
}

Output was:

wrong digest: 
ad7facb2ffef7cb47c7a85dabd8b48892ca7

Fixes: 172b1d6b5a93 ("crypto: sha256-mb - fix ctx pointer and digest copy")
Cc:  # v4.8+
Signed-off-by: Eric Biggers 
---
 arch/x86/crypto/sha256-mb/sha256_mb_mgr_flush_avx2.S | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/crypto/sha256-mb/sha256_mb_mgr_flush_avx2.S 
b/arch/x86/crypto/sha256-mb/sha256_mb_mgr_flush_avx2.S
index 16c4ccb1f154..d2364c55bbde 100644
--- a/arch/x86/crypto/sha256-mb/sha256_mb_mgr_flush_avx2.S
+++ b/arch/x86/crypto/sha256-mb/sha256_mb_mgr_flush_avx2.S
@@ -265,7 +265,7 @@ ENTRY(sha256_mb_mgr_get_comp_job_avx2)
vpinsrd $1, _args_digest+1*32(state, idx, 4), %xmm0, %xmm0
vpinsrd $2, _args_digest+2*32(state, idx, 4), %xmm0, %xmm0
vpinsrd $3, _args_digest+3*32(state, idx, 4), %xmm0, %xmm0
-   vmovd   _args_digest(state , idx, 4) , %xmm0
+   vmovd   _args_digest+4*32(state, idx, 4), %xmm1
vpinsrd $1, _args_digest+5*32(state, idx, 4), %xmm1, %xmm1
vpinsrd $2, _args_digest+6*32(state, idx, 4), %xmm1, %xmm1
vpinsrd $3, _args_digest+7*32(state, idx, 4), %xmm1, %xmm1
-- 
2.18.0.399.gad0ab374a1-goog

Re: [PATCH 3/5] crypto: testmgr - Improve compression/decompression test

2018-06-22 Thread Eric Biggers

Hi Jan,

On Fri, Jun 22, 2018 at 04:37:20PM +0200, Jan Glauber wrote:
> While commit 336073840a87 ("crypto: testmgr - Allow different compression 
> results")
> allowed to test non-generic compression algorithms there are some corner
> cases that would not be detected in test_comp().
> 
> For example if input -> compression -> decompression would all yield
> the same bytes the test would still pass.
> 
> Improve the compression test by using the generic variant (if available)
> to decompress the compressed test vector from the non-generic
> algorithm.
> 
> Suggested-by: Herbert Xu 
> Signed-off-by: Jan Glauber 
> ---
>  crypto/testmgr.c | 23 ++-
>  1 file changed, 22 insertions(+), 1 deletion(-)
> 
> diff --git a/crypto/testmgr.c b/crypto/testmgr.c
> index d1d99843cce4..cfb5fe4c5ccf 100644
> --- a/crypto/testmgr.c
> +++ b/crypto/testmgr.c
> @@ -1346,6 +1346,7 @@ static int test_comp(struct crypto_comp *tfm,
>int ctcount, int dtcount)
>  {

Any particular reason for not updating test_acomp() too?

>   const char *algo = crypto_tfm_alg_driver_name(crypto_comp_tfm(tfm));
> + const char *name = crypto_tfm_alg_name(crypto_comp_tfm(tfm));
>   char *output, *decomp_output;
>   unsigned int i;
>   int ret;
> @@ -1363,6 +1364,8 @@ static int test_comp(struct crypto_comp *tfm,
>   for (i = 0; i < ctcount; i++) {
>   int ilen;
>   unsigned int dlen = COMP_BUF_SIZE;
> + struct crypto_comp *tfm_decomp = NULL;
> + char *gname;
>  
>   memset(output, 0, sizeof(COMP_BUF_SIZE));
>   memset(decomp_output, 0, sizeof(COMP_BUF_SIZE));
> @@ -1377,9 +1380,27 @@ static int test_comp(struct crypto_comp *tfm,
>   goto out;
>   }
>  
> + /*
> +  * If compression of a non-generic algorithm was tested try to
> +  * decompress using the generic variant.
> +  */
> + if (!strstr(algo, "generic")) {

That's a pretty sloppy string comparison.  It matches "generic" anywhere in the
string, like "foogenericbar".  It should just be "-generic" at the end, right?
Like:

static bool is_generic_driver(const char *driver_name)
{
size_t len = strlen(driver_name);

return len >= 8 && !strcmp(_name[len - 8], "-generic");
}

> + /* Construct name from cra_name + "-generic" */
> + gname = kmalloc(strlen(name) + 9, GFP_KERNEL);
> + strncpy(gname, name, strlen(name));
> + strncpy(gname + strlen(name), "-generic", 9);
> +
> + tfm_decomp = crypto_alloc_comp(gname, 0, 0);
> + kfree(gname);

If you're going to allocate memory here you need to check for error (note:
kasprintf() would make building the string a bit cleaner).  But algorithm names
are limited anyway, so a better way may be:

char generic_name[CRYPTO_MAX_ALG_NAME];

if (snprintf(generic_name, "%s-generic", name) <
sizeof(generic_name))
tfm_decomp = crypto_alloc_comp(gname, 0, 0);

> + }
> +
> + /* If there is no generic variant use the same tfm as before. */
> + if (!tfm_decomp || IS_ERR(tfm_decomp))
> + tfm_decomp = tfm;
> +

if (!IS_ERR_OR_NULL(tfm_decomp))

>   ilen = dlen;
>   dlen = COMP_BUF_SIZE;
> - ret = crypto_comp_decompress(tfm, output,
> + ret = crypto_comp_decompress(tfm_decomp, output,
>ilen, decomp_output, );

Shouldn't you decompress with both tfms, not just the generic one?

It's also weird that each 'struct comp_testvec' in 'ctemplate[]' has an
'output', but it's never used.  The issue seems to be that there are separate
test vectors for compression and decompression, but you really only need one
set.  It would have the '.uncompressed' and '.compressed' data.  From that, you
could compress the '.uncompressed' data with the tfm under test, and decompress
that result with both the tfm under test and the generic tfm.  Then, you could
decompress the '.compressed' data with the tfm under test and verify it matches
the '.uncompressed' data.  (I did something similar for symmetric ciphers in
commit 92a4c9fef34c.)

Thanks,

- Eric

Re: [PATCH 4/5] crypto: testmgr - Add test vectors for LZS compression

2018-06-22 Thread Eric Biggers

Hi Jan,

On Fri, Jun 22, 2018 at 04:37:21PM +0200, Jan Glauber wrote:
> The test vectors were generated using the ThunderX ZIP coprocessor.
> 
> Signed-off-by: Jan Glauber 
> ---
>  crypto/testmgr.c |  9 ++
>  crypto/testmgr.h | 77 
>  2 files changed, 86 insertions(+)
> 
> diff --git a/crypto/testmgr.c b/crypto/testmgr.c
> index cfb5fe4c5ccf..8e9ff1229e93 100644
> --- a/crypto/testmgr.c
> +++ b/crypto/testmgr.c
> @@ -3238,6 +3238,15 @@ static const struct alg_test_desc alg_test_descs[] = {
>   .decomp = __VECS(lzo_decomp_tv_template)
>   }
>   }
> + }, {
> + .alg = "lzs",
> + .test = alg_test_comp,
> + .suite = {
> + .comp = {
> + .comp = __VECS(lzs_comp_tv_template),
> + .decomp = __VECS(lzs_decomp_tv_template)
> + }
> + }
>   }, {
>   .alg = "md4",
>   .test = alg_test_hash,
> diff --git a/crypto/testmgr.h b/crypto/testmgr.h
> index b950aa234e43..ae7fecadcade 100644
> --- a/crypto/testmgr.h
> +++ b/crypto/testmgr.h
> @@ -31699,6 +31699,83 @@ static const struct comp_testvec 
> lzo_decomp_tv_template[] = {
>   },
>  };
>  
> +/*
> + * LZS test vectors (null-terminated strings).
> + */
> +static const struct comp_testvec lzs_comp_tv_template[] = {
> + {
> + .inlen  = 70,
> + .outlen = 40,
> + .input  = "Join us now and share the software "
> + "Join us now and share the software ",
> + .output = "\x25\x1b\xcd\x26\xe1\x01\xd4\xe6"
> +   "\x20\x37\x1b\xce\xe2\x03\x09\xb8"
> +   "\xc8\x20\x39\x9a\x0c\x27\x23\x28"
> +   "\x80\xe8\x68\xc2\x07\x33\x79\x98"
> +   "\xe8\x77\xc6\xda\x3f\xfc\xc0\x00",
> + }, {
> + .inlen  = 184,
> + .outlen = 130,
> + .input  = "This document describes a compression method based 
> on the LZS "
> + "compression algorithm.  This document defines the 
> application of "
> + "the LZS algorithm to the IP Payload Compression 
> Protocol.",

Your comment claims that the test vectors (presumably the inputs) are
null-terminated strings, but the lengths of the inputs actually don't include
the null terminator.  The length of the first one, for example, would have to be
71 to include the null terminator, not 70.

Eric

[PATCH] crypto: arm/speck - fix building in Thumb2 mode

2018-06-18 Thread Eric Biggers

Building the kernel with CONFIG_THUMB2_KERNEL=y and
CONFIG_CRYPTO_SPECK_NEON set fails with the following errors:

arch/arm/crypto/speck-neon-core.S: Assembler messages:

arch/arm/crypto/speck-neon-core.S:419: Error: r13 not allowed here -- `bic 
sp,#0xf'
arch/arm/crypto/speck-neon-core.S:423: Error: r13 not allowed here -- `bic 
sp,#0xf'
arch/arm/crypto/speck-neon-core.S:427: Error: r13 not allowed here -- `bic 
sp,#0xf'
arch/arm/crypto/speck-neon-core.S:431: Error: r13 not allowed here -- `bic 
sp,#0xf'

The problem is that the 'bic' instruction can't operate on the 'sp'
register in Thumb2 mode.  Fix it by using a temporary register.  This
isn't in the main loop, so the performance difference is negligible.
This also matches what aes-neonbs-core.S does.

Reported-by: Stefan Agner 
Fixes: ede9622162fa ("crypto: arm/speck - add NEON-accelerated implementation 
of Speck-XTS")
Signed-off-by: Eric Biggers 
---
 arch/arm/crypto/speck-neon-core.S | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/arch/arm/crypto/speck-neon-core.S 
b/arch/arm/crypto/speck-neon-core.S
index 3c1e203e53b9..57caa742016e 100644
--- a/arch/arm/crypto/speck-neon-core.S
+++ b/arch/arm/crypto/speck-neon-core.S
@@ -272,9 +272,11 @@
 * Allocate stack space to store 128 bytes worth of tweaks.  For
 * performance, this space is aligned to a 16-byte boundary so that we
 * can use the load/store instructions that declare 16-byte alignment.
+* For Thumb2 compatibility, don't do the 'bic' directly on 'sp'.
 */
-   sub sp, #128
-   bic sp, #0xf
+   sub r12, sp, #128
+   bic r12, #0xf
+   mov sp, r12
 
 .if \n == 64
// Load first tweak
-- 
2.18.0.rc1.244.gcf134e6275-goog

Re: [PATCH v3 3/5] crypto: arm/speck - add NEON-accelerated implementation of Speck-XTS

2018-06-18 Thread Eric Biggers

On Sun, Jun 17, 2018 at 01:10:41PM +0200, Ard Biesheuvel wrote:
> > +
> > + // One-time XTS preparation
> > +
> > + /*
> > +  * Allocate stack space to store 128 bytes worth of tweaks.  For
> > +  * performance, this space is aligned to a 16-byte boundary so 
> > that we
> > +  * can use the load/store instructions that declare 16-byte 
> > alignment.
> > +  */
> > + sub sp, #128
> > + bic sp, #0xf
> 
> 
>  This fails here when building with CONFIG_THUMB2_KERNEL=y
> 
>    AS  arch/arm/crypto/speck-neon-core.o
> 
>  arch/arm/crypto/speck-neon-core.S: Assembler messages:
> 
>  arch/arm/crypto/speck-neon-core.S:419: Error: r13 not allowed here --
>  `bic sp,#0xf'
>  arch/arm/crypto/speck-neon-core.S:423: Error: r13 not allowed here --
>  `bic sp,#0xf'
>  arch/arm/crypto/speck-neon-core.S:427: Error: r13 not allowed here --
>  `bic sp,#0xf'
>  arch/arm/crypto/speck-neon-core.S:431: Error: r13 not allowed here --
>  `bic sp,#0xf'
> 
>  In a quick hack this change seems to address it:
> 
> 
>  -   sub sp, #128
>  -   bic sp, #0xf
>  +   mov r6, sp
>  +   sub r6, #128
>  +   bic r6, #0xf
>  +   mov sp, r6
> 
>  But there is probably a better solution to address this.
> 
> >>>
> >>> Given that there is no NEON on M class cores, I recommend we put 
> >>> something like
> >>>
> >>> THUMB(bx pc)
> >>> THUMB(nop.w)
> >>> THUMB(.arm)
> >>>
> >>> at the beginning and be done with it.
> >>
> >> I mean nop.n or just nop, of course, and we may need a '.align 2' at
> >> the beginning as well.
> >
> > Wouldn't it be preferable to have it assemble it in Thumb2 too? It seems
> > that bic sp,#0xf is the only issue...
> >
> 
> Well, in general, yes. In the case of NEON code, not really, since the
> resulting code will not be smaller anyway, because the Thumb2 NEON
> opcodes are all 4 bytes. Also, Thumb2-only cores don't have NEON
> units, so all cores that this code can run on will be able to run in
> ARM mode.
> 
> So from a maintainability pov, having code that only assembles in one
> way is better than having code that must compile both to ARM and to
> Thumb2 opcodes.
> 
> Just my 2 cents, anyway.

I don't have too much of a preference, though Stefan's suggested 4 instructions
can be reduced to 3, which also matches what aes-neonbs-core.S does:

sub r12, sp, #128
bic r12, #0xf
mov sp, r12

Ard, is the following what you're suggesting instead?

diff --git a/arch/arm/crypto/speck-neon-core.S 
b/arch/arm/crypto/speck-neon-core.S
index 3c1e203e53b9..c989ce3dc057 100644
--- a/arch/arm/crypto/speck-neon-core.S
+++ b/arch/arm/crypto/speck-neon-core.S
@@ -8,6 +8,7 @@
  */
 
 #include 
+#include 
 
.text
.fpuneon
@@ -233,6 +234,12 @@
  * nonzero multiple of 128.
  */
 .macro _speck_xts_cryptn, decrypting
+
+   .align  2
+   THUMB(bx pc)
+   THUMB(nop)
+   THUMB(.arm)
+
push{r4-r7}
mov r7, sp
 
@@ -413,6 +420,8 @@
mov sp, r7
pop {r4-r7}
bx  lr
+
+   THUMB(.thumb)
 .endm
 
 ENTRY(speck128_xts_encrypt_neon)

[PATCH 4/4] crypto: vmac - remove insecure version with hardcoded nonce

2018-06-18 Thread Eric Biggers

From: Eric Biggers 

Remove the original version of the VMAC template that had the nonce
hardcoded to 0 and produced a digest with the wrong endianness.  I'm
unsure whether this had users or not (there are no explicit in-kernel
references to it), but given that the hardcoded nonce made it wildly
insecure unless a unique key was used for each message, let's try
removing it and see if anyone complains.

Leave the new "vmac64" template that requires the nonce to be explicitly
specified as the first 16 bytes of data and uses the correct endianness
for the digest.

Signed-off-by: Eric Biggers 
---
 crypto/tcrypt.c  |   2 +-
 crypto/testmgr.c |   6 ---
 crypto/testmgr.h | 102 ---
 crypto/vmac.c|  84 --
 4 files changed, 8 insertions(+), 186 deletions(-)

diff --git a/crypto/tcrypt.c b/crypto/tcrypt.c
index d5bcdd905007..078ec36007bf 100644
--- a/crypto/tcrypt.c
+++ b/crypto/tcrypt.c
@@ -1939,7 +1939,7 @@ static int do_test(const char *alg, u32 type, u32 mask, 
int m, u32 num_mb)
break;
 
case 109:
-   ret += tcrypt_test("vmac(aes)");
+   ret += tcrypt_test("vmac64(aes)");
break;
 
case 111:
diff --git a/crypto/testmgr.c b/crypto/testmgr.c
index 60a557b0f8d3..63f263fd1dae 100644
--- a/crypto/testmgr.c
+++ b/crypto/testmgr.c
@@ -3477,12 +3477,6 @@ static const struct alg_test_desc alg_test_descs[] = {
.suite = {
.hash = __VECS(tgr192_tv_template)
}
-   }, {
-   .alg = "vmac(aes)",
-   .test = alg_test_hash,
-   .suite = {
-   .hash = __VECS(aes_vmac128_tv_template)
-   }
}, {
.alg = "vmac64(aes)",
.test = alg_test_hash,
diff --git a/crypto/testmgr.h b/crypto/testmgr.h
index 7b022c47a623..b6362169771a 100644
--- a/crypto/testmgr.h
+++ b/crypto/testmgr.h
@@ -4603,108 +4603,6 @@ static const struct hash_testvec 
aes_xcbc128_tv_template[] = {
}
 };
 
-static const char vmac_string1[128] = {'\x01', '\x01', '\x01', '\x01',
-  '\x02', '\x03', '\x02', '\x02',
-  '\x02', '\x04', '\x01', '\x07',
-  '\x04', '\x01', '\x04', '\x03',};
-static const char vmac_string2[128] = {'a', 'b', 'c',};
-static const char vmac_string3[128] = {'a', 'b', 'c', 'a', 'b', 'c',
-  'a', 'b', 'c', 'a', 'b', 'c',
-  'a', 'b', 'c', 'a', 'b', 'c',
-  'a', 'b', 'c', 'a', 'b', 'c',
-  'a', 'b', 'c', 'a', 'b', 'c',
-  'a', 'b', 'c', 'a', 'b', 'c',
-  'a', 'b', 'c', 'a', 'b', 'c',
-  'a', 'b', 'c', 'a', 'b', 'c',
- };
-
-static const char vmac_string4[17] = {'b', 'c', 'e', 'f',
- 'i', 'j', 'l', 'm',
- 'o', 'p', 'r', 's',
- 't', 'u', 'w', 'x', 'z'};
-
-static const char vmac_string5[127] = {'r', 'm', 'b', 't', 'c',
-  'o', 'l', 'k', ']', '%',
-  '9', '2', '7', '!', 'A'};
-
-static const char vmac_string6[129] = {'p', 't', '*', '7', 'l',
-  'i', '!', '#', 'w', '0',
-  'z', '/', '4', 'A', 'n'};
-
-static const struct hash_testvec aes_vmac128_tv_template[] = {
-   {
-   .key= "\x00\x01\x02\x03\x04\x05\x06\x07"
- "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f",
-   .plaintext = NULL,
-   .digest = "\x07\x58\x80\x35\x77\xa4\x7b\x54",
-   .psize  = 0,
-   .ksize  = 16,
-   }, {
-   .key= "\x00\x01\x02\x03\x04\x05\x06\x07"
- "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f",
-   .plaintext = vmac_string1,
-   .digest = "\xce\xf5\x3c\xd3\xae\x68\x8c\xa1",
-   .psize  = 128,
-   .ksize  = 16,
-   }, {
-   .key= "\x00\x01\x02\x03\x04\x05\x06\x07"
- "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f",
-   .plaintext = vmac_string2,
-   .digest = "\xc9\x27\xb0\x73\x81\xbd\x14\x2d",
-   .psize  = 128,
-   .ksize  = 16,
-   }, {
-   .key= "\x00\x01\x02\x03\x04\x05\x06\x07"
- "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f",
-   .plaintext = vmac_string3,
-   .di

[PATCH 0/4] crypto: vmac - various fixes

2018-06-18 Thread Eric Biggers

From: Eric Biggers 

Hi, this series fixes various bugs in the VMAC template (crypto/vmac.c).
First, the per-request context was being stored in the transform
context, which made VMAC not thread-safe, and the kernel could be
crashed by using the same VMAC transform in multiple threads using
AF_ALG (found by syzkaller).  Also the keys were incorrectly being wiped
after each message.  Patch 2 fixes these bugs, Cc'ed to stable.

But there are also bugs that require breaking changes: the nonce is
hardcoded to 0, and the endianness of the final digest is wrong.  So
patch 3 introduces a fixed version of the VMAC template that takes the
nonce as the first 16 bytes of data, and fixes the digest endianness.

Patch 4 then removes the current version of the VMAC template.  I'm not
100% sure whether we can really do that or not as it may have users
(there are no explicit users in the kernel, though), but given that the
old version was insecure unless a unique key was set for each message, I
think we should try and see if anyone complains.

Eric Biggers (4):
  crypto: vmac - require a block cipher with 128-bit block size
  crypto: vmac - separate tfm and request context
  crypto: vmac - add nonced version with big endian digest
  crypto: vmac - remove insecure version with hardcoded nonce

 crypto/tcrypt.c   |   2 +-
 crypto/testmgr.c  |   4 +-
 crypto/testmgr.h  | 217 +
 crypto/vmac.c | 444 --
 include/crypto/vmac.h |  63 --
 5 files changed, 351 insertions(+), 379 deletions(-)
 delete mode 100644 include/crypto/vmac.h

-- 
2.18.0.rc1.244.gcf134e6275-goog

[PATCH 1/4] crypto: vmac - require a block cipher with 128-bit block size

2018-06-18 Thread Eric Biggers

From: Eric Biggers 

The VMAC template assumes the block cipher has a 128-bit block size, but
it failed to check for that.  Thus it was possible to instantiate it
using a 64-bit block size cipher, e.g. "vmac(cast5)", causing
uninitialized memory to be used.

Add the needed check when instantiating the template.

Fixes: f1939f7c5645 ("crypto: vmac - New hash algorithm for intel_txt support")
Cc:  # v2.6.32+
Signed-off-by: Eric Biggers 
---
 crypto/vmac.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/crypto/vmac.c b/crypto/vmac.c
index df76a816cfb2..3034454a3713 100644
--- a/crypto/vmac.c
+++ b/crypto/vmac.c
@@ -655,6 +655,10 @@ static int vmac_create(struct crypto_template *tmpl, 
struct rtattr **tb)
if (IS_ERR(alg))
return PTR_ERR(alg);
 
+   err = -EINVAL;
+   if (alg->cra_blocksize != 16)
+   goto out_put_alg;
+
inst = shash_alloc_instance("vmac", alg);
err = PTR_ERR(inst);
if (IS_ERR(inst))
-- 
2.18.0.rc1.244.gcf134e6275-goog

[PATCH 3/4] crypto: vmac - add nonced version with big endian digest

2018-06-18 Thread Eric Biggers

From: Eric Biggers 

Currently the VMAC template uses a "nonce" hardcoded to 0, which makes
it insecure unless a unique key is set for every message.  Also, the
endianness of the final digest is wrong: the implementation uses little
endian, but the VMAC specification has it as big endian, as do other
VMAC implementations such as the one in Crypto++.

Add a new VMAC template where the nonce is passed as the first 16 bytes
of data (similar to what is done for Poly1305's nonce), and the digest
is big endian.  Call it "vmac64", since the old name of simply "vmac"
didn't clarify whether the implementation is of VMAC-64 or of VMAC-128
(which produce 64-bit and 128-bit digests respectively); so we fix the
naming ambiguity too.

Signed-off-by: Eric Biggers 
---
 crypto/testmgr.c |   6 ++
 crypto/testmgr.h | 155 +++
 crypto/vmac.c| 130 +--
 3 files changed, 273 insertions(+), 18 deletions(-)

diff --git a/crypto/testmgr.c b/crypto/testmgr.c
index 11e45352fd0b..60a557b0f8d3 100644
--- a/crypto/testmgr.c
+++ b/crypto/testmgr.c
@@ -3483,6 +3483,12 @@ static const struct alg_test_desc alg_test_descs[] = {
.suite = {
.hash = __VECS(aes_vmac128_tv_template)
}
+   }, {
+   .alg = "vmac64(aes)",
+   .test = alg_test_hash,
+   .suite = {
+   .hash = __VECS(vmac64_aes_tv_template)
+   }
}, {
.alg = "wp256",
.test = alg_test_hash,
diff --git a/crypto/testmgr.h b/crypto/testmgr.h
index b950aa234e43..7b022c47a623 100644
--- a/crypto/testmgr.h
+++ b/crypto/testmgr.h
@@ -4705,6 +4705,161 @@ static const struct hash_testvec 
aes_vmac128_tv_template[] = {
},
 };
 
+static const char vmac64_string1[144] = {
+   '\0', '\0',   '\0',   '\0',   '\0',   '\0',   '\0',   '\0',
+   '\0', '\0',   '\0',   '\0',   '\0',   '\0',   '\0',   '\0',
+   '\x01', '\x01', '\x01', '\x01', '\x02', '\x03', '\x02', '\x02',
+   '\x02', '\x04', '\x01', '\x07', '\x04', '\x01', '\x04', '\x03',
+};
+
+static const char vmac64_string2[144] = {
+   '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0',
+   '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0',
+'a',  'b',  'c',
+};
+
+static const char vmac64_string3[144] = {
+   '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0',
+   '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0',
+'a',  'b',  'c',  'a',  'b',  'c',  'a',  'b',
+'c',  'a',  'b',  'c',  'a',  'b',  'c',  'a',
+'b',  'c',  'a',  'b',  'c',  'a',  'b',  'c',
+'a',  'b',  'c',  'a',  'b',  'c',  'a',  'b',
+'c',  'a',  'b',  'c',  'a',  'b',  'c',  'a',
+'b',  'c',  'a',  'b',  'c',  'a',  'b',  'c',
+};
+
+static const char vmac64_string4[33] = {
+   '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0',
+   '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0',
+   'b',   'c',  'e',  'f',  'i',  'j',  'l',  'm',
+   'o',   'p',  'r',  's',  't',  'u',  'w',  'x',
+   'z',
+};
+
+static const char vmac64_string5[143] = {
+   '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0',
+   '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0',
+'r',  'm',  'b',  't',  'c',  'o',  'l',  'k',
+']',  '%',  '9',  '2',  '7',  '!',  'A',
+};
+
+static const char vmac64_string6[145] = {
+   '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0',
+   '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0',
+'p',  't',  '*',  '7',  'l',  'i',  '!',  '#',
+'w',  '0',  'z',  '/',  '4',  'A',  'n',
+};
+
+static const struct hash_testvec vmac64_aes_tv_template[] = {
+   { /* draft-krovetz-vmac-01 test vector 1 */
+   .key= "abcdefghijklmnop",
+   .ksize  = 16,
+   .plaintext = "\0\0\0\0\0\0\0\0bcdefghi",
+   .psize  = 16,
+   .digest = "\x25\x76\xbe\x1c\x56\xd8\xb8\x1b",
+   }, { /* draft-krovetz-vmac-01 test vector 2 */
+   .key= "abcdefghijklmnop",
+   .ksize  = 16,
+   .plaintext = "\0\0\0\0\0\0\0\0bcdefghiabc",
+   .psize  = 19,
+   .digest = "\x2d\x37\x6c\xf5\xb1\x81\x3c\xe5",
+   }, { /* draft-krovetz-vmac-01 test vector 3 */
+   .key= "abcdefghijklmnop",
+   .ksize  = 16,
+   .plaintext = "\0\0\0\0\0\0\0\0bcdefghi"
+ "abcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabc",
+   .psize  = 64,
+   .digest = "\xe8\x42\x1f\x61\xd5\x73\xd2\x98",
+   }, { /* draft-krovetz-vmac-01 test vector 4 */
+   .key= "abcdefghijklmnop",
+   .ksize  = 16,
+   .plaintext = "\0\0\0\0\0\0\0\0bcd

[PATCH 2/4] crypto: vmac - separate tfm and request context

2018-06-18 Thread Eric Biggers

From: Eric Biggers 

syzbot reported a crash in vmac_final() when multiple threads
concurrently use the same "vmac(aes)" transform through AF_ALG.  The bug
is pretty fundamental: the VMAC template doesn't separate per-request
state from per-tfm (per-key) state like the other hash algorithms do,
but rather stores it all in the tfm context.  That's wrong.

Also, vmac_final() incorrectly zeroes most of the state including the
derived keys and cached pseudorandom pad.  Therefore, only the first
VMAC invocation with a given key calculates the correct digest.

Fix these bugs by splitting the per-tfm state from the per-request state
and using the proper init/update/final sequencing for requests.

Reproducer for the crash:

#include 
#include 
#include 

int main()
{
int fd;
struct sockaddr_alg addr = {
.salg_type = "hash",
.salg_name = "vmac(aes)",
};
char buf[256] = { 0 };

fd = socket(AF_ALG, SOCK_SEQPACKET, 0);
bind(fd, (void *), sizeof(addr));
setsockopt(fd, SOL_ALG, ALG_SET_KEY, buf, 16);
fork();
fd = accept(fd, NULL, NULL);
for (;;)
write(fd, buf, 256);
}

The immediate cause of the crash is that vmac_ctx_t.partial_size exceeds
VMAC_NHBYTES, causing vmac_final() to memset() a negative length.

Reported-by: syzbot+264bca3a6e8d64555...@syzkaller.appspotmail.com
Fixes: f1939f7c5645 ("crypto: vmac - New hash algorithm for intel_txt support")
Cc:  # v2.6.32+
Signed-off-by: Eric Biggers 
---
 crypto/vmac.c | 408 +++---
 include/crypto/vmac.h |  63 ---
 2 files changed, 181 insertions(+), 290 deletions(-)
 delete mode 100644 include/crypto/vmac.h

diff --git a/crypto/vmac.c b/crypto/vmac.c
index 3034454a3713..bb2fc787d615 100644
--- a/crypto/vmac.c
+++ b/crypto/vmac.c
@@ -1,6 +1,10 @@
 /*
- * Modified to interface to the Linux kernel
+ * VMAC: Message Authentication Code using Universal Hashing
+ *
+ * Reference: https://tools.ietf.org/html/draft-krovetz-vmac-01
+ *
  * Copyright (c) 2009, Intel Corporation.
+ * Copyright (c) 2018, Google Inc.
  *
  * This program is free software; you can redistribute it and/or modify it
  * under the terms and conditions of the GNU General Public License,
@@ -16,14 +20,15 @@
  * Place - Suite 330, Boston, MA 02111-1307 USA.
  */
 
-/* --
- * VMAC and VHASH Implementation by Ted Krovetz (t...@acm.org) and Wei Dai.
- * This implementation is herby placed in the public domain.
- * The authors offers no warranty. Use at your own risk.
- * Please send bug reports to the authors.
- * Last modified: 17 APR 08, 1700 PDT
- * --- */
+/*
+ * Derived from:
+ * VMAC and VHASH Implementation by Ted Krovetz (t...@acm.org) and Wei Dai.
+ * This implementation is herby placed in the public domain.
+ * The authors offers no warranty. Use at your own risk.
+ * Last modified: 17 APR 08, 1700 PDT
+ */
 
+#include 
 #include 
 #include 
 #include 
@@ -31,9 +36,35 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 
+/*
+ * User definable settings.
+ */
+#define VMAC_TAG_LEN   64
+#define VMAC_KEY_SIZE  128/* Must be 128, 192 or 256   */
+#define VMAC_KEY_LEN   (VMAC_KEY_SIZE/8)
+#define VMAC_NHBYTES   128/* Must 2^i for any 3 < i < 13 Standard = 128*/
+
+/* per-transform (per-key) context */
+struct vmac_tfm_ctx {
+   struct crypto_cipher *cipher;
+   u64 nhkey[(VMAC_NHBYTES/8)+2*(VMAC_TAG_LEN/64-1)];
+   u64 polykey[2*VMAC_TAG_LEN/64];
+   u64 l3key[2*VMAC_TAG_LEN/64];
+};
+
+/* per-request context */
+struct vmac_desc_ctx {
+   union {
+   u8 partial[VMAC_NHBYTES];   /* partial block */
+   __le64 partial_words[VMAC_NHBYTES / 8];
+   };
+   unsigned int partial_size;  /* size of the partial block */
+   bool first_block_processed;
+   u64 polytmp[2*VMAC_TAG_LEN/64]; /* running total of L2-hash */
+};
+
 /*
  * Constants and masks
  */
@@ -318,13 +349,6 @@ static void poly_step_func(u64 *ahi, u64 *alo,
} while (0)
 #endif
 
-static void vhash_abort(struct vmac_ctx *ctx)
-{
-   ctx->polytmp[0] = ctx->polykey[0] ;
-   ctx->polytmp[1] = ctx->polykey[1] ;
-   ctx->first_block_processed = 0;
-}
-
 static u64 l3hash(u64 p1, u64 p2, u64 k1, u64 k2, u64 len)
 {
u64 rh, rl, t, z = 0;
@@ -364,280 +388,209 @@ static u64 l3hash(u64 p1, u64 p2, u64 k1, u64 k2, u64 
len)
return rl;
 }
 
-static void vhash_update(const unsigned char *m,
-   unsigned int mbytes, /* Pos multiple of VMAC_NHBYTES */
-   struct vmac_ctx *ctx)
+/* L1 and L2-hash one or more VMAC_NHBYTES-byte blocks

Re: WARNING: kernel stack regs has bad 'bp' value (3)

2018-05-26 Thread Eric Biggers

On Sat, May 12, 2018 at 10:43:08AM +0200, Dmitry Vyukov wrote:
> On Fri, Feb 2, 2018 at 11:18 PM, Eric Biggers <ebigge...@gmail.com> wrote:
> > On Fri, Feb 02, 2018 at 02:57:32PM +0100, Dmitry Vyukov wrote:
> >> On Fri, Feb 2, 2018 at 2:48 PM, syzbot
> >> <syzbot+ffa3a158337bbc01f...@syzkaller.appspotmail.com> wrote:
> >> > Hello,
> >> >
> >> > syzbot hit the following crash on upstream commit
> >> > 7109a04eae81c41ed529da9f3c48c3655ccea741 (Thu Feb 1 17:37:30 2018 +)
> >> > Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/ide
> >> >
> >> > So far this crash happened 4 times on net-next, upstream.
> >> > C reproducer is attached.
> >> > syzkaller reproducer is attached.
> >> > Raw console output is attached.
> >> > compiler: gcc (GCC) 7.1.1 20170620
> >> > .config is attached.
> >>
> >>
> >> From suspicious frames I see salsa20_asm_crypt there, so +crypto 
> >> maintainers.
> >>
> >
> > Looks like the x86 implementations of Salsa20 (both i586 and x86_64) need 
> > to be
> > updated to not use %ebp/%rbp.
> 
> Ard,
> 
> This was bisected as introduced by:
> 
> commit 83dee2ce1ae791c3dc0c9d4d3a8d42cb109613f6
> Author: Ard Biesheuvel <ard.biesheu...@linaro.org>
> Date:   Fri Jan 19 12:04:34 2018 +
> 
> crypto: sha3-generic - rewrite KECCAK transform to help the
> compiler optimize
> 
> https://gist.githubusercontent.com/dvyukov/47f93f5a0679170dddf93bc019b42f6d/raw/65beac8ddd30003bbd4e9729236dc8572094abf7/gistfile1.txt

Note that syzbot's original C reproducer (from Feb 1) for this actually
triggered the warning through salsa20-asm, which I've just proposed to "fix" by
https://patchwork.kernel.org/patch/10428863/.  sha3-generic is apparently
another instance of the same bug, where the %rbp register is used for data.

Eric

[PATCH 1/2] crypto: x86/salsa20 - remove x86 salsa20 implementations

2018-05-26 Thread Eric Biggers

From: Eric Biggers <ebigg...@google.com>

The x86 assembly implementations of Salsa20 use the frame base pointer
register (%ebp or %rbp), which breaks frame pointer convention and
breaks stack traces when unwinding from an interrupt in the crypto code.
Recent (v4.10+) kernels will warn about this, e.g.

WARNING: kernel stack regs at a8291e69 in syzkaller047086:4677 has bad 
'bp' value 1077994c
[...]

But after looking into it, I believe there's very little reason to still
retain the x86 Salsa20 code.  First, these are *not* vectorized
(SSE2/SSSE3/AVX2) implementations, which would be needed to get anywhere
close to the best Salsa20 performance on any remotely modern x86
processor; they're just regular x86 assembly.  Second, it's still
unclear that anyone is actually using the kernel's Salsa20 at all,
especially given that now ChaCha20 is supported too, and with much more
efficient SSSE3 and AVX2 implementations.  Finally, in benchmarks I did
on both Intel and AMD processors with both gcc 8.1.0 and gcc 4.9.4, the
x86_64 salsa20-asm is actually slightly *slower* than salsa20-generic
(~3% slower on Skylake, ~10% slower on Zen), while the i686 salsa20-asm
is only slightly faster than salsa20-generic (~15% faster on Skylake,
~20% faster on Zen).  The gcc version made little difference.

So, the x86_64 salsa20-asm is pretty clearly useless.  That leaves just
the i686 salsa20-asm, which based on my tests provides a 15-20% speed
boost.  But that's without updating the code to not use %ebp.  And given
the maintenance cost, the small speed difference vs. salsa20-generic,
the fact that few people still use i686 kernels, the doubt that anyone
is even using the kernel's Salsa20 at all, and the fact that a SSE2
implementation would almost certainly be much faster on any remotely
modern x86 processor yet no one has cared enough to add one yet, I don't
think it's worthwhile to keep.

Thus, just remove both the x86_64 and i686 salsa20-asm implementations.

Reported-by: syzbot+ffa3a158337bbc01f...@syzkaller.appspotmail.com
Signed-off-by: Eric Biggers <ebigg...@google.com>
---
 arch/x86/crypto/Makefile|   4 -
 arch/x86/crypto/salsa20-i586-asm_32.S   | 938 
 arch/x86/crypto/salsa20-x86_64-asm_64.S | 805 
 arch/x86/crypto/salsa20_glue.c  |  91 ---
 crypto/Kconfig  |  28 -
 5 files changed, 1866 deletions(-)
 delete mode 100644 arch/x86/crypto/salsa20-i586-asm_32.S
 delete mode 100644 arch/x86/crypto/salsa20-x86_64-asm_64.S
 delete mode 100644 arch/x86/crypto/salsa20_glue.c

diff --git a/arch/x86/crypto/Makefile b/arch/x86/crypto/Makefile
index 3813e7cdaada..2e07a0e66314 100644
--- a/arch/x86/crypto/Makefile
+++ b/arch/x86/crypto/Makefile
@@ -15,7 +15,6 @@ obj-$(CONFIG_CRYPTO_GLUE_HELPER_X86) += glue_helper.o
 
 obj-$(CONFIG_CRYPTO_AES_586) += aes-i586.o
 obj-$(CONFIG_CRYPTO_TWOFISH_586) += twofish-i586.o
-obj-$(CONFIG_CRYPTO_SALSA20_586) += salsa20-i586.o
 obj-$(CONFIG_CRYPTO_SERPENT_SSE2_586) += serpent-sse2-i586.o
 
 obj-$(CONFIG_CRYPTO_AES_X86_64) += aes-x86_64.o
@@ -24,7 +23,6 @@ obj-$(CONFIG_CRYPTO_CAMELLIA_X86_64) += camellia-x86_64.o
 obj-$(CONFIG_CRYPTO_BLOWFISH_X86_64) += blowfish-x86_64.o
 obj-$(CONFIG_CRYPTO_TWOFISH_X86_64) += twofish-x86_64.o
 obj-$(CONFIG_CRYPTO_TWOFISH_X86_64_3WAY) += twofish-x86_64-3way.o
-obj-$(CONFIG_CRYPTO_SALSA20_X86_64) += salsa20-x86_64.o
 obj-$(CONFIG_CRYPTO_CHACHA20_X86_64) += chacha20-x86_64.o
 obj-$(CONFIG_CRYPTO_SERPENT_SSE2_X86_64) += serpent-sse2-x86_64.o
 obj-$(CONFIG_CRYPTO_AES_NI_INTEL) += aesni-intel.o
@@ -68,7 +66,6 @@ endif
 
 aes-i586-y := aes-i586-asm_32.o aes_glue.o
 twofish-i586-y := twofish-i586-asm_32.o twofish_glue.o
-salsa20-i586-y := salsa20-i586-asm_32.o salsa20_glue.o
 serpent-sse2-i586-y := serpent-sse2-i586-asm_32.o serpent_sse2_glue.o
 
 aes-x86_64-y := aes-x86_64-asm_64.o aes_glue.o
@@ -77,7 +74,6 @@ camellia-x86_64-y := camellia-x86_64-asm_64.o camellia_glue.o
 blowfish-x86_64-y := blowfish-x86_64-asm_64.o blowfish_glue.o
 twofish-x86_64-y := twofish-x86_64-asm_64.o twofish_glue.o
 twofish-x86_64-3way-y := twofish-x86_64-asm_64-3way.o twofish_glue_3way.o
-salsa20-x86_64-y := salsa20-x86_64-asm_64.o salsa20_glue.o
 chacha20-x86_64-y := chacha20-ssse3-x86_64.o chacha20_glue.o
 serpent-sse2-x86_64-y := serpent-sse2-x86_64-asm_64.o serpent_sse2_glue.o
 
diff --git a/arch/x86/crypto/salsa20-i586-asm_32.S 
b/arch/x86/crypto/salsa20-i586-asm_32.S
deleted file mode 100644
index 6014b7b9e52a..
--- a/arch/x86/crypto/salsa20-i586-asm_32.S
+++ /dev/null
@@ -1,938 +0,0 @@
-# Derived from:
-#  salsa20_pm.s version 20051229
-#  D. J. Bernstein
-#  Public domain.
-
-#include 
-
-.text
-
-# enter salsa20_encrypt_bytes
-ENTRY(salsa20_encrypt_bytes)
-   mov %esp,%eax
-   and $31,%eax
-   add $256,%eax
-   sub %eax,%esp
-   # eax_stack = eax
-   movl%eax,80(%esp)
-   # ebx_stack = ebx

[PATCH 2/2] crypto: salsa20 - Revert "crypto: salsa20 - export generic helpers"

2018-05-26 Thread Eric Biggers

From: Eric Biggers <ebigg...@google.com>

This reverts commit eb772f37ae8163a89e28a435f6a18742ae06653b, as now the
x86 Salsa20 implementation has been removed and the generic helpers are
no longer needed outside of salsa20_generic.c.

We could keep this just in case someone else wants to add a new
optimized Salsa20 implementation.  But given that we have ChaCha20 now
too, I think it's unlikely.  And this can always be reverted back.

Signed-off-by: Eric Biggers <ebigg...@google.com>
---
 crypto/salsa20_generic.c | 20 +---
 include/crypto/salsa20.h | 27 ---
 2 files changed, 13 insertions(+), 34 deletions(-)
 delete mode 100644 include/crypto/salsa20.h

diff --git a/crypto/salsa20_generic.c b/crypto/salsa20_generic.c
index 5074006a56c3..8c77bc78a09f 100644
--- a/crypto/salsa20_generic.c
+++ b/crypto/salsa20_generic.c
@@ -21,9 +21,17 @@
 
 #include 
 #include 
-#include 
 #include 
 
+#define SALSA20_IV_SIZE8
+#define SALSA20_MIN_KEY_SIZE  16
+#define SALSA20_MAX_KEY_SIZE  32
+#define SALSA20_BLOCK_SIZE64
+
+struct salsa20_ctx {
+   u32 initial_state[16];
+};
+
 static void salsa20_block(u32 *state, __le32 *stream)
 {
u32 x[16];
@@ -93,16 +101,15 @@ static void salsa20_docrypt(u32 *state, u8 *dst, const u8 
*src,
}
 }
 
-void crypto_salsa20_init(u32 *state, const struct salsa20_ctx *ctx,
+static void salsa20_init(u32 *state, const struct salsa20_ctx *ctx,
 const u8 *iv)
 {
memcpy(state, ctx->initial_state, sizeof(ctx->initial_state));
state[6] = get_unaligned_le32(iv + 0);
state[7] = get_unaligned_le32(iv + 4);
 }
-EXPORT_SYMBOL_GPL(crypto_salsa20_init);
 
-int crypto_salsa20_setkey(struct crypto_skcipher *tfm, const u8 *key,
+static int salsa20_setkey(struct crypto_skcipher *tfm, const u8 *key,
  unsigned int keysize)
 {
static const char sigma[16] = "expand 32-byte k";
@@ -143,7 +150,6 @@ int crypto_salsa20_setkey(struct crypto_skcipher *tfm, 
const u8 *key,
 
return 0;
 }
-EXPORT_SYMBOL_GPL(crypto_salsa20_setkey);
 
 static int salsa20_crypt(struct skcipher_request *req)
 {
@@ -155,7 +161,7 @@ static int salsa20_crypt(struct skcipher_request *req)
 
err = skcipher_walk_virt(, req, true);
 
-   crypto_salsa20_init(state, ctx, walk.iv);
+   salsa20_init(state, ctx, walk.iv);
 
while (walk.nbytes > 0) {
unsigned int nbytes = walk.nbytes;
@@ -183,7 +189,7 @@ static struct skcipher_alg alg = {
.max_keysize= SALSA20_MAX_KEY_SIZE,
.ivsize = SALSA20_IV_SIZE,
.chunksize  = SALSA20_BLOCK_SIZE,
-   .setkey = crypto_salsa20_setkey,
+   .setkey = salsa20_setkey,
.encrypt= salsa20_crypt,
.decrypt= salsa20_crypt,
 };
diff --git a/include/crypto/salsa20.h b/include/crypto/salsa20.h
deleted file mode 100644
index 19ed48aefc86..
--- a/include/crypto/salsa20.h
+++ /dev/null
@@ -1,27 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-/*
- * Common values for the Salsa20 algorithm
- */
-
-#ifndef _CRYPTO_SALSA20_H
-#define _CRYPTO_SALSA20_H
-
-#include 
-
-#define SALSA20_IV_SIZE8
-#define SALSA20_MIN_KEY_SIZE   16
-#define SALSA20_MAX_KEY_SIZE   32
-#define SALSA20_BLOCK_SIZE 64
-
-struct crypto_skcipher;
-
-struct salsa20_ctx {
-   u32 initial_state[16];
-};
-
-void crypto_salsa20_init(u32 *state, const struct salsa20_ctx *ctx,
-const u8 *iv);
-int crypto_salsa20_setkey(struct crypto_skcipher *tfm, const u8 *key,
- unsigned int keysize);
-
-#endif /* _CRYPTO_SALSA20_H */
-- 
2.17.0

[PATCH 0/2] crypto: remove x86 salsa20 implementations

2018-05-26 Thread Eric Biggers

Hello,

The x86 asm implementations of Salsa20 have been missed so far in the
fixes to stop abusing %ebp/%rbp in asm code to get correct stack traces.
This has been causing the unwinder warnings reported by syzkaller to
continue.

This series "fixes" it by just removing the offending salsa20-asm
implementations, which as far as I can tell are basically useless these
days; the x86_64 asm version in particular isn't actually any faster
than the C version anymore.  (And possibly no one even uses these
anyway.)  See the patch for the full explanation.

Eric Biggers (2):
  crypto: x86/salsa20 - remove x86 salsa20 implementations
  crypto: salsa20 - Revert "crypto: salsa20 - export generic helpers"

 arch/x86/crypto/Makefile|   4 -
 arch/x86/crypto/salsa20-i586-asm_32.S   | 938 
 arch/x86/crypto/salsa20-x86_64-asm_64.S | 805 
 arch/x86/crypto/salsa20_glue.c  |  91 ---
 crypto/Kconfig  |  28 -
 crypto/salsa20_generic.c|  20 +-
 include/crypto/salsa20.h|  27 -
 7 files changed, 13 insertions(+), 1900 deletions(-)
 delete mode 100644 arch/x86/crypto/salsa20-i586-asm_32.S
 delete mode 100644 arch/x86/crypto/salsa20-x86_64-asm_64.S
 delete mode 100644 arch/x86/crypto/salsa20_glue.c
 delete mode 100644 include/crypto/salsa20.h

-- 
2.17.0

Re: PBKDF2 support in the linux kernel

2018-05-25 Thread Eric Biggers

Hi Denis,

On Fri, May 25, 2018 at 09:48:36AM -0500, Denis Kenzior wrote:
> Hi Eric,
> 
> > The solution to the "too many system calls" problem is trivial: just do 
> > SHA-512
> > in userspace.  It's just math; you don't need a system call, any more than 
> > you
> > would call sys_add(1, 1) to compute 1 + 1.  The CPU instructions that can
> > accelerate SHA-512, such as AVX and ARM CE, are available in userspace too; 
> > and
> > there are tons of libraries already available that implement it for you.  
> > Your
> > argument isn't fundamentally different from saying that sys_leftpad() (if 
> > we had
> > the extraordinary misfortune of it actually existing) is too slow, so we 
> > should
> > add a Javascript interpreter to the kernel.
> 
> So lets recap.  The Kernel crypto framework is something that:
> a) (some, many?) people are totally happy with, it does everything that they
> want
> b) is peer reviewed by the best programmers in the world
> c) responds / fixes vulnerabilities almost instantly
> d) automatically picks the best software optimized version of a given crypto
> algorithm for us
> e) automagically uses hardware optimization if the system supports it
> f) API compatibility is essentially guaranteed forever
> g) Maybe not the most performant in the world, but to many users this
> doesn't matter.
> 
> So your response to those users is to please stop using what works well and
> start adding random crypto code from the internet into their project?
> Something that likely won't do a, b, c, d, e or f above just because *oh
> gosh* we might find and have to fix some bugs in the kernel?  Have you
> actually thought through how that sounds?
> 
> What you call laziness I call 'common sense' and 'good security practice.'
> Does using the kernel make sense for everyone? No.  But for some it does.
> So if there's a legitimate way to make things better, can we not discuss
> them civilly?

I've explained this already -- it is exactly the *opposite* of good security
practice to increase the attack surface of ring 0 code.  Have *you* actually
thought about how it ridiculous it is elevate privileges just to do math?  We
need to be reducing the kernel attack surface, not increasing it.

It's great that you're confident in me, Herbert, Stephen, and other people who
contribute to the Linux crypto API.  The reality though is that there are still
known denial of service vulnerabilities that no one has found time to fix yet,
like deadlocking the kernel through recursive pcrypt, or exhausting kernel
memory by instantiating an unbounded number of crypto algorithms.  These types
of bugs aren't relevant for userspace crypto libraries as it would just be an
application shooting itself in the foot, but in the kernel they are a problem
for everyone, at least without e.g. an SELinux policy in place to lock down the
attack surface.

Again, all your arguments minus the crypto-specific parts also apply to
sys_leftpad(), which to be very clear was an April Fool's joke, and not an
actual proposal.  I trust that this wasn't meant to be a very late April Fool's
joke too :-)

> 
> > 
> > Also note that in the rare cases where the kernel actually does do very long
> > running calculations for whatever reason, kernel developers pretty regularly
> > screw it up by forgetting to insert explicit preemption points 
> > (cond_resched()),
> > or (slightly less bad) making it noninterruptible.  I had to "fix" one of 
> > these,
> > where someone for whatever reason added a keyctl() operation that does
> > Diffie-Hellman key exchange in software.  In !CONFIG_PREEMPT kernels any
> > unprivileged user could call it to lock up all CPUs for 20+ seconds, meaning
> > that no other processes can be scheduled on them.  This isn't a problem at 
> > all
> > in userspace.
> 
> And this is exactly why people should _want_ to use the kernel crypto
> framework.  Because people like you exist and fix such issues.  So again,
> kudos :)
> 

No, it's exactly why people should *not* want to do crypto in the kernel,
because that class of bug cannot exist in userspace code.

Eric

Re: PBKDF2 support in the linux kernel

2018-05-24 Thread Eric Biggers

Hi Denis,

On Thu, May 24, 2018 at 07:56:50PM -0500, Denis Kenzior wrote:
> Hi Ted,
> 
> > > I'm not really here to criticize or judge the past.  AF_ALG exists now. It
> > > is being used.  Can we just make it better?  Or are we going to whinge at
> > > every user that tries to use (and improve) kernel features that (some)
> > > people disagree with because it can 'compromise' kernel security?
> > 
> > Another point of view is that it was arguably a mistake, and we
> > shouldn't make it worse.
> 
> Fair enough.  I'm just voicing the opposite point of view.  Namely that you
> have created something nice, and useful.  Even if it turned out not quite
> like you thought it would be in hindsight.
> 
> > 
> > > > Also, if speed isn't a worry, why not just a single software-only
> > > > implementation of SHA1, and be done with it?  It's what I did in
> > > > e2fsprogs for e4crypt.
> > > 
> > > If things were that simple, we would definitely not be having this 
> > > exchange.
> > > Lets just say we use just about every feature that crypto subsystem 
> > > provides
> > > in some way.
> > 
> > What I'm saying here is if you need to code PBPDF2 in user-space, it
> > _really_ isn't hard.  I've done it.  It's less than 7k of source code
> > to implement SHA512:
> > 
> > https://ghit.kernel.org/pub/scm/fs/ext2/e2fsprogs.git/tree/lib/ext2fs/sha256.c
> > 
> > and then less than 50 lines of code to implement PBPDF2 (I didn't even
> > bother putting in a library such as libext2fs):
> > 
> > https://git.kernel.org/pub/scm/fs/ext2/e2fsprogs.git/tree/misc/e4crypt.c#n405
> > 
> > This is all you would need to do if we don't put PBPDF2 in the kernel.
> > Is it really that onerous?
> 
> No.  And in fact if you read upthread you will notice that I provided a link
> to our implementation of both PBKDF1 & 2 and they're as small as you say.
> They do everything we need.  So I'm right there with you.
> 
> But, PBKDF uses like 4K iterations (for WiFi passphrase -> key conversion
> for example) to arrive at its solution.  So you have implementations
> hammering the kernel with system calls.
> 
> So we can whinge at these implementations for 'being lazy', wring our hands,
> say how everything was just a big mistake.  Or maybe we can do something so
> that the kernel isn't hammered needlessly...
> 

The solution to the "too many system calls" problem is trivial: just do SHA-512
in userspace.  It's just math; you don't need a system call, any more than you
would call sys_add(1, 1) to compute 1 + 1.  The CPU instructions that can
accelerate SHA-512, such as AVX and ARM CE, are available in userspace too; and
there are tons of libraries already available that implement it for you.  Your
argument isn't fundamentally different from saying that sys_leftpad() (if we had
the extraordinary misfortune of it actually existing) is too slow, so we should
add a Javascript interpreter to the kernel.

Also note that in the rare cases where the kernel actually does do very long
running calculations for whatever reason, kernel developers pretty regularly
screw it up by forgetting to insert explicit preemption points (cond_resched()),
or (slightly less bad) making it noninterruptible.  I had to "fix" one of these,
where someone for whatever reason added a keyctl() operation that does
Diffie-Hellman key exchange in software.  In !CONFIG_PREEMPT kernels any
unprivileged user could call it to lock up all CPUs for 20+ seconds, meaning
that no other processes can be scheduled on them.  This isn't a problem at all
in userspace.

Eric

Re: PBKDF2 support in the linux kernel

2018-05-24 Thread Eric Biggers

On Thu, May 24, 2018 at 09:36:15AM -0500, Denis Kenzior wrote:
> Hi Stephan,
> 
> On 05/24/2018 12:57 AM, Stephan Mueller wrote:
> > Am Donnerstag, 24. Mai 2018, 04:45:00 CEST schrieb Eric Biggers:
> > 
> > Hi Eric,
> > 
> > > 
> > > "Not having to rely on any third-party library" is not an excuse to add
> > > random code to the kernel, which runs in a privileged context.  Please do
> > > PBKDF2 in userspace instead.
> > 
> > I second that. Besides, if you really need to rely on the kernel crypto API 
> > to
> > do that because you do not want to add yet another crypto lib, libkcapi has 
> > a
> > PBKDF2 implementation that uses the kernel crypto API via AF_ALG. I.e. the
> > kernel crypto API is used and yet the PBKDF logic is in user space.
> > 
> > http://www.chronox.de/libkcapi.html
> > 
> 
> I actually don't see why we _shouldn't_ have PBKDF in the kernel.  We
> already have at least 2 user space libraries that implement it via AF_ALG.
> ell does this as well:
> https://git.kernel.org/pub/scm/libs/ell/ell.git/tree/ell/pkcs5.c
> 
> One can argue whether this is a good or bad idea, but the cat is out of the
> bag.
> 
> So from a practical perspective, would it not be better to make this an
> explicit kernel API and not have userspace hammer AF_ALG socket a few
> thousand times to do what it wants?
> 

No, we don't add random code to the kernel just because people are lazy.  IMO it
was a mistake that AF_ALG allows access to software crypto implementations by
default (as opposed to just hardware crypto devices), but it's not an excuse to
add random other stuff to the kernel.  The kernel runs in a privileged context
under special constraints, e.g. non-preemptible in some configurations, and any
bug can crash or lock up the system, leak data, or even allow elevation of
privilege.  We're already dealing with hundreds of bugs in the kernel found by
fuzzing [1], many of which no one feels very responsible for fixing.  In fact
about 20 bugs were reported in AF_ALG as soon as definitions for AF_ALG were
added to syzkaller; at least a couple were very likely exploitable to gain
arbitrary kernel code execution.  The last thing we need is adding even more
code to the kernel just because people are too lazy to write userspace code.  Do
we need sys_leftpad() [2] next?

[1] https://syzkaller.appspot.com/
[2] https://lkml.org/lkml/2016/3/31/1108

- Eric

Re: PBKDF2 support in the linux kernel

2018-05-23 Thread Eric Biggers

Hi Yu,

On Thu, May 24, 2018 at 10:26:12AM +0800, Yu Chen wrote:
> Hi Stephan,
> thanks for your reply,
> On Wed, May 23, 2018 at 1:43 AM Stephan Mueller  wrote:
> 
> > Am Dienstag, 22. Mai 2018, 05:00:40 CEST schrieb Yu Chen:
> 
> > Hi Yu,
> 
> > > Hi all,
> > > The request is that, we'd like to generate a symmetric key derived from
> > > user provided passphase(not rely on any third-party library). May I
> know if
> > > there is a PBKDF2(Password-Based Key Derivation Function 2) support in
> the
> > > kernel? (https://tools.ietf.org/html/rfc2898#5.2)
> > > We have hmac sha1 in the kernel, do we have plan to port/implement
> > > corresponding PBKDF2 in the kernel too?
> 
> > There is no PBKDF2 support in the kernel.
> 
> I saw that there's already a kdf implementation using SP800-56A
> in security/keys/dh.c, I think I can learn from that and  implement PDKDF2
> using similar code.
> > Ciao
> > Stephan
> Best,
> Yu

"Not having to rely on any third-party library" is not an excuse to add random
code to the kernel, which runs in a privileged context.  Please do PBKDF2 in
userspace instead.

- Eric

[PATCH 4/5] crypto: testmgr - add extra kw(aes) encryption test vector

2018-05-20 Thread Eric Biggers

From: Eric Biggers <ebigg...@google.com>

One "kw(aes)" decryption test vector doesn't exactly match an encryption
test vector with input and result swapped.  In preparation for removing
the decryption test vectors, add this test vector to the encryption test
vectors, so we don't lose any test coverage.

Signed-off-by: Eric Biggers <ebigg...@google.com>
---
 crypto/testmgr.h | 13 +
 1 file changed, 13 insertions(+)

diff --git a/crypto/testmgr.h b/crypto/testmgr.h
index 745a0ed0a73a..75ddbf790a99 100644
--- a/crypto/testmgr.h
+++ b/crypto/testmgr.h
@@ -33695,6 +33695,19 @@ static const struct cipher_testvec 
aes_kw_enc_tv_template[] = {
  "\xf5\x6f\xab\xea\x25\x48\xf5\xfb",
.rlen   = 16,
.iv_out = "\x03\x1f\x6b\xd7\xe6\x1e\x64\x3d",
+   }, {
+   .key= "\x80\xaa\x99\x73\x27\xa4\x80\x6b"
+ "\x6a\x7a\x41\xa5\x2b\x86\xc3\x71"
+ "\x03\x86\xf9\x32\x78\x6e\xf7\x96"
+ "\x76\xfa\xfb\x90\xb8\x26\x3c\x5f",
+   .klen   = 32,
+   .input  = "\x0a\x25\x6b\xa7\x5c\xfa\x03\xaa"
+ "\xa0\x2b\xa9\x42\x03\xf1\x5b\xaa",
+   .ilen   = 16,
+   .result = "\xd3\x3d\x3d\x97\x7b\xf0\xa9\x15"
+ "\x59\xf9\x9c\x8a\xcd\x29\x3d\x43",
+   .rlen   = 16,
+   .iv_out = "\x42\x3c\x96\x0d\x8a\x2a\xc4\xc1",
},
 };
 
-- 
2.17.0

[PATCH 3/5] crypto: testmgr - add extra ecb(tnepres) encryption test vectors

2018-05-20 Thread Eric Biggers

From: Eric Biggers <ebigg...@google.com>

None of the four "ecb(tnepres)" decryption test vectors exactly match an
encryption test vector with input and result swapped.  In preparation
for removing the decryption test vectors, add these to the encryption
test vectors, so we don't lose any test coverage.

Signed-off-by: Eric Biggers <ebigg...@google.com>
---
 crypto/testmgr.h | 40 +++-
 1 file changed, 39 insertions(+), 1 deletion(-)

diff --git a/crypto/testmgr.h b/crypto/testmgr.h
index bdc67c058d5c..745a0ed0a73a 100644
--- a/crypto/testmgr.h
+++ b/crypto/testmgr.h
@@ -12047,6 +12047,14 @@ static const struct cipher_testvec 
serpent_enc_tv_template[] = {
 };
 
 static const struct cipher_testvec tnepres_enc_tv_template[] = {
+   { /* KeySize=0 */
+   .input  = "\x00\x01\x02\x03\x04\x05\x06\x07"
+ "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f",
+   .ilen   = 16,
+   .result = "\x41\xcc\x6b\x31\x59\x31\x45\x97"
+ "\x6d\x6f\xbb\x38\x4b\x37\x21\x28",
+   .rlen   = 16,
+   },
{ /* KeySize=128, PT=0, I=1 */
.input  = "\x00\x00\x00\x00\x00\x00\x00\x00"
  "\x00\x00\x00\x00\x00\x00\x00\x00",
@@ -12057,6 +12065,24 @@ static const struct cipher_testvec 
tnepres_enc_tv_template[] = {
.result = "\x49\xaf\xbf\xad\x9d\x5a\x34\x05"
  "\x2c\xd8\xff\xa5\x98\x6b\xd2\xdd",
.rlen   = 16,
+   }, { /* KeySize=128 */
+   .key= "\x00\x01\x02\x03\x04\x05\x06\x07"
+ "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f",
+   .klen   = 16,
+   .input  = "\x00\x01\x02\x03\x04\x05\x06\x07"
+ "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f",
+   .ilen   = 16,
+   .result = "\xea\xf4\xd7\xfc\xd8\x01\x34\x47"
+ "\x81\x45\x0b\xfa\x0c\xd6\xad\x6e",
+   .rlen   = 16,
+   }, { /* KeySize=128, I=121 */
+   .key= 
"\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x80",
+   .klen   = 16,
+   .input  = zeroed_string,
+   .ilen   = 16,
+   .result = "\x3d\xda\xbf\xc0\x06\xda\xab\x06"
+ "\x46\x2a\xf4\xef\x81\x54\x4e\x26",
+   .rlen   = 16,
}, { /* KeySize=192, PT=0, I=1 */
.key= "\x80\x00\x00\x00\x00\x00\x00\x00"
  "\x00\x00\x00\x00\x00\x00\x00\x00"
@@ -12092,7 +12118,19 @@ static const struct cipher_testvec 
tnepres_enc_tv_template[] = {
.result = "\x5c\xe7\x1c\x70\xd2\x88\x2e\x5b"
  "\xb8\x32\xe4\x33\xf8\x9f\x26\xde",
.rlen   = 16,
-   },
+   }, { /* KeySize=256 */
+   .key= "\x00\x01\x02\x03\x04\x05\x06\x07"
+ "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f"
+ "\x10\x11\x12\x13\x14\x15\x16\x17"
+ "\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f",
+   .klen   = 32,
+   .input  = "\x00\x01\x02\x03\x04\x05\x06\x07"
+ "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f",
+   .ilen   = 16,
+   .result = "\x64\xa9\x1a\x37\xed\x9f\xe7\x49"
+ "\xa8\x4e\x76\xd6\xf5\x0d\x78\xee",
+   .rlen   = 16,
+   }
 };
 
 
-- 
2.17.0

[PATCH 1/5] crypto: testmgr - add extra ecb(des) encryption test vectors

2018-05-20 Thread Eric Biggers

From: Eric Biggers <ebigg...@google.com>

Two "ecb(des)" decryption test vectors don't exactly match any of the
encryption test vectors with input and result swapped.  In preparation
for removing the decryption test vectors, add these to the encryption
test vectors, so we don't lose any test coverage.

Signed-off-by: Eric Biggers <ebigg...@google.com>
---
 crypto/testmgr.h | 22 ++
 1 file changed, 22 insertions(+)

diff --git a/crypto/testmgr.h b/crypto/testmgr.h
index 552d8f00d85b..5ab36fb8dd31 100644
--- a/crypto/testmgr.h
+++ b/crypto/testmgr.h
@@ -5595,6 +5595,28 @@ static const struct cipher_testvec des_enc_tv_template[] 
= {
.rlen   = 16,
.np = 2,
.tap= { 8, 8 }
+   }, {
+   .key= "\x01\x23\x45\x67\x89\xab\xcd\xef",
+   .klen   = 8,
+   .input  = "\x01\x23\x45\x67\x89\xab\xcd\xe7"
+ "\xa3\x99\x7b\xca\xaf\x69\xa0\xf5",
+   .ilen   = 16,
+   .result = "\xc9\x57\x44\x25\x6a\x5e\xd3\x1d"
+ "\x69\x0f\x5b\x0d\x9a\x26\x93\x9b",
+   .rlen   = 16,
+   .np = 2,
+   .tap= { 8, 8 }
+   }, {
+   .key= "\x01\x23\x45\x67\x89\xab\xcd\xef",
+   .klen   = 8,
+   .input  = "\x01\x23\x45\x67\x89\xab\xcd\xe7"
+ "\xa3\x99\x7b\xca\xaf\x69\xa0\xf5",
+   .ilen   = 16,
+   .result = "\xc9\x57\x44\x25\x6a\x5e\xd3\x1d"
+ "\x69\x0f\x5b\x0d\x9a\x26\x93\x9b",
+   .rlen   = 16,
+   .np = 3,
+   .tap= { 3, 12, 1 }
}, { /* Four blocks -- for testing encryption with chunking */
.key= "\x01\x23\x45\x67\x89\xab\xcd\xef",
.klen   = 8,
-- 
2.17.0

[PATCH 2/5] crypto: testmgr - make an cbc(des) encryption test vector chunked

2018-05-20 Thread Eric Biggers

From: Eric Biggers <ebigg...@google.com>

One "cbc(des)" decryption test vector doesn't exactly match an
encryption test vector with input and result swapped.  It's *almost* the
same as one, but the decryption version is "chunked" while the
encryption version is "unchunked".  In preparation for removing the
decryption test vectors, make the encryption one both chunked and
unchunked, so we don't lose any test coverage.

Signed-off-by: Eric Biggers <ebigg...@google.com>
---
 crypto/testmgr.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/crypto/testmgr.h b/crypto/testmgr.h
index 5ab36fb8dd31..bdc67c058d5c 100644
--- a/crypto/testmgr.h
+++ b/crypto/testmgr.h
@@ -5885,6 +5885,9 @@ static const struct cipher_testvec 
des_cbc_enc_tv_template[] = {
.ilen   = 8,
.result = "\x68\x37\x88\x49\x9a\x7c\x05\xf6",
.rlen   = 8,
+   .np = 2,
+   .tap= { 4, 4 },
+   .also_non_np = 1,
}, { /* Copy of openssl vector for chunk testing */
 /* From OpenSSL */
.key= "\x01\x23\x45\x67\x89\xab\xcd\xef",
-- 
2.17.0

[PATCH 0/5] crypto: eliminate redundant decryption test vectors

2018-05-20 Thread Eric Biggers

Hello,

When adding the Speck cipher support I was annoyed by having to add both
encryption and decryption test vectors, since they are redundant: the
decryption ones are just the encryption ones with the input and result
flipped.

It turns out that's nearly always the case for all the other
ciphers/skciphers too.  A few have slight differences, but they seem to
be accidental except for "kw(aes)", and we can still handle "kw(aes)"
nearly as easily with just one copy of the test vectors.

Therefore, this series removes all the decryption cipher_testvecs and
updates testmgr to test both encryption and decryption using what used
to be the encryption test vectors.  I did not change any of the AEAD
test vectors, though a similar change could be made for them too.

Patches 1-4 add some encryption test vectors, just so no test coverage
is lost.  Patch 5 is the real patch.  Due to the 1+ lines deleted
from testmgr.h, the patch file is 615 KB so it may be too large for the
mailing list.  You can also grab the series from git:
https://github.com/ebiggers/linux, branch "test_vector_redundancy_v1"
(HEAD is a09e48518f957bb61bb278227917eaad64cf13be).  Most of the patch
is scripted, but there are also some manual changes, mostly to
testmgr.c.  For review purposes, in case the full 615 KB patch doesn't
reach the mailing list, I'm also pasting an abbreviated version of the
patch below that excludes the scripted changes to testmgr.h, i.e. it
only includes my manual changes on top of the scripted changes.

Eric Biggers (5):
  crypto: testmgr - add extra ecb(des) encryption test vectors
  crypto: testmgr - make an cbc(des) encryption test vector chunked
  crypto: testmgr - add extra ecb(tnepres) encryption test vectors
  crypto: testmgr - add extra kw(aes) encryption test vector
  crypto: testmgr - eliminate redundant decryption test vectors

 crypto/testmgr.c |   409 +-
 crypto/testmgr.h | 12227 -
 2 files changed, 954 insertions(+), 11682 deletions(-)

(Abbreviated patch for review purposes only begins here, in case full
 patch is too large for the list; also see git link above)

[PATCH 5/5] crypto: testmgr - eliminate redundant decryption test vectors

Currently testmgr has separate encryption and decryption test vectors
for symmetric ciphers.  That's massively redundant, since with few
exceptions (mostly mistakes, apparently), all decryption tests are
identical to the encryption tests, just with the input/result flipped.

Therefore, eliminate the redundancy by removing the decryption test
vectors and updating testmgr to test both encryption and decryption
using what used to be the encryption test vectors.  Naming is adjusted
accordingly: each cipher_testvec now has a 'ptext' (plaintext), 'ctext'
(ciphertext), and 'len' instead of an 'input', 'result', 'ilen', and
'rlen'.  Note that it was always the case that 'ilen == rlen'.

AES keywrap ("kw(aes)") is special because its IV is generated by the
encryption.  Previously this was handled by specifying 'iv_out' for
encryption and 'iv' for decryption.  To make it work cleanly with only
one set of test vectors, put the IV in 'iv', remove 'iv_out', and add a
boolean that indicates that the IV is generated by the encryption.

In total, this removes over 1 lines from testmgr.h, with no
reduction in test coverage since prior patches already copied the few
unique decryption test vectors into the encryption test vectors.

This covers all algorithms that used 'struct cipher_testvec', e.g. any
block cipher in the ECB, CBC, CTR, XTS, LRW, CTS-CBC, PCBC, OFB, or
keywrap modes, and Salsa20 and ChaCha20.  No change is made to AEAD
tests, though we probably can eliminate a similar redundancy there too.

The testmgr.h portion of this patch was automatically generated using
the following awk script, with some slight manual fixups on top (updated
'struct cipher_testvec' definition, updated a few comments, and fixed up
the AES keywrap test vectors):

BEGIN { OTHER = 0; ENCVEC = 1; DECVEC = 2; DECVEC_TAIL = 3; mode = OTHER }

/^static const struct cipher_testvec.*_enc_/ { sub("_enc", ""); mode = 
ENCVEC }
/^static const struct cipher_testvec.*_dec_/ { mode = DECVEC }
mode == ENCVEC && !/\.ilen[[:space:]]*=/ {
sub(/\.input[[:space:]]*=$/,".ptext =")
sub(/\.input[[:space:]]*=/, ".ptext\t=")
sub(/\.result[[:space:]]*=$/,   ".ctext =")
sub(/\.result[[:space:]]*=/,".ctext\t=")
sub(/\.rlen[[:space:]]*=/,  ".len\t=")
print
}
mode == DECVEC_TAIL && /[^[:space:]]/ { mode = OTHER }
mode == OTHER { print }
mode == ENCVEC && /^};/   { mode = OTHER }
mode == DECVEC && /^};/   { mode = DECVEC_TAIL }

Note that git's default diff algorithm gets confused by the testm

[PATCH 3/6] crypto: crc32-generic - remove __crc32_le()

2018-05-19 Thread Eric Biggers

From: Eric Biggers <ebigg...@google.com>

The __crc32_le() wrapper function is pointless.  Just call crc32_le()
directly instead.

Signed-off-by: Eric Biggers <ebigg...@google.com>
---
 crypto/crc32_generic.c | 10 ++
 1 file changed, 2 insertions(+), 8 deletions(-)

diff --git a/crypto/crc32_generic.c b/crypto/crc32_generic.c
index 20b879881a2d..00facd27bcc2 100644
--- a/crypto/crc32_generic.c
+++ b/crypto/crc32_generic.c
@@ -40,11 +40,6 @@
 #define CHKSUM_BLOCK_SIZE  1
 #define CHKSUM_DIGEST_SIZE 4
 
-static u32 __crc32_le(u32 crc, unsigned char const *p, size_t len)
-{
-   return crc32_le(crc, p, len);
-}
-
 /** No default init with ~0 */
 static int crc32_cra_init(struct crypto_tfm *tfm)
 {
@@ -55,7 +50,6 @@ static int crc32_cra_init(struct crypto_tfm *tfm)
return 0;
 }
 
-
 /*
  * Setting the seed allows arbitrary accumulators and flexible XOR policy
  * If your algorithm starts with ~0, then XOR with ~0 before you set
@@ -89,7 +83,7 @@ static int crc32_update(struct shash_desc *desc, const u8 
*data,
 {
u32 *crcp = shash_desc_ctx(desc);
 
-   *crcp = __crc32_le(*crcp, data, len);
+   *crcp = crc32_le(*crcp, data, len);
return 0;
 }
 
@@ -97,7 +91,7 @@ static int crc32_update(struct shash_desc *desc, const u8 
*data,
 static int __crc32_finup(u32 *crcp, const u8 *data, unsigned int len,
 u8 *out)
 {
-   put_unaligned_le32(__crc32_le(*crcp, data, len), out);
+   put_unaligned_le32(crc32_le(*crcp, data, len), out);
return 0;
 }
 
-- 
2.17.0

[PATCH 2/6] crypto: crc32c-generic - remove cra_alignmask

2018-05-19 Thread Eric Biggers

From: Eric Biggers <ebigg...@google.com>

crc32c-generic sets an alignmask, but actually its ->update() works with
any alignment; only its ->setkey() and outputting the final digest
assume an alignment.  To prevent the buffer from having to be aligned by
the crypto API for just these cases, switch these cases over to the
unaligned access macros and remove the cra_alignmask.  Note that this
also makes crc32c-generic more consistent with crc32-generic.

Signed-off-by: Eric Biggers <ebigg...@google.com>
---
 crypto/crc32c_generic.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/crypto/crc32c_generic.c b/crypto/crc32c_generic.c
index 372320399622..7283066ecc98 100644
--- a/crypto/crc32c_generic.c
+++ b/crypto/crc32c_generic.c
@@ -35,6 +35,7 @@
  *
  */
 
+#include 
 #include 
 #include 
 #include 
@@ -82,7 +83,7 @@ static int chksum_setkey(struct crypto_shash *tfm, const u8 
*key,
crypto_shash_set_flags(tfm, CRYPTO_TFM_RES_BAD_KEY_LEN);
return -EINVAL;
}
-   mctx->key = le32_to_cpu(*(__le32 *)key);
+   mctx->key = get_unaligned_le32(key);
return 0;
 }
 
@@ -99,13 +100,13 @@ static int chksum_final(struct shash_desc *desc, u8 *out)
 {
struct chksum_desc_ctx *ctx = shash_desc_ctx(desc);
 
-   *(__le32 *)out = ~cpu_to_le32p(>crc);
+   put_unaligned_le32(~ctx->crc, out);
return 0;
 }
 
 static int __chksum_finup(u32 *crcp, const u8 *data, unsigned int len, u8 *out)
 {
-   *(__le32 *)out = ~cpu_to_le32(__crc32c_le(*crcp, data, len));
+   put_unaligned_le32(~__crc32c_le(*crcp, data, len), out);
return 0;
 }
 
@@ -148,7 +149,6 @@ static struct shash_alg alg = {
.cra_priority   =   100,
.cra_flags  =   CRYPTO_ALG_OPTIONAL_KEY,
.cra_blocksize  =   CHKSUM_BLOCK_SIZE,
-   .cra_alignmask  =   3,
.cra_ctxsize=   sizeof(struct chksum_ctx),
.cra_module =   THIS_MODULE,
.cra_init   =   crc32c_cra_init,
-- 
2.17.0

[PATCH 6/6] crypto: testmgr - add more unkeyed crc32 and crc32c test vectors

2018-05-19 Thread Eric Biggers

From: Eric Biggers <ebigg...@google.com>

crc32c has an unkeyed test vector but crc32 did not.  Add the crc32c one
(which uses an empty input) to crc32 too, and also add a new one to both
that uses a nonempty input.  These test vectors verify that crc32 and
crc32c implementations use the correct default initial state.

Signed-off-by: Eric Biggers <ebigg...@google.com>
---
 crypto/testmgr.h | 14 ++
 1 file changed, 14 insertions(+)

diff --git a/crypto/testmgr.h b/crypto/testmgr.h
index 816e3eb197b2..9350f9846451 100644
--- a/crypto/testmgr.h
+++ b/crypto/testmgr.h
@@ -42292,6 +42292,15 @@ static const struct hash_testvec 
michael_mic_tv_template[] = {
  * CRC32 test vectors
  */
 static const struct hash_testvec crc32_tv_template[] = {
+   {
+   .psize = 0,
+   .digest = "\x00\x00\x00\x00",
+   },
+   {
+   .plaintext = "abcdefg",
+   .psize = 7,
+   .digest = "\xd8\xb5\x46\xac",
+   },
{
.key = "\x87\xa9\xcb\xed",
.ksize = 4,
@@ -42728,6 +42737,11 @@ static const struct hash_testvec crc32c_tv_template[] 
= {
.psize = 0,
.digest = "\x00\x00\x00\x00",
},
+   {
+   .plaintext = "abcdefg",
+   .psize = 7,
+   .digest = "\x41\xf4\x27\xe6",
+   },
{
.key = "\x87\xa9\xcb\xed",
.ksize = 4,
-- 
2.17.0

[PATCH 1/6] crypto: crc32-generic - use unaligned access macros when needed

2018-05-19 Thread Eric Biggers

From: Eric Biggers <ebigg...@google.com>

crc32-generic doesn't have a cra_alignmask set, which is desired as its
->update() works with any alignment.  However, it incorrectly assumes
4-byte alignment in ->setkey() and when outputting the final digest.

Fix this by using the unaligned access macros in those cases.

Signed-off-by: Eric Biggers <ebigg...@google.com>
---
 crypto/crc32_generic.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/crypto/crc32_generic.c b/crypto/crc32_generic.c
index 718cbce8d169..20b879881a2d 100644
--- a/crypto/crc32_generic.c
+++ b/crypto/crc32_generic.c
@@ -29,6 +29,7 @@
  * This is crypto api shash wrappers to crc32_le.
  */
 
+#include 
 #include 
 #include 
 #include 
@@ -69,7 +70,7 @@ static int crc32_setkey(struct crypto_shash *hash, const u8 
*key,
crypto_shash_set_flags(hash, CRYPTO_TFM_RES_BAD_KEY_LEN);
return -EINVAL;
}
-   *mctx = le32_to_cpup((__le32 *)key);
+   *mctx = get_unaligned_le32(key);
return 0;
 }
 
@@ -96,7 +97,7 @@ static int crc32_update(struct shash_desc *desc, const u8 
*data,
 static int __crc32_finup(u32 *crcp, const u8 *data, unsigned int len,
 u8 *out)
 {
-   *(__le32 *)out = cpu_to_le32(__crc32_le(*crcp, data, len));
+   put_unaligned_le32(__crc32_le(*crcp, data, len), out);
return 0;
 }
 
@@ -110,7 +111,7 @@ static int crc32_final(struct shash_desc *desc, u8 *out)
 {
u32 *crcp = shash_desc_ctx(desc);
 
-   *(__le32 *)out = cpu_to_le32p(crcp);
+   put_unaligned_le32(*crcp, out);
return 0;
 }
 
-- 
2.17.0

[PATCH 0/6] crypto: crc32 cleanups and unkeyed tests

2018-05-19 Thread Eric Biggers

This series fixes up alignment for crc32-generic and crc32c-generic,
removes test vectors for bfin_crc that are no longer needed, and adds
unkeyed test vectors for crc32 and an extra unkeyed test vector for
crc32c.  Adding the unkeyed test vectors also required a testmgr change
to allow a single hash algorithm to have both unkeyed and keyed tests,
without relying on having it work by accident.

The new test vectors pass with the generic and x86 CRC implementations.
I haven't tested others yet; if any happen to be broken, they'll need to
be fixed.

Eric Biggers (6):
  crypto: crc32-generic - use unaligned access macros when needed
  crypto: crc32c-generic - remove cra_alignmask
  crypto: crc32-generic - remove __crc32_le()
  crypto: testmgr - remove bfin_crc "hmac(crc32)" test vectors
  crypto: testmgr - fix testing OPTIONAL_KEY hash algorithms
  crypto: testmgr - add more unkeyed crc32 and crc32c test vectors

 crypto/crc32_generic.c  |  15 ++
 crypto/crc32c_generic.c |   8 ++--
 crypto/tcrypt.c |   4 --
 crypto/testmgr.c|  56 +-
 crypto/testmgr.h| 102 ++--
 5 files changed, 66 insertions(+), 119 deletions(-)

-- 
2.17.0

[PATCH 4/6] crypto: testmgr - remove bfin_crc "hmac(crc32)" test vectors

2018-05-19 Thread Eric Biggers

From: Eric Biggers <ebigg...@google.com>

The Blackfin CRC driver was removed by commit 9678a8dc53c1 ("crypto:
bfin_crc - remove blackfin CRC driver"), but it was forgotten to remove
the corresponding "hmac(crc32)" test vectors.  I see no point in keeping
them since nothing else appears to implement or use "hmac(crc32)", which
isn't an algorithm that makes sense anyway because HMAC is meant to be
used with a cryptographically secure hash function, which CRC's are not.

Thus, remove the unneeded test vectors.

Signed-off-by: Eric Biggers <ebigg...@google.com>
---
 crypto/tcrypt.c  |  4 ---
 crypto/testmgr.c |  6 
 crypto/testmgr.h | 88 
 3 files changed, 98 deletions(-)

diff --git a/crypto/tcrypt.c b/crypto/tcrypt.c
index e721faab6fc8..d5bcdd905007 100644
--- a/crypto/tcrypt.c
+++ b/crypto/tcrypt.c
@@ -1942,10 +1942,6 @@ static int do_test(const char *alg, u32 type, u32 mask, 
int m, u32 num_mb)
ret += tcrypt_test("vmac(aes)");
break;
 
-   case 110:
-   ret += tcrypt_test("hmac(crc32)");
-   break;
-
case 111:
ret += tcrypt_test("hmac(sha3-224)");
break;
diff --git a/crypto/testmgr.c b/crypto/testmgr.c
index 41a5f42d4104..7e57530ecd52 100644
--- a/crypto/testmgr.c
+++ b/crypto/testmgr.c
@@ -3168,12 +3168,6 @@ static const struct alg_test_desc alg_test_descs[] = {
.suite = {
.hash = __VECS(ghash_tv_template)
}
-   }, {
-   .alg = "hmac(crc32)",
-   .test = alg_test_hash,
-   .suite = {
-   .hash = __VECS(bfin_crc_tv_template)
-   }
}, {
.alg = "hmac(md5)",
.test = alg_test_hash,
diff --git a/crypto/testmgr.h b/crypto/testmgr.h
index 552d8f00d85b..816e3eb197b2 100644
--- a/crypto/testmgr.h
+++ b/crypto/testmgr.h
@@ -43156,94 +43156,6 @@ static const struct hash_testvec crc32c_tv_template[] 
= {
}
 };
 
-/*
- * Blakcifn CRC test vectors
- */
-static const struct hash_testvec bfin_crc_tv_template[] = {
-   {
-   .psize = 0,
-   .digest = "\x00\x00\x00\x00",
-   },
-   {
-   .key = "\x87\xa9\xcb\xed",
-   .ksize = 4,
-   .psize = 0,
-   .digest = "\x87\xa9\xcb\xed",
-   },
-   {
-   .key = "\xff\xff\xff\xff",
-   .ksize = 4,
-   .plaintext = "\x01\x02\x03\x04\x05\x06\x07\x08"
-"\x09\x0a\x0b\x0c\x0d\x0e\x0f\x10"
-"\x11\x12\x13\x14\x15\x16\x17\x18"
-"\x19\x1a\x1b\x1c\x1d\x1e\x1f\x20"
-"\x21\x22\x23\x24\x25\x26\x27\x28",
-   .psize = 40,
-   .digest = "\x84\x0c\x8d\xa2",
-   },
-   {
-   .key = "\xff\xff\xff\xff",
-   .ksize = 4,
-   .plaintext = "\x01\x02\x03\x04\x05\x06\x07\x08"
-"\x09\x0a\x0b\x0c\x0d\x0e\x0f\x10"
-"\x11\x12\x13\x14\x15\x16\x17\x18"
-"\x19\x1a\x1b\x1c\x1d\x1e\x1f\x20"
-"\x21\x22\x23\x24\x25\x26",
-   .psize = 38,
-   .digest = "\x8c\x58\xec\xb7",
-   },
-   {
-   .key = "\xff\xff\xff\xff",
-   .ksize = 4,
-   .plaintext = "\x01\x02\x03\x04\x05\x06\x07\x08"
-"\x09\x0a\x0b\x0c\x0d\x0e\x0f\x10"
-"\x11\x12\x13\x14\x15\x16\x17\x18"
-"\x19\x1a\x1b\x1c\x1d\x1e\x1f\x20"
-"\x21\x22\x23\x24\x25\x26\x27",
-   .psize = 39,
-   .digest = "\xdc\x50\x28\x7b",
-   },
-   {
-   .key = "\xff\xff\xff\xff",
-   .ksize = 4,
-   .plaintext = "\x01\x02\x03\x04\x05\x06\x07\x08"
-"\x09\x0a\x0b\x0c\x0d\x0e\x0f\x10"
-"\x11\x12\x13\x14\x15\x16\x17\x18"
-"\x19\x1a\x1b\x1c\x1d\x1e\x1f\x20"
-"\x21\x22\x23\x24\x25\x26\x27\x28"
-"\x29\x2a\x2b\x2c\x2d\x2e\x2f\x30"
-"\x31\x32\x33\x34\x35\x36\x37\x38"
-"\x39\x3a\x3b\x3c\x3d\x3e\x3f\x40"
-"\x41\x42\x43\x44\x45\x46\x47\x48"
-

[PATCH 5/6] crypto: testmgr - fix testing OPTIONAL_KEY hash algorithms

2018-05-19 Thread Eric Biggers

From: Eric Biggers <ebigg...@google.com>

Since testmgr uses a single tfm for all tests of each hash algorithm,
once a key is set the tfm won't be unkeyed anymore.  But with crc32 and
crc32c, the key is really the "default initial state" and is optional;
those algorithms should have both keyed and unkeyed test vectors, to
verify that implementations use the correct default key.

Simply listing the unkeyed test vectors first isn't guaranteed to work
yet because testmgr makes multiple passes through the test vectors.
crc32c does have an unkeyed test vector listed first currently, but it
only works by chance because the last crc32c test vector happens to use
a key that is the same as the default key.

Therefore, teach testmgr to split hash test vectors into unkeyed and
keyed sections, and do all the unkeyed ones before the keyed ones.

Signed-off-by: Eric Biggers <ebigg...@google.com>
---
 crypto/testmgr.c | 50 +---
 1 file changed, 43 insertions(+), 7 deletions(-)

diff --git a/crypto/testmgr.c b/crypto/testmgr.c
index 7e57530ecd52..d3335d347e10 100644
--- a/crypto/testmgr.c
+++ b/crypto/testmgr.c
@@ -1798,8 +1798,9 @@ static int alg_test_comp(const struct alg_test_desc 
*desc, const char *driver,
return err;
 }
 
-static int alg_test_hash(const struct alg_test_desc *desc, const char *driver,
-u32 type, u32 mask)
+static int __alg_test_hash(const struct hash_testvec *template,
+  unsigned int tcount, const char *driver,
+  u32 type, u32 mask)
 {
struct crypto_ahash *tfm;
int err;
@@ -1811,16 +1812,51 @@ static int alg_test_hash(const struct alg_test_desc 
*desc, const char *driver,
return PTR_ERR(tfm);
}
 
-   err = test_hash(tfm, desc->suite.hash.vecs,
-   desc->suite.hash.count, true);
+   err = test_hash(tfm, template, tcount, true);
if (!err)
-   err = test_hash(tfm, desc->suite.hash.vecs,
-   desc->suite.hash.count, false);
-
+   err = test_hash(tfm, template, tcount, false);
crypto_free_ahash(tfm);
return err;
 }
 
+static int alg_test_hash(const struct alg_test_desc *desc, const char *driver,
+u32 type, u32 mask)
+{
+   const struct hash_testvec *template = desc->suite.hash.vecs;
+   unsigned int tcount = desc->suite.hash.count;
+   unsigned int nr_unkeyed, nr_keyed;
+   int err;
+
+   /*
+* For OPTIONAL_KEY algorithms, we have to do all the unkeyed tests
+* first, before setting a key on the tfm.  To make this easier, we
+* require that the unkeyed test vectors (if any) are listed first.
+*/
+
+   for (nr_unkeyed = 0; nr_unkeyed < tcount; nr_unkeyed++) {
+   if (template[nr_unkeyed].ksize)
+   break;
+   }
+   for (nr_keyed = 0; nr_unkeyed + nr_keyed < tcount; nr_keyed++) {
+   if (!template[nr_unkeyed + nr_keyed].ksize) {
+   pr_err("alg: hash: test vectors for %s out of order, "
+  "unkeyed ones must come first\n", desc->alg);
+   return -EINVAL;
+   }
+   }
+
+   err = 0;
+   if (nr_unkeyed) {
+   err = __alg_test_hash(template, nr_unkeyed, driver, type, mask);
+   template += nr_unkeyed;
+   }
+
+   if (!err && nr_keyed)
+   err = __alg_test_hash(template, nr_keyed, driver, type, mask);
+
+   return err;
+}
+
 static int alg_test_crc32c(const struct alg_test_desc *desc,
   const char *driver, u32 type, u32 mask)
 {
-- 
2.17.0

Re: [PATCH 3/3] crypto: x86 - Add optimized AEGIS implementations

2018-05-19 Thread Eric Biggers

Hi Ondrej,

On Fri, May 11, 2018 at 02:12:51PM +0200, Ondrej Mosnáček wrote:
> From: Ondrej Mosnacek 
> 
> This patch adds optimized implementations of AEGIS-128, AEGIS-128L,
> and AEGIS-256, utilizing the AES-NI and SSE2 x86 extensions.
> 
> Signed-off-by: Ondrej Mosnacek 
[...]
> +static int crypto_aegis256_aesni_setkey(struct crypto_aead *aead, const u8 
> *key,
> + unsigned int keylen)
> +{
> + struct aegis_ctx *ctx = crypto_aegis256_aesni_ctx(aead);
> +
> + if (keylen != AEGIS256_KEY_SIZE) {
> + crypto_aead_set_flags(aead, CRYPTO_TFM_RES_BAD_KEY_LEN);
> + return -EINVAL;
> + }
> +
> + memcpy(ctx->key.bytes, key, AEGIS256_KEY_SIZE);
> +
> + return 0;
> +}

This code is copying 32 bytes into a 16-byte buffer.

==
BUG: KASAN: slab-out-of-bounds in memcpy include/linux/string.h:345 [inline]
BUG: KASAN: slab-out-of-bounds in crypto_aegis256_aesni_setkey+0x23/0x60 
arch/x86/crypto/aegis256-aesni-glue.c:167
Write of size 32 at addr 88006c16b650 by task cryptomgr_test/120
CPU: 2 PID: 120 Comm: cryptomgr_test Not tainted 4.17.0-rc1-00069-g6ecc9d9ff91f 
#31
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.11.0-20171110_100015-anatol 04/01/2014
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x86/0xca lib/dump_stack.c:113
 print_address_description+0x65/0x204 mm/kasan/report.c:256
 kasan_report_error mm/kasan/report.c:354 [inline]
 kasan_report.cold.6+0x242/0x304 mm/kasan/report.c:412
 check_memory_region_inline mm/kasan/kasan.c:260 [inline]
 check_memory_region+0x13c/0x1b0 mm/kasan/kasan.c:267
 memcpy+0x37/0x50 mm/kasan/kasan.c:303
 memcpy include/linux/string.h:345 [inline]
 crypto_aegis256_aesni_setkey+0x23/0x60 
arch/x86/crypto/aegis256-aesni-glue.c:167
 crypto_aead_setkey+0xa4/0x1e0 crypto/aead.c:62
 cryptd_aead_setkey+0x30/0x50 crypto/cryptd.c:938
 crypto_aead_setkey+0xa4/0x1e0 crypto/aead.c:62
 cryptd_aegis256_aesni_setkey+0x30/0x50 
arch/x86/crypto/aegis256-aesni-glue.c:259
 crypto_aead_setkey+0xa4/0x1e0 crypto/aead.c:62
 __test_aead+0x8bf/0x3770 crypto/testmgr.c:675
 test_aead+0x28/0x110 crypto/testmgr.c:957
 alg_test_aead+0x8b/0x140 crypto/testmgr.c:1690
 alg_test.part.5+0x1bb/0x4d0 crypto/testmgr.c:3845
 alg_test+0x23/0x25 crypto/testmgr.c:3865
 cryptomgr_test+0x56/0x80 crypto/algboss.c:223
 kthread+0x329/0x3f0 kernel/kthread.c:238
 ret_from_fork+0x24/0x30 arch/x86/entry/entry_64.S:412
Allocated by task 120:
 save_stack mm/kasan/kasan.c:448 [inline]
 set_track mm/kasan/kasan.c:460 [inline]
 kasan_kmalloc.part.1+0x5f/0xf0 mm/kasan/kasan.c:553
 kasan_kmalloc+0xaf/0xc0 mm/kasan/kasan.c:538
 __do_kmalloc mm/slab.c:3718 [inline]
 __kmalloc+0x114/0x1d0 mm/slab.c:3727
 kmalloc include/linux/slab.h:517 [inline]
 kzalloc include/linux/slab.h:701 [inline]
 crypto_create_tfm+0x80/0x2c0 crypto/api.c:464
 crypto_spawn_tfm2+0x57/0x90 crypto/algapi.c:717
 crypto_spawn_aead include/crypto/internal/aead.h:112 [inline]
 cryptd_aead_init_tfm+0x3d/0x110 crypto/cryptd.c:1033
 crypto_aead_init_tfm+0x130/0x190 crypto/aead.c:111
 crypto_create_tfm+0xda/0x2c0 crypto/api.c:471
 crypto_alloc_tfm+0xcf/0x1d0 crypto/api.c:543
 crypto_alloc_aead+0x14/0x20 crypto/aead.c:351
 cryptd_alloc_aead+0xeb/0x1c0 crypto/cryptd.c:1334
 cryptd_aegis256_aesni_init_tfm+0x24/0xf0 
arch/x86/crypto/aegis256-aesni-glue.c:308
 crypto_aead_init_tfm+0x130/0x190 crypto/aead.c:111
 crypto_create_tfm+0xda/0x2c0 crypto/api.c:471
 crypto_alloc_tfm+0xcf/0x1d0 crypto/api.c:543
 crypto_alloc_aead+0x14/0x20 crypto/aead.c:351
 alg_test_aead+0x1f/0x140 crypto/testmgr.c:1682
 alg_test.part.5+0x1bb/0x4d0 crypto/testmgr.c:3845
 alg_test+0x23/0x25 crypto/testmgr.c:3865
 cryptomgr_test+0x56/0x80 crypto/algboss.c:223
 kthread+0x329/0x3f0 kernel/kthread.c:238
 ret_from_[   16.453502] serial8250: too much work for irq4
Freed by task 0:
(stack is not available)
The buggy address belongs to the object at 88006c16b600
The buggy address is located 80 bytes inside of
The buggy address belongs to the page:
page:ea00017a4f68 count:1 mapcount:0 mapping:88006c16b000 index:0x0
flags: 0x1000100(slab)
raw: 01000100 88006c16b000  00010015
raw: ea00017a2470 88006d401548 88006d400400
page dumped because: kasan: bad access detected
Memory state around the buggy address:
 88006c16b500: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
 88006c16b580: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
>88006c16b600: 00 00 00 00 00 00 00 00 00 00 00 00 00 fc fc fc
  ^
 88006c16b680: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
 88006c16b700: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
==
Disabling lock debugging due to kernel taint

[PATCH v2] fscrypt: log the crypto algorithm implementations

2018-05-18 Thread Eric Biggers

Log the crypto algorithm driver name for each fscrypt encryption mode on
its first use, also showing a friendly name for the mode.

This will help people determine whether the expected implementations are
being used.  In some cases we've seen people do benchmarks and reject
using encryption for performance reasons, when in fact they used a much
slower implementation of AES-XTS than was possible on the hardware.  It
can make an enormous difference; e.g., AES-XTS on ARM is about 10x
faster with the crypto extensions (AES instructions) than without.

This also makes it more obvious which modes are being used, now that
fscrypt supports multiple combinations of modes.

Example messages (with default modes, on x86_64):

[   35.492057] fscrypt: AES-256-CTS-CBC using implementation 
"cts(cbc-aes-aesni)"
[   35.492171] fscrypt: AES-256-XTS using implementation "xts-aes-aesni"

Note: algorithms can be dynamically added to the crypto API, which can
result in different implementations being used at different times.  But
this is rare; for most users, showing the first will be good enough.

Signed-off-by: Eric Biggers <ebigg...@google.com>
---

Changed since v1:
- Added missing "\n" (oops)

Note: this patch is on top of the other fscrypt patches I've sent out for 4.18.

 fs/crypto/keyinfo.c | 102 +---
 1 file changed, 68 insertions(+), 34 deletions(-)

diff --git a/fs/crypto/keyinfo.c b/fs/crypto/keyinfo.c
index 41f6025d5d7a..e997ca51192f 100644
--- a/fs/crypto/keyinfo.c
+++ b/fs/crypto/keyinfo.c
@@ -148,44 +148,64 @@ static int find_and_derive_key(const struct inode *inode,
return err;
 }
 
-static const struct {
+static struct fscrypt_mode {
+   const char *friendly_name;
const char *cipher_str;
int keysize;
+   bool logged_impl_name;
 } available_modes[] = {
-   [FS_ENCRYPTION_MODE_AES_256_XTS]  = { "xts(aes)",   64 },
-   [FS_ENCRYPTION_MODE_AES_256_CTS]  = { "cts(cbc(aes))",  32 },
-   [FS_ENCRYPTION_MODE_AES_128_CBC]  = { "cbc(aes)",   16 },
-   [FS_ENCRYPTION_MODE_AES_128_CTS]  = { "cts(cbc(aes))",  16 },
-   [FS_ENCRYPTION_MODE_SPECK128_256_XTS] = { "xts(speck128)",  64 },
-   [FS_ENCRYPTION_MODE_SPECK128_256_CTS] = { "cts(cbc(speck128))", 32 },
+   [FS_ENCRYPTION_MODE_AES_256_XTS] = {
+   .friendly_name = "AES-256-XTS",
+   .cipher_str = "xts(aes)",
+   .keysize = 64,
+   },
+   [FS_ENCRYPTION_MODE_AES_256_CTS] = {
+   .friendly_name = "AES-256-CTS-CBC",
+   .cipher_str = "cts(cbc(aes))",
+   .keysize = 32,
+   },
+   [FS_ENCRYPTION_MODE_AES_128_CBC] = {
+   .friendly_name = "AES-128-CBC",
+   .cipher_str = "cbc(aes)",
+   .keysize = 16,
+   },
+   [FS_ENCRYPTION_MODE_AES_128_CTS] = {
+   .friendly_name = "AES-128-CTS-CBC",
+   .cipher_str = "cts(cbc(aes))",
+   .keysize = 16,
+   },
+   [FS_ENCRYPTION_MODE_SPECK128_256_XTS] = {
+   .friendly_name = "Speck128/256-XTS",
+   .cipher_str = "xts(speck128)",
+   .keysize = 64,
+   },
+   [FS_ENCRYPTION_MODE_SPECK128_256_CTS] = {
+   .friendly_name = "Speck128/256-CTS-CBC",
+   .cipher_str = "cts(cbc(speck128))",
+   .keysize = 32,
+   },
 };
 
-static int determine_cipher_type(struct fscrypt_info *ci, struct inode *inode,
-const char **cipher_str_ret, int *keysize_ret)
+static struct fscrypt_mode *
+select_encryption_mode(const struct fscrypt_info *ci, const struct inode 
*inode)
 {
-   u32 mode;
-
if (!fscrypt_valid_enc_modes(ci->ci_data_mode, ci->ci_filename_mode)) {
fscrypt_warn(inode->i_sb,
 "inode %lu uses unsupported encryption modes 
(contents mode %d, filenames mode %d)",
 inode->i_ino, ci->ci_data_mode,
 ci->ci_filename_mode);
-   return -EINVAL;
+   return ERR_PTR(-EINVAL);
}
 
-   if (S_ISREG(inode->i_mode)) {
-   mode = ci->ci_data_mode;
-   } else if (S_ISDIR(inode->i_mode) || S_ISLNK(inode->i_mode)) {
-   mode = ci->ci_filename_mode;
-   } else {
-   WARN_ONCE(1, "fscrypt: filesystem tried to load encryption info 
for inode %lu, which is not encryptable (file type %d)\n",
- inode->i_ino, (inode->i_mode & S_IFMT));
-   return -EINVAL;
-   }
+   if (S_ISREG(inode->i_mode))
+   return _modes[ci->

[PATCH] fscrypt: log the crypto algorithm implementations

2018-05-17 Thread Eric Biggers

Log the crypto algorithm driver name for each fscrypt encryption mode on
its first use, also showing a friendly name for the mode.

This will help people determine whether the expected implementations are
being used.  In some cases we've seen people do benchmarks and reject
using encryption for performance reasons, when in fact they used a much
slower implementation of AES-XTS than was possible on the hardware.  It
can make an enormous difference; e.g., AES-XTS on ARM is about 10x
faster with the crypto extensions (AES instructions) than without.

This also makes it more obvious which modes are being used, now that
fscrypt supports multiple combinations of modes.

Example messages (with default modes, on x86_64):

[   35.492057] fscrypt: AES-256-CTS-CBC using implementation 
"cts(cbc-aes-aesni)"
[   35.492171] fscrypt: AES-256-XTS using implementation "xts-aes-aesni"

Note: algorithms can be dynamically added to the crypto API, which can
result in different implementations being used at different times.  But
this is rare; for most users, showing the first will be good enough.

Signed-off-by: Eric Biggers <ebigg...@google.com>
---

Note: this patch is on top of the other fscrypt patches I've sent out for 4.18.

 fs/crypto/keyinfo.c | 102 +---
 1 file changed, 68 insertions(+), 34 deletions(-)

diff --git a/fs/crypto/keyinfo.c b/fs/crypto/keyinfo.c
index 41f6025d5d7a..68b5baef5960 100644
--- a/fs/crypto/keyinfo.c
+++ b/fs/crypto/keyinfo.c
@@ -148,44 +148,64 @@ static int find_and_derive_key(const struct inode *inode,
return err;
 }
 
-static const struct {
+static struct fscrypt_mode {
+   const char *friendly_name;
const char *cipher_str;
int keysize;
+   bool logged_impl_name;
 } available_modes[] = {
-   [FS_ENCRYPTION_MODE_AES_256_XTS]  = { "xts(aes)",   64 },
-   [FS_ENCRYPTION_MODE_AES_256_CTS]  = { "cts(cbc(aes))",  32 },
-   [FS_ENCRYPTION_MODE_AES_128_CBC]  = { "cbc(aes)",   16 },
-   [FS_ENCRYPTION_MODE_AES_128_CTS]  = { "cts(cbc(aes))",  16 },
-   [FS_ENCRYPTION_MODE_SPECK128_256_XTS] = { "xts(speck128)",  64 },
-   [FS_ENCRYPTION_MODE_SPECK128_256_CTS] = { "cts(cbc(speck128))", 32 },
+   [FS_ENCRYPTION_MODE_AES_256_XTS] = {
+   .friendly_name = "AES-256-XTS",
+   .cipher_str = "xts(aes)",
+   .keysize = 64,
+   },
+   [FS_ENCRYPTION_MODE_AES_256_CTS] = {
+   .friendly_name = "AES-256-CTS-CBC",
+   .cipher_str = "cts(cbc(aes))",
+   .keysize = 32,
+   },
+   [FS_ENCRYPTION_MODE_AES_128_CBC] = {
+   .friendly_name = "AES-128-CBC",
+   .cipher_str = "cbc(aes)",
+   .keysize = 16,
+   },
+   [FS_ENCRYPTION_MODE_AES_128_CTS] = {
+   .friendly_name = "AES-128-CTS-CBC",
+   .cipher_str = "cts(cbc(aes))",
+   .keysize = 16,
+   },
+   [FS_ENCRYPTION_MODE_SPECK128_256_XTS] = {
+   .friendly_name = "Speck128/256-XTS",
+   .cipher_str = "xts(speck128)",
+   .keysize = 64,
+   },
+   [FS_ENCRYPTION_MODE_SPECK128_256_CTS] = {
+   .friendly_name = "Speck128/256-CTS-CBC",
+   .cipher_str = "cts(cbc(speck128))",
+   .keysize = 32,
+   },
 };
 
-static int determine_cipher_type(struct fscrypt_info *ci, struct inode *inode,
-const char **cipher_str_ret, int *keysize_ret)
+static struct fscrypt_mode *
+select_encryption_mode(const struct fscrypt_info *ci, const struct inode 
*inode)
 {
-   u32 mode;
-
if (!fscrypt_valid_enc_modes(ci->ci_data_mode, ci->ci_filename_mode)) {
fscrypt_warn(inode->i_sb,
 "inode %lu uses unsupported encryption modes 
(contents mode %d, filenames mode %d)",
 inode->i_ino, ci->ci_data_mode,
 ci->ci_filename_mode);
-   return -EINVAL;
+   return ERR_PTR(-EINVAL);
}
 
-   if (S_ISREG(inode->i_mode)) {
-   mode = ci->ci_data_mode;
-   } else if (S_ISDIR(inode->i_mode) || S_ISLNK(inode->i_mode)) {
-   mode = ci->ci_filename_mode;
-   } else {
-   WARN_ONCE(1, "fscrypt: filesystem tried to load encryption info 
for inode %lu, which is not encryptable (file type %d)\n",
- inode->i_ino, (inode->i_mode & S_IFMT));
-   return -EINVAL;
-   }
+   if (S_ISREG(inode->i_mode))
+   return _modes[ci->ci_data_mode];
+
+   if (S_ISDIR(inode->i_mode) || S_IS

[PATCH v2] fscrypt: add Speck128/256 support

2018-05-07 Thread Eric Biggers

fscrypt currently only supports AES encryption.  However, many low-end
mobile devices have older CPUs that don't have AES instructions, e.g.
the ARMv8 Cryptography Extensions.  Currently, user data on such devices
is not encrypted at rest because AES is too slow, even when the NEON
bit-sliced implementation of AES is used.  Unfortunately, it is
infeasible to encrypt these devices at all when AES is the only option.

Therefore, this patch updates fscrypt to support the Speck block cipher,
which was recently added to the crypto API.  The C implementation of
Speck is not especially fast, but Speck can be implemented very
efficiently with general-purpose vector instructions, e.g. ARM NEON.
For example, on an ARMv7 processor, we measured the NEON-accelerated
Speck128/256-XTS at 69 MB/s for both encryption and decryption, while
AES-256-XTS with the NEON bit-sliced implementation was only 22 MB/s
encryption and 19 MB/s decryption.

There are multiple variants of Speck.  This patch only adds support for
Speck128/256, which is the variant with a 128-bit block size and 256-bit
key size -- the same as AES-256.  This is believed to be the most secure
variant of Speck, and it's only about 6% slower than Speck128/128.
Speck64/128 would be at least 20% faster because it has 20% rounds, and
it can be even faster on CPUs that can't efficiently do the 64-bit
operations needed for Speck128.  However, Speck64's 64-bit block size is
not preferred security-wise.  ARM NEON also supports the needed 64-bit
operations even on 32-bit CPUs, resulting in Speck128 being fast enough
for our targeted use cases so far.

The chosen modes of operation are XTS for contents and CTS-CBC for
filenames.  These are the same modes of operation that fscrypt defaults
to for AES.  Note that as with the other fscrypt modes, Speck will not
be used unless userspace chooses to use it.  Nor are any of the existing
modes (which are all AES-based) being removed, of course.

We intentionally don't make CONFIG_FS_ENCRYPTION select
CONFIG_CRYPTO_SPECK, so people will have to enable Speck support
themselves if they need it.  This is because we shouldn't bloat the
FS_ENCRYPTION dependencies with every new cipher, especially ones that
aren't recommended for most users.  Moreover, CRYPTO_SPECK is just the
generic implementation, which won't be fast enough for many users; in
practice, they'll need to enable CRYPTO_SPECK_NEON to get acceptable
performance.

More details about our choice of Speck can be found in our patches that
added Speck to the crypto API, and the follow-on discussion threads.
We're planning a publication that explains the choice in more detail.
But briefly, we can't use ChaCha20 as we previously proposed, since it
would be insecure to use a stream cipher in this context, with potential
IV reuse during writes on f2fs and/or on wear-leveling flash storage.

We also evaluated many other lightweight and/or ARX-based block ciphers
such as Chaskey-LTS, RC5, LEA, CHAM, Threefish, RC6, NOEKEON, SPARX, and
XTEA.  However, all had disadvantages vs. Speck, such as insufficient
performance with NEON, much less published cryptanalysis, or an
insufficient security level.  Various design choices in Speck make it
perform better with NEON than competing ciphers while still having a
security margin similar to AES, and in the case of Speck128 also the
same available security levels.  Unfortunately, Speck does have some
political baggage attached -- it's an NSA designed cipher, and was
rejected from an ISO standard (though for context, as far as I know none
of the above-mentioned alternatives are ISO standards either).
Nevertheless, we believe it is a good solution to the problem from a
technical perspective.

Certain algorithms constructed from ChaCha or the ChaCha permutation,
such as MEM (Masked Even-Mansour) or HPolyC, may also meet our
performance requirements.  However, these are new constructions that
need more time to receive the cryptographic review and acceptance needed
to be confident in their security.  HPolyC hasn't been published yet,
and we are concerned that MEM makes stronger assumptions about the
underlying permutation than the ChaCha stream cipher does.  In contrast,
the XTS mode of operation is relatively well accepted, and Speck has
over 70 cryptanalysis papers.  Of course, these ChaCha-based algorithms
can still be added later if they become ready.

The best known attack on Speck128/256 is a differential cryptanalysis
attack on 25 of 34 rounds with 2^253 time complexity and 2^125 chosen
plaintexts, i.e. only marginally faster than brute force.  There is no
known attack on the full 34 rounds.

Signed-off-by: Eric Biggers <ebigg...@google.com>
---

Changed since v1:
- Improved commit message and documentation.

 Documentation/filesystems/fscrypt.rst | 10 ++
 fs/crypto/fscrypt_private.h   |  4 
 fs/crypto/keyinfo.c   |  2 ++
 include/uapi/linux/fs.h   |  2 ++
 4 files changed, 18 inse

Re: [PATCH v2 0/5] crypto: Speck support

2018-05-07 Thread Eric Biggers

Hi Samuel,

On Thu, Apr 26, 2018 at 03:05:44AM +0100, Samuel Neves wrote:
> On Wed, Apr 25, 2018 at 8:49 PM, Eric Biggers <ebigg...@google.com> wrote:
> > I agree that my explanation should have been better, and should have 
> > considered
> > more crypto algorithms.  The main difficulty is that we have extreme 
> > performance
> > requirements -- it needs to be 50 MB/s at the very least on even low-end ARM
> > devices like smartwatches.  And even with the NEON-accelerated Speck128-XTS
> > performance exceeding that after much optimization, we've been getting a 
> > lot of
> > pushback as people want closer to 100 MB/s.
> >
> 
> I couldn't find any NEON-capable ARMv7 chip below 800 MHz, so this
> would put the performance upper bound around 15 cycles per byte, with
> the comfortable number being ~7. That's indeed tough, though not
> impossible.
> 
> >
> > That's why I also included Speck64-XTS in the patches, since it was
> > straightforward to include, and some devices may really need that last 
> > 20-30% of
> > performance for encryption to be feasible at all.  (And when the choice is
> > between unencrypted and a 64-bit block cipher, used in a context where the
> > weakest points in the cryptosystem are actually elsewhere such as the user's
> > low-entropy PIN and the flash storage doing wear-leveling, I'd certainly 
> > take
> > the 64-bit block cipher.)  So far we haven't had to use Speck64 though, and 
> > if
> > that continues to be the case I'd be fine with Speck64 being removed, 
> > leaving
> > just Speck128.
> >
> 
> I would very much prefer that to be the case. As many of us know,
> "it's better than nothing" has been often used to justify other bad
> choices, like RC4, that end up preventing better ones from being
> adopted. At a time where we're trying to get rid of 64-bit ciphers in
> TLS, where data volumes per session are comparatively low, it would be
> unfortunate if the opposite starts happening on encryption at rest.
> 
> >
> > Note that in practice, to have any chance at meeting the performance 
> > requirement
> > the cipher needed to be NEON accelerated.  That made benchmarking really 
> > hard
> > and time-consuming, since to definitely know how an algorithm performs it 
> > can
> > take upwards of a week to implement a NEON version.  It needs to be very 
> > well
> > optimized too, to compare the algorithms fairly -- e.g. with Speck I got a 
> > 20%
> > performance improvement on some CPUs just by changing the NEON instructions 
> > used
> > to implement the 8-bit rotates, an optimization that is not possible with
> > ciphers that don't use rotate amounts that are multiples of 8.  (This was an
> > intentional design choice by the Speck designers; they do know what they're
> > doing, actually.)
> >
> > Thus, we had to be pretty aggressive about dropping algorithms from
> > consideration if there were preliminary indications that they wouldn't 
> > perform
> > well, or had too little cryptanalysis, or had other issues such as an 
> > unclear
> > patent situation.  Threefish for example I did test the C implementation at
> > https://github.com/wernerd/Skein3Fish, but on ARM32 it was over 4 times 
> > slower
> > than my NEON implementation of Speck128/256-XTS.  And I did not see a clear 
> > way
> > that it could be improved over 4x with NEON, if at all, so I did not take 
> > the
> > long time it would have taken to write an optimized NEON implementation to
> > benchmark it properly.  Perhaps that was a mistake.  But, time is not 
> > unlimited.
> >
> 
> In my limited experience with NEON and 64-bit ARX, there's usually a
> ~2x speedup solely from NEON's native 64-bit operations on ARMv7-A.
> The extra speedup from encrypting 2 block in parallel is then
> somewhere between 1x and 2x, depending on various details. Getting
> near 4x might be feasible, but it is indeed time-consuming to get
> there.
> 
> >
> > As for the wide-block mode using ChaCha20 and Poly1305, you'd have to ask 
> > Paul
> > Crowley to explain it properly, but briefly it's actually a pseudorandom
> > permutation over an arbitrarily-sized message.  So with dm-crypt for 
> > example, it
> > would operate on a whole 512-byte sector, and if any bit of the 512-byte
> > plaintext is changed, then every bit in the 512-byte ciphertext would change
> > with 50% probability.  To make this possible, the construction uses a 
> > polynomial
> > evalution in GF(2^130-5) as a universal hash function, similar to the 
> &

Re: [PATCH v2 0/5] crypto: Speck support

2018-04-25 Thread Eric Biggers

Hi Samuel,

On Wed, Apr 25, 2018 at 03:33:16PM +0100, Samuel Neves wrote:
> Let's put the provenance of Speck aside for a moment, and suppose that
> it is an ideal block cipher. There are still some issues with this
> patch as it stands.
> 
>  - The rationale seems off. Consider this bit from the commit message:
> 
> > Other AES alternatives such as Twofish, Threefish, Camellia, CAST6, and 
> > Serpent aren't
> > fast enough either; it seems that only a modern ARX cipher can provide 
> > sufficient performance
> > on these devices.
> 
> One of these things is very much not like the others. Threefish _is_ a
> modern ARX cipher---a tweakable block cipher in fact, precluding the
> need for XEX-style masking. Is it too slow? Does it not have the
> correct block size?
> 
> > We've also considered a novel length-preserving encryption mode based on
> > ChaCha20 and Poly1305.
> 
> I'm very curious about this, namely as to what the role of Poly1305
> would be here. ChaCha20's underlying permutation could, of course, be
> transformed into a 512-bit tweakable block cipher relatively
> painlessly, retaining the performance of regular ChaCha20 with
> marginal additional overhead. This would not be a standard
> construction, but clearly that is not an issue.
> 
> But the biggest problem here, in my mind, is that for all the talk of
> using 128-bit block Speck, this patch tacks on the 64-bit block
> variant of Speck into the kernel, and speck64-xts as well! As far as I
> can tell, this is the _only_ instance of a 64-bit XTS instance in the
> entire codebase. Now, if you wanted a fast 64-bit ARX block cipher,
> the kernel already had XTEA. Instead, this is adding yet another
> 64-bit block cipher into the crypto API, in a disk-encryption mode no
> less, so that it can be misused later. In the disk encryption setting,
> it's particularly concerning to be using such a small block size, as
> data volumes can quickly add up to the birthday bound.
> 
> > It's easy to say that, but do you have an actual suggestion?
> 
> I don't know how seriously you are actually asking this, but some
> 128-bit software-friendly block ciphers could be SPARX, LEA, RC5, or
> RC6. SPARX, in particular, has similarities to Speck but has some
> further AES-like design guarantees that other prior ARX block ciphers
> did not. Some other bitsliced designs, such as Noekeon or SKINNY, may
> also work well with NEON, but I don't know much about their
> performance there.
> 

I agree that my explanation should have been better, and should have considered
more crypto algorithms.  The main difficulty is that we have extreme performance
requirements -- it needs to be 50 MB/s at the very least on even low-end ARM
devices like smartwatches.  And even with the NEON-accelerated Speck128-XTS
performance exceeding that after much optimization, we've been getting a lot of
pushback as people want closer to 100 MB/s.

That's why I also included Speck64-XTS in the patches, since it was
straightforward to include, and some devices may really need that last 20-30% of
performance for encryption to be feasible at all.  (And when the choice is
between unencrypted and a 64-bit block cipher, used in a context where the
weakest points in the cryptosystem are actually elsewhere such as the user's
low-entropy PIN and the flash storage doing wear-leveling, I'd certainly take
the 64-bit block cipher.)  So far we haven't had to use Speck64 though, and if
that continues to be the case I'd be fine with Speck64 being removed, leaving
just Speck128.

Note that in practice, to have any chance at meeting the performance requirement
the cipher needed to be NEON accelerated.  That made benchmarking really hard
and time-consuming, since to definitely know how an algorithm performs it can
take upwards of a week to implement a NEON version.  It needs to be very well
optimized too, to compare the algorithms fairly -- e.g. with Speck I got a 20%
performance improvement on some CPUs just by changing the NEON instructions used
to implement the 8-bit rotates, an optimization that is not possible with
ciphers that don't use rotate amounts that are multiples of 8.  (This was an
intentional design choice by the Speck designers; they do know what they're
doing, actually.)

Thus, we had to be pretty aggressive about dropping algorithms from
consideration if there were preliminary indications that they wouldn't perform
well, or had too little cryptanalysis, or had other issues such as an unclear
patent situation.  Threefish for example I did test the C implementation at
https://github.com/wernerd/Skein3Fish, but on ARM32 it was over 4 times slower
than my NEON implementation of Speck128/256-XTS.  And I did not see a clear way
that it could be improved over 4x with NEON, if at all, so I did not take the
long time it would have taken to write an optimized NEON implementation to
benchmark it properly.  Perhaps that was a mistake.  But, time is not unlimited.

RC5 and RC6 use data-dependent rotates which

Re: [PATCH v2 0/5] crypto: Speck support

2018-04-24 Thread Eric Biggers

Hi Jason,

On Tue, Apr 24, 2018 at 10:58:35PM +0200, Jason A. Donenfeld wrote:
> Hi Eric,
> 
> On Tue, Apr 24, 2018 at 8:16 PM, Eric Biggers <ebigg...@google.com> wrote:
> > So, what do you propose replacing it with?
> 
> Something more cryptographically justifiable.
> 

It's easy to say that, but do you have an actual suggestion?  As I mentioned,
for disk encryption without AES instructions the main alternatives we've
considered are ChaCha20 with reused nonces, an unpublished wide-block mode based
on ChaCha20 and Poly1305 (with no external cryptanalysis yet, and probably
actually using ChaCha8 or ChaCha12 to meet performance requirements), or the
status quo of no encryption at all.

It *might* be possible to add per-block metadata support to f2fs, in which it
could be used with ChaCha20 in fscrypt.  But if feasible at all it would be
quite difficult (requiring some significant filesystem surgery, and disabling
conflicting filesystem features that allow data to be updated in-place) and
would not cover dm-crypt, nor ext4.

Note also that many other lightweight block ciphers are designed for hardware
and perform poorly in software, e.g. PRESENT is even slower than AES.  Thus
there really weren't many options.

Any concrete suggestions are greatly appreciated!

> > outside crypto review, vs. the many cryptanalysis papers on Speck.  (In that
> > respect the controversy about Speck has actually become an advantage, as it 
> > has
> > received much more cryptanalysis than other lightweight block ciphers.)
> 
> That's the thing that worries me, actually. Many of the design
> decisions behind Speck haven't been justified.
> 

Originally that was true, but later there were significant clarifications
released, e.g. the paper "Notes on the design and analysis of Simon and Speck"
(https://eprint.iacr.org/2017/560.pdf).  In fact, from what I can see, many
competing lightweight block ciphers don't have as much design justification
available as Speck.  Daniel Bernstein's papers are excellent, but unfortunately
he has only designed a stream cipher, not a block cipher or another algorithm
that is applicable for disk encryption.

> > The reason we chose Speck had nothing to do with the proposed ISO standard 
> > or
> > any sociopolitical factors, but rather because it was the only algorithm we
> > could find that met the performance and security requirements.
> 
> > Note that Linux
> > doesn't bow down to any particular standards organization, and it offers
> > algorithms that were specified in various places, even some with no more 
> > than a
> > publication by the author.  In fact, support for SM4 was just added too, 
> > which
> > is a Chinese government standard.  Are you going to send a patch to remove 
> > that
> > too, or is it just NSA designed algorithms that are not okay?
> 
> No need to be belittling; I have much less tinfoil strapped around my
> head than perhaps you think. I'm not blindly opposed to
> government-designed algorithms. Take SHA2, for example -- built by the
> NSA.
> 
> But I do care quite a bit about using ciphers that have acceptance of
> the academic community and a large body of literature documenting its
> design decisions and analyzing it. Some of the best symmetric
> cryptographers in academia have expressed reservations about it, and
> it was just rejected from a major standard's body. Linux, of course,
> is free to disagree -- or "bow down" as you oddly put it -- but I'd
> make sure you've got a pretty large bucket of justifications for that
> disagreement.
> 

There have actually been many papers analyzing Speck.  As with other ciphers,
reduced-round variants have been successfully attacked, while the full variants
have held up.  This is expected.  It's true that some other ciphers such as
ChaCha20 have a higher security margin, which has resulted in some criticism of
Speck.  But the correct security margin is always debatable, and in a
performance-oriented cipher it's desirable to not have an excessive number of
rounds.  In fact it was even looking like ChaCha20 was not going to be fast
enough on some CPUs, so if we went the ChaCha route we may have actually have
had to use ChaCha12 or ChaCha8 instead.

Also, some papers present results for just the weakest variants of Speck
(Speck32 and Speck48) while omitting the strongest (Speck128, the one that's
planned to be offered for Android), presumably because the authors weren't able
to attack it as successfully.  I think that's causing some confusion.

I don't see how the ISO standarization process means much for crypto algorithms.
It seems very political, and actually it seems that people involved were pretty
clear that Speck was rejected primarily for political reasons.  Interestingly,
ChaCha20 is not an ISO standard either.  D

1 2 3 4 5 >

1 - 100 of 460 matches

Mail list logo