[PATCH 2/2] xfrm: add rfc4494 AES-CMAC-96 support
Now that CryptoAPI has support for CMAC, we can add support for AES-CMAC-96 (rfc4494). Cc: Tom St Denis tstde...@elliptictech.com Signed-off-by: Jussi Kivilinna jussi.kivili...@iki.fi --- net/xfrm/xfrm_algo.c | 13 + 1 file changed, 13 insertions(+) diff --git a/net/xfrm/xfrm_algo.c b/net/xfrm/xfrm_algo.c index 6fb9d00..ab4ef72 100644 --- a/net/xfrm/xfrm_algo.c +++ b/net/xfrm/xfrm_algo.c @@ -311,6 +311,19 @@ static struct xfrm_algo_desc aalg_list[] = { .sadb_alg_maxbits = 128 } }, +{ + /* rfc4494 */ + .name = cmac(aes), + + .uinfo = { + .auth = { + .icv_truncbits = 96, + .icv_fullbits = 128, + } + }, + + .pfkey_supported = 0, +}, }; static struct xfrm_algo_desc ealg_list[] = { -- To unsubscribe from this list: send the line unsubscribe linux-crypto in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] crypto: add CMAC support to CryptoAPI
On Mon, Apr 08, 2013 at 10:48:44AM +0300, Jussi Kivilinna wrote: Patch adds support for NIST recommended block cipher mode CMAC to CryptoAPI. This work is based on Tom St Denis' earlier patch, http://marc.info/?l=linux-crypto-vgerm=135877306305466w=2 Cc: Tom St Denis tstde...@elliptictech.com Signed-off-by: Jussi Kivilinna jussi.kivili...@iki.fi This patch does not apply clean to the ipsec-next tree because of some crypto changes I don't have in ipsec-next. The IPsec part should apply to the cryptodev tree, so it's probaply the best if we route this patchset through the cryptodev tree. Herbert, are you going to take these patches? -- To unsubscribe from this list: send the line unsubscribe linux-crypto in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] crypto: add CMAC support to CryptoAPI
On 08.04.2013 11:24, Steffen Klassert wrote: On Mon, Apr 08, 2013 at 10:48:44AM +0300, Jussi Kivilinna wrote: Patch adds support for NIST recommended block cipher mode CMAC to CryptoAPI. This work is based on Tom St Denis' earlier patch, http://marc.info/?l=linux-crypto-vgerm=135877306305466w=2 Cc: Tom St Denis tstde...@elliptictech.com Signed-off-by: Jussi Kivilinna jussi.kivili...@iki.fi This patch does not apply clean to the ipsec-next tree because of some crypto changes I don't have in ipsec-next. The IPsec part should apply to the cryptodev tree, so it's probaply the best if we route this patchset through the cryptodev tree. I should have mentioned that the patchset is on top of cryptodev tree and previous crypto patches that I send yesterday, likely to cause problems atleast at tcrypt.c: http://marc.info/?l=linux-crypto-vgerm=136534223503368w=2 -Jussi Herbert, are you going to take these patches? -- To unsubscribe from this list: send the line unsubscribe linux-crypto in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] crypto: add CMAC support to CryptoAPI
On Mon, Apr 08, 2013 at 10:24:16AM +0200, Steffen Klassert wrote: On Mon, Apr 08, 2013 at 10:48:44AM +0300, Jussi Kivilinna wrote: Patch adds support for NIST recommended block cipher mode CMAC to CryptoAPI. This work is based on Tom St Denis' earlier patch, http://marc.info/?l=linux-crypto-vgerm=135877306305466w=2 Cc: Tom St Denis tstde...@elliptictech.com Signed-off-by: Jussi Kivilinna jussi.kivili...@iki.fi This patch does not apply clean to the ipsec-next tree because of some crypto changes I don't have in ipsec-next. The IPsec part should apply to the cryptodev tree, so it's probaply the best if we route this patchset through the cryptodev tree. Herbert, are you going to take these patches? Sure I can do that. Cheers, -- Email: Herbert Xu herb...@gondor.apana.org.au Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -- To unsubscribe from this list: send the line unsubscribe linux-crypto in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: questions of crypto async api
Thanks for the answer. I have further question on the same subject. With regard to the commit in talitos.c, (attached at end of this mail), the driver submits requests of same tfm to the same channel to ensure the ordering. Is it because the tfm context needs to be maintained from one operation to next operation? Eg, aead_givencrypt() generates iv based on previous iv result stored in tfm. If requests are sent to different channels dynamically. And driver at the completion of a request from HW, reorders the requests completion callback, what would happen? Thanks in advance. Chemin commit 5228f0f79e983c2b39c202c75af901ceb0003fc1 Author: Kim Phillips kim.phill...@freescale.com Date: Fri Jul 15 11:21:38 2011 +0800 crypto: talitos - ensure request ordering within a single tfm Assign single target channel per tfm in talitos_cra_init instead of performing channel scheduling dynamically during the encryption request. This changes the talitos_submit interface to accept a new channel number argument. Without this, rapid bursts of misc. sized requests could make it possible for IPsec packets to be encrypted out-of-order, which would result in packet drops due to sequence numbers falling outside the anti-reply window on a peer gateway. Signed-off-by: Kim Phillips kim.phill...@freescale.com Signed-off-by: Herbert Xu herb...@gondor.apana.org.au -Original Message- From: Kim Phillips [mailto:kim.phill...@freescale.com] Sent: Friday, April 05, 2013 6:33 PM To: Hsieh, Che-Min Cc: linux-crypto@vger.kernel.org Subject: Re: questions of crypto async api On Thu, 4 Apr 2013 14:38:41 + Hsieh, Che-Min chem...@qti.qualcomm.com wrote: If a driver supports multiple instances of HW crypto engines, the order of the request completion from HW can be different from the order of requests submitted to different HW. The 2nd request sent out to the 2nd HW instance may take shorter time to complete than the first request for different HW instance. Is the driver responsible for re-ordering the completion callout? Or the agents (such as IP protocol stack) are responsible for reordering? How does pcrypt do it? Does it make sense for a transform to send multiple requests outstanding to async crypto api? see: http://comments.gmane.org/gmane.linux.kernel.cryptoapi/5350 Is scatterwalk_sg_next() preferred method over sg_next()? Why? scatterwalk_* is the crypto subsystem's version of the function, so yes. sg_copy_to_buffer() and sg_copy_from_buffer() - sg_copy_buffer()-sg_copy_buffer() - sg_miter_next()- sg_next() Sometimes sg_copy_to_buffer() and sg_copy_from_buffer() in our driver do not copy the whole list. We have to rewrite those functions by using scattewalk_sg_next() to walk down the list. Is this the correct behavior? sounds like you're on the right track, although buffers shouldn't be being copied that often, if at all. Kim -- To unsubscribe from this list: send the line unsubscribe linux-crypto in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/5] crypto: x86 - add more optimized XTS-mode for serpent-avx
This patch adds AVX optimized XTS-mode helper functions/macros and converts serpent-avx to use the new facilities. Benefits are slightly improved speed and reduced stack usage as use of temporary IV-array is avoided. tcrypt results, with Intel i5-2450M: enc dec 16B 1.00x 1.00x 64B 1.00x 1.00x 256B1.04x 1.06x 1024B 1.09x 1.09x 8192B 1.10x 1.09x Signed-off-by: Jussi Kivilinna jussi.kivili...@iki.fi --- arch/x86/crypto/glue_helper-asm-avx.S | 61 + arch/x86/crypto/glue_helper.c | 97 +++ arch/x86/crypto/serpent-avx-x86_64-asm_64.S | 45 - arch/x86/crypto/serpent_avx_glue.c | 87 +--- arch/x86/include/asm/crypto/glue_helper.h | 24 +++ arch/x86/include/asm/crypto/serpent-avx.h |5 + 6 files changed, 273 insertions(+), 46 deletions(-) diff --git a/arch/x86/crypto/glue_helper-asm-avx.S b/arch/x86/crypto/glue_helper-asm-avx.S index f7b6ea2..02ee230 100644 --- a/arch/x86/crypto/glue_helper-asm-avx.S +++ b/arch/x86/crypto/glue_helper-asm-avx.S @@ -1,7 +1,7 @@ /* * Shared glue code for 128bit block ciphers, AVX assembler macros * - * Copyright (c) 2012 Jussi Kivilinna jussi.kivili...@mbnet.fi + * Copyright © 2012-2013 Jussi Kivilinna jussi.kivili...@iki.fi * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by @@ -89,3 +89,62 @@ vpxor (6*16)(src), x6, x6; \ vpxor (7*16)(src), x7, x7; \ store_8way(dst, x0, x1, x2, x3, x4, x5, x6, x7); + +#define gf128mul_x_ble(iv, mask, tmp) \ + vpsrad $31, iv, tmp; \ + vpaddq iv, iv, iv; \ + vpshufd $0x13, tmp, tmp; \ + vpand mask, tmp, tmp; \ + vpxor tmp, iv, iv; + +#define load_xts_8way(iv, src, dst, x0, x1, x2, x3, x4, x5, x6, x7, tiv, t0, \ + t1, xts_gf128mul_and_shl1_mask) \ + vmovdqa xts_gf128mul_and_shl1_mask, t0; \ + \ + /* load IV */ \ + vmovdqu (iv), tiv; \ + vpxor (0*16)(src), tiv, x0; \ + vmovdqu tiv, (0*16)(dst); \ + \ + /* construct and store IVs, also xor with source */ \ + gf128mul_x_ble(tiv, t0, t1); \ + vpxor (1*16)(src), tiv, x1; \ + vmovdqu tiv, (1*16)(dst); \ + \ + gf128mul_x_ble(tiv, t0, t1); \ + vpxor (2*16)(src), tiv, x2; \ + vmovdqu tiv, (2*16)(dst); \ + \ + gf128mul_x_ble(tiv, t0, t1); \ + vpxor (3*16)(src), tiv, x3; \ + vmovdqu tiv, (3*16)(dst); \ + \ + gf128mul_x_ble(tiv, t0, t1); \ + vpxor (4*16)(src), tiv, x4; \ + vmovdqu tiv, (4*16)(dst); \ + \ + gf128mul_x_ble(tiv, t0, t1); \ + vpxor (5*16)(src), tiv, x5; \ + vmovdqu tiv, (5*16)(dst); \ + \ + gf128mul_x_ble(tiv, t0, t1); \ + vpxor (6*16)(src), tiv, x6; \ + vmovdqu tiv, (6*16)(dst); \ + \ + gf128mul_x_ble(tiv, t0, t1); \ + vpxor (7*16)(src), tiv, x7; \ + vmovdqu tiv, (7*16)(dst); \ + \ + gf128mul_x_ble(tiv, t0, t1); \ + vmovdqu tiv, (iv); + +#define store_xts_8way(dst, x0, x1, x2, x3, x4, x5, x6, x7) \ + vpxor (0*16)(dst), x0, x0; \ + vpxor (1*16)(dst), x1, x1; \ + vpxor (2*16)(dst), x2, x2; \ + vpxor (3*16)(dst), x3, x3; \ + vpxor (4*16)(dst), x4, x4; \ + vpxor (5*16)(dst), x5, x5; \ + vpxor (6*16)(dst), x6, x6; \ + vpxor (7*16)(dst), x7, x7; \ + store_8way(dst, x0, x1, x2, x3, x4, x5, x6, x7); diff --git a/arch/x86/crypto/glue_helper.c b/arch/x86/crypto/glue_helper.c index 22ce4f6..432f1d76 100644 --- a/arch/x86/crypto/glue_helper.c +++ b/arch/x86/crypto/glue_helper.c @@ -1,7 +1,7 @@ /* * Shared glue code for 128bit block ciphers * - * Copyright (c) 2012 Jussi Kivilinna jussi.kivili...@mbnet.fi + * Copyright © 2012-2013 Jussi Kivilinna jussi.kivili...@iki.fi * * CBC ECB parts based on code (crypto/cbc.c,ecb.c) by: * Copyright (c) 2006 Herbert Xu herb...@gondor.apana.org.au @@ -304,4 +304,99 @@ int glue_ctr_crypt_128bit(const struct common_glue_ctx *gctx, } EXPORT_SYMBOL_GPL(glue_ctr_crypt_128bit); +static unsigned int __glue_xts_crypt_128bit(const struct common_glue_ctx *gctx, + void *ctx, + struct blkcipher_desc *desc, + struct blkcipher_walk *walk) +{ + const unsigned int bsize = 128 / 8; + unsigned int nbytes = walk-nbytes; + u128 *src = (u128 *)walk-src.virt.addr; + u128 *dst = (u128 *)walk-dst.virt.addr; + unsigned int num_blocks, func_bytes; + unsigned int i; + + /* Process multi-block batch */ + for (i = 0; i gctx-num_funcs; i++) { + num_blocks = gctx-funcs[i].num_blocks; + func_bytes = bsize * num_blocks; + + if (nbytes = func_bytes) { +
[PATCH 3/5] crypto: cast6-avx: use new optimized XTS code
Change cast6-avx to use the new XTS code, for smaller stack usage and small boost to performance. tcrypt results, with Intel i5-2450M: enc dec 16B 1.01x 1.01x 64B 1.01x 1.00x 256B1.09x 1.02x 1024B 1.08x 1.06x 8192B 1.08x 1.07x Signed-off-by: Jussi Kivilinna jussi.kivili...@iki.fi --- arch/x86/crypto/cast6-avx-x86_64-asm_64.S | 48 +++ arch/x86/crypto/cast6_avx_glue.c | 91 - 2 files changed, 98 insertions(+), 41 deletions(-) diff --git a/arch/x86/crypto/cast6-avx-x86_64-asm_64.S b/arch/x86/crypto/cast6-avx-x86_64-asm_64.S index f93b610..e3531f8 100644 --- a/arch/x86/crypto/cast6-avx-x86_64-asm_64.S +++ b/arch/x86/crypto/cast6-avx-x86_64-asm_64.S @@ -4,7 +4,7 @@ * Copyright (C) 2012 Johannes Goetzfried * johannes.goetzfr...@informatik.stud.uni-erlangen.de * - * Copyright © 2012 Jussi Kivilinna jussi.kivili...@mbnet.fi + * Copyright © 2012-2013 Jussi Kivilinna jussi.kivili...@iki.fi * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by @@ -227,6 +227,8 @@ .data .align 16 +.Lxts_gf128mul_and_shl1_mask: + .byte 0x87, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0 .Lbswap_mask: .byte 3, 2, 1, 0, 7, 6, 5, 4, 11, 10, 9, 8, 15, 14, 13, 12 .Lbswap128_mask: @@ -424,3 +426,47 @@ ENTRY(cast6_ctr_8way) ret; ENDPROC(cast6_ctr_8way) + +ENTRY(cast6_xts_enc_8way) + /* input: +* %rdi: ctx, CTX +* %rsi: dst +* %rdx: src +* %rcx: iv (t ⊕ αⁿ ∈ GF(2¹²⁸)) +*/ + + movq %rsi, %r11; + + /* regs = src, dst = IVs, regs = regs xor IVs */ + load_xts_8way(%rcx, %rdx, %rsi, RA1, RB1, RC1, RD1, RA2, RB2, RC2, RD2, + RX, RKR, RKM, .Lxts_gf128mul_and_shl1_mask); + + call __cast6_enc_blk8; + + /* dst = regs xor IVs(in dst) */ + store_xts_8way(%r11, RA1, RB1, RC1, RD1, RA2, RB2, RC2, RD2); + + ret; +ENDPROC(cast6_xts_enc_8way) + +ENTRY(cast6_xts_dec_8way) + /* input: +* %rdi: ctx, CTX +* %rsi: dst +* %rdx: src +* %rcx: iv (t ⊕ αⁿ ∈ GF(2¹²⁸)) +*/ + + movq %rsi, %r11; + + /* regs = src, dst = IVs, regs = regs xor IVs */ + load_xts_8way(%rcx, %rdx, %rsi, RA1, RB1, RC1, RD1, RA2, RB2, RC2, RD2, + RX, RKR, RKM, .Lxts_gf128mul_and_shl1_mask); + + call __cast6_dec_blk8; + + /* dst = regs xor IVs(in dst) */ + store_xts_8way(%r11, RA1, RB1, RC1, RD1, RA2, RB2, RC2, RD2); + + ret; +ENDPROC(cast6_xts_dec_8way) diff --git a/arch/x86/crypto/cast6_avx_glue.c b/arch/x86/crypto/cast6_avx_glue.c index 92f7ca2..8d0dfb8 100644 --- a/arch/x86/crypto/cast6_avx_glue.c +++ b/arch/x86/crypto/cast6_avx_glue.c @@ -4,6 +4,8 @@ * Copyright (C) 2012 Johannes Goetzfried * johannes.goetzfr...@informatik.stud.uni-erlangen.de * + * Copyright © 2013 Jussi Kivilinna jussi.kivili...@iki.fi + * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation; either version 2 of the License, or @@ -50,6 +52,23 @@ asmlinkage void cast6_cbc_dec_8way(struct cast6_ctx *ctx, u8 *dst, asmlinkage void cast6_ctr_8way(struct cast6_ctx *ctx, u8 *dst, const u8 *src, le128 *iv); +asmlinkage void cast6_xts_enc_8way(struct cast6_ctx *ctx, u8 *dst, + const u8 *src, le128 *iv); +asmlinkage void cast6_xts_dec_8way(struct cast6_ctx *ctx, u8 *dst, + const u8 *src, le128 *iv); + +static void cast6_xts_enc(void *ctx, u128 *dst, const u128 *src, le128 *iv) +{ + glue_xts_crypt_128bit_one(ctx, dst, src, iv, + GLUE_FUNC_CAST(__cast6_encrypt)); +} + +static void cast6_xts_dec(void *ctx, u128 *dst, const u128 *src, le128 *iv) +{ + glue_xts_crypt_128bit_one(ctx, dst, src, iv, + GLUE_FUNC_CAST(__cast6_decrypt)); +} + static void cast6_crypt_ctr(void *ctx, u128 *dst, const u128 *src, le128 *iv) { be128 ctrblk; @@ -87,6 +106,19 @@ static const struct common_glue_ctx cast6_ctr = { } } }; +static const struct common_glue_ctx cast6_enc_xts = { + .num_funcs = 2, + .fpu_blocks_limit = CAST6_PARALLEL_BLOCKS, + + .funcs = { { + .num_blocks = CAST6_PARALLEL_BLOCKS, + .fn_u = { .xts = GLUE_XTS_FUNC_CAST(cast6_xts_enc_8way) } + }, { + .num_blocks = 1, + .fn_u = { .xts = GLUE_XTS_FUNC_CAST(cast6_xts_enc) } + } } +}; + static const struct common_glue_ctx cast6_dec = { .num_funcs = 2, .fpu_blocks_limit = CAST6_PARALLEL_BLOCKS, @@ -113,6 +145,19 @@ static const struct common_glue_ctx cast6_dec_cbc = { } } }; +static
[PATCH 2/5] crypto: x86/twofish-avx - use optimized XTS code
Change twofish-avx to use the new XTS code, for smaller stack usage and small boost to performance. tcrypt results, with Intel i5-2450M: enc dec 16B 1.03x 1.02x 64B 0.91x 0.91x 256B1.10x 1.09x 1024B 1.12x 1.11x 8192B 1.12x 1.11x Since XTS is practically always used with data blocks of size 512 bytes or more, I chose to not make use of twofish-3way for block sized smaller than 128 bytes. This causes slower result in tcrypt for 64 bytes. Signed-off-by: Jussi Kivilinna jussi.kivili...@iki.fi --- arch/x86/crypto/twofish-avx-x86_64-asm_64.S | 48 ++ arch/x86/crypto/twofish_avx_glue.c | 91 +++ 2 files changed, 98 insertions(+), 41 deletions(-) diff --git a/arch/x86/crypto/twofish-avx-x86_64-asm_64.S b/arch/x86/crypto/twofish-avx-x86_64-asm_64.S index 8d3e113..0505813 100644 --- a/arch/x86/crypto/twofish-avx-x86_64-asm_64.S +++ b/arch/x86/crypto/twofish-avx-x86_64-asm_64.S @@ -4,7 +4,7 @@ * Copyright (C) 2012 Johannes Goetzfried * johannes.goetzfr...@informatik.stud.uni-erlangen.de * - * Copyright © 2012 Jussi Kivilinna jussi.kivili...@mbnet.fi + * Copyright © 2012-2013 Jussi Kivilinna jussi.kivili...@iki.fi * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by @@ -33,6 +33,8 @@ .Lbswap128_mask: .byte 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0 +.Lxts_gf128mul_and_shl1_mask: + .byte 0x87, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0 .text @@ -408,3 +410,47 @@ ENTRY(twofish_ctr_8way) ret; ENDPROC(twofish_ctr_8way) + +ENTRY(twofish_xts_enc_8way) + /* input: +* %rdi: ctx, CTX +* %rsi: dst +* %rdx: src +* %rcx: iv (t ⊕ αⁿ ∈ GF(2¹²⁸)) +*/ + + movq %rsi, %r11; + + /* regs = src, dst = IVs, regs = regs xor IVs */ + load_xts_8way(%rcx, %rdx, %rsi, RA1, RB1, RC1, RD1, RA2, RB2, RC2, RD2, + RX0, RX1, RY0, .Lxts_gf128mul_and_shl1_mask); + + call __twofish_enc_blk8; + + /* dst = regs xor IVs(in dst) */ + store_xts_8way(%r11, RC1, RD1, RA1, RB1, RC2, RD2, RA2, RB2); + + ret; +ENDPROC(twofish_xts_enc_8way) + +ENTRY(twofish_xts_dec_8way) + /* input: +* %rdi: ctx, CTX +* %rsi: dst +* %rdx: src +* %rcx: iv (t ⊕ αⁿ ∈ GF(2¹²⁸)) +*/ + + movq %rsi, %r11; + + /* regs = src, dst = IVs, regs = regs xor IVs */ + load_xts_8way(%rcx, %rdx, %rsi, RC1, RD1, RA1, RB1, RC2, RD2, RA2, RB2, + RX0, RX1, RY0, .Lxts_gf128mul_and_shl1_mask); + + call __twofish_dec_blk8; + + /* dst = regs xor IVs(in dst) */ + store_xts_8way(%r11, RA1, RB1, RC1, RD1, RA2, RB2, RC2, RD2); + + ret; +ENDPROC(twofish_xts_dec_8way) diff --git a/arch/x86/crypto/twofish_avx_glue.c b/arch/x86/crypto/twofish_avx_glue.c index 94ac91d..a62ba54 100644 --- a/arch/x86/crypto/twofish_avx_glue.c +++ b/arch/x86/crypto/twofish_avx_glue.c @@ -4,6 +4,8 @@ * Copyright (C) 2012 Johannes Goetzfried * johannes.goetzfr...@informatik.stud.uni-erlangen.de * + * Copyright © 2013 Jussi Kivilinna jussi.kivili...@iki.fi + * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation; either version 2 of the License, or @@ -56,12 +58,29 @@ asmlinkage void twofish_cbc_dec_8way(struct twofish_ctx *ctx, u8 *dst, asmlinkage void twofish_ctr_8way(struct twofish_ctx *ctx, u8 *dst, const u8 *src, le128 *iv); +asmlinkage void twofish_xts_enc_8way(struct twofish_ctx *ctx, u8 *dst, +const u8 *src, le128 *iv); +asmlinkage void twofish_xts_dec_8way(struct twofish_ctx *ctx, u8 *dst, +const u8 *src, le128 *iv); + static inline void twofish_enc_blk_3way(struct twofish_ctx *ctx, u8 *dst, const u8 *src) { __twofish_enc_blk_3way(ctx, dst, src, false); } +static void twofish_xts_enc(void *ctx, u128 *dst, const u128 *src, le128 *iv) +{ + glue_xts_crypt_128bit_one(ctx, dst, src, iv, + GLUE_FUNC_CAST(twofish_enc_blk)); +} + +static void twofish_xts_dec(void *ctx, u128 *dst, const u128 *src, le128 *iv) +{ + glue_xts_crypt_128bit_one(ctx, dst, src, iv, + GLUE_FUNC_CAST(twofish_dec_blk)); +} + static const struct common_glue_ctx twofish_enc = { .num_funcs = 3, @@ -95,6 +114,19 @@ static const struct common_glue_ctx twofish_ctr = { } } }; +static const struct common_glue_ctx twofish_enc_xts = { + .num_funcs = 2, + .fpu_blocks_limit = TWOFISH_PARALLEL_BLOCKS, + + .funcs = { { + .num_blocks = TWOFISH_PARALLEL_BLOCKS, +
[PATCH 4/5] crypto: x86/camellia-aesni-avx - add more optimized XTS code
Add more optimized XTS code for camellia-aesni-avx, for smaller stack usage and small boost for speed. tcrypt results, with Intel i5-2450M: enc dec 16B 1.10x 1.01x 64B 0.82x 0.77x 256B1.14x 1.10x 1024B 1.17x 1.16x 8192B 1.10x 1.11x Since XTS is practically always used with data blocks of size 512 bytes or more, I chose to not make use of camellia-2way for block sized smaller than 256 bytes. This causes slower result in tcrypt for 64 bytes. Signed-off-by: Jussi Kivilinna jussi.kivili...@iki.fi --- arch/x86/crypto/camellia-aesni-avx-asm_64.S | 180 +++ arch/x86/crypto/camellia_aesni_avx_glue.c | 91 -- 2 files changed, 229 insertions(+), 42 deletions(-) diff --git a/arch/x86/crypto/camellia-aesni-avx-asm_64.S b/arch/x86/crypto/camellia-aesni-avx-asm_64.S index cfc1634..ce71f92 100644 --- a/arch/x86/crypto/camellia-aesni-avx-asm_64.S +++ b/arch/x86/crypto/camellia-aesni-avx-asm_64.S @@ -1,7 +1,7 @@ /* * x86_64/AVX/AES-NI assembler implementation of Camellia * - * Copyright © 2012 Jussi Kivilinna jussi.kivili...@mbnet.fi + * Copyright © 2012-2013 Jussi Kivilinna jussi.kivili...@iki.fi * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by @@ -589,6 +589,10 @@ ENDPROC(roundsm16_x4_x5_x6_x7_x0_x1_x2_x3_y4_y5_y6_y7_y0_y1_y2_y3_ab) .Lbswap128_mask: .byte 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0 +/* For XTS mode IV generation */ +.Lxts_gf128mul_and_shl1_mask: + .byte 0x87, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0 + /* * pre-SubByte transform * @@ -1090,3 +1094,177 @@ ENTRY(camellia_ctr_16way) ret; ENDPROC(camellia_ctr_16way) + +#define gf128mul_x_ble(iv, mask, tmp) \ + vpsrad $31, iv, tmp; \ + vpaddq iv, iv, iv; \ + vpshufd $0x13, tmp, tmp; \ + vpand mask, tmp, tmp; \ + vpxor tmp, iv, iv; + +.align 8 +camellia_xts_crypt_16way: + /* input: +* %rdi: ctx, CTX +* %rsi: dst (16 blocks) +* %rdx: src (16 blocks) +* %rcx: iv (t ⊕ αⁿ ∈ GF(2¹²⁸)) +* %r8: index for input whitening key +* %r9: pointer to __camellia_enc_blk16 or __camellia_dec_blk16 +*/ + + subq $(16 * 16), %rsp; + movq %rsp, %rax; + + vmovdqa .Lxts_gf128mul_and_shl1_mask, %xmm14; + + /* load IV */ + vmovdqu (%rcx), %xmm0; + vpxor 0 * 16(%rdx), %xmm0, %xmm15; + vmovdqu %xmm15, 15 * 16(%rax); + vmovdqu %xmm0, 0 * 16(%rsi); + + /* construct IVs */ + gf128mul_x_ble(%xmm0, %xmm14, %xmm15); + vpxor 1 * 16(%rdx), %xmm0, %xmm15; + vmovdqu %xmm15, 14 * 16(%rax); + vmovdqu %xmm0, 1 * 16(%rsi); + + gf128mul_x_ble(%xmm0, %xmm14, %xmm15); + vpxor 2 * 16(%rdx), %xmm0, %xmm13; + vmovdqu %xmm0, 2 * 16(%rsi); + + gf128mul_x_ble(%xmm0, %xmm14, %xmm15); + vpxor 3 * 16(%rdx), %xmm0, %xmm12; + vmovdqu %xmm0, 3 * 16(%rsi); + + gf128mul_x_ble(%xmm0, %xmm14, %xmm15); + vpxor 4 * 16(%rdx), %xmm0, %xmm11; + vmovdqu %xmm0, 4 * 16(%rsi); + + gf128mul_x_ble(%xmm0, %xmm14, %xmm15); + vpxor 5 * 16(%rdx), %xmm0, %xmm10; + vmovdqu %xmm0, 5 * 16(%rsi); + + gf128mul_x_ble(%xmm0, %xmm14, %xmm15); + vpxor 6 * 16(%rdx), %xmm0, %xmm9; + vmovdqu %xmm0, 6 * 16(%rsi); + + gf128mul_x_ble(%xmm0, %xmm14, %xmm15); + vpxor 7 * 16(%rdx), %xmm0, %xmm8; + vmovdqu %xmm0, 7 * 16(%rsi); + + gf128mul_x_ble(%xmm0, %xmm14, %xmm15); + vpxor 8 * 16(%rdx), %xmm0, %xmm7; + vmovdqu %xmm0, 8 * 16(%rsi); + + gf128mul_x_ble(%xmm0, %xmm14, %xmm15); + vpxor 9 * 16(%rdx), %xmm0, %xmm6; + vmovdqu %xmm0, 9 * 16(%rsi); + + gf128mul_x_ble(%xmm0, %xmm14, %xmm15); + vpxor 10 * 16(%rdx), %xmm0, %xmm5; + vmovdqu %xmm0, 10 * 16(%rsi); + + gf128mul_x_ble(%xmm0, %xmm14, %xmm15); + vpxor 11 * 16(%rdx), %xmm0, %xmm4; + vmovdqu %xmm0, 11 * 16(%rsi); + + gf128mul_x_ble(%xmm0, %xmm14, %xmm15); + vpxor 12 * 16(%rdx), %xmm0, %xmm3; + vmovdqu %xmm0, 12 * 16(%rsi); + + gf128mul_x_ble(%xmm0, %xmm14, %xmm15); + vpxor 13 * 16(%rdx), %xmm0, %xmm2; + vmovdqu %xmm0, 13 * 16(%rsi); + + gf128mul_x_ble(%xmm0, %xmm14, %xmm15); + vpxor 14 * 16(%rdx), %xmm0, %xmm1; + vmovdqu %xmm0, 14 * 16(%rsi); + + gf128mul_x_ble(%xmm0, %xmm14, %xmm15); + vpxor 15 * 16(%rdx), %xmm0, %xmm15; + vmovdqu %xmm15, 0 * 16(%rax); + vmovdqu %xmm0, 15 * 16(%rsi); + + gf128mul_x_ble(%xmm0, %xmm14, %xmm15); + vmovdqu %xmm0, (%rcx); + + /* inpack16_pre: */ + vmovq (key_table)(CTX, %r8, 8), %xmm15; + vpshufb .Lpack_bswap, %xmm15, %xmm15; + vpxor 0 * 16(%rax), %xmm15, %xmm0; + vpxor %xmm1, %xmm15, %xmm1; + vpxor %xmm2, %xmm15, %xmm2; + vpxor %xmm3,
[PATCH 5/5] crypto: aesni_intel - add more optimized XTS mode for x86-64
Add more optimized XTS code for aesni_intel in 64-bit mode, for smaller stack usage and boost for speed. tcrypt results, with Intel i5-2450M: 256-bit key enc dec 16B 0.98x 0.99x 64B 0.64x 0.63x 256B1.29x 1.32x 1024B 1.54x 1.58x 8192B 1.57x 1.60x 512-bit key enc dec 16B 0.98x 0.99x 64B 0.60x 0.59x 256B1.24x 1.25x 1024B 1.39x 1.42x 8192B 1.38x 1.42x I chose not to optimize smaller than block size of 256 bytes, since XTS is practically always used with data blocks of size 512 bytes. This is why performance is reduced in tcrypt for 64 byte long blocks. Cc: Huang Ying ying.hu...@intel.com Signed-off-by: Jussi Kivilinna jussi.kivili...@iki.fi --- arch/x86/crypto/aesni-intel_asm.S | 117 arch/x86/crypto/aesni-intel_glue.c | 80 + crypto/Kconfig |1 3 files changed, 198 insertions(+) diff --git a/arch/x86/crypto/aesni-intel_asm.S b/arch/x86/crypto/aesni-intel_asm.S index 04b7977..62fe22c 100644 --- a/arch/x86/crypto/aesni-intel_asm.S +++ b/arch/x86/crypto/aesni-intel_asm.S @@ -34,6 +34,10 @@ #ifdef __x86_64__ .data +.align 16 +.Lgf128mul_x_ble_mask: + .octa 0x00010087 + POLY: .octa 0xC201 TWOONE: .octa 0x00010001 @@ -105,6 +109,8 @@ enc:.octa 0x2 #define CTR%xmm11 #define INC%xmm12 +#define GF128MUL_MASK %xmm10 + #ifdef __x86_64__ #define AREG %rax #define KEYP %rdi @@ -2636,4 +2642,115 @@ ENTRY(aesni_ctr_enc) .Lctr_enc_just_ret: ret ENDPROC(aesni_ctr_enc) + +/* + * _aesni_gf128mul_x_ble: internal ABI + * Multiply in GF(2^128) for XTS IVs + * input: + * IV: current IV + * GF128MUL_MASK == mask with 0x87 and 0x01 + * output: + * IV: next IV + * changed: + * CTR:== temporary value + */ +#define _aesni_gf128mul_x_ble() \ + pshufd $0x13, IV, CTR; \ + paddq IV, IV; \ + psrad $31, CTR; \ + pand GF128MUL_MASK, CTR; \ + pxor CTR, IV; + +/* + * void aesni_xts_crypt8(struct crypto_aes_ctx *ctx, const u8 *dst, u8 *src, + * bool enc, u8 *iv) + */ +ENTRY(aesni_xts_crypt8) + cmpb $0, %cl + movl $0, %ecx + movl $240, %r10d + leaq _aesni_enc4, %r11 + leaq _aesni_dec4, %rax + cmovel %r10d, %ecx + cmoveq %rax, %r11 + + movdqa .Lgf128mul_x_ble_mask, GF128MUL_MASK + movups (IVP), IV + + mov 480(KEYP), KLEN + addq %rcx, KEYP + + movdqa IV, STATE1 + pxor 0x00(INP), STATE1 + movdqu IV, 0x00(OUTP) + + _aesni_gf128mul_x_ble() + movdqa IV, STATE2 + pxor 0x10(INP), STATE2 + movdqu IV, 0x10(OUTP) + + _aesni_gf128mul_x_ble() + movdqa IV, STATE3 + pxor 0x20(INP), STATE3 + movdqu IV, 0x20(OUTP) + + _aesni_gf128mul_x_ble() + movdqa IV, STATE4 + pxor 0x30(INP), STATE4 + movdqu IV, 0x30(OUTP) + + call *%r11 + + pxor 0x00(OUTP), STATE1 + movdqu STATE1, 0x00(OUTP) + + _aesni_gf128mul_x_ble() + movdqa IV, STATE1 + pxor 0x40(INP), STATE1 + movdqu IV, 0x40(OUTP) + + pxor 0x10(OUTP), STATE2 + movdqu STATE2, 0x10(OUTP) + + _aesni_gf128mul_x_ble() + movdqa IV, STATE2 + pxor 0x50(INP), STATE2 + movdqu IV, 0x50(OUTP) + + pxor 0x20(OUTP), STATE3 + movdqu STATE3, 0x20(OUTP) + + _aesni_gf128mul_x_ble() + movdqa IV, STATE3 + pxor 0x60(INP), STATE3 + movdqu IV, 0x60(OUTP) + + pxor 0x30(OUTP), STATE4 + movdqu STATE4, 0x30(OUTP) + + _aesni_gf128mul_x_ble() + movdqa IV, STATE4 + pxor 0x70(INP), STATE4 + movdqu IV, 0x70(OUTP) + + _aesni_gf128mul_x_ble() + movups IV, (IVP) + + call *%r11 + + pxor 0x40(OUTP), STATE1 + movdqu STATE1, 0x40(OUTP) + + pxor 0x50(OUTP), STATE2 + movdqu STATE2, 0x50(OUTP) + + pxor 0x60(OUTP), STATE3 + movdqu STATE3, 0x60(OUTP) + + pxor 0x70(OUTP), STATE4 + movdqu STATE4, 0x70(OUTP) + + ret +ENDPROC(aesni_xts_crypt8) + #endif diff --git a/arch/x86/crypto/aesni-intel_glue.c b/arch/x86/crypto/aesni-intel_glue.c index a0795da..f80e668 100644 --- a/arch/x86/crypto/aesni-intel_glue.c +++ b/arch/x86/crypto/aesni-intel_glue.c @@ -39,6 +39,9 @@ #include crypto/internal/aead.h #include linux/workqueue.h #include linux/spinlock.h +#ifdef CONFIG_X86_64 +#include asm/crypto/glue_helper.h +#endif #if defined(CONFIG_CRYPTO_PCBC) || defined(CONFIG_CRYPTO_PCBC_MODULE) #define HAS_PCBC @@ -102,6 +105,9 @@ void crypto_fpu_exit(void); asmlinkage void aesni_ctr_enc(struct crypto_aes_ctx *ctx, u8 *out, const u8 *in, unsigned int len, u8 *iv); +asmlinkage void aesni_xts_crypt8(struct crypto_aes_ctx *ctx, u8 *out, +const u8 *in,
Re: [PATCH 1/2] crypto: add CMAC support to CryptoAPI
From: Herbert Xu herb...@gondor.apana.org.au Date: Mon, 8 Apr 2013 17:33:40 +0800 On Mon, Apr 08, 2013 at 10:24:16AM +0200, Steffen Klassert wrote: On Mon, Apr 08, 2013 at 10:48:44AM +0300, Jussi Kivilinna wrote: Patch adds support for NIST recommended block cipher mode CMAC to CryptoAPI. This work is based on Tom St Denis' earlier patch, http://marc.info/?l=linux-crypto-vgerm=135877306305466w=2 Cc: Tom St Denis tstde...@elliptictech.com Signed-off-by: Jussi Kivilinna jussi.kivili...@iki.fi This patch does not apply clean to the ipsec-next tree because of some crypto changes I don't have in ipsec-next. The IPsec part should apply to the cryptodev tree, so it's probaply the best if we route this patchset through the cryptodev tree. Herbert, are you going to take these patches? Sure I can do that. I'm fine with this: Acked-by: David S. Miller da...@davemloft.net -- To unsubscribe from this list: send the line unsubscribe linux-crypto in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: questions of crypto async api
On Mon, 8 Apr 2013 13:49:58 + Hsieh, Che-Min chem...@qti.qualcomm.com wrote: Thanks for the answer. I have further question on the same subject. With regard to the commit in talitos.c, (attached at end of this mail), the driver submits requests of same tfm to the same channel to ensure the ordering. Is it because the tfm context needs to be maintained from one operation to next operation? Eg, aead_givencrypt() generates iv based on previous iv result stored in tfm. is that what the commit text says? If requests are sent to different channels dynamically. And driver at the completion of a request from HW, reorders the requests completion callback, what would happen? about the same thing as wrapping the driver with pcrypt? why not use the h/w to maintain ordering? Kim Thanks in advance. Chemin commit 5228f0f79e983c2b39c202c75af901ceb0003fc1 Author: Kim Phillips kim.phill...@freescale.com Date: Fri Jul 15 11:21:38 2011 +0800 crypto: talitos - ensure request ordering within a single tfm Assign single target channel per tfm in talitos_cra_init instead of performing channel scheduling dynamically during the encryption request. This changes the talitos_submit interface to accept a new channel number argument. Without this, rapid bursts of misc. sized requests could make it possible for IPsec packets to be encrypted out-of-order, which would result in packet drops due to sequence numbers falling outside the anti-reply window on a peer gateway. Signed-off-by: Kim Phillips kim.phill...@freescale.com Signed-off-by: Herbert Xu herb...@gondor.apana.org.au -Original Message- From: Kim Phillips [mailto:kim.phill...@freescale.com] Sent: Friday, April 05, 2013 6:33 PM To: Hsieh, Che-Min Cc: linux-crypto@vger.kernel.org Subject: Re: questions of crypto async api On Thu, 4 Apr 2013 14:38:41 + Hsieh, Che-Min chem...@qti.qualcomm.com wrote: If a driver supports multiple instances of HW crypto engines, the order of the request completion from HW can be different from the order of requests submitted to different HW. The 2nd request sent out to the 2nd HW instance may take shorter time to complete than the first request for different HW instance. Is the driver responsible for re-ordering the completion callout? Or the agents (such as IP protocol stack) are responsible for reordering? How does pcrypt do it? Does it make sense for a transform to send multiple requests outstanding to async crypto api? see: http://comments.gmane.org/gmane.linux.kernel.cryptoapi/5350 Is scatterwalk_sg_next() preferred method over sg_next()? Why? scatterwalk_* is the crypto subsystem's version of the function, so yes. sg_copy_to_buffer() and sg_copy_from_buffer() - sg_copy_buffer()-sg_copy_buffer() - sg_miter_next()- sg_next() Sometimes sg_copy_to_buffer() and sg_copy_from_buffer() in our driver do not copy the whole list. We have to rewrite those functions by using scattewalk_sg_next() to walk down the list. Is this the correct behavior? sounds like you're on the right track, although buffers shouldn't be being copied that often, if at all. Kim -- To unsubscribe from this list: send the line unsubscribe linux-crypto in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html