Re: LXC broken with 5.10-stable?, ok with 5.9-stable (Re: Linux 5.10.3)
On 27.12.2020 21.05, Linus Torvalds wrote: On Sun, Dec 27, 2020 at 10:39 AM Jussi Kivilinna wrote: 5.10.3 with patch compiles fine, but does not solve the issue. Duh. adding the read_iter only fixes kernel_read(). For splice, it also needs a .splice_read = generic_file_splice_read, in the file operations, something like this... Does that get things working? Yes, LXC works for me now. Thanks. -Jussi
Re: LXC broken with 5.10-stable?, ok with 5.9-stable (Re: Linux 5.10.3)
On 27.12.2020 19.20, Linus Torvalds wrote: On Sun, Dec 27, 2020 at 8:32 AM Jussi Kivilinna wrote: Has this been fixed in 5.11-rc? Is there any patch that I could backport and test with 5.10? Here's a patch to test. Entirely untested by me. I'm surprised at how people use sendfile() on random files. Oh well.. 5.10.3 with patch compiles fine, but does not solve the issue. The test case from bugzilla still fails and LXC container wont start. -Jussi
LXC broken with 5.10-stable?, ok with 5.9-stable (Re: Linux 5.10.3)
Hello, Now that 5.9 series is EOL, I tried to move to 5.10.3. I ran in to regression where LXC containers do not start with newer kernel. I found that issue had been reported (bisected + with reduced test case) in bugzilla at: https://bugzilla.kernel.org/show_bug.cgi?id=209971 Has this been fixed in 5.11-rc? Is there any patch that I could backport and test with 5.10? -Jussi On 26.12.2020 17.20, Greg Kroah-Hartman wrote: I'm announcing the release of the 5.10.3 kernel. All users of the 5.10 kernel series must upgrade. The updated 5.10.y git tree can be found at: git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git linux-5.10.y and can be browsed at the normal kernel.org git web browser: https://git.kernel.org/?p=linux/kernel/git/stable/linux-stable.git;a=summary thanks, greg k-h
Re: [PATCH 3/5] lib/mpi: Fix for building for MIPS32 with Clang
Hello, On 12.8.2019 20.14, Nathan Chancellor wrote: > On Mon, Aug 12, 2019 at 10:35:53AM +0300, Jussi Kivilinna wrote: >> Hello, >> >> On 12.8.2019 6.31, Nathan Chancellor wrote: >>> From: Vladimir Serbinenko >>> >>> clang doesn't recognise =l / =h assembly operand specifiers but apparently >>> handles C version well. >>> >>> lib/mpi/generic_mpih-mul1.c:37:24: error: invalid use of a cast in a >>> inline asm context requiring an l-value: remove the cast or build with >>> -fheinous-gnu-extensions >>> umul_ppmm(prod_high, prod_low, s1_ptr[j], s2_limb); >>> ~^ >>> lib/mpi/longlong.h:652:20: note: expanded from macro 'umul_ppmm' >>> : "=l" ((USItype)(w0)), \ >>> ~~^~~ >>> lib/mpi/generic_mpih-mul1.c:37:3: error: invalid output constraint '=h' >>> in asm >>> umul_ppmm(prod_high, prod_low, s1_ptr[j], s2_limb); >>> ^ >>> lib/mpi/longlong.h:653:7: note: expanded from macro 'umul_ppmm' >>> "=h" ((USItype)(w1)) \ >>> ^ >>> 2 errors generated. >>> >>> Fixes: 5ce3e312ec5c ("crypto: GnuPG based MPI lib - header files (part 2)") >>> Link: https://github.com/ClangBuiltLinux/linux/issues/605 >>> Link: >>> https://github.com/gpg/libgcrypt/commit/1ecbd0bca31d462719a2a6590c1d03244e76ef89 >>> Signed-off-by: Vladimir Serbinenko >>> [jk: add changelog, rebase on libgcrypt repository, reformat changed >>> line so it does not go over 80 characters] >>> Signed-off-by: Jussi Kivilinna >> >> This is my signed-off-by for libgcrypt project, not kernel. I do not think >> signed-offs can be passed from other projects in this way. >> >> -Jussi > > Hi Jussi, > > I am no signoff expert but if I am reading the developer certificate of > origin in the libgcrypt repo correctly [1], your signoff on this commit > falls under: > > (d) I understand and agree that this project and the contribution > are public and that a record of the contribution (including all > personal information I submit with it, including my sign-off) is > maintained indefinitely and may be redistributed consistent with > this project or the open source license(s) involved. There is nothing wrong with the commit in libgcrypt repo and/or my libgcrypt-DCO-sign-off. > > This file is maintained under the LGPL because it was taken straight > from the libgcrypr repo and per (b), I can submit this commit here > with everything intact. But you do not have my kernel-DCO-sign-off for this patch. I have not been involved with this kernel patch in anyway, have not integrated it to kernel, not testing it on kernel.. I do not own it. However, with this signed-off-by line you have involved me to kernel patch process in which for this patch I'm not interested. So to be clear, I retract my kernel-DCO-signed-off for this kernel patch: NOT-Signed-off-by: Jussi Kivilinna Of course you can copy the original libgcrypt commit message to this patch, but I think it needs to be clearly quoted so that my libgcrypt-DCO-signed-off line wont be mixed with kernel-DOC-signed-off lines. > > However, I don't want to upset you in any way though so if you are not > comfortable with that, I suppose I can remove it as if Vladimir > submitted this fix to me directly (as I got permission for his signoff). > I need to resubmit this fix to an appropriate maintainer so let me know > what you think. That's quite complicated approach. Fast and easier process would be if you just own the patch yourself. Libgcrypt (and target file in libgcrypt) is LGPL v2.1+, so the license is compatible with kernel and you are good to go with just your own (kernel DCO) signed-off-by. -Jussi > > [1]: > https://github.com/gpg/libgcrypt/blob/3bb858551cd5d84e43b800edfa2b07d1529718a9/doc/DCO > > Cheers, > Nathan >
Re: [PATCH 3/5] lib/mpi: Fix for building for MIPS32 with Clang
Hello, On 12.8.2019 6.31, Nathan Chancellor wrote: > From: Vladimir Serbinenko > > clang doesn't recognise =l / =h assembly operand specifiers but apparently > handles C version well. > > lib/mpi/generic_mpih-mul1.c:37:24: error: invalid use of a cast in a > inline asm context requiring an l-value: remove the cast or build with > -fheinous-gnu-extensions > umul_ppmm(prod_high, prod_low, s1_ptr[j], s2_limb); > ~^ > lib/mpi/longlong.h:652:20: note: expanded from macro 'umul_ppmm' > : "=l" ((USItype)(w0)), \ > ~~^~~ > lib/mpi/generic_mpih-mul1.c:37:3: error: invalid output constraint '=h' > in asm > umul_ppmm(prod_high, prod_low, s1_ptr[j], s2_limb); > ^ > lib/mpi/longlong.h:653:7: note: expanded from macro 'umul_ppmm' > "=h" ((USItype)(w1)) \ > ^ > 2 errors generated. > > Fixes: 5ce3e312ec5c ("crypto: GnuPG based MPI lib - header files (part 2)") > Link: https://github.com/ClangBuiltLinux/linux/issues/605 > Link: > https://github.com/gpg/libgcrypt/commit/1ecbd0bca31d462719a2a6590c1d03244e76ef89 > Signed-off-by: Vladimir Serbinenko > [jk: add changelog, rebase on libgcrypt repository, reformat changed > line so it does not go over 80 characters] > Signed-off-by: Jussi Kivilinna This is my signed-off-by for libgcrypt project, not kernel. I do not think signed-offs can be passed from other projects in this way. -Jussi > [nc: Added build error and tags to commit message > Added Vladimir's signoff with his permission > Adjusted Jussi's comment to wrap at 73 characters > Modified commit subject to mirror MIPS64 commit > Removed space between defined and (__clang__)] > Signed-off-by: Nathan Chancellor > --- > lib/mpi/longlong.h | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/lib/mpi/longlong.h b/lib/mpi/longlong.h > index 3bb6260d8f42..8a1507fc94dd 100644 > --- a/lib/mpi/longlong.h > +++ b/lib/mpi/longlong.h > @@ -639,7 +639,8 @@ do { \ > ** MIPS * > ***/ > #if defined(__mips__) && W_TYPE_SIZE == 32 > -#if (__GNUC__ >= 5) || (__GNUC__ >= 4 && __GNUC_MINOR__ >= 4) > +#if defined(__clang__) || (__GNUC__ >= 5) || (__GNUC__ == 4 && \ > + __GNUC_MINOR__ >= 4) > #define umul_ppmm(w1, w0, u, v) \ > do { \ > UDItype __ll = (UDItype)(u) * (v); \ >
Re: Question about ctr mode 3des-ede IV len
Hello, 07.12.2016, 14:43, Longpeng (Mike) kirjoitti: > Hi Jussi and Herbert, > > I saw serveral des3-ede testcases(in crypto/testmgr.h) has 16-bytes IV, and > the > libgcrypt/nettle/RFC1851 said the IV-len is 8-bytes. > > Would you please tell me why these testcases has 16-bytes IV ? Because I used same tool to create test-vectors which I had previously used to create AES/Camellia/Serpent/Twofish test-vectors. So, I must have forgotten to change 16-byte IV generation to 8 bytes and thus those testcases in crypto/testmgr.h have wrong length. The extra trailing 8 bytes are not used and can be removed. -Jussi > > Thank you. :) >
Re: Question about ctr mode 3des-ede IV len
Hello, 07.12.2016, 14:43, Longpeng (Mike) kirjoitti: > Hi Jussi and Herbert, > > I saw serveral des3-ede testcases(in crypto/testmgr.h) has 16-bytes IV, and > the > libgcrypt/nettle/RFC1851 said the IV-len is 8-bytes. > > Would you please tell me why these testcases has 16-bytes IV ? Because I used same tool to create test-vectors which I had previously used to create AES/Camellia/Serpent/Twofish test-vectors. So, I must have forgotten to change 16-byte IV generation to 8 bytes and thus those testcases in crypto/testmgr.h have wrong length. The extra trailing 8 bytes are not used and can be removed. -Jussi > > Thank you. :) >
Re: Kernel crypto API: cryptoperf performance measurement
On 2014-08-20 21:14, Milan Broz wrote: > On 08/20/2014 03:25 PM, Jussi Kivilinna wrote: >>> One to four GB per second for XTS? 12 GB per second for AES CBC? Somehow >>> that >>> does not sound right. >> >> Agreed, those do not look correct... I wonder what happened there. On >> new run, I got more sane results: > > Which cryptsetup version are you using? > > There was a bug in that test on fast machines (fixed in 1.6.3, I hope :) I had version 1.6.1 at hand. > > But anyway, it is not intended as rigorous speed test, > it was intended for comparison of ciphers speed on particular machine. > True, but it's nice easy test when compared to parsing results from tcrypt speed tests. -Jussi > Test basically tries to encrypt 1MB block (or multiple of this > if machine is too fast). All it runs through kernel userspace crypto API > interface. > (Real FDE is always slower because it runs over 512bytes blocks.) > > Milan > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Kernel crypto API: cryptoperf performance measurement
On 2014-08-20 21:14, Milan Broz wrote: On 08/20/2014 03:25 PM, Jussi Kivilinna wrote: One to four GB per second for XTS? 12 GB per second for AES CBC? Somehow that does not sound right. Agreed, those do not look correct... I wonder what happened there. On new run, I got more sane results: Which cryptsetup version are you using? There was a bug in that test on fast machines (fixed in 1.6.3, I hope :) I had version 1.6.1 at hand. But anyway, it is not intended as rigorous speed test, it was intended for comparison of ciphers speed on particular machine. True, but it's nice easy test when compared to parsing results from tcrypt speed tests. -Jussi Test basically tries to encrypt 1MB block (or multiple of this if machine is too fast). All it runs through kernel userspace crypto API interface. (Real FDE is always slower because it runs over 512bytes blocks.) Milan -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Kernel crypto API: cryptoperf performance measurement
Hello, On 2014-08-19 21:23, Stephan Mueller wrote: > Am Dienstag, 19. August 2014, 10:17:36 schrieb Jussi Kivilinna: > > Hi Jussi, > >> Hello, >> >> On 2014-08-17 18:55, Stephan Mueller wrote: >>> Hi, >>> >>> during playing around with the kernel crypto API, I implemented a >>> performance measurement tool kit for the various kernel crypto API cipher >>> types. The cryptoperf tool kit is provided in [1]. >>> >>> Comments are welcome. >> >> Your results are quite slow compared to, for example "cryptsetup >> benchmark", which uses kernel crypto from userspace. >> >> With Intel i5-2450M (turbo enabled), I get: >> >> # Algorithm | Key | Encryption | Decryption >> aes-cbc 128b 524,0 MiB/s 11909,1 MiB/s >> serpent-cbc 128b60,9 MiB/s 219,4 MiB/s >> twofish-cbc 128b 143,4 MiB/s 240,3 MiB/s >> aes-cbc 256b 330,4 MiB/s 1242,8 MiB/s >> serpent-cbc 256b66,1 MiB/s 220,3 MiB/s >> twofish-cbc 256b 143,5 MiB/s 221,8 MiB/s >> aes-xts 256b 1268,7 MiB/s 4193,0 MiB/s >> serpent-xts 256b 234,8 MiB/s 224,6 MiB/s >> twofish-xts 256b 253,5 MiB/s 254,7 MiB/s >> aes-xts 512b 2535,0 MiB/s 2945,0 MiB/s >> serpent-xts 512b 274,2 MiB/s 242,3 MiB/s >> twofish-xts 512b 250,0 MiB/s 245,8 MiB/s > > One to four GB per second for XTS? 12 GB per second for AES CBC? Somehow that > does not sound right. Agreed, those do not look correct... I wonder what happened there. On new run, I got more sane results: # Algorithm | Key | Encryption | Decryption aes-cbc 128b 139,1 MiB/s 1713,6 MiB/s serpent-cbc 128b62,2 MiB/s 232,9 MiB/s twofish-cbc 128b 116,3 MiB/s 243,7 MiB/s aes-cbc 256b 375,1 MiB/s 1159,4 MiB/s serpent-cbc 256b62,1 MiB/s 214,9 MiB/s twofish-cbc 256b 139,3 MiB/s 217,5 MiB/s aes-xts 256b 1296,4 MiB/s 1272,5 MiB/s serpent-xts 256b 283,3 MiB/s 275,6 MiB/s twofish-xts 256b 294,8 MiB/s 299,3 MiB/s aes-xts 512b 984,3 MiB/s 991,1 MiB/s serpent-xts 512b 227,7 MiB/s 220,6 MiB/s twofish-xts 512b 220,6 MiB/s 220,2 MiB/s -Jussi -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Kernel crypto API: cryptoperf performance measurement
Hello, On 2014-08-19 21:23, Stephan Mueller wrote: Am Dienstag, 19. August 2014, 10:17:36 schrieb Jussi Kivilinna: Hi Jussi, Hello, On 2014-08-17 18:55, Stephan Mueller wrote: Hi, during playing around with the kernel crypto API, I implemented a performance measurement tool kit for the various kernel crypto API cipher types. The cryptoperf tool kit is provided in [1]. Comments are welcome. Your results are quite slow compared to, for example cryptsetup benchmark, which uses kernel crypto from userspace. With Intel i5-2450M (turbo enabled), I get: # Algorithm | Key | Encryption | Decryption aes-cbc 128b 524,0 MiB/s 11909,1 MiB/s serpent-cbc 128b60,9 MiB/s 219,4 MiB/s twofish-cbc 128b 143,4 MiB/s 240,3 MiB/s aes-cbc 256b 330,4 MiB/s 1242,8 MiB/s serpent-cbc 256b66,1 MiB/s 220,3 MiB/s twofish-cbc 256b 143,5 MiB/s 221,8 MiB/s aes-xts 256b 1268,7 MiB/s 4193,0 MiB/s serpent-xts 256b 234,8 MiB/s 224,6 MiB/s twofish-xts 256b 253,5 MiB/s 254,7 MiB/s aes-xts 512b 2535,0 MiB/s 2945,0 MiB/s serpent-xts 512b 274,2 MiB/s 242,3 MiB/s twofish-xts 512b 250,0 MiB/s 245,8 MiB/s One to four GB per second for XTS? 12 GB per second for AES CBC? Somehow that does not sound right. Agreed, those do not look correct... I wonder what happened there. On new run, I got more sane results: # Algorithm | Key | Encryption | Decryption aes-cbc 128b 139,1 MiB/s 1713,6 MiB/s serpent-cbc 128b62,2 MiB/s 232,9 MiB/s twofish-cbc 128b 116,3 MiB/s 243,7 MiB/s aes-cbc 256b 375,1 MiB/s 1159,4 MiB/s serpent-cbc 256b62,1 MiB/s 214,9 MiB/s twofish-cbc 256b 139,3 MiB/s 217,5 MiB/s aes-xts 256b 1296,4 MiB/s 1272,5 MiB/s serpent-xts 256b 283,3 MiB/s 275,6 MiB/s twofish-xts 256b 294,8 MiB/s 299,3 MiB/s aes-xts 512b 984,3 MiB/s 991,1 MiB/s serpent-xts 512b 227,7 MiB/s 220,6 MiB/s twofish-xts 512b 220,6 MiB/s 220,2 MiB/s -Jussi -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Kernel crypto API: cryptoperf performance measurement
Hello, On 2014-08-17 18:55, Stephan Mueller wrote: > Hi, > > during playing around with the kernel crypto API, I implemented a performance > measurement tool kit for the various kernel crypto API cipher types. The > cryptoperf tool kit is provided in [1]. > > Comments are welcome. Your results are quite slow compared to, for example "cryptsetup benchmark", which uses kernel crypto from userspace. With Intel i5-2450M (turbo enabled), I get: # Algorithm | Key | Encryption | Decryption aes-cbc 128b 524,0 MiB/s 11909,1 MiB/s serpent-cbc 128b60,9 MiB/s 219,4 MiB/s twofish-cbc 128b 143,4 MiB/s 240,3 MiB/s aes-cbc 256b 330,4 MiB/s 1242,8 MiB/s serpent-cbc 256b66,1 MiB/s 220,3 MiB/s twofish-cbc 256b 143,5 MiB/s 221,8 MiB/s aes-xts 256b 1268,7 MiB/s 4193,0 MiB/s serpent-xts 256b 234,8 MiB/s 224,6 MiB/s twofish-xts 256b 253,5 MiB/s 254,7 MiB/s aes-xts 512b 2535,0 MiB/s 2945,0 MiB/s serpent-xts 512b 274,2 MiB/s 242,3 MiB/s twofish-xts 512b 250,0 MiB/s 245,8 MiB/s > > In general, the results are as expected, i.e. the assembler implementations > are faster than the pure C implementations. However, there are curious > results > which probably should be checked by the maintainers of the respective ciphers > (hoping that my tool works correctly ;-) ): > > ablkcipher > -- > > - cryptd is slower by factor 10 across the board > > blkcipher > - > > - Blowfish x86_64 assembler together with the generic C block chaining modes > is significantly slower than Blowfish implemented in generic C > > - Blowfish x86_64 assembler in ECB is significantly slower than generic C > Blowfish ECB > > - Serpent assembler implementations are not significantly faster than generic > C implementations > > - AES-NI ECB, LRW, CTR is significantly slower than AES i586 assembler. > > - AES-NI ECB, LRW, CTR is not significantly faster than AES generic C > Quite many assembly implementations get speed up from processing parallel block cipher blocks, which modes of operation that (CTR, XTS, LWR, CBC(dec)). For small buffer sizes, these implementations will use the non-parallel implementation of cipher. -Jussi > rng > --- > > - The ANSI X9.31 RNG seems to work massively faster than the underlying AES > cipher (by about a factor of 5). I am unsure about the cause of this. > > > Caveat > -- > > Please note that there is one small error which I am unsure how to fix it as > documented in the TODO file. > > [1] http://www.chronox.de/cryptoperf.html > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Kernel crypto API: cryptoperf performance measurement
Hello, On 2014-08-17 18:55, Stephan Mueller wrote: Hi, during playing around with the kernel crypto API, I implemented a performance measurement tool kit for the various kernel crypto API cipher types. The cryptoperf tool kit is provided in [1]. Comments are welcome. Your results are quite slow compared to, for example cryptsetup benchmark, which uses kernel crypto from userspace. With Intel i5-2450M (turbo enabled), I get: # Algorithm | Key | Encryption | Decryption aes-cbc 128b 524,0 MiB/s 11909,1 MiB/s serpent-cbc 128b60,9 MiB/s 219,4 MiB/s twofish-cbc 128b 143,4 MiB/s 240,3 MiB/s aes-cbc 256b 330,4 MiB/s 1242,8 MiB/s serpent-cbc 256b66,1 MiB/s 220,3 MiB/s twofish-cbc 256b 143,5 MiB/s 221,8 MiB/s aes-xts 256b 1268,7 MiB/s 4193,0 MiB/s serpent-xts 256b 234,8 MiB/s 224,6 MiB/s twofish-xts 256b 253,5 MiB/s 254,7 MiB/s aes-xts 512b 2535,0 MiB/s 2945,0 MiB/s serpent-xts 512b 274,2 MiB/s 242,3 MiB/s twofish-xts 512b 250,0 MiB/s 245,8 MiB/s In general, the results are as expected, i.e. the assembler implementations are faster than the pure C implementations. However, there are curious results which probably should be checked by the maintainers of the respective ciphers (hoping that my tool works correctly ;-) ): ablkcipher -- - cryptd is slower by factor 10 across the board blkcipher - - Blowfish x86_64 assembler together with the generic C block chaining modes is significantly slower than Blowfish implemented in generic C - Blowfish x86_64 assembler in ECB is significantly slower than generic C Blowfish ECB - Serpent assembler implementations are not significantly faster than generic C implementations - AES-NI ECB, LRW, CTR is significantly slower than AES i586 assembler. - AES-NI ECB, LRW, CTR is not significantly faster than AES generic C Quite many assembly implementations get speed up from processing parallel block cipher blocks, which modes of operation that (CTR, XTS, LWR, CBC(dec)). For small buffer sizes, these implementations will use the non-parallel implementation of cipher. -Jussi rng --- - The ANSI X9.31 RNG seems to work massively faster than the underlying AES cipher (by about a factor of 5). I am unsure about the cause of this. Caveat -- Please note that there is one small error which I am unsure how to fix it as documented in the TODO file. [1] http://www.chronox.de/cryptoperf.html -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Documentation: "kerneli" typo in description for "Serpent cipher algorithm" Bug #60848
On 02.10.2013 21:12, Rob Landley wrote: > On 10/02/2013 11:10:37 AM, Kevin Mulvey wrote: >> change kerneli to kernel as well as kerneli.org to kernel.org >> >> Signed-off-by: Kevin Mulvey > > There's a bug number for this? > > Acked, queued. (Although I'm not sure the value of pointing to www.kernel.org > for this.) I think kerneli.org is correct.. see old website at http://web.archive.org/web/20010201085500/http://www.kerneli.org/ -Jussi > > Thanks, > > Rob > > -- > To unsubscribe from this list: send the line "unsubscribe linux-crypto" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Documentation: kerneli typo in description for Serpent cipher algorithm Bug #60848
On 02.10.2013 21:12, Rob Landley wrote: On 10/02/2013 11:10:37 AM, Kevin Mulvey wrote: change kerneli to kernel as well as kerneli.org to kernel.org Signed-off-by: Kevin Mulvey ke...@kevinmulvey.net There's a bug number for this? Acked, queued. (Although I'm not sure the value of pointing to www.kernel.org for this.) I think kerneli.org is correct.. see old website at http://web.archive.org/web/20010201085500/http://www.kerneli.org/ -Jussi Thanks, Rob -- To unsubscribe from this list: send the line unsubscribe linux-crypto in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [cryptomgr_test] BUG: unable to handle kernel NULL pointer dereference at (null)
Hello, Appears to be caused by some memory corruption. Changing from SLOB allocator to SLUB made this crash disappear. Some different crashes with same config: [0.246152] cryptomgr_test (23) used greatest stack depth: 6400 bytes left [0.246929] cryptomgr_test (24) used greatest stack depth: 5384 bytes left [0.248851] modprobe (33) used greatest stack depth: 5376 bytes left [0.250351] alg: No test for crc32 (crc32-pclmul) [0.251669] BUG: unable to handle kernel paging request at 882006646e18 [0.252007] IP: [] task_active_pid_ns+0x17/0x30 [0.252007] PGD 2af8067 PUD 0 [0.252007] Oops: [#1] SMP [0.252007] Modules linked in: [0.252007] CPU: 0 PID: 43 Comm: kworker/u2:1 Not tainted 3.10.0-rc1-crash1-00048-gf9a31a2 #24 [0.252007] task: 880006694000 ti: 880006698000 task.ti: 880006698000 [0.252007] RIP: 0010:[] [] task_active_pid_ns+0x17/0x30 [0.252007] RSP: 0018:880006699dd8 EFLAGS: 00010002 [0.252007] RAX: 880006646e00 RBX: 880006694000 RCX: 0001 [0.252007] RDX: 001fffe0 RSI: 00098000 RDI: 880006655000 [0.252007] RBP: 880006699dd8 R08: 000d R09: 8800066945d0 [0.252007] R10: R11: R12: 0011 [0.252007] R13: R14: 88000704c000 R15: 880006693ff0 [0.252007] FS: () GS:880007c0() knlGS: [0.252007] CS: 0010 DS: ES: CR0: 80050033 [0.252007] CR2: 882006646e18 CR3: 02015000 CR4: 001407f0 [0.252007] DR0: DR1: DR2: [0.252007] DR3: DR6: 0ff0 DR7: 0400 [0.252007] Stack: [0.252007] 880006699e98 81078054 81077fc2 0011 [0.252007] 880006699e78 [0.252007] 0046 8106c5ef [0.252007] Call Trace: [0.252007] [] do_notify_parent+0x114/0x580 [0.252007] [] ? do_notify_parent+0x82/0x580 [0.252007] [] ? do_exit+0x80f/0xa20 [0.252007] [] do_exit+0x8de/0xa20 [0.252007] [] wait_for_helper+0x98/0xa0 [0.252007] [] ? call_helper+0x20/0x20 [0.252007] [] ret_from_fork+0x7c/0xb0 [0.252007] [] ? call_helper+0x20/0x20 [0.252007] Code: 1f 44 00 00 5d c3 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 48 8b 87 48 02 00 00 55 48 89 e5 48 85 c0 74 10 8b 50 04 48 c1 e2 05 <48> 8b 44 10 38 eb 0a 66 90 31 c0 66 0f 1f 44 00 00 5d c3 66 0f [0.252007] RIP [] task_active_pid_ns+0x17/0x30 [0.252007] RSP [0.252007] CR2: 882006646e18 [0.252007] ---[ end trace 7caca246688ed8b9 ]--- [0.252007] Kernel panic - not syncing: Fatal exception ... [0.328072] kernel tried to execute NX-protected page - exploit attempt? (uid: 0) [0.328683] BUG: unable to handle kernel paging request at 88000644cd98 [0.329227] IP: [] 0x88000644cd97 [0.329690] PGD 2af8067 PUD 2af9067 PMD 864001e3 [0.330182] Oops: 0011 [#1] SMP [0.330449] Modules linked in: [0.330694] CPU: 1 PID: 1 Comm: swapper/0 Not tainted 3.10.0-rc1-crash1-00048-gf9a31a2 #24 [0.331314] task: 88000644d000 ti: 88000644e000 task.ti: 88000644e000 [0.331899] RIP: 0010:[] [] 0x88000644cd97 [0.332004] RSP: 0018:880007d03eb8 EFLAGS: 00010296 [0.332004] RAX: 88000644cd98 RBX: 880007d0e880 RCX: 0002 [0.332004] RDX: 880006a82560 RSI: 88000644d5d0 RDI: 880006a82560 [0.332004] RBP: 880007d03f20 R08: 0002 R09: [0.332004] R10: R11: R12: 8203d000 [0.332004] R13: 880006a6dd40 R14: 000a R15: 0008 [0.332004] FS: () GS:880007d0() knlGS: [0.332004] CS: 0010 DS: ES: CR0: 80050033 [0.332004] CR2: 88000644cd98 CR3: 02015000 CR4: 001407e0 [0.332004] DR0: DR1: DR2: [0.332004] DR3: DR6: 0ff0 DR7: 0400 [0.332004] Stack: [0.332004] 810db172 880005252e80 88000644d000 8800070bfe80 [0.332004] 88000644d000 88000644ffd8 880007d0e8a8 [0.332004] 88000644ffd8 0009 0101 82007048 [0.332004] Call Trace: [0.332004] [0.332004] [] ? rcu_process_callbacks+0x322/0x5a0 [0.332004] [] __do_softirq+0xd0/0x1a0 [0.332004] [] irq_exit+0x59/0xb0 [0.332004] [] smp_apic_timer_interrupt+0x8a/0xa0 [0.332004] [] apic_timer_interrupt+0x6f/0x80 [0.332004] [0.332004] [] ? __lock_acquire+0xaee/0xcc0 [0.332004] [] ?
Re: [cryptomgr_test] BUG: unable to handle kernel NULL pointer dereference at (null)
Hello, Appears to be caused by some memory corruption. Changing from SLOB allocator to SLUB made this crash disappear. Some different crashes with same config: [0.246152] cryptomgr_test (23) used greatest stack depth: 6400 bytes left [0.246929] cryptomgr_test (24) used greatest stack depth: 5384 bytes left [0.248851] modprobe (33) used greatest stack depth: 5376 bytes left [0.250351] alg: No test for crc32 (crc32-pclmul) [0.251669] BUG: unable to handle kernel paging request at 882006646e18 [0.252007] IP: [81085f07] task_active_pid_ns+0x17/0x30 [0.252007] PGD 2af8067 PUD 0 [0.252007] Oops: [#1] SMP [0.252007] Modules linked in: [0.252007] CPU: 0 PID: 43 Comm: kworker/u2:1 Not tainted 3.10.0-rc1-crash1-00048-gf9a31a2 #24 [0.252007] task: 880006694000 ti: 880006698000 task.ti: 880006698000 [0.252007] RIP: 0010:[81085f07] [81085f07] task_active_pid_ns+0x17/0x30 [0.252007] RSP: 0018:880006699dd8 EFLAGS: 00010002 [0.252007] RAX: 880006646e00 RBX: 880006694000 RCX: 0001 [0.252007] RDX: 001fffe0 RSI: 00098000 RDI: 880006655000 [0.252007] RBP: 880006699dd8 R08: 000d R09: 8800066945d0 [0.252007] R10: R11: R12: 0011 [0.252007] R13: R14: 88000704c000 R15: 880006693ff0 [0.252007] FS: () GS:880007c0() knlGS: [0.252007] CS: 0010 DS: ES: CR0: 80050033 [0.252007] CR2: 882006646e18 CR3: 02015000 CR4: 001407f0 [0.252007] DR0: DR1: DR2: [0.252007] DR3: DR6: 0ff0 DR7: 0400 [0.252007] Stack: [0.252007] 880006699e98 81078054 81077fc2 0011 [0.252007] 880006699e78 [0.252007] 0046 8106c5ef [0.252007] Call Trace: [0.252007] [81078054] do_notify_parent+0x114/0x580 [0.252007] [81077fc2] ? do_notify_parent+0x82/0x580 [0.252007] [8106c5ef] ? do_exit+0x80f/0xa20 [0.252007] [8106c6be] do_exit+0x8de/0xa20 [0.252007] [8107f628] wait_for_helper+0x98/0xa0 [0.252007] [8107f590] ? call_helper+0x20/0x20 [0.252007] [8196207c] ret_from_fork+0x7c/0xb0 [0.252007] [8107f590] ? call_helper+0x20/0x20 [0.252007] Code: 1f 44 00 00 5d c3 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 48 8b 87 48 02 00 00 55 48 89 e5 48 85 c0 74 10 8b 50 04 48 c1 e2 05 48 8b 44 10 38 eb 0a 66 90 31 c0 66 0f 1f 44 00 00 5d c3 66 0f [0.252007] RIP [81085f07] task_active_pid_ns+0x17/0x30 [0.252007] RSP 880006699dd8 [0.252007] CR2: 882006646e18 [0.252007] ---[ end trace 7caca246688ed8b9 ]--- [0.252007] Kernel panic - not syncing: Fatal exception ... [0.328072] kernel tried to execute NX-protected page - exploit attempt? (uid: 0) [0.328683] BUG: unable to handle kernel paging request at 88000644cd98 [0.329227] IP: [88000644cd98] 0x88000644cd97 [0.329690] PGD 2af8067 PUD 2af9067 PMD 864001e3 [0.330182] Oops: 0011 [#1] SMP [0.330449] Modules linked in: [0.330694] CPU: 1 PID: 1 Comm: swapper/0 Not tainted 3.10.0-rc1-crash1-00048-gf9a31a2 #24 [0.331314] task: 88000644d000 ti: 88000644e000 task.ti: 88000644e000 [0.331899] RIP: 0010:[88000644cd98] [88000644cd98] 0x88000644cd97 [0.332004] RSP: 0018:880007d03eb8 EFLAGS: 00010296 [0.332004] RAX: 88000644cd98 RBX: 880007d0e880 RCX: 0002 [0.332004] RDX: 880006a82560 RSI: 88000644d5d0 RDI: 880006a82560 [0.332004] RBP: 880007d03f20 R08: 0002 R09: [0.332004] R10: R11: R12: 8203d000 [0.332004] R13: 880006a6dd40 R14: 000a R15: 0008 [0.332004] FS: () GS:880007d0() knlGS: [0.332004] CS: 0010 DS: ES: CR0: 80050033 [0.332004] CR2: 88000644cd98 CR3: 02015000 CR4: 001407e0 [0.332004] DR0: DR1: DR2: [0.332004] DR3: DR6: 0ff0 DR7: 0400 [0.332004] Stack: [0.332004] 810db172 880005252e80 88000644d000 8800070bfe80 [0.332004] 88000644d000 88000644ffd8 880007d0e8a8 [0.332004] 88000644ffd8 0009 0101 82007048 [0.332004] Call Trace: [0.332004] IRQ [0.332004] [810db172] ? rcu_process_callbacks+0x322/0x5a0 [0.332004]
[PATCH] crypto: aesni_intel - fix accessing of unaligned memory
The new XTS code for aesni_intel uses input buffers directly as memory operands for pxor instructions, which causes crash if those buffers are not aligned to 16 bytes. Patch changes XTS code to handle unaligned memory correctly, by loading memory with movdqu instead. Reported-by: Dave Jones Tested-by: Dave Jones Signed-off-by: Jussi Kivilinna --- arch/x86/crypto/aesni-intel_asm.S | 48 + 1 file changed, 32 insertions(+), 16 deletions(-) diff --git a/arch/x86/crypto/aesni-intel_asm.S b/arch/x86/crypto/aesni-intel_asm.S index 62fe22c..477e9d7 100644 --- a/arch/x86/crypto/aesni-intel_asm.S +++ b/arch/x86/crypto/aesni-intel_asm.S @@ -2681,56 +2681,68 @@ ENTRY(aesni_xts_crypt8) addq %rcx, KEYP movdqa IV, STATE1 - pxor 0x00(INP), STATE1 + movdqu 0x00(INP), INC + pxor INC, STATE1 movdqu IV, 0x00(OUTP) _aesni_gf128mul_x_ble() movdqa IV, STATE2 - pxor 0x10(INP), STATE2 + movdqu 0x10(INP), INC + pxor INC, STATE2 movdqu IV, 0x10(OUTP) _aesni_gf128mul_x_ble() movdqa IV, STATE3 - pxor 0x20(INP), STATE3 + movdqu 0x20(INP), INC + pxor INC, STATE3 movdqu IV, 0x20(OUTP) _aesni_gf128mul_x_ble() movdqa IV, STATE4 - pxor 0x30(INP), STATE4 + movdqu 0x30(INP), INC + pxor INC, STATE4 movdqu IV, 0x30(OUTP) call *%r11 - pxor 0x00(OUTP), STATE1 + movdqu 0x00(OUTP), INC + pxor INC, STATE1 movdqu STATE1, 0x00(OUTP) _aesni_gf128mul_x_ble() movdqa IV, STATE1 - pxor 0x40(INP), STATE1 + movdqu 0x40(INP), INC + pxor INC, STATE1 movdqu IV, 0x40(OUTP) - pxor 0x10(OUTP), STATE2 + movdqu 0x10(OUTP), INC + pxor INC, STATE2 movdqu STATE2, 0x10(OUTP) _aesni_gf128mul_x_ble() movdqa IV, STATE2 - pxor 0x50(INP), STATE2 + movdqu 0x50(INP), INC + pxor INC, STATE2 movdqu IV, 0x50(OUTP) - pxor 0x20(OUTP), STATE3 + movdqu 0x20(OUTP), INC + pxor INC, STATE3 movdqu STATE3, 0x20(OUTP) _aesni_gf128mul_x_ble() movdqa IV, STATE3 - pxor 0x60(INP), STATE3 + movdqu 0x60(INP), INC + pxor INC, STATE3 movdqu IV, 0x60(OUTP) - pxor 0x30(OUTP), STATE4 + movdqu 0x30(OUTP), INC + pxor INC, STATE4 movdqu STATE4, 0x30(OUTP) _aesni_gf128mul_x_ble() movdqa IV, STATE4 - pxor 0x70(INP), STATE4 + movdqu 0x70(INP), INC + pxor INC, STATE4 movdqu IV, 0x70(OUTP) _aesni_gf128mul_x_ble() @@ -2738,16 +2750,20 @@ ENTRY(aesni_xts_crypt8) call *%r11 - pxor 0x40(OUTP), STATE1 + movdqu 0x40(OUTP), INC + pxor INC, STATE1 movdqu STATE1, 0x40(OUTP) - pxor 0x50(OUTP), STATE2 + movdqu 0x50(OUTP), INC + pxor INC, STATE2 movdqu STATE2, 0x50(OUTP) - pxor 0x60(OUTP), STATE3 + movdqu 0x60(OUTP), INC + pxor INC, STATE3 movdqu STATE3, 0x60(OUTP) - pxor 0x70(OUTP), STATE4 + movdqu 0x70(OUTP), INC + pxor INC, STATE4 movdqu STATE4, 0x70(OUTP) ret -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: GPF in aesni_xts_crypt8 (3.10-rc5)
Hello, Does attached patch help? -Jussi On 11.06.2013 20:26, Dave Jones wrote: > Just found that 3.10-rc doesn't boot on my laptop with encrypted disk. > > > general protection fault: [#1] PREEMPT SMP DEBUG_PAGEALLOC > Modules linked in: xfs libcrc32c dm_crypt crc32c_intel ghash_clmulni_intel > aesni_intel glue_helper ablk_helper i915 i2c_algo_bit drm_kms_helper drm > i2c_core video > CPU: 1 PID: 53 Comm: kworker/1:1 Not tainted 3.10.0-rc5+ #5 > Hardware name: LENOVO 2356JK8/2356JK8, BIOS G7ET94WW (2.54 ) 04/30/2013 > Workqueue: kcryptd kcryptd_crypt [dm_crypt] > task: 880135c58000 ti: 880135c54000 task.ti: 880135c54000 > RIP: 0010:[] [] > aesni_xts_crypt8+0x42/0x1e0 [aesni_intel] > RSP: 0018:880135c55b68 EFLAGS: 00010282 > RAX: a0142eb8 RBX: 0080 RCX: 00f0 > RDX: 8801316eeaa8 RSI: 8801316eeaa8 RDI: 88012fd84440 > RBP: 880135c55b70 R08: 8801304fe118 R09: 0020 > R10: 00f0 R11: a0142eb8 R12: 8801316eeb28 > R13: 0080 R14: 8801316eeb28 R15: 0180 > FS: () GS:88013940() knlGS: > CS: 0010 DS: ES: CR0: 80050033 > CR2: 0039e88bc720 CR3: 01c0b000 CR4: 001407e0 > Stack: > a0143683 880135c55c40 a00602fb 880135c55c70 > a0146060 01ad0190 a0146060 ea0004c5bb80 > 8801316eeaa8 ea0004c5bb80 8801316eeaa8 8801304fe0c0 > Call Trace: > [] ? aesni_xts_dec8+0x13/0x20 [aesni_intel] > [] glue_xts_crypt_128bit+0x10b/0x1c0 [glue_helper] > [] xts_decrypt+0x4b/0x50 [aesni_intel] > [] ablk_decrypt+0x4f/0xd0 [ablk_helper] > [] crypt_convert+0x352/0x3b0 [dm_crypt] > [] kcryptd_crypt+0x355/0x4e0 [dm_crypt] > [] ? process_one_work+0x1a5/0x700 > [] process_one_work+0x211/0x700 > [] ? process_one_work+0x1a5/0x700 > [] worker_thread+0x11b/0x3a0 > [] ? process_one_work+0x700/0x700 > [] kthread+0xed/0x100 > [] ? insert_kthread_work+0x80/0x80 > [] ret_from_fork+0x7c/0xb0 > [] ? insert_kthread_work+0x80/0x80 > Code: 8d 04 25 b8 2e 14 a0 41 0f 44 ca 4c 0f 44 d8 66 44 0f 6f 14 25 00 70 14 > a0 41 0f 10 18 44 8b 8f e0 01 00 00 48 01 cf 66 0f 6f c3 <66> 0f ef 02 f3 0f > 7f 1e 66 44 0f 70 db 13 66 0f d4 db 66 41 0f > RIP [] aesni_xts_crypt8+0x42/0x1e0 [aesni_intel] > RSP > >0: 8d 04 25 b8 2e 14 a0lea0xa0142eb8,%eax >7: 41 0f 44 ca cmove %r10d,%ecx >b: 4c 0f 44 d8 cmove %rax,%r11 >f: 66 44 0f 6f 14 25 00movdqa 0xa0147000,%xmm10 > 16: 70 14 a0 > 19: 41 0f 10 18 movups (%r8),%xmm3 > 1d: 44 8b 8f e0 01 00 00mov0x1e0(%rdi),%r9d > 24: 48 01 cfadd%rcx,%rdi > 27: 66 0f 6f c3 movdqa %xmm3,%xmm0 > 2b:*66 0f ef 02 pxor (%rdx),%xmm0 <-- trapping > instruction > 2f: f3 0f 7f 1e movdqu %xmm3,(%rsi) > 33: 66 44 0f 70 db 13 pshufd $0x13,%xmm3,%xmm11 > 39: 66 0f d4 db paddq %xmm3,%xmm3 > 3d: 66 data16 > 3e: 41 rex.B > 3f: > > crypto: aesni_intel - fix accessing of unaligned memory From: Jussi Kivilinna The new XTS code for aesni_intel uses input buffers directly as memory operands for pxor instructions, which causes crash if those buffers are not aligned to 16 bytes. Patch change XTS code to handle unaligned memory correctly, by loading memory with movdqu instead. Reported-by: Dave Jones Signed-off-by: Jussi Kivilinna --- arch/x86/crypto/aesni-intel_asm.S | 48 + 1 file changed, 32 insertions(+), 16 deletions(-) diff --git a/arch/x86/crypto/aesni-intel_asm.S b/arch/x86/crypto/aesni-intel_asm.S index 62fe22c..477e9d7 100644 --- a/arch/x86/crypto/aesni-intel_asm.S +++ b/arch/x86/crypto/aesni-intel_asm.S @@ -2681,56 +2681,68 @@ ENTRY(aesni_xts_crypt8) addq %rcx, KEYP movdqa IV, STATE1 - pxor 0x00(INP), STATE1 + movdqu 0x00(INP), INC + pxor INC, STATE1 movdqu IV, 0x00(OUTP) _aesni_gf128mul_x_ble() movdqa IV, STATE2 - pxor 0x10(INP), STATE2 + movdqu 0x10(INP), INC + pxor INC, STATE2 movdqu IV, 0x10(OUTP) _aesni_gf128mul_x_ble() movdqa IV, STATE3 - pxor 0x20(INP), STATE3 + movdqu 0x20(INP), INC + pxor INC, STATE3 movdqu IV, 0x20(OUTP) _aesni_gf128mul_x_ble() movdqa IV, STATE4 - pxor 0x30(INP), STATE4 + movdqu 0x30(INP), INC + pxor INC, STATE4 movdqu IV, 0x30(OUTP) call *%r11 - pxor 0x00(OUTP), STATE1 + movdqu 0x00(OUTP), INC + pxor INC, STATE1 movdqu STATE1, 0x00(OUTP) _aesni_gf128mul_x_ble() movdqa IV, STATE1 - pxor 0x40(INP), STATE1 + movdqu 0x40(INP), INC + pxor INC, STATE1 movdqu IV, 0x40(OUTP) - p
Re: GPF in aesni_xts_crypt8 (3.10-rc5)
Hello, Does attached patch help? -Jussi On 11.06.2013 20:26, Dave Jones wrote: Just found that 3.10-rc doesn't boot on my laptop with encrypted disk. general protection fault: [#1] PREEMPT SMP DEBUG_PAGEALLOC Modules linked in: xfs libcrc32c dm_crypt crc32c_intel ghash_clmulni_intel aesni_intel glue_helper ablk_helper i915 i2c_algo_bit drm_kms_helper drm i2c_core video CPU: 1 PID: 53 Comm: kworker/1:1 Not tainted 3.10.0-rc5+ #5 Hardware name: LENOVO 2356JK8/2356JK8, BIOS G7ET94WW (2.54 ) 04/30/2013 Workqueue: kcryptd kcryptd_crypt [dm_crypt] task: 880135c58000 ti: 880135c54000 task.ti: 880135c54000 RIP: 0010:[a01433a2] [a01433a2] aesni_xts_crypt8+0x42/0x1e0 [aesni_intel] RSP: 0018:880135c55b68 EFLAGS: 00010282 RAX: a0142eb8 RBX: 0080 RCX: 00f0 RDX: 8801316eeaa8 RSI: 8801316eeaa8 RDI: 88012fd84440 RBP: 880135c55b70 R08: 8801304fe118 R09: 0020 R10: 00f0 R11: a0142eb8 R12: 8801316eeb28 R13: 0080 R14: 8801316eeb28 R15: 0180 FS: () GS:88013940() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 0039e88bc720 CR3: 01c0b000 CR4: 001407e0 Stack: a0143683 880135c55c40 a00602fb 880135c55c70 a0146060 01ad0190 a0146060 ea0004c5bb80 8801316eeaa8 ea0004c5bb80 8801316eeaa8 8801304fe0c0 Call Trace: [a0143683] ? aesni_xts_dec8+0x13/0x20 [aesni_intel] [a00602fb] glue_xts_crypt_128bit+0x10b/0x1c0 [glue_helper] [a014358b] xts_decrypt+0x4b/0x50 [aesni_intel] [a000617f] ablk_decrypt+0x4f/0xd0 [ablk_helper] [a0067202] crypt_convert+0x352/0x3b0 [dm_crypt] [a00675b5] kcryptd_crypt+0x355/0x4e0 [dm_crypt] [81061b35] ? process_one_work+0x1a5/0x700 [81061ba1] process_one_work+0x211/0x700 [81061b35] ? process_one_work+0x1a5/0x700 [810621ab] worker_thread+0x11b/0x3a0 [81062090] ? process_one_work+0x700/0x700 [81069f4d] kthread+0xed/0x100 [81069e60] ? insert_kthread_work+0x80/0x80 [815fd41c] ret_from_fork+0x7c/0xb0 [81069e60] ? insert_kthread_work+0x80/0x80 Code: 8d 04 25 b8 2e 14 a0 41 0f 44 ca 4c 0f 44 d8 66 44 0f 6f 14 25 00 70 14 a0 41 0f 10 18 44 8b 8f e0 01 00 00 48 01 cf 66 0f 6f c3 66 0f ef 02 f3 0f 7f 1e 66 44 0f 70 db 13 66 0f d4 db 66 41 0f RIP [a01433a2] aesni_xts_crypt8+0x42/0x1e0 [aesni_intel] RSP 880135c55b68 0: 8d 04 25 b8 2e 14 a0lea0xa0142eb8,%eax 7: 41 0f 44 ca cmove %r10d,%ecx b: 4c 0f 44 d8 cmove %rax,%r11 f: 66 44 0f 6f 14 25 00movdqa 0xa0147000,%xmm10 16: 70 14 a0 19: 41 0f 10 18 movups (%r8),%xmm3 1d: 44 8b 8f e0 01 00 00mov0x1e0(%rdi),%r9d 24: 48 01 cfadd%rcx,%rdi 27: 66 0f 6f c3 movdqa %xmm3,%xmm0 2b:*66 0f ef 02 pxor (%rdx),%xmm0 -- trapping instruction 2f: f3 0f 7f 1e movdqu %xmm3,(%rsi) 33: 66 44 0f 70 db 13 pshufd $0x13,%xmm3,%xmm11 39: 66 0f d4 db paddq %xmm3,%xmm3 3d: 66 data16 3e: 41 rex.B 3f: crypto: aesni_intel - fix accessing of unaligned memory From: Jussi Kivilinna jussi.kivili...@iki.fi The new XTS code for aesni_intel uses input buffers directly as memory operands for pxor instructions, which causes crash if those buffers are not aligned to 16 bytes. Patch change XTS code to handle unaligned memory correctly, by loading memory with movdqu instead. Reported-by: Dave Jones da...@redhat.com Signed-off-by: Jussi Kivilinna jussi.kivili...@iki.fi --- arch/x86/crypto/aesni-intel_asm.S | 48 + 1 file changed, 32 insertions(+), 16 deletions(-) diff --git a/arch/x86/crypto/aesni-intel_asm.S b/arch/x86/crypto/aesni-intel_asm.S index 62fe22c..477e9d7 100644 --- a/arch/x86/crypto/aesni-intel_asm.S +++ b/arch/x86/crypto/aesni-intel_asm.S @@ -2681,56 +2681,68 @@ ENTRY(aesni_xts_crypt8) addq %rcx, KEYP movdqa IV, STATE1 - pxor 0x00(INP), STATE1 + movdqu 0x00(INP), INC + pxor INC, STATE1 movdqu IV, 0x00(OUTP) _aesni_gf128mul_x_ble() movdqa IV, STATE2 - pxor 0x10(INP), STATE2 + movdqu 0x10(INP), INC + pxor INC, STATE2 movdqu IV, 0x10(OUTP) _aesni_gf128mul_x_ble() movdqa IV, STATE3 - pxor 0x20(INP), STATE3 + movdqu 0x20(INP), INC + pxor INC, STATE3 movdqu IV, 0x20(OUTP) _aesni_gf128mul_x_ble() movdqa IV, STATE4 - pxor 0x30(INP), STATE4 + movdqu 0x30(INP), INC + pxor INC, STATE4 movdqu IV, 0x30(OUTP) call *%r11 - pxor 0x00(OUTP), STATE1 + movdqu 0x00(OUTP), INC + pxor INC, STATE1 movdqu STATE1, 0x00(OUTP) _aesni_gf128mul_x_ble() movdqa IV, STATE1 - pxor 0x40(INP
[PATCH] crypto: aesni_intel - fix accessing of unaligned memory
The new XTS code for aesni_intel uses input buffers directly as memory operands for pxor instructions, which causes crash if those buffers are not aligned to 16 bytes. Patch changes XTS code to handle unaligned memory correctly, by loading memory with movdqu instead. Reported-by: Dave Jones da...@redhat.com Tested-by: Dave Jones da...@redhat.com Signed-off-by: Jussi Kivilinna jussi.kivili...@iki.fi --- arch/x86/crypto/aesni-intel_asm.S | 48 + 1 file changed, 32 insertions(+), 16 deletions(-) diff --git a/arch/x86/crypto/aesni-intel_asm.S b/arch/x86/crypto/aesni-intel_asm.S index 62fe22c..477e9d7 100644 --- a/arch/x86/crypto/aesni-intel_asm.S +++ b/arch/x86/crypto/aesni-intel_asm.S @@ -2681,56 +2681,68 @@ ENTRY(aesni_xts_crypt8) addq %rcx, KEYP movdqa IV, STATE1 - pxor 0x00(INP), STATE1 + movdqu 0x00(INP), INC + pxor INC, STATE1 movdqu IV, 0x00(OUTP) _aesni_gf128mul_x_ble() movdqa IV, STATE2 - pxor 0x10(INP), STATE2 + movdqu 0x10(INP), INC + pxor INC, STATE2 movdqu IV, 0x10(OUTP) _aesni_gf128mul_x_ble() movdqa IV, STATE3 - pxor 0x20(INP), STATE3 + movdqu 0x20(INP), INC + pxor INC, STATE3 movdqu IV, 0x20(OUTP) _aesni_gf128mul_x_ble() movdqa IV, STATE4 - pxor 0x30(INP), STATE4 + movdqu 0x30(INP), INC + pxor INC, STATE4 movdqu IV, 0x30(OUTP) call *%r11 - pxor 0x00(OUTP), STATE1 + movdqu 0x00(OUTP), INC + pxor INC, STATE1 movdqu STATE1, 0x00(OUTP) _aesni_gf128mul_x_ble() movdqa IV, STATE1 - pxor 0x40(INP), STATE1 + movdqu 0x40(INP), INC + pxor INC, STATE1 movdqu IV, 0x40(OUTP) - pxor 0x10(OUTP), STATE2 + movdqu 0x10(OUTP), INC + pxor INC, STATE2 movdqu STATE2, 0x10(OUTP) _aesni_gf128mul_x_ble() movdqa IV, STATE2 - pxor 0x50(INP), STATE2 + movdqu 0x50(INP), INC + pxor INC, STATE2 movdqu IV, 0x50(OUTP) - pxor 0x20(OUTP), STATE3 + movdqu 0x20(OUTP), INC + pxor INC, STATE3 movdqu STATE3, 0x20(OUTP) _aesni_gf128mul_x_ble() movdqa IV, STATE3 - pxor 0x60(INP), STATE3 + movdqu 0x60(INP), INC + pxor INC, STATE3 movdqu IV, 0x60(OUTP) - pxor 0x30(OUTP), STATE4 + movdqu 0x30(OUTP), INC + pxor INC, STATE4 movdqu STATE4, 0x30(OUTP) _aesni_gf128mul_x_ble() movdqa IV, STATE4 - pxor 0x70(INP), STATE4 + movdqu 0x70(INP), INC + pxor INC, STATE4 movdqu IV, 0x70(OUTP) _aesni_gf128mul_x_ble() @@ -2738,16 +2750,20 @@ ENTRY(aesni_xts_crypt8) call *%r11 - pxor 0x40(OUTP), STATE1 + movdqu 0x40(OUTP), INC + pxor INC, STATE1 movdqu STATE1, 0x40(OUTP) - pxor 0x50(OUTP), STATE2 + movdqu 0x50(OUTP), INC + pxor INC, STATE2 movdqu STATE2, 0x50(OUTP) - pxor 0x60(OUTP), STATE3 + movdqu 0x60(OUTP), INC + pxor INC, STATE3 movdqu STATE3, 0x60(OUTP) - pxor 0x70(OUTP), STATE4 + movdqu 0x70(OUTP), INC + pxor INC, STATE4 movdqu STATE4, 0x70(OUTP) ret -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/4] Simple correctness and speed test for CRCT10DIF hash
On 16.04.2013 19:20, Tim Chen wrote: > These are simple tests to do sanity check of CRC T10 DIF hash. The > correctness of the transform can be checked with the command > modprobe tcrypt mode=47 > The speed of the transform can be evaluated with the command > modprobe tcrypt mode=320 > > Set the cpu frequency to constant and turn turbo off when running the > speed test so the frequency governor will not tweak the frequency and > affects the measurements. > > Signed-off-by: Tim Chen > Tested-by: Keith Busch > > +#define CRCT10DIF_TEST_VECTORS 2 > +static struct hash_testvec crct10dif_tv_template[] = { > + { > + .plaintext = "abc", > + .psize = 3, > +#ifdef __LITTLE_ENDIAN > + .digest = "\x3b\x44", > +#else > + .digest = "\x44\x3b", > +#endif > + }, { > + .plaintext = > + "abcd", > + .psize = 56, > +#ifdef __LITTLE_ENDIAN > + .digest = "\xe3\x9c", > +#else > + .digest = "\x9c\xe3", > +#endif > + .np = 2, > + .tap= { 28, 28 } > + } > +}; > + Are these large enough to test all code paths in the PCLMULQDQ implementation? -Jussi -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/4] Accelerated CRC T10 DIF computation with PCLMULQDQ instruction
On 16.04.2013 19:20, Tim Chen wrote: > This is the x86_64 CRC T10 DIF transform accelerated with the PCLMULQDQ > instructions. Details discussing the implementation can be found in the > paper: > > "Fast CRC Computation for Generic Polynomials Using PCLMULQDQ Instruction" > URL: http://download.intel.com/design/intarch/papers/323102.pdf URL does not work. > > Signed-off-by: Tim Chen > Tested-by: Keith Busch > --- > arch/x86/crypto/crct10dif-pcl-asm_64.S | 659 > + > 1 file changed, 659 insertions(+) > create mode 100644 arch/x86/crypto/crct10dif-pcl-asm_64.S > + > + # Allocate Stack Space > + mov %rsp, %rcx > + sub $16*10, %rsp > + and $~(0x20 - 1), %rsp > + > + # push the xmm registers into the stack to maintain > + movdqa %xmm10, 16*2(%rsp) > + movdqa %xmm11, 16*3(%rsp) > + movdqa %xmm8 , 16*4(%rsp) > + movdqa %xmm12, 16*5(%rsp) > + movdqa %xmm13, 16*6(%rsp) > + movdqa %xmm6, 16*7(%rsp) > + movdqa %xmm7, 16*8(%rsp) > + movdqa %xmm9, 16*9(%rsp) You don't need to store (and restore) these, as 'crc_t10dif_pcl' is called between kernel_fpu_begin/_end. > + > + > + # check if smaller than 256 > + cmp $256, arg3 > + > +_cleanup: > + # scale the result back to 16 bits > + shr $16, %eax > + movdqa 16*2(%rsp), %xmm10 > + movdqa 16*3(%rsp), %xmm11 > + movdqa 16*4(%rsp), %xmm8 > + movdqa 16*5(%rsp), %xmm12 > + movdqa 16*6(%rsp), %xmm13 > + movdqa 16*7(%rsp), %xmm6 > + movdqa 16*8(%rsp), %xmm7 > + movdqa 16*9(%rsp), %xmm9 Registers are overwritten by kernel_fpu_end. > + mov %rcx, %rsp > + ret > +ENDPROC(crc_t10dif_pcl) > + You should move ENDPROC at end of the full function. > + > + > +.align 16 > +_less_than_128: > + > + # check if there is enough buffer to be able to fold 16B at a time > + cmp $32, arg3 > + movdqa (%rsp), %xmm7 > + pshufb %xmm11, %xmm7 > + pxor%xmm0 , %xmm7 # xor the initial crc value > + > + psrldq $7, %xmm7 > + > + jmp _barrett Move ENDPROC here. -Jussi -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/4] Accelerated CRC T10 DIF computation with PCLMULQDQ instruction
On 16.04.2013 19:20, Tim Chen wrote: This is the x86_64 CRC T10 DIF transform accelerated with the PCLMULQDQ instructions. Details discussing the implementation can be found in the paper: Fast CRC Computation for Generic Polynomials Using PCLMULQDQ Instruction URL: http://download.intel.com/design/intarch/papers/323102.pdf URL does not work. Signed-off-by: Tim Chen tim.c.c...@linux.intel.com Tested-by: Keith Busch keith.bu...@intel.com --- arch/x86/crypto/crct10dif-pcl-asm_64.S | 659 + 1 file changed, 659 insertions(+) create mode 100644 arch/x86/crypto/crct10dif-pcl-asm_64.S snip + + # Allocate Stack Space + mov %rsp, %rcx + sub $16*10, %rsp + and $~(0x20 - 1), %rsp + + # push the xmm registers into the stack to maintain + movdqa %xmm10, 16*2(%rsp) + movdqa %xmm11, 16*3(%rsp) + movdqa %xmm8 , 16*4(%rsp) + movdqa %xmm12, 16*5(%rsp) + movdqa %xmm13, 16*6(%rsp) + movdqa %xmm6, 16*7(%rsp) + movdqa %xmm7, 16*8(%rsp) + movdqa %xmm9, 16*9(%rsp) You don't need to store (and restore) these, as 'crc_t10dif_pcl' is called between kernel_fpu_begin/_end. + + + # check if smaller than 256 + cmp $256, arg3 + snip +_cleanup: + # scale the result back to 16 bits + shr $16, %eax + movdqa 16*2(%rsp), %xmm10 + movdqa 16*3(%rsp), %xmm11 + movdqa 16*4(%rsp), %xmm8 + movdqa 16*5(%rsp), %xmm12 + movdqa 16*6(%rsp), %xmm13 + movdqa 16*7(%rsp), %xmm6 + movdqa 16*8(%rsp), %xmm7 + movdqa 16*9(%rsp), %xmm9 Registers are overwritten by kernel_fpu_end. + mov %rcx, %rsp + ret +ENDPROC(crc_t10dif_pcl) + You should move ENDPROC at end of the full function. + + +.align 16 +_less_than_128: + + # check if there is enough buffer to be able to fold 16B at a time + cmp $32, arg3 snip + movdqa (%rsp), %xmm7 + pshufb %xmm11, %xmm7 + pxor%xmm0 , %xmm7 # xor the initial crc value + + psrldq $7, %xmm7 + + jmp _barrett Move ENDPROC here. -Jussi -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/4] Simple correctness and speed test for CRCT10DIF hash
On 16.04.2013 19:20, Tim Chen wrote: These are simple tests to do sanity check of CRC T10 DIF hash. The correctness of the transform can be checked with the command modprobe tcrypt mode=47 The speed of the transform can be evaluated with the command modprobe tcrypt mode=320 Set the cpu frequency to constant and turn turbo off when running the speed test so the frequency governor will not tweak the frequency and affects the measurements. Signed-off-by: Tim Chen tim.c.c...@linux.intel.com Tested-by: Keith Busch keith.bu...@intel.com snip +#define CRCT10DIF_TEST_VECTORS 2 +static struct hash_testvec crct10dif_tv_template[] = { + { + .plaintext = abc, + .psize = 3, +#ifdef __LITTLE_ENDIAN + .digest = \x3b\x44, +#else + .digest = \x44\x3b, +#endif + }, { + .plaintext = + abcd, + .psize = 56, +#ifdef __LITTLE_ENDIAN + .digest = \xe3\x9c, +#else + .digest = \x9c\xe3, +#endif + .np = 2, + .tap= { 28, 28 } + } +}; + Are these large enough to test all code paths in the PCLMULQDQ implementation? -Jussi -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC PATCH 2/6] crypto: tcrypt - add async cipher speed tests for blowfish
Signed-off-by: Jussi Kivilinna --- crypto/tcrypt.c | 15 +++ 1 file changed, 15 insertions(+) diff --git a/crypto/tcrypt.c b/crypto/tcrypt.c index 24ea7df..66d254c 100644 --- a/crypto/tcrypt.c +++ b/crypto/tcrypt.c @@ -1768,6 +1768,21 @@ static int do_test(int m) speed_template_32_64); break; + case 509: + test_acipher_speed("ecb(blowfish)", ENCRYPT, sec, NULL, 0, + speed_template_8_32); + test_acipher_speed("ecb(blowfish)", DECRYPT, sec, NULL, 0, + speed_template_8_32); + test_acipher_speed("cbc(blowfish)", ENCRYPT, sec, NULL, 0, + speed_template_8_32); + test_acipher_speed("cbc(blowfish)", DECRYPT, sec, NULL, 0, + speed_template_8_32); + test_acipher_speed("ctr(blowfish)", ENCRYPT, sec, NULL, 0, + speed_template_8_32); + test_acipher_speed("ctr(blowfish)", DECRYPT, sec, NULL, 0, + speed_template_8_32); + break; + case 1000: test_available(); break; -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC PATCH 5/6] crypto: serpent - add AVX2/x86_64 assembler implementation of serpent cipher
Patch adds AVX2/x86-64 implementation of Serpent cipher, requiring 16 parallel blocks for input (256 bytes). Implementation is based on the AVX implementation and extends to use the 256-bit wide YMM registers. Since serpent does not use table look-ups, this implementation should be close to two times faster than the AVX implementation. Signed-off-by: Jussi Kivilinna --- arch/x86/crypto/Makefile |2 arch/x86/crypto/serpent-avx2-asm_64.S | 800 + arch/x86/crypto/serpent_avx2_glue.c | 562 arch/x86/crypto/serpent_avx_glue.c| 62 ++ arch/x86/include/asm/crypto/serpent-avx.h | 24 + crypto/Kconfig| 23 + crypto/testmgr.c | 15 + 7 files changed, 1468 insertions(+), 20 deletions(-) create mode 100644 arch/x86/crypto/serpent-avx2-asm_64.S create mode 100644 arch/x86/crypto/serpent_avx2_glue.c diff --git a/arch/x86/crypto/Makefile b/arch/x86/crypto/Makefile index 1f6e0c2..a21af59 100644 --- a/arch/x86/crypto/Makefile +++ b/arch/x86/crypto/Makefile @@ -43,6 +43,7 @@ endif # These modules require assembler to support AVX2. ifeq ($(avx2_supported),yes) obj-$(CONFIG_CRYPTO_BLOWFISH_AVX2_X86_64) += blowfish-avx2.o + obj-$(CONFIG_CRYPTO_SERPENT_AVX2_X86_64) += serpent-avx2.o obj-$(CONFIG_CRYPTO_TWOFISH_AVX2_X86_64) += twofish-avx2.o endif @@ -72,6 +73,7 @@ endif ifeq ($(avx2_supported),yes) blowfish-avx2-y := blowfish-avx2-asm_64.o blowfish_avx2_glue.o + serpent-avx2-y := serpent-avx2-asm_64.o serpent_avx2_glue.o twofish-avx2-y := twofish-avx2-asm_64.o twofish_avx2_glue.o endif diff --git a/arch/x86/crypto/serpent-avx2-asm_64.S b/arch/x86/crypto/serpent-avx2-asm_64.S new file mode 100644 index 000..b222085 --- /dev/null +++ b/arch/x86/crypto/serpent-avx2-asm_64.S @@ -0,0 +1,800 @@ +/* + * x86_64/AVX2 assembler optimized version of Serpent + * + * Copyright © 2012-2013 Jussi Kivilinna + * + * Based on AVX assembler implementation of Serpent by: + * Copyright © 2012 Johannes Goetzfried + * + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + */ + +#include +#include "glue_helper-asm-avx2.S" + +.file "serpent-avx2-asm_64.S" + +.data +.align 16 + +.Lbswap128_mask: + .byte 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0 +.Lxts_gf128mul_and_shl1_mask_0: + .byte 0x87, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0 +.Lxts_gf128mul_and_shl1_mask_1: + .byte 0x0e, 1, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0 + +.text + +#define CTX %rdi + +#define RNOT %ymm0 +#define tp %ymm1 + +#define RA1 %ymm2 +#define RA2 %ymm3 +#define RB1 %ymm4 +#define RB2 %ymm5 +#define RC1 %ymm6 +#define RC2 %ymm7 +#define RD1 %ymm8 +#define RD2 %ymm9 +#define RE1 %ymm10 +#define RE2 %ymm11 + +#define RK0 %ymm12 +#define RK1 %ymm13 +#define RK2 %ymm14 +#define RK3 %ymm15 + +#define RK0x %xmm12 +#define RK1x %xmm13 +#define RK2x %xmm14 +#define RK3x %xmm15 + +#define S0_1(x0, x1, x2, x3, x4) \ + vporx0, x3, tp; \ + vpxor x3, x0, x0; \ + vpxor x2, x3, x4; \ + vpxor RNOT, x4, x4; \ + vpxor x1, tp, x3; \ + vpand x0, x1, x1; \ + vpxor x4, x1, x1; \ + vpxor x0, x2, x2; +#define S0_2(x0, x1, x2, x3, x4) \ + vpxor x3, x0, x0; \ + vporx0, x4, x4; \ + vpxor x2, x0, x0; \ + vpand x1, x2, x2; \ + vpxor x2, x3, x3; \ + vpxor RNOT, x1, x1; \ + vpxor x4, x2, x2; \ + vpxor x2, x1, x1; + +#define S1_1(x0, x1, x2, x3, x4) \ + vpxor x0, x1, tp; \ + vpxor x3, x0, x0; \ + vpxor RNOT, x3, x3; \ + vpand tp, x1, x4; \ + vportp, x0, x0; \ + vpxor x2, x3, x3; \ + vpxor x3, x0, x0; \ + vpxor x3, tp, x1; +#define S1_2(x0, x1, x2, x3, x4) \ + vpxor x4, x3, x3; \ + vporx4, x1, x1; \ + vpxor x2, x4, x4; \ + vpand x0, x2, x2; \ + vpxor x1, x2, x2; \ + vporx0, x1, x1; \ + vpxor RNOT, x0, x0; \ + vpxor x2, x0, x0; \ + vpxor x1, x4, x4; + +#define S2_1(x0, x1, x2, x3, x4) \ + vpxor RNOT, x3, x3; \ + vpxor x0, x1, x1; \ + vpand x2, x0, tp; \ + vpxor x3, tp, tp; \ + vporx0, x3, x3; \ + vpxor x1, x2,
[RFC PATCH 4/6] crypto: twofish - add AVX2/x86_64 assembler implementation of twofish cipher
Patch adds AVX2/x86-64 implementation of Twofish cipher, requiring 16 parallel blocks for input (256 bytes). Table look-ups are performed using vpgatherdd instruction directly from vector registers and thus should be faster than earlier implementations. Implementation also uses 256-bit wide YMM registers, which should give additional speed up compared to the AVX implementation. Signed-off-by: Jussi Kivilinna --- arch/x86/crypto/Makefile |2 arch/x86/crypto/glue_helper-asm-avx2.S | 180 ++ arch/x86/crypto/twofish-avx2-asm_64.S | 600 arch/x86/crypto/twofish_avx2_glue.c| 584 +++ arch/x86/crypto/twofish_avx_glue.c | 14 + arch/x86/include/asm/crypto/twofish.h | 18 + crypto/Kconfig | 24 + crypto/testmgr.c | 12 + 8 files changed, 1432 insertions(+), 2 deletions(-) create mode 100644 arch/x86/crypto/glue_helper-asm-avx2.S create mode 100644 arch/x86/crypto/twofish-avx2-asm_64.S create mode 100644 arch/x86/crypto/twofish_avx2_glue.c diff --git a/arch/x86/crypto/Makefile b/arch/x86/crypto/Makefile index 28464ef..1f6e0c2 100644 --- a/arch/x86/crypto/Makefile +++ b/arch/x86/crypto/Makefile @@ -43,6 +43,7 @@ endif # These modules require assembler to support AVX2. ifeq ($(avx2_supported),yes) obj-$(CONFIG_CRYPTO_BLOWFISH_AVX2_X86_64) += blowfish-avx2.o + obj-$(CONFIG_CRYPTO_TWOFISH_AVX2_X86_64) += twofish-avx2.o endif aes-i586-y := aes-i586-asm_32.o aes_glue.o @@ -71,6 +72,7 @@ endif ifeq ($(avx2_supported),yes) blowfish-avx2-y := blowfish-avx2-asm_64.o blowfish_avx2_glue.o + twofish-avx2-y := twofish-avx2-asm_64.o twofish_avx2_glue.o endif aesni-intel-y := aesni-intel_asm.o aesni-intel_glue.o fpu.o diff --git a/arch/x86/crypto/glue_helper-asm-avx2.S b/arch/x86/crypto/glue_helper-asm-avx2.S new file mode 100644 index 000..a53ac11 --- /dev/null +++ b/arch/x86/crypto/glue_helper-asm-avx2.S @@ -0,0 +1,180 @@ +/* + * Shared glue code for 128bit block ciphers, AVX2 assembler macros + * + * Copyright © 2012-2013 Jussi Kivilinna + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + */ + +#define load_16way(src, x0, x1, x2, x3, x4, x5, x6, x7) \ + vmovdqu (0*32)(src), x0; \ + vmovdqu (1*32)(src), x1; \ + vmovdqu (2*32)(src), x2; \ + vmovdqu (3*32)(src), x3; \ + vmovdqu (4*32)(src), x4; \ + vmovdqu (5*32)(src), x5; \ + vmovdqu (6*32)(src), x6; \ + vmovdqu (7*32)(src), x7; + +#define store_16way(dst, x0, x1, x2, x3, x4, x5, x6, x7) \ + vmovdqu x0, (0*32)(dst); \ + vmovdqu x1, (1*32)(dst); \ + vmovdqu x2, (2*32)(dst); \ + vmovdqu x3, (3*32)(dst); \ + vmovdqu x4, (4*32)(dst); \ + vmovdqu x5, (5*32)(dst); \ + vmovdqu x6, (6*32)(dst); \ + vmovdqu x7, (7*32)(dst); + +#define store_cbc_16way(src, dst, x0, x1, x2, x3, x4, x5, x6, x7, t0) \ + vpxor t0, t0, t0; \ + vinserti128 $1, (src), t0, t0; \ + vpxor t0, x0, x0; \ + vpxor (0*32+16)(src), x1, x1; \ + vpxor (1*32+16)(src), x2, x2; \ + vpxor (2*32+16)(src), x3, x3; \ + vpxor (3*32+16)(src), x4, x4; \ + vpxor (4*32+16)(src), x5, x5; \ + vpxor (5*32+16)(src), x6, x6; \ + vpxor (6*32+16)(src), x7, x7; \ + store_16way(dst, x0, x1, x2, x3, x4, x5, x6, x7); + +#define inc_le128(x, minus_one, tmp) \ + vpcmpeqq minus_one, x, tmp; \ + vpsubq minus_one, x, x; \ + vpslldq $8, tmp, tmp; \ + vpsubq tmp, x, x; + +#define add2_le128(x, minus_one, minus_two, tmp1, tmp2) \ + vpcmpeqq minus_one, x, tmp1; \ + vpcmpeqq minus_two, x, tmp2; \ + vpsubq minus_two, x, x; \ + vpor tmp2, tmp1, tmp1; \ + vpslldq $8, tmp1, tmp1; \ + vpsubq tmp1, x, x; + +#define load_ctr_16way(iv, bswap, x0, x1, x2, x3, x4, x5, x6, x7, t0, t0x, t1, \ + t1x, t2, t2x, t3, t3x, t4, t5) \ + vpcmpeqd t0, t0, t0; \ + vpsrldq $8, t0, t0; /* ab: -1:0 ; cd: -1:0 */ \ + vpaddq t0, t0, t4; /* ab: -2:0 ; cd: -2:0 */\ + \ + /* load IV and byteswap */ \ + vmovdqu (iv), t2x; \ + vmovdqa t2x, t3x; \ + inc_le128(t2x, t0x, t1x); \ + vbroadcasti128 bswap, t1; \ + vinserti128 $1, t2x, t3, t2; /* ab: le0 ; cd: le1 */ \ + vpshufb t1, t2, x0; \ + \ + /* construct IVs */ \ + add2_le128(t2, t0, t4, t3, t5); /* ab: le2 ; cd: le3 */ \ + vpshufb t1, t2, x1; \ + add2_le128(t2, t0, t4, t3, t5); \ + vpshufb t1, t2, x2; \ + add2_le128(t2, t0, t4, t3, t5); \ + vpshufb t1, t2, x3; \ + add2_le128(t2, t0, t4, t3, t5); \ + vpshufb t1, t2, x4; \ + add2_le128(t2, t0, t4, t3, t5
[RFC PATCH 1/6] crypto: testmgr - extend camellia test-vectors for camellia-aesni/avx2
Signed-off-by: Jussi Kivilinna --- crypto/testmgr.h | 1100 -- 1 file changed, 1062 insertions(+), 38 deletions(-) diff --git a/crypto/testmgr.h b/crypto/testmgr.h index d503660..dc2c054 100644 --- a/crypto/testmgr.h +++ b/crypto/testmgr.h @@ -20997,8 +20997,72 @@ static struct cipher_testvec camellia_enc_tv_template[] = { "\x86\x1D\xB4\x28\xBF\x56\xED\x61" "\xF8\x8F\x03\x9A\x31\xC8\x3C\xD3" "\x6A\x01\x75\x0C\xA3\x17\xAE\x45" - "\xDC\x50\xE7\x7E\x15\x89\x20\xB7", - .ilen = 496, + "\xDC\x50\xE7\x7E\x15\x89\x20\xB7" + "\x2B\xC2\x59\xF0\x64\xFB\x92\x06" + "\x9D\x34\xCB\x3F\xD6\x6D\x04\x78" + "\x0F\xA6\x1A\xB1\x48\xDF\x53\xEA" + "\x81\x18\x8C\x23\xBA\x2E\xC5\x5C" + "\xF3\x67\xFE\x95\x09\xA0\x37\xCE" + "\x42\xD9\x70\x07\x7B\x12\xA9\x1D" + "\xB4\x4B\xE2\x56\xED\x84\x1B\x8F" + "\x26\xBD\x31\xC8\x5F\xF6\x6A\x01" + "\x98\x0C\xA3\x3A\xD1\x45\xDC\x73" + "\x0A\x7E\x15\xAC\x20\xB7\x4E\xE5" + "\x59\xF0\x87\x1E\x92\x29\xC0\x34" + "\xCB\x62\xF9\x6D\x04\x9B\x0F\xA6" + "\x3D\xD4\x48\xDF\x76\x0D\x81\x18" + "\xAF\x23\xBA\x51\xE8\x5C\xF3\x8A" + "\x21\x95\x2C\xC3\x37\xCE\x65\xFC" + "\x70\x07\x9E\x12\xA9\x40\xD7\x4B" + "\xE2\x79\x10\x84\x1B\xB2\x26\xBD" + "\x54\xEB\x5F\xF6\x8D\x01\x98\x2F" + "\xC6\x3A\xD1\x68\xFF\x73\x0A\xA1" + "\x15\xAC\x43\xDA\x4E\xE5\x7C\x13" + "\x87\x1E\xB5\x29\xC0\x57\xEE\x62" + "\xF9\x90\x04\x9B\x32\xC9\x3D\xD4" + "\x6B\x02\x76\x0D\xA4\x18\xAF\x46" + "\xDD\x51\xE8\x7F\x16\x8A\x21\xB8" + "\x2C\xC3\x5A\xF1\x65\xFC\x93\x07" + "\x9E\x35\xCC\x40\xD7\x6E\x05\x79" + "\x10\xA7\x1B\xB2\x49\xE0\x54\xEB" + "\x82\x19\x8D\x24\xBB\x2F\xC6\x5D" + "\xF4\x68\xFF\x96\x0A\xA1\x38\xCF" + "\x43\xDA\x71\x08\x7C\x13\xAA\x1E" + "\xB5\x4C\xE3\x57\xEE\x85\x1C\x90" + "\x27\xBE\x32\xC9\x60\xF7\x6B\x02" + "\x99\x0D\xA4\x3B\xD2\x46\xDD\x74" + "\x0B\x7F\x16\xAD\x21\xB8\x4F\xE6" + "\x5A\xF1\x88\x1F\x93\x2A\xC1\x35" + "\xCC\x63\xFA\x6E\x05\x9C\x10\xA7" + "\x3E\xD5\x49\xE0\x77\x0E\x82\x19" + "\xB0\x24\xBB\x52\xE9\x5D\xF4\x8B" + "\x22\x96\x2D\xC4\x38\xCF\x66\xFD" + "\x71\x08\x9F\x13\xAA\x41\xD8\x4C" + "\xE3\x7A\x11\x85\x1C\xB3\x27\xBE" + "\x55\xEC\x60\xF7\x8E\x02\x99\x30" + "\xC7\x3B\xD2\x69\x00\x74\x0B\xA2" + "\x16\xAD\x44\xDB\x4F\xE6\x7D\x14" + "\x88\x1F\xB6\x2A\xC1\x58\xEF\x63" + "\xFA\x91\x05\x9C\x33\xCA\x3E\xD5" + "\x6C\x03\x77\x0E\xA5\x19\xB0\x47" + "\xDE\x52\xE9\x80\x17\x8B\x22\xB9" + "\x2D\xC4\x5B\xF2\x66\xFD\x94\x08" + "\x9F\x36\xCD\x41\xD8\x6F\x06\x7A" + "\x11\xA8\x1C\xB3\x4A\xE1\x55\xEC" + "\x83\x1A\x8E\x25\xBC\x30\xC7\x5E" + "\xF5\x69\x00\x97\x0B\xA2\x39\xD0" + "\x44\xDB\x72\x09\x7D\x14\xAB\x1F" + "\xB6\x4D\xE4\x58\xEF\x86\x1D\x91" + "\x28\xBF\x33\xCA\x61\xF8\x6C\x03" + "\x9A\x0E\xA5\x3C\xD3\x47\xDE\x75" + "\x0C\x80\x17\xAE\x22\xB9\x50\xE7" + "\x5B\xF2\x89\x20\x94\x2B\xC2\x36" + &
[RFC PATCH 3/6] crypto: blowfish - add AVX2/x86_64 implementation of blowfish cipher
Patch adds AVX2/x86-64 implementation of Blowfish cipher, requiring 32 parallel blocks for input (256 bytes). Table look-ups are performed using vpgatherdd instruction directly from vector registers and thus should be faster than earlier implementations. Signed-off-by: Jussi Kivilinna --- arch/x86/crypto/Makefile | 11 + arch/x86/crypto/blowfish-avx2-asm_64.S | 449 + arch/x86/crypto/blowfish_avx2_glue.c | 585 arch/x86/crypto/blowfish_glue.c| 32 -- arch/x86/include/asm/cpufeature.h |1 arch/x86/include/asm/crypto/blowfish.h | 43 ++ crypto/Kconfig | 18 + crypto/testmgr.c | 12 + 8 files changed, 1127 insertions(+), 24 deletions(-) create mode 100644 arch/x86/crypto/blowfish-avx2-asm_64.S create mode 100644 arch/x86/crypto/blowfish_avx2_glue.c create mode 100644 arch/x86/include/asm/crypto/blowfish.h diff --git a/arch/x86/crypto/Makefile b/arch/x86/crypto/Makefile index 03cd731..28464ef 100644 --- a/arch/x86/crypto/Makefile +++ b/arch/x86/crypto/Makefile @@ -3,6 +3,8 @@ # avx_supported := $(call as-instr,vpxor %xmm0$(comma)%xmm0$(comma)%xmm0,yes,no) +avx2_supported := $(call as-instr,vpgatherdd %ymm0$(comma)(%eax$(comma)%ymm1\ + $(comma)4)$(comma)%ymm2,yes,no) obj-$(CONFIG_CRYPTO_ABLK_HELPER_X86) += ablk_helper.o obj-$(CONFIG_CRYPTO_GLUE_HELPER_X86) += glue_helper.o @@ -38,6 +40,11 @@ ifeq ($(avx_supported),yes) obj-$(CONFIG_CRYPTO_SERPENT_AVX_X86_64) += serpent-avx-x86_64.o endif +# These modules require assembler to support AVX2. +ifeq ($(avx2_supported),yes) + obj-$(CONFIG_CRYPTO_BLOWFISH_AVX2_X86_64) += blowfish-avx2.o +endif + aes-i586-y := aes-i586-asm_32.o aes_glue.o twofish-i586-y := twofish-i586-asm_32.o twofish_glue.o salsa20-i586-y := salsa20-i586-asm_32.o salsa20_glue.o @@ -62,6 +69,10 @@ ifeq ($(avx_supported),yes) serpent_avx_glue.o endif +ifeq ($(avx2_supported),yes) + blowfish-avx2-y := blowfish-avx2-asm_64.o blowfish_avx2_glue.o +endif + aesni-intel-y := aesni-intel_asm.o aesni-intel_glue.o fpu.o ghash-clmulni-intel-y := ghash-clmulni-intel_asm.o ghash-clmulni-intel_glue.o sha1-ssse3-y := sha1_ssse3_asm.o sha1_ssse3_glue.o diff --git a/arch/x86/crypto/blowfish-avx2-asm_64.S b/arch/x86/crypto/blowfish-avx2-asm_64.S new file mode 100644 index 000..784452e --- /dev/null +++ b/arch/x86/crypto/blowfish-avx2-asm_64.S @@ -0,0 +1,449 @@ +/* + * x86_64/AVX2 assembler optimized version of Blowfish + * + * Copyright © 2012-2013 Jussi Kivilinna + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + */ + +#include + +.file "blowfish-avx2-asm_64.S" + +.data +.align 32 + +.Lprefetch_mask: +.long 0*64 +.long 1*64 +.long 2*64 +.long 3*64 +.long 4*64 +.long 5*64 +.long 6*64 +.long 7*64 + +.Lbswap32_mask: +.long 0x00010203 +.long 0x04050607 +.long 0x08090a0b +.long 0x0c0d0e0f + +.Lbswap128_mask: + .byte 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0 +.Lbswap_iv_mask: + .byte 7, 6, 5, 4, 3, 2, 1, 0, 7, 6, 5, 4, 3, 2, 1, 0 + +.text +/* structure of crypto context */ +#define p 0 +#define s0 ((16 + 2) * 4) +#define s1 ((16 + 2 + (1 * 256)) * 4) +#define s2 ((16 + 2 + (2 * 256)) * 4) +#define s3 ((16 + 2 + (3 * 256)) * 4) + +/* register macros */ +#define CTX%rdi +#define RIO %rdx + +#define RS0%rax +#define RS1%r8 +#define RS2%r9 +#define RS3%r10 + +#define RLOOP %r11 +#define RLOOPd %r11d + +#define RXr0 %ymm8 +#define RXr1 %ymm9 +#define RXr2 %ymm10 +#define RXr3 %ymm11 +#define RXl0 %ymm12 +#define RXl1 %ymm13 +#define RXl2 %ymm14 +#define RXl3 %ymm15 + +/* temp regs */ +#define RT0%ymm0 +#define RT0x %xmm0 +#define RT1%ymm1 +#define RT1x %xmm1 +#define RIDX0 %ymm2 +#define RIDX1 %ymm3 +#define RIDX1x %xmm3 +#define RIDX2 %ymm4 +#define RIDX3 %ymm5 + +/* vpgatherdd mask and '-1' */ +#define RNOT %ymm6 + +/* byte mask, (-1 >> 24) */ +#define RBYTE %ymm7 + +/*** + * 32-way AVX2 blowfish + ***/ +#define F(xl, xr) \ + vpsrld $24, xl, RIDX0; \ + vpsrld $16, xl, RIDX1; \ + vpsrld $8, xl, RIDX2; \ + vpand RBYTE, RIDX1, RIDX1; \ + vpand RBYTE, RIDX2, RIDX2; \ + vpand RBYTE, xl, RIDX3; \ + \ + vpgatherdd RNOT, (RS0, RIDX0, 4), RT0; \ + vpcmpeqd RNOT, RNOT, RNOT; \ + vpcmpeqd RIDX0, RIDX0, RIDX0; \ + \ + vpgatherdd RNOT, (RS1, RIDX1, 4), RT1; \ + vpcmpeqd RIDX1, RIDX1, RIDX1; \ + vpad
[RFC PATCH 0/6] Add AVX2 accelerated implementations for Blowfish, Twofish, Serpent and Camellia
The following series implements four block ciphers - Blowfish, Twofish, Serpent and Camellia - using AVX2 instruction set. This work on AVX2 implementations started over year ago and have been available at https://github.com/jkivilin/crypto-avx2 The Serpent and Camellia implementations are directly based on the word-sliced and byte-sliced AVX implementations and have been extended to use the 256-bit YMM registers. As such the performance should be better than with the 128-bit wide AVX implementations. (Camellia implementation needs some extra handling for the AES-NI as AES instructions have remained only 128-bit wide.) Blowfish and Twofish implementations utilize the new vpgatherdd instruction to perform eight vectorized 8x32-bit table look-ups at once. This is different from the previous word-sliced AVX implementations, where table look-ups have to performed through general purpose registers. AVX2 implementations thus avoid additional moving of data between the SIMD and general purpose registers and therefore should be faster. For obvious reasons, I have not tested these implementations on real hardware. Kernel tcrypt tests have been run under Bochs, which should contain somewhat working AVX2 implementation. But I cannot be sure, even the Intel SDE emulator that I used for testing these implementations did not quite follow the specs (a past version of SDE that I initially used allowed vector registers to vgather be same, whereas specs say that in such case exception should be raised). Because of this, the first versions of patchset in above repository are broken. So since I'm unable to verify that these implementations work on real hardware and are unable to conduct real performance evaluation, I'm sending this patchset as RFC. Maybe someone can actually test these on real hardware and maybe give acked-by in case these look ok(?). If such is not possible, I'll do the testing myself when those Haswell processors come available where I live. -Jussi --- Jussi Kivilinna (6): crypto: testmgr - extend camellia test-vectors for camellia-aesni/avx2 crypto: tcrypt - add async cipher speed tests for blowfish crypto: blowfish - add AVX2/x86_64 implementation of blowfish cipher crypto: twofish - add AVX2/x86_64 assembler implementation of twofish cipher crypto: serpent - add AVX2/x86_64 assembler implementation of serpent cipher crypto: camellia - add AVX2/AES-NI/x86_64 assembler implementation of camellia cipher arch/x86/crypto/Makefile | 17 arch/x86/crypto/blowfish-avx2-asm_64.S | 449 + arch/x86/crypto/blowfish_avx2_glue.c | 585 +++ arch/x86/crypto/blowfish_glue.c | 32 - arch/x86/crypto/camellia-aesni-avx2-asm_64.S | 1368 ++ arch/x86/crypto/camellia_aesni_avx2_glue.c | 586 +++ arch/x86/crypto/camellia_aesni_avx_glue.c| 17 arch/x86/crypto/glue_helper-asm-avx2.S | 180 +++ arch/x86/crypto/serpent-avx2-asm_64.S| 800 +++ arch/x86/crypto/serpent_avx2_glue.c | 562 +++ arch/x86/crypto/serpent_avx_glue.c | 62 + arch/x86/crypto/twofish-avx2-asm_64.S| 600 +++ arch/x86/crypto/twofish_avx2_glue.c | 584 +++ arch/x86/crypto/twofish_avx_glue.c | 14 arch/x86/include/asm/cpufeature.h|1 arch/x86/include/asm/crypto/blowfish.h | 43 + arch/x86/include/asm/crypto/camellia.h | 19 arch/x86/include/asm/crypto/serpent-avx.h| 24 arch/x86/include/asm/crypto/twofish.h| 18 crypto/Kconfig | 88 ++ crypto/tcrypt.c | 15 crypto/testmgr.c | 51 + crypto/testmgr.h | 1100 - 23 files changed, 7128 insertions(+), 87 deletions(-) create mode 100644 arch/x86/crypto/blowfish-avx2-asm_64.S create mode 100644 arch/x86/crypto/blowfish_avx2_glue.c create mode 100644 arch/x86/crypto/camellia-aesni-avx2-asm_64.S create mode 100644 arch/x86/crypto/camellia_aesni_avx2_glue.c create mode 100644 arch/x86/crypto/glue_helper-asm-avx2.S create mode 100644 arch/x86/crypto/serpent-avx2-asm_64.S create mode 100644 arch/x86/crypto/serpent_avx2_glue.c create mode 100644 arch/x86/crypto/twofish-avx2-asm_64.S create mode 100644 arch/x86/crypto/twofish_avx2_glue.c create mode 100644 arch/x86/include/asm/crypto/blowfish.h -- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC PATCH 0/6] Add AVX2 accelerated implementations for Blowfish, Twofish, Serpent and Camellia
The following series implements four block ciphers - Blowfish, Twofish, Serpent and Camellia - using AVX2 instruction set. This work on AVX2 implementations started over year ago and have been available at https://github.com/jkivilin/crypto-avx2 The Serpent and Camellia implementations are directly based on the word-sliced and byte-sliced AVX implementations and have been extended to use the 256-bit YMM registers. As such the performance should be better than with the 128-bit wide AVX implementations. (Camellia implementation needs some extra handling for the AES-NI as AES instructions have remained only 128-bit wide.) Blowfish and Twofish implementations utilize the new vpgatherdd instruction to perform eight vectorized 8x32-bit table look-ups at once. This is different from the previous word-sliced AVX implementations, where table look-ups have to performed through general purpose registers. AVX2 implementations thus avoid additional moving of data between the SIMD and general purpose registers and therefore should be faster. For obvious reasons, I have not tested these implementations on real hardware. Kernel tcrypt tests have been run under Bochs, which should contain somewhat working AVX2 implementation. But I cannot be sure, even the Intel SDE emulator that I used for testing these implementations did not quite follow the specs (a past version of SDE that I initially used allowed vector registers to vgather be same, whereas specs say that in such case exception should be raised). Because of this, the first versions of patchset in above repository are broken. So since I'm unable to verify that these implementations work on real hardware and are unable to conduct real performance evaluation, I'm sending this patchset as RFC. Maybe someone can actually test these on real hardware and maybe give acked-by in case these look ok(?). If such is not possible, I'll do the testing myself when those Haswell processors come available where I live. -Jussi --- Jussi Kivilinna (6): crypto: testmgr - extend camellia test-vectors for camellia-aesni/avx2 crypto: tcrypt - add async cipher speed tests for blowfish crypto: blowfish - add AVX2/x86_64 implementation of blowfish cipher crypto: twofish - add AVX2/x86_64 assembler implementation of twofish cipher crypto: serpent - add AVX2/x86_64 assembler implementation of serpent cipher crypto: camellia - add AVX2/AES-NI/x86_64 assembler implementation of camellia cipher arch/x86/crypto/Makefile | 17 arch/x86/crypto/blowfish-avx2-asm_64.S | 449 + arch/x86/crypto/blowfish_avx2_glue.c | 585 +++ arch/x86/crypto/blowfish_glue.c | 32 - arch/x86/crypto/camellia-aesni-avx2-asm_64.S | 1368 ++ arch/x86/crypto/camellia_aesni_avx2_glue.c | 586 +++ arch/x86/crypto/camellia_aesni_avx_glue.c| 17 arch/x86/crypto/glue_helper-asm-avx2.S | 180 +++ arch/x86/crypto/serpent-avx2-asm_64.S| 800 +++ arch/x86/crypto/serpent_avx2_glue.c | 562 +++ arch/x86/crypto/serpent_avx_glue.c | 62 + arch/x86/crypto/twofish-avx2-asm_64.S| 600 +++ arch/x86/crypto/twofish_avx2_glue.c | 584 +++ arch/x86/crypto/twofish_avx_glue.c | 14 arch/x86/include/asm/cpufeature.h|1 arch/x86/include/asm/crypto/blowfish.h | 43 + arch/x86/include/asm/crypto/camellia.h | 19 arch/x86/include/asm/crypto/serpent-avx.h| 24 arch/x86/include/asm/crypto/twofish.h| 18 crypto/Kconfig | 88 ++ crypto/tcrypt.c | 15 crypto/testmgr.c | 51 + crypto/testmgr.h | 1100 - 23 files changed, 7128 insertions(+), 87 deletions(-) create mode 100644 arch/x86/crypto/blowfish-avx2-asm_64.S create mode 100644 arch/x86/crypto/blowfish_avx2_glue.c create mode 100644 arch/x86/crypto/camellia-aesni-avx2-asm_64.S create mode 100644 arch/x86/crypto/camellia_aesni_avx2_glue.c create mode 100644 arch/x86/crypto/glue_helper-asm-avx2.S create mode 100644 arch/x86/crypto/serpent-avx2-asm_64.S create mode 100644 arch/x86/crypto/serpent_avx2_glue.c create mode 100644 arch/x86/crypto/twofish-avx2-asm_64.S create mode 100644 arch/x86/crypto/twofish_avx2_glue.c create mode 100644 arch/x86/include/asm/crypto/blowfish.h -- -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC PATCH 1/6] crypto: testmgr - extend camellia test-vectors for camellia-aesni/avx2
Signed-off-by: Jussi Kivilinna jussi.kivili...@iki.fi --- crypto/testmgr.h | 1100 -- 1 file changed, 1062 insertions(+), 38 deletions(-) diff --git a/crypto/testmgr.h b/crypto/testmgr.h index d503660..dc2c054 100644 --- a/crypto/testmgr.h +++ b/crypto/testmgr.h @@ -20997,8 +20997,72 @@ static struct cipher_testvec camellia_enc_tv_template[] = { \x86\x1D\xB4\x28\xBF\x56\xED\x61 \xF8\x8F\x03\x9A\x31\xC8\x3C\xD3 \x6A\x01\x75\x0C\xA3\x17\xAE\x45 - \xDC\x50\xE7\x7E\x15\x89\x20\xB7, - .ilen = 496, + \xDC\x50\xE7\x7E\x15\x89\x20\xB7 + \x2B\xC2\x59\xF0\x64\xFB\x92\x06 + \x9D\x34\xCB\x3F\xD6\x6D\x04\x78 + \x0F\xA6\x1A\xB1\x48\xDF\x53\xEA + \x81\x18\x8C\x23\xBA\x2E\xC5\x5C + \xF3\x67\xFE\x95\x09\xA0\x37\xCE + \x42\xD9\x70\x07\x7B\x12\xA9\x1D + \xB4\x4B\xE2\x56\xED\x84\x1B\x8F + \x26\xBD\x31\xC8\x5F\xF6\x6A\x01 + \x98\x0C\xA3\x3A\xD1\x45\xDC\x73 + \x0A\x7E\x15\xAC\x20\xB7\x4E\xE5 + \x59\xF0\x87\x1E\x92\x29\xC0\x34 + \xCB\x62\xF9\x6D\x04\x9B\x0F\xA6 + \x3D\xD4\x48\xDF\x76\x0D\x81\x18 + \xAF\x23\xBA\x51\xE8\x5C\xF3\x8A + \x21\x95\x2C\xC3\x37\xCE\x65\xFC + \x70\x07\x9E\x12\xA9\x40\xD7\x4B + \xE2\x79\x10\x84\x1B\xB2\x26\xBD + \x54\xEB\x5F\xF6\x8D\x01\x98\x2F + \xC6\x3A\xD1\x68\xFF\x73\x0A\xA1 + \x15\xAC\x43\xDA\x4E\xE5\x7C\x13 + \x87\x1E\xB5\x29\xC0\x57\xEE\x62 + \xF9\x90\x04\x9B\x32\xC9\x3D\xD4 + \x6B\x02\x76\x0D\xA4\x18\xAF\x46 + \xDD\x51\xE8\x7F\x16\x8A\x21\xB8 + \x2C\xC3\x5A\xF1\x65\xFC\x93\x07 + \x9E\x35\xCC\x40\xD7\x6E\x05\x79 + \x10\xA7\x1B\xB2\x49\xE0\x54\xEB + \x82\x19\x8D\x24\xBB\x2F\xC6\x5D + \xF4\x68\xFF\x96\x0A\xA1\x38\xCF + \x43\xDA\x71\x08\x7C\x13\xAA\x1E + \xB5\x4C\xE3\x57\xEE\x85\x1C\x90 + \x27\xBE\x32\xC9\x60\xF7\x6B\x02 + \x99\x0D\xA4\x3B\xD2\x46\xDD\x74 + \x0B\x7F\x16\xAD\x21\xB8\x4F\xE6 + \x5A\xF1\x88\x1F\x93\x2A\xC1\x35 + \xCC\x63\xFA\x6E\x05\x9C\x10\xA7 + \x3E\xD5\x49\xE0\x77\x0E\x82\x19 + \xB0\x24\xBB\x52\xE9\x5D\xF4\x8B + \x22\x96\x2D\xC4\x38\xCF\x66\xFD + \x71\x08\x9F\x13\xAA\x41\xD8\x4C + \xE3\x7A\x11\x85\x1C\xB3\x27\xBE + \x55\xEC\x60\xF7\x8E\x02\x99\x30 + \xC7\x3B\xD2\x69\x00\x74\x0B\xA2 + \x16\xAD\x44\xDB\x4F\xE6\x7D\x14 + \x88\x1F\xB6\x2A\xC1\x58\xEF\x63 + \xFA\x91\x05\x9C\x33\xCA\x3E\xD5 + \x6C\x03\x77\x0E\xA5\x19\xB0\x47 + \xDE\x52\xE9\x80\x17\x8B\x22\xB9 + \x2D\xC4\x5B\xF2\x66\xFD\x94\x08 + \x9F\x36\xCD\x41\xD8\x6F\x06\x7A + \x11\xA8\x1C\xB3\x4A\xE1\x55\xEC + \x83\x1A\x8E\x25\xBC\x30\xC7\x5E + \xF5\x69\x00\x97\x0B\xA2\x39\xD0 + \x44\xDB\x72\x09\x7D\x14\xAB\x1F + \xB6\x4D\xE4\x58\xEF\x86\x1D\x91 + \x28\xBF\x33\xCA\x61\xF8\x6C\x03 + \x9A\x0E\xA5\x3C\xD3\x47\xDE\x75 + \x0C\x80\x17\xAE\x22\xB9\x50\xE7 + \x5B\xF2\x89\x20\x94\x2B\xC2\x36 + \xCD\x64\xFB\x6F\x06\x9D\x11\xA8 + \x3F\xD6\x4A\xE1\x78\x0F\x83\x1A + \xB1\x25\xBC\x53\xEA\x5E\xF5\x8C + \x00\x97\x2E\xC5\x39\xD0\x67\xFE + \x72\x09\xA0\x14\xAB\x42\xD9\x4D, + .ilen = 1008, .result = \xED\xCD\xDB\xB8\x68\xCE\xBD\xEA \x9D\x9D\xCD\x9F\x4F\xFC\x4D\xB7 \xA5\xFF\x6F\x43\x0F\xBA\x32\x04 @@ -21060,11 +21124,75 @@ static struct cipher_testvec camellia_enc_tv_template[] = { \x2C\x35\x1B\x38\x85\x7D\xE8\xF3 \x87\x4F\xDA\xD8\x5F\xFC\xB6\x44 \xD0\xE3\x9B\x8B\xBF\xD6\xB8\xC4
[RFC PATCH 3/6] crypto: blowfish - add AVX2/x86_64 implementation of blowfish cipher
Patch adds AVX2/x86-64 implementation of Blowfish cipher, requiring 32 parallel blocks for input (256 bytes). Table look-ups are performed using vpgatherdd instruction directly from vector registers and thus should be faster than earlier implementations. Signed-off-by: Jussi Kivilinna jussi.kivili...@iki.fi --- arch/x86/crypto/Makefile | 11 + arch/x86/crypto/blowfish-avx2-asm_64.S | 449 + arch/x86/crypto/blowfish_avx2_glue.c | 585 arch/x86/crypto/blowfish_glue.c| 32 -- arch/x86/include/asm/cpufeature.h |1 arch/x86/include/asm/crypto/blowfish.h | 43 ++ crypto/Kconfig | 18 + crypto/testmgr.c | 12 + 8 files changed, 1127 insertions(+), 24 deletions(-) create mode 100644 arch/x86/crypto/blowfish-avx2-asm_64.S create mode 100644 arch/x86/crypto/blowfish_avx2_glue.c create mode 100644 arch/x86/include/asm/crypto/blowfish.h diff --git a/arch/x86/crypto/Makefile b/arch/x86/crypto/Makefile index 03cd731..28464ef 100644 --- a/arch/x86/crypto/Makefile +++ b/arch/x86/crypto/Makefile @@ -3,6 +3,8 @@ # avx_supported := $(call as-instr,vpxor %xmm0$(comma)%xmm0$(comma)%xmm0,yes,no) +avx2_supported := $(call as-instr,vpgatherdd %ymm0$(comma)(%eax$(comma)%ymm1\ + $(comma)4)$(comma)%ymm2,yes,no) obj-$(CONFIG_CRYPTO_ABLK_HELPER_X86) += ablk_helper.o obj-$(CONFIG_CRYPTO_GLUE_HELPER_X86) += glue_helper.o @@ -38,6 +40,11 @@ ifeq ($(avx_supported),yes) obj-$(CONFIG_CRYPTO_SERPENT_AVX_X86_64) += serpent-avx-x86_64.o endif +# These modules require assembler to support AVX2. +ifeq ($(avx2_supported),yes) + obj-$(CONFIG_CRYPTO_BLOWFISH_AVX2_X86_64) += blowfish-avx2.o +endif + aes-i586-y := aes-i586-asm_32.o aes_glue.o twofish-i586-y := twofish-i586-asm_32.o twofish_glue.o salsa20-i586-y := salsa20-i586-asm_32.o salsa20_glue.o @@ -62,6 +69,10 @@ ifeq ($(avx_supported),yes) serpent_avx_glue.o endif +ifeq ($(avx2_supported),yes) + blowfish-avx2-y := blowfish-avx2-asm_64.o blowfish_avx2_glue.o +endif + aesni-intel-y := aesni-intel_asm.o aesni-intel_glue.o fpu.o ghash-clmulni-intel-y := ghash-clmulni-intel_asm.o ghash-clmulni-intel_glue.o sha1-ssse3-y := sha1_ssse3_asm.o sha1_ssse3_glue.o diff --git a/arch/x86/crypto/blowfish-avx2-asm_64.S b/arch/x86/crypto/blowfish-avx2-asm_64.S new file mode 100644 index 000..784452e --- /dev/null +++ b/arch/x86/crypto/blowfish-avx2-asm_64.S @@ -0,0 +1,449 @@ +/* + * x86_64/AVX2 assembler optimized version of Blowfish + * + * Copyright © 2012-2013 Jussi Kivilinna jussi.kivili...@iki.fi + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + */ + +#include linux/linkage.h + +.file blowfish-avx2-asm_64.S + +.data +.align 32 + +.Lprefetch_mask: +.long 0*64 +.long 1*64 +.long 2*64 +.long 3*64 +.long 4*64 +.long 5*64 +.long 6*64 +.long 7*64 + +.Lbswap32_mask: +.long 0x00010203 +.long 0x04050607 +.long 0x08090a0b +.long 0x0c0d0e0f + +.Lbswap128_mask: + .byte 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0 +.Lbswap_iv_mask: + .byte 7, 6, 5, 4, 3, 2, 1, 0, 7, 6, 5, 4, 3, 2, 1, 0 + +.text +/* structure of crypto context */ +#define p 0 +#define s0 ((16 + 2) * 4) +#define s1 ((16 + 2 + (1 * 256)) * 4) +#define s2 ((16 + 2 + (2 * 256)) * 4) +#define s3 ((16 + 2 + (3 * 256)) * 4) + +/* register macros */ +#define CTX%rdi +#define RIO %rdx + +#define RS0%rax +#define RS1%r8 +#define RS2%r9 +#define RS3%r10 + +#define RLOOP %r11 +#define RLOOPd %r11d + +#define RXr0 %ymm8 +#define RXr1 %ymm9 +#define RXr2 %ymm10 +#define RXr3 %ymm11 +#define RXl0 %ymm12 +#define RXl1 %ymm13 +#define RXl2 %ymm14 +#define RXl3 %ymm15 + +/* temp regs */ +#define RT0%ymm0 +#define RT0x %xmm0 +#define RT1%ymm1 +#define RT1x %xmm1 +#define RIDX0 %ymm2 +#define RIDX1 %ymm3 +#define RIDX1x %xmm3 +#define RIDX2 %ymm4 +#define RIDX3 %ymm5 + +/* vpgatherdd mask and '-1' */ +#define RNOT %ymm6 + +/* byte mask, (-1 24) */ +#define RBYTE %ymm7 + +/*** + * 32-way AVX2 blowfish + ***/ +#define F(xl, xr) \ + vpsrld $24, xl, RIDX0; \ + vpsrld $16, xl, RIDX1; \ + vpsrld $8, xl, RIDX2; \ + vpand RBYTE, RIDX1, RIDX1; \ + vpand RBYTE, RIDX2, RIDX2; \ + vpand RBYTE, xl, RIDX3; \ + \ + vpgatherdd RNOT, (RS0, RIDX0, 4), RT0; \ + vpcmpeqd RNOT, RNOT, RNOT; \ + vpcmpeqd RIDX0, RIDX0, RIDX0; \ + \ + vpgatherdd RNOT, (RS1, RIDX1, 4), RT1; \ + vpcmpeqd RIDX1, RIDX1, RIDX1
[RFC PATCH 4/6] crypto: twofish - add AVX2/x86_64 assembler implementation of twofish cipher
Patch adds AVX2/x86-64 implementation of Twofish cipher, requiring 16 parallel blocks for input (256 bytes). Table look-ups are performed using vpgatherdd instruction directly from vector registers and thus should be faster than earlier implementations. Implementation also uses 256-bit wide YMM registers, which should give additional speed up compared to the AVX implementation. Signed-off-by: Jussi Kivilinna jussi.kivili...@iki.fi --- arch/x86/crypto/Makefile |2 arch/x86/crypto/glue_helper-asm-avx2.S | 180 ++ arch/x86/crypto/twofish-avx2-asm_64.S | 600 arch/x86/crypto/twofish_avx2_glue.c| 584 +++ arch/x86/crypto/twofish_avx_glue.c | 14 + arch/x86/include/asm/crypto/twofish.h | 18 + crypto/Kconfig | 24 + crypto/testmgr.c | 12 + 8 files changed, 1432 insertions(+), 2 deletions(-) create mode 100644 arch/x86/crypto/glue_helper-asm-avx2.S create mode 100644 arch/x86/crypto/twofish-avx2-asm_64.S create mode 100644 arch/x86/crypto/twofish_avx2_glue.c diff --git a/arch/x86/crypto/Makefile b/arch/x86/crypto/Makefile index 28464ef..1f6e0c2 100644 --- a/arch/x86/crypto/Makefile +++ b/arch/x86/crypto/Makefile @@ -43,6 +43,7 @@ endif # These modules require assembler to support AVX2. ifeq ($(avx2_supported),yes) obj-$(CONFIG_CRYPTO_BLOWFISH_AVX2_X86_64) += blowfish-avx2.o + obj-$(CONFIG_CRYPTO_TWOFISH_AVX2_X86_64) += twofish-avx2.o endif aes-i586-y := aes-i586-asm_32.o aes_glue.o @@ -71,6 +72,7 @@ endif ifeq ($(avx2_supported),yes) blowfish-avx2-y := blowfish-avx2-asm_64.o blowfish_avx2_glue.o + twofish-avx2-y := twofish-avx2-asm_64.o twofish_avx2_glue.o endif aesni-intel-y := aesni-intel_asm.o aesni-intel_glue.o fpu.o diff --git a/arch/x86/crypto/glue_helper-asm-avx2.S b/arch/x86/crypto/glue_helper-asm-avx2.S new file mode 100644 index 000..a53ac11 --- /dev/null +++ b/arch/x86/crypto/glue_helper-asm-avx2.S @@ -0,0 +1,180 @@ +/* + * Shared glue code for 128bit block ciphers, AVX2 assembler macros + * + * Copyright © 2012-2013 Jussi Kivilinna jussi.kivili...@mbnet.fi + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + */ + +#define load_16way(src, x0, x1, x2, x3, x4, x5, x6, x7) \ + vmovdqu (0*32)(src), x0; \ + vmovdqu (1*32)(src), x1; \ + vmovdqu (2*32)(src), x2; \ + vmovdqu (3*32)(src), x3; \ + vmovdqu (4*32)(src), x4; \ + vmovdqu (5*32)(src), x5; \ + vmovdqu (6*32)(src), x6; \ + vmovdqu (7*32)(src), x7; + +#define store_16way(dst, x0, x1, x2, x3, x4, x5, x6, x7) \ + vmovdqu x0, (0*32)(dst); \ + vmovdqu x1, (1*32)(dst); \ + vmovdqu x2, (2*32)(dst); \ + vmovdqu x3, (3*32)(dst); \ + vmovdqu x4, (4*32)(dst); \ + vmovdqu x5, (5*32)(dst); \ + vmovdqu x6, (6*32)(dst); \ + vmovdqu x7, (7*32)(dst); + +#define store_cbc_16way(src, dst, x0, x1, x2, x3, x4, x5, x6, x7, t0) \ + vpxor t0, t0, t0; \ + vinserti128 $1, (src), t0, t0; \ + vpxor t0, x0, x0; \ + vpxor (0*32+16)(src), x1, x1; \ + vpxor (1*32+16)(src), x2, x2; \ + vpxor (2*32+16)(src), x3, x3; \ + vpxor (3*32+16)(src), x4, x4; \ + vpxor (4*32+16)(src), x5, x5; \ + vpxor (5*32+16)(src), x6, x6; \ + vpxor (6*32+16)(src), x7, x7; \ + store_16way(dst, x0, x1, x2, x3, x4, x5, x6, x7); + +#define inc_le128(x, minus_one, tmp) \ + vpcmpeqq minus_one, x, tmp; \ + vpsubq minus_one, x, x; \ + vpslldq $8, tmp, tmp; \ + vpsubq tmp, x, x; + +#define add2_le128(x, minus_one, minus_two, tmp1, tmp2) \ + vpcmpeqq minus_one, x, tmp1; \ + vpcmpeqq minus_two, x, tmp2; \ + vpsubq minus_two, x, x; \ + vpor tmp2, tmp1, tmp1; \ + vpslldq $8, tmp1, tmp1; \ + vpsubq tmp1, x, x; + +#define load_ctr_16way(iv, bswap, x0, x1, x2, x3, x4, x5, x6, x7, t0, t0x, t1, \ + t1x, t2, t2x, t3, t3x, t4, t5) \ + vpcmpeqd t0, t0, t0; \ + vpsrldq $8, t0, t0; /* ab: -1:0 ; cd: -1:0 */ \ + vpaddq t0, t0, t4; /* ab: -2:0 ; cd: -2:0 */\ + \ + /* load IV and byteswap */ \ + vmovdqu (iv), t2x; \ + vmovdqa t2x, t3x; \ + inc_le128(t2x, t0x, t1x); \ + vbroadcasti128 bswap, t1; \ + vinserti128 $1, t2x, t3, t2; /* ab: le0 ; cd: le1 */ \ + vpshufb t1, t2, x0; \ + \ + /* construct IVs */ \ + add2_le128(t2, t0, t4, t3, t5); /* ab: le2 ; cd: le3 */ \ + vpshufb t1, t2, x1; \ + add2_le128(t2, t0, t4, t3, t5); \ + vpshufb t1, t2, x2; \ + add2_le128(t2, t0, t4, t3, t5); \ + vpshufb t1, t2, x3; \ + add2_le128(t2, t0, t4, t3, t5); \ + vpshufb t1, t2
[RFC PATCH 5/6] crypto: serpent - add AVX2/x86_64 assembler implementation of serpent cipher
Patch adds AVX2/x86-64 implementation of Serpent cipher, requiring 16 parallel blocks for input (256 bytes). Implementation is based on the AVX implementation and extends to use the 256-bit wide YMM registers. Since serpent does not use table look-ups, this implementation should be close to two times faster than the AVX implementation. Signed-off-by: Jussi Kivilinna jussi.kivili...@iki.fi --- arch/x86/crypto/Makefile |2 arch/x86/crypto/serpent-avx2-asm_64.S | 800 + arch/x86/crypto/serpent_avx2_glue.c | 562 arch/x86/crypto/serpent_avx_glue.c| 62 ++ arch/x86/include/asm/crypto/serpent-avx.h | 24 + crypto/Kconfig| 23 + crypto/testmgr.c | 15 + 7 files changed, 1468 insertions(+), 20 deletions(-) create mode 100644 arch/x86/crypto/serpent-avx2-asm_64.S create mode 100644 arch/x86/crypto/serpent_avx2_glue.c diff --git a/arch/x86/crypto/Makefile b/arch/x86/crypto/Makefile index 1f6e0c2..a21af59 100644 --- a/arch/x86/crypto/Makefile +++ b/arch/x86/crypto/Makefile @@ -43,6 +43,7 @@ endif # These modules require assembler to support AVX2. ifeq ($(avx2_supported),yes) obj-$(CONFIG_CRYPTO_BLOWFISH_AVX2_X86_64) += blowfish-avx2.o + obj-$(CONFIG_CRYPTO_SERPENT_AVX2_X86_64) += serpent-avx2.o obj-$(CONFIG_CRYPTO_TWOFISH_AVX2_X86_64) += twofish-avx2.o endif @@ -72,6 +73,7 @@ endif ifeq ($(avx2_supported),yes) blowfish-avx2-y := blowfish-avx2-asm_64.o blowfish_avx2_glue.o + serpent-avx2-y := serpent-avx2-asm_64.o serpent_avx2_glue.o twofish-avx2-y := twofish-avx2-asm_64.o twofish_avx2_glue.o endif diff --git a/arch/x86/crypto/serpent-avx2-asm_64.S b/arch/x86/crypto/serpent-avx2-asm_64.S new file mode 100644 index 000..b222085 --- /dev/null +++ b/arch/x86/crypto/serpent-avx2-asm_64.S @@ -0,0 +1,800 @@ +/* + * x86_64/AVX2 assembler optimized version of Serpent + * + * Copyright © 2012-2013 Jussi Kivilinna jussi.kivili...@mbnet.fi + * + * Based on AVX assembler implementation of Serpent by: + * Copyright © 2012 Johannes Goetzfried + * johannes.goetzfr...@informatik.stud.uni-erlangen.de + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + */ + +#include linux/linkage.h +#include glue_helper-asm-avx2.S + +.file serpent-avx2-asm_64.S + +.data +.align 16 + +.Lbswap128_mask: + .byte 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0 +.Lxts_gf128mul_and_shl1_mask_0: + .byte 0x87, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0 +.Lxts_gf128mul_and_shl1_mask_1: + .byte 0x0e, 1, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0 + +.text + +#define CTX %rdi + +#define RNOT %ymm0 +#define tp %ymm1 + +#define RA1 %ymm2 +#define RA2 %ymm3 +#define RB1 %ymm4 +#define RB2 %ymm5 +#define RC1 %ymm6 +#define RC2 %ymm7 +#define RD1 %ymm8 +#define RD2 %ymm9 +#define RE1 %ymm10 +#define RE2 %ymm11 + +#define RK0 %ymm12 +#define RK1 %ymm13 +#define RK2 %ymm14 +#define RK3 %ymm15 + +#define RK0x %xmm12 +#define RK1x %xmm13 +#define RK2x %xmm14 +#define RK3x %xmm15 + +#define S0_1(x0, x1, x2, x3, x4) \ + vporx0, x3, tp; \ + vpxor x3, x0, x0; \ + vpxor x2, x3, x4; \ + vpxor RNOT, x4, x4; \ + vpxor x1, tp, x3; \ + vpand x0, x1, x1; \ + vpxor x4, x1, x1; \ + vpxor x0, x2, x2; +#define S0_2(x0, x1, x2, x3, x4) \ + vpxor x3, x0, x0; \ + vporx0, x4, x4; \ + vpxor x2, x0, x0; \ + vpand x1, x2, x2; \ + vpxor x2, x3, x3; \ + vpxor RNOT, x1, x1; \ + vpxor x4, x2, x2; \ + vpxor x2, x1, x1; + +#define S1_1(x0, x1, x2, x3, x4) \ + vpxor x0, x1, tp; \ + vpxor x3, x0, x0; \ + vpxor RNOT, x3, x3; \ + vpand tp, x1, x4; \ + vportp, x0, x0; \ + vpxor x2, x3, x3; \ + vpxor x3, x0, x0; \ + vpxor x3, tp, x1; +#define S1_2(x0, x1, x2, x3, x4) \ + vpxor x4, x3, x3; \ + vporx4, x1, x1; \ + vpxor x2, x4, x4; \ + vpand x0, x2, x2; \ + vpxor x1, x2, x2; \ + vporx0, x1, x1; \ + vpxor RNOT, x0, x0; \ + vpxor x2, x0, x0; \ + vpxor x1, x4, x4; + +#define S2_1(x0, x1, x2, x3, x4) \ + vpxor RNOT, x3, x3; \ + vpxor x0, x1, x1; \ + vpand x2, x0, tp; \ + vpxor x3
[RFC PATCH 2/6] crypto: tcrypt - add async cipher speed tests for blowfish
Signed-off-by: Jussi Kivilinna jussi.kivili...@iki.fi --- crypto/tcrypt.c | 15 +++ 1 file changed, 15 insertions(+) diff --git a/crypto/tcrypt.c b/crypto/tcrypt.c index 24ea7df..66d254c 100644 --- a/crypto/tcrypt.c +++ b/crypto/tcrypt.c @@ -1768,6 +1768,21 @@ static int do_test(int m) speed_template_32_64); break; + case 509: + test_acipher_speed(ecb(blowfish), ENCRYPT, sec, NULL, 0, + speed_template_8_32); + test_acipher_speed(ecb(blowfish), DECRYPT, sec, NULL, 0, + speed_template_8_32); + test_acipher_speed(cbc(blowfish), ENCRYPT, sec, NULL, 0, + speed_template_8_32); + test_acipher_speed(cbc(blowfish), DECRYPT, sec, NULL, 0, + speed_template_8_32); + test_acipher_speed(ctr(blowfish), ENCRYPT, sec, NULL, 0, + speed_template_8_32); + test_acipher_speed(ctr(blowfish), DECRYPT, sec, NULL, 0, + speed_template_8_32); + break; + case 1000: test_available(); break; -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 06/11] Create module providing optimized SHA256 routines using SSSE3, AVX or AVX2 instructions.
On 22.03.2013 23:29, Tim Chen wrote: > We added glue code and config options to create crypto > module that uses SSE/AVX/AVX2 optimized SHA256 x86_64 assembly routines. > > Signed-off-by: Tim Chen ..snip.. > diff --git a/arch/x86/crypto/sha256_ssse3_glue.c > b/arch/x86/crypto/sha256_ssse3_glue.c > new file mode 100644 > index 000..5876a19 > --- /dev/null > +++ b/arch/x86/crypto/sha256_ssse3_glue.c ..snip.. > +static int __init sha256_ssse3_mod_init(void) > +{ > + /* test for SSE3 first */ > + if (cpu_has_xmm3) > + sha256_transform_asm = sha256_transform_ssse3; > + This causes OOPS on my computer. Maybe use 'cpu_has_ssse3' instead? -Jussi -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 03/11] Optimized sha256 x86_64 assembly routine using Supplemental SSE3 instructions.
On 22.03.2013 23:29, Tim Chen wrote: > Provides SHA256 x86_64 assembly routine optimized with SSSE3 instructions. > Speedup of 40% or more has been measured over the generic implementation. > > Signed-off-by: Tim Chen > --- > arch/x86/crypto/sha256-ssse3-asm.S | 504 > + > 1 file changed, 504 insertions(+) > create mode 100644 arch/x86/crypto/sha256-ssse3-asm.S > > diff --git a/arch/x86/crypto/sha256-ssse3-asm.S > b/arch/x86/crypto/sha256-ssse3-asm.S ..snip.. > + > + > +## void sha256_transform_ssse3(void *input_data, UINT32 digest[8], UINT64 > num_blks) > +## arg 1 : pointer to input data > +## arg 2 : pointer to digest > +## arg 3 : Num blocks > + > +.text > +.global sha256_transform_ssse3 > +.align 32 > +sha256_transform_ssse3: Maybe use ENRTY/ENDPROC macros for exporting functions from assembly? -Jussi -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 11/11] Create module providing optimized SHA512 routines using SSSE3, AVX or AVX2 instructions.
On 22.03.2013 23:29, Tim Chen wrote: > We added glue code and config options to create crypto > module that uses SSE/AVX/AVX2 optimized SHA512 x86_64 assembly routines. > > Signed-off-by: Tim Chen > --- > arch/x86/crypto/Makefile| 2 + > arch/x86/crypto/sha512_ssse3_glue.c | 276 > > crypto/Kconfig | 11 ++ > 3 files changed, 289 insertions(+) > create mode 100644 arch/x86/crypto/sha512_ssse3_glue.c > > diff --git a/arch/x86/crypto/Makefile b/arch/x86/crypto/Makefile > index 02a664a..7d12625 100644 > --- a/arch/x86/crypto/Makefile > +++ b/arch/x86/crypto/Makefile > @@ -28,6 +28,7 @@ obj-$(CONFIG_CRYPTO_GHASH_CLMUL_NI_INTEL) += > ghash-clmulni-intel.o > obj-$(CONFIG_CRYPTO_CRC32C_INTEL) += crc32c-intel.o > obj-$(CONFIG_CRYPTO_SHA1_SSSE3) += sha1-ssse3.o > obj-$(CONFIG_CRYPTO_SHA256_SSSE3) += sha256-ssse3.o > +obj-$(CONFIG_CRYPTO_SHA512_SSSE3) += sha512-ssse3.o > > aes-i586-y := aes-i586-asm_32.o aes_glue.o > twofish-i586-y := twofish-i586-asm_32.o twofish_glue.o > @@ -54,3 +55,4 @@ sha1-ssse3-y := sha1_ssse3_asm.o sha1_ssse3_glue.o > crc32c-intel-y := crc32c-intel_glue.o > crc32c-intel-$(CONFIG_CRYPTO_CRC32C_X86_64) += crc32c-pcl-intel-asm_64.o > sha256-ssse3-y := sha256-ssse3-asm.o sha256-avx-asm.o sha256-avx2-asm.o > sha256_ssse3_glue.o > +sha512-ssse3-y := sha512-ssse3-asm.o sha512-avx-asm.o sha512-avx2-asm.o > sha512_ssse3_glue.o > diff --git a/arch/x86/crypto/sha512_ssse3_glue.c > b/arch/x86/crypto/sha512_ssse3_glue.c > new file mode 100644 > index 000..25a2e07 > --- /dev/null > +++ b/arch/x86/crypto/sha512_ssse3_glue.c ...snip.. > +#include > + > +asmlinkage void sha512_transform_ssse3(const char *data, u64 *digest, > + u64 rounds); > +#ifdef CONFIG_AS_AVX > +asmlinkage void sha512_transform_avx(const char *data, u64 *digest, > + u64 rounds); > +asmlinkage void sha512_transform_rorx(const char *data, u64 *digest, > + u64 rounds); > +#endif > + Is CONFIG_AS_AVX enough to ensure that rorx is supported by assembler? You also have #ifdef CONFIG_AS_AVX / #endif missing in 'sha256-avx-asm.S', 'sha256-avx2-asm.S', 'sha512-avx-asm.S' and 'sha512-avx2-asm.S'. -Jussi -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 06/11] Create module providing optimized SHA256 routines using SSSE3, AVX or AVX2 instructions.
On 22.03.2013 23:29, Tim Chen wrote: > We added glue code and config options to create crypto > module that uses SSE/AVX/AVX2 optimized SHA256 x86_64 assembly routines. > > Signed-off-by: Tim Chen > --- I could not apply this patch cleanly on top of cryptodev-2.6 tree: Applying: Create module providing optimized SHA256 routines using SSSE3, AVX or AVX2 instructions. Using index info to reconstruct a base tree... Falling back to patching base and 3-way merge... Auto-merging crypto/Kconfig Auto-merging arch/x86/crypto/Makefile CONFLICT (content): Merge conflict in arch/x86/crypto/Makefile Failed to merge in the changes. Patch failed at 0006 Create module providing optimized SHA256 routines using SSSE3, AVX or AVX2 instructions. -Jussi -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 06/11] Create module providing optimized SHA256 routines using SSSE3, AVX or AVX2 instructions.
On 22.03.2013 23:29, Tim Chen wrote: We added glue code and config options to create crypto module that uses SSE/AVX/AVX2 optimized SHA256 x86_64 assembly routines. Signed-off-by: Tim Chen tim.c.c...@linux.intel.com --- I could not apply this patch cleanly on top of cryptodev-2.6 tree: Applying: Create module providing optimized SHA256 routines using SSSE3, AVX or AVX2 instructions. Using index info to reconstruct a base tree... Falling back to patching base and 3-way merge... Auto-merging crypto/Kconfig Auto-merging arch/x86/crypto/Makefile CONFLICT (content): Merge conflict in arch/x86/crypto/Makefile Failed to merge in the changes. Patch failed at 0006 Create module providing optimized SHA256 routines using SSSE3, AVX or AVX2 instructions. -Jussi -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 11/11] Create module providing optimized SHA512 routines using SSSE3, AVX or AVX2 instructions.
On 22.03.2013 23:29, Tim Chen wrote: We added glue code and config options to create crypto module that uses SSE/AVX/AVX2 optimized SHA512 x86_64 assembly routines. Signed-off-by: Tim Chen tim.c.c...@linux.intel.com --- arch/x86/crypto/Makefile| 2 + arch/x86/crypto/sha512_ssse3_glue.c | 276 crypto/Kconfig | 11 ++ 3 files changed, 289 insertions(+) create mode 100644 arch/x86/crypto/sha512_ssse3_glue.c diff --git a/arch/x86/crypto/Makefile b/arch/x86/crypto/Makefile index 02a664a..7d12625 100644 --- a/arch/x86/crypto/Makefile +++ b/arch/x86/crypto/Makefile @@ -28,6 +28,7 @@ obj-$(CONFIG_CRYPTO_GHASH_CLMUL_NI_INTEL) += ghash-clmulni-intel.o obj-$(CONFIG_CRYPTO_CRC32C_INTEL) += crc32c-intel.o obj-$(CONFIG_CRYPTO_SHA1_SSSE3) += sha1-ssse3.o obj-$(CONFIG_CRYPTO_SHA256_SSSE3) += sha256-ssse3.o +obj-$(CONFIG_CRYPTO_SHA512_SSSE3) += sha512-ssse3.o aes-i586-y := aes-i586-asm_32.o aes_glue.o twofish-i586-y := twofish-i586-asm_32.o twofish_glue.o @@ -54,3 +55,4 @@ sha1-ssse3-y := sha1_ssse3_asm.o sha1_ssse3_glue.o crc32c-intel-y := crc32c-intel_glue.o crc32c-intel-$(CONFIG_CRYPTO_CRC32C_X86_64) += crc32c-pcl-intel-asm_64.o sha256-ssse3-y := sha256-ssse3-asm.o sha256-avx-asm.o sha256-avx2-asm.o sha256_ssse3_glue.o +sha512-ssse3-y := sha512-ssse3-asm.o sha512-avx-asm.o sha512-avx2-asm.o sha512_ssse3_glue.o diff --git a/arch/x86/crypto/sha512_ssse3_glue.c b/arch/x86/crypto/sha512_ssse3_glue.c new file mode 100644 index 000..25a2e07 --- /dev/null +++ b/arch/x86/crypto/sha512_ssse3_glue.c ...snip.. +#include linux/string.h + +asmlinkage void sha512_transform_ssse3(const char *data, u64 *digest, + u64 rounds); +#ifdef CONFIG_AS_AVX +asmlinkage void sha512_transform_avx(const char *data, u64 *digest, + u64 rounds); +asmlinkage void sha512_transform_rorx(const char *data, u64 *digest, + u64 rounds); +#endif + Is CONFIG_AS_AVX enough to ensure that rorx is supported by assembler? You also have #ifdef CONFIG_AS_AVX / #endif missing in 'sha256-avx-asm.S', 'sha256-avx2-asm.S', 'sha512-avx-asm.S' and 'sha512-avx2-asm.S'. -Jussi -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 03/11] Optimized sha256 x86_64 assembly routine using Supplemental SSE3 instructions.
On 22.03.2013 23:29, Tim Chen wrote: Provides SHA256 x86_64 assembly routine optimized with SSSE3 instructions. Speedup of 40% or more has been measured over the generic implementation. Signed-off-by: Tim Chen tim.c.c...@linux.intel.com --- arch/x86/crypto/sha256-ssse3-asm.S | 504 + 1 file changed, 504 insertions(+) create mode 100644 arch/x86/crypto/sha256-ssse3-asm.S diff --git a/arch/x86/crypto/sha256-ssse3-asm.S b/arch/x86/crypto/sha256-ssse3-asm.S ..snip.. + + +## void sha256_transform_ssse3(void *input_data, UINT32 digest[8], UINT64 num_blks) +## arg 1 : pointer to input data +## arg 2 : pointer to digest +## arg 3 : Num blocks + +.text +.global sha256_transform_ssse3 +.align 32 +sha256_transform_ssse3: Maybe use ENRTY/ENDPROC macros for exporting functions from assembly? -Jussi -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 06/11] Create module providing optimized SHA256 routines using SSSE3, AVX or AVX2 instructions.
On 22.03.2013 23:29, Tim Chen wrote: We added glue code and config options to create crypto module that uses SSE/AVX/AVX2 optimized SHA256 x86_64 assembly routines. Signed-off-by: Tim Chen tim.c.c...@linux.intel.com ..snip.. diff --git a/arch/x86/crypto/sha256_ssse3_glue.c b/arch/x86/crypto/sha256_ssse3_glue.c new file mode 100644 index 000..5876a19 --- /dev/null +++ b/arch/x86/crypto/sha256_ssse3_glue.c ..snip.. +static int __init sha256_ssse3_mod_init(void) +{ + /* test for SSE3 first */ + if (cpu_has_xmm3) + sha256_transform_asm = sha256_transform_ssse3; + This causes OOPS on my computer. Maybe use 'cpu_has_ssse3' instead? -Jussi -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] CMAC support for CryptoAPI, fixed patch issues, indent, and testmgr build issues
Quoting Steffen Klassert : On Thu, Jan 24, 2013 at 01:25:46PM +0200, Jussi Kivilinna wrote: Maybe it would be cleaner to not mess with pfkeyv2.h at all, but instead mark algorithms that do not support pfkey with flag. See patch below. As nobody seems to have another opinion, we could go either with your approach, or we can invert the logic and mark all existing algorithms as pfkey supported. Then we would not need to bother about pfkey again. I'd be fine with both. Do you want to submit a patch? Ok, I'll invert the logic and send new patch. -Jussi -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] CMAC support for CryptoAPI, fixed patch issues, indent, and testmgr build issues
Quoting Steffen Klassert steffen.klass...@secunet.com: On Thu, Jan 24, 2013 at 01:25:46PM +0200, Jussi Kivilinna wrote: Maybe it would be cleaner to not mess with pfkeyv2.h at all, but instead mark algorithms that do not support pfkey with flag. See patch below. As nobody seems to have another opinion, we could go either with your approach, or we can invert the logic and mark all existing algorithms as pfkey supported. Then we would not need to bother about pfkey again. I'd be fine with both. Do you want to submit a patch? Ok, I'll invert the logic and send new patch. -Jussi -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] CMAC support for CryptoAPI, fixed patch issues, indent, and testmgr build issues
Quoting Steffen Klassert : > On Wed, Jan 23, 2013 at 05:35:10PM +0200, Jussi Kivilinna wrote: >> >> Problem seems to be that PFKEYv2 does not quite work with IKEv2, and >> XFRM API should be used instead. There is new numbers assigned for >> IKEv2: >> https://www.iana.org/assignments/ikev2-parameters/ikev2-parameters.xml#ikev2-parameters-7 >> >> For new SADB_X_AALG_*, I'd think you should use value from "Reserved >> for private use" range. Maybe 250? > > This would be an option, but we have just a few slots for private > algorithms. > >> >> But maybe better solution might be to not make AES-CMAC (or other >> new algorithms) available throught PFKEY API at all, just XFRM? >> > > It is probably the best to make new algorithms unavailable for pfkey > as long as they have no official ikev1 iana transform identifier. > > But how to do that? Perhaps we can assign SADB_X_AALG_NOPFKEY to > the private value 255 and return -EINVAL if pfkey tries to register > such an algorithm. The netlink interface does not use these > identifiers, everything should work as expected. So it should be > possible to use these algoritms with iproute2 and the most modern > ike deamons. Maybe it would be cleaner to not mess with pfkeyv2.h at all, but instead mark algorithms that do not support pfkey with flag. See patch below. Then I started looking up if sadb_alg_id is being used somewhere outside pfkey. Seems that its value is just being copied around.. but at "http://lxr.linux.no/linux+v3.7/net/xfrm/xfrm_policy.c#L1991; it's used as bit-index. So do larger values than 31 break some stuff? Can multiple algorithms have same sadb_alg_id value? Also in af_key.c, sadb_alg_id being used as bit-index. -Jussi --- ONLY COMPILE TESTED! --- include/net/xfrm.h |5 +++-- net/key/af_key.c | 39 +++ net/xfrm/xfrm_algo.c | 12 ++-- 3 files changed, 40 insertions(+), 16 deletions(-) diff --git a/include/net/xfrm.h b/include/net/xfrm.h index 421f764..5d5eec2 100644 --- a/include/net/xfrm.h +++ b/include/net/xfrm.h @@ -1320,6 +1320,7 @@ struct xfrm_algo_desc { char *name; char *compat; u8 available:1; + u8 sadb_disabled:1; union { struct xfrm_algo_aead_info aead; struct xfrm_algo_auth_info auth; @@ -1561,8 +1562,8 @@ extern void xfrm_input_init(void); extern int xfrm_parse_spi(struct sk_buff *skb, u8 nexthdr, __be32 *spi, __be32 *seq); extern void xfrm_probe_algs(void); -extern int xfrm_count_auth_supported(void); -extern int xfrm_count_enc_supported(void); +extern int xfrm_count_sadb_auth_supported(void); +extern int xfrm_count_sadb_enc_supported(void); extern struct xfrm_algo_desc *xfrm_aalg_get_byidx(unsigned int idx); extern struct xfrm_algo_desc *xfrm_ealg_get_byidx(unsigned int idx); extern struct xfrm_algo_desc *xfrm_aalg_get_byid(int alg_id); diff --git a/net/key/af_key.c b/net/key/af_key.c index 5b426a6..307cf1d 100644 --- a/net/key/af_key.c +++ b/net/key/af_key.c @@ -816,18 +816,21 @@ static struct sk_buff *__pfkey_xfrm_state2msg(const struct xfrm_state *x, sa->sadb_sa_auth = 0; if (x->aalg) { struct xfrm_algo_desc *a = xfrm_aalg_get_byname(x->aalg->alg_name, 0); - sa->sadb_sa_auth = a ? a->desc.sadb_alg_id : 0; + sa->sadb_sa_auth = (a && !a->sadb_disabled) ? + a->desc.sadb_alg_id : 0; } sa->sadb_sa_encrypt = 0; BUG_ON(x->ealg && x->calg); if (x->ealg) { struct xfrm_algo_desc *a = xfrm_ealg_get_byname(x->ealg->alg_name, 0); - sa->sadb_sa_encrypt = a ? a->desc.sadb_alg_id : 0; + sa->sadb_sa_encrypt = (a && !a->sadb_disabled) ? + a->desc.sadb_alg_id : 0; } /* KAME compatible: sadb_sa_encrypt is overloaded with calg id */ if (x->calg) { struct xfrm_algo_desc *a = xfrm_calg_get_byname(x->calg->alg_name, 0); - sa->sadb_sa_encrypt = a ? a->desc.sadb_alg_id : 0; + sa->sadb_sa_encrypt = (a && !a->sadb_disabled) ? + a->desc.sadb_alg_id : 0; } sa->sadb_sa_flags = 0; @@ -1138,7 +1141,7 @@ static struct xfrm_state * pfkey_msg2xfrm_state(struct net *net, if (sa->sadb_sa_auth) { int keysize = 0; struct xfrm_algo_desc *a = xfrm_aalg_get_byid(sa->sadb_sa_auth); - if (!a) { + if (!a || a->sadb_disabled) { err = -ENOSYS; goto out; } @@ -1160,7 +1163,7 @@
Re: [PATCH] CMAC support for CryptoAPI, fixed patch issues, indent, and testmgr build issues
Quoting YOSHIFUJI Hideaki : YOSHIFUJI Hideaki wrote: Jussi Kivilinna wrote: diff --git a/include/uapi/linux/pfkeyv2.h b/include/uapi/linux/pfkeyv2.h index 0b80c80..d61898e 100644 --- a/include/uapi/linux/pfkeyv2.h +++ b/include/uapi/linux/pfkeyv2.h @@ -296,6 +296,7 @@ struct sadb_x_kmaddress { #define SADB_X_AALG_SHA2_512HMAC7 #define SADB_X_AALG_RIPEMD160HMAC8 #define SADB_X_AALG_AES_XCBC_MAC9 +#define SADB_X_AALG_AES_CMAC_MAC10 #define SADB_X_AALG_NULL251/* kame */ #define SADB_AALG_MAX251 Should these values be based on IANA assigned IPSEC AH transform identifiers? https://www.iana.org/assignments/isakmp-registry/isakmp-registry.xml#isakmp-registry-6 There is no CMAC entry apparently ... despite the fact that CMAC is a proposed RFC standard for IPsec. It might be safer to move that to 14 since it's currently unassigned and then go through whatever channels are required to allocate it. Mostly this affects key setting. So this means my patch would break AH_RSA setkey calls (which the kernel doesn't support anyways). Problem seems to be that PFKEYv2 does not quite work with IKEv2, and XFRM API should be used instead. There is new numbers assigned for IKEv2: https://www.iana.org/assignments/ikev2-parameters/ikev2-parameters.xml#ikev2-parameters-7 For new SADB_X_AALG_*, I'd think you should use value from "Reserved for private use" range. Maybe 250? We can choose any value unless we do not break existing binaries. When IKE used, the daemon is responsible for translation. I meant, we can choose any values "if" we do not break ... Ok, so giving '10' to AES-CMAC is fine after all? And if I'd want to add Camellia-CTR and Camellia-CCM support, I can choose next free numbers from SADB_X_EALG_*? -Jussi -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] CMAC support for CryptoAPI, fixed patch issues, indent, and testmgr build issues
Quoting Steffen Klassert steffen.klass...@secunet.com: On Wed, Jan 23, 2013 at 05:35:10PM +0200, Jussi Kivilinna wrote: Problem seems to be that PFKEYv2 does not quite work with IKEv2, and XFRM API should be used instead. There is new numbers assigned for IKEv2: https://www.iana.org/assignments/ikev2-parameters/ikev2-parameters.xml#ikev2-parameters-7 For new SADB_X_AALG_*, I'd think you should use value from Reserved for private use range. Maybe 250? This would be an option, but we have just a few slots for private algorithms. But maybe better solution might be to not make AES-CMAC (or other new algorithms) available throught PFKEY API at all, just XFRM? It is probably the best to make new algorithms unavailable for pfkey as long as they have no official ikev1 iana transform identifier. But how to do that? Perhaps we can assign SADB_X_AALG_NOPFKEY to the private value 255 and return -EINVAL if pfkey tries to register such an algorithm. The netlink interface does not use these identifiers, everything should work as expected. So it should be possible to use these algoritms with iproute2 and the most modern ike deamons. Maybe it would be cleaner to not mess with pfkeyv2.h at all, but instead mark algorithms that do not support pfkey with flag. See patch below. Then I started looking up if sadb_alg_id is being used somewhere outside pfkey. Seems that its value is just being copied around.. but at http://lxr.linux.no/linux+v3.7/net/xfrm/xfrm_policy.c#L1991; it's used as bit-index. So do larger values than 31 break some stuff? Can multiple algorithms have same sadb_alg_id value? Also in af_key.c, sadb_alg_id being used as bit-index. -Jussi --- ONLY COMPILE TESTED! --- include/net/xfrm.h |5 +++-- net/key/af_key.c | 39 +++ net/xfrm/xfrm_algo.c | 12 ++-- 3 files changed, 40 insertions(+), 16 deletions(-) diff --git a/include/net/xfrm.h b/include/net/xfrm.h index 421f764..5d5eec2 100644 --- a/include/net/xfrm.h +++ b/include/net/xfrm.h @@ -1320,6 +1320,7 @@ struct xfrm_algo_desc { char *name; char *compat; u8 available:1; + u8 sadb_disabled:1; union { struct xfrm_algo_aead_info aead; struct xfrm_algo_auth_info auth; @@ -1561,8 +1562,8 @@ extern void xfrm_input_init(void); extern int xfrm_parse_spi(struct sk_buff *skb, u8 nexthdr, __be32 *spi, __be32 *seq); extern void xfrm_probe_algs(void); -extern int xfrm_count_auth_supported(void); -extern int xfrm_count_enc_supported(void); +extern int xfrm_count_sadb_auth_supported(void); +extern int xfrm_count_sadb_enc_supported(void); extern struct xfrm_algo_desc *xfrm_aalg_get_byidx(unsigned int idx); extern struct xfrm_algo_desc *xfrm_ealg_get_byidx(unsigned int idx); extern struct xfrm_algo_desc *xfrm_aalg_get_byid(int alg_id); diff --git a/net/key/af_key.c b/net/key/af_key.c index 5b426a6..307cf1d 100644 --- a/net/key/af_key.c +++ b/net/key/af_key.c @@ -816,18 +816,21 @@ static struct sk_buff *__pfkey_xfrm_state2msg(const struct xfrm_state *x, sa-sadb_sa_auth = 0; if (x-aalg) { struct xfrm_algo_desc *a = xfrm_aalg_get_byname(x-aalg-alg_name, 0); - sa-sadb_sa_auth = a ? a-desc.sadb_alg_id : 0; + sa-sadb_sa_auth = (a !a-sadb_disabled) ? + a-desc.sadb_alg_id : 0; } sa-sadb_sa_encrypt = 0; BUG_ON(x-ealg x-calg); if (x-ealg) { struct xfrm_algo_desc *a = xfrm_ealg_get_byname(x-ealg-alg_name, 0); - sa-sadb_sa_encrypt = a ? a-desc.sadb_alg_id : 0; + sa-sadb_sa_encrypt = (a !a-sadb_disabled) ? + a-desc.sadb_alg_id : 0; } /* KAME compatible: sadb_sa_encrypt is overloaded with calg id */ if (x-calg) { struct xfrm_algo_desc *a = xfrm_calg_get_byname(x-calg-alg_name, 0); - sa-sadb_sa_encrypt = a ? a-desc.sadb_alg_id : 0; + sa-sadb_sa_encrypt = (a !a-sadb_disabled) ? + a-desc.sadb_alg_id : 0; } sa-sadb_sa_flags = 0; @@ -1138,7 +1141,7 @@ static struct xfrm_state * pfkey_msg2xfrm_state(struct net *net, if (sa-sadb_sa_auth) { int keysize = 0; struct xfrm_algo_desc *a = xfrm_aalg_get_byid(sa-sadb_sa_auth); - if (!a) { + if (!a || a-sadb_disabled) { err = -ENOSYS; goto out; } @@ -1160,7 +1163,7 @@ static struct xfrm_state * pfkey_msg2xfrm_state(struct net *net, if (sa-sadb_sa_encrypt) { if (hdr-sadb_msg_satype == SADB_X_SATYPE_IPCOMP) { struct xfrm_algo_desc *a = xfrm_calg_get_byid(sa-sadb_sa_encrypt); - if (!a) { + if (!a || a-sadb_disabled
Re: [PATCH] CMAC support for CryptoAPI, fixed patch issues, indent, and testmgr build issues
Quoting YOSHIFUJI Hideaki yoshf...@linux-ipv6.org: YOSHIFUJI Hideaki wrote: Jussi Kivilinna wrote: diff --git a/include/uapi/linux/pfkeyv2.h b/include/uapi/linux/pfkeyv2.h index 0b80c80..d61898e 100644 --- a/include/uapi/linux/pfkeyv2.h +++ b/include/uapi/linux/pfkeyv2.h @@ -296,6 +296,7 @@ struct sadb_x_kmaddress { #define SADB_X_AALG_SHA2_512HMAC7 #define SADB_X_AALG_RIPEMD160HMAC8 #define SADB_X_AALG_AES_XCBC_MAC9 +#define SADB_X_AALG_AES_CMAC_MAC10 #define SADB_X_AALG_NULL251/* kame */ #define SADB_AALG_MAX251 Should these values be based on IANA assigned IPSEC AH transform identifiers? https://www.iana.org/assignments/isakmp-registry/isakmp-registry.xml#isakmp-registry-6 There is no CMAC entry apparently ... despite the fact that CMAC is a proposed RFC standard for IPsec. It might be safer to move that to 14 since it's currently unassigned and then go through whatever channels are required to allocate it. Mostly this affects key setting. So this means my patch would break AH_RSA setkey calls (which the kernel doesn't support anyways). Problem seems to be that PFKEYv2 does not quite work with IKEv2, and XFRM API should be used instead. There is new numbers assigned for IKEv2: https://www.iana.org/assignments/ikev2-parameters/ikev2-parameters.xml#ikev2-parameters-7 For new SADB_X_AALG_*, I'd think you should use value from Reserved for private use range. Maybe 250? We can choose any value unless we do not break existing binaries. When IKE used, the daemon is responsible for translation. I meant, we can choose any values if we do not break ... Ok, so giving '10' to AES-CMAC is fine after all? And if I'd want to add Camellia-CTR and Camellia-CCM support, I can choose next free numbers from SADB_X_EALG_*? -Jussi -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] CMAC support for CryptoAPI, fixed patch issues, indent, and testmgr build issues
Quoting Tom St Denis : - Original Message - From: "Jussi Kivilinna" To: "Tom St Denis" Cc: linux-kernel@vger.kernel.org, "Herbert Xu" , "David Miller" , linux-cry...@vger.kernel.org, "Steffen Klassert" , net...@vger.kernel.org Sent: Wednesday, 23 January, 2013 9:36:44 AM Subject: Re: [PATCH] CMAC support for CryptoAPI, fixed patch issues, indent, and testmgr build issues Quoting Tom St Denis : > Hey all, > > Here's an updated patch which addresses a couple of build issues > and > coding style complaints. > > I still can't get it to run via testmgr I get > > [ 162.407807] alg: No test for cmac(aes) (cmac(aes-generic)) > > Despite the fact I have an entry for cmac(aes) (much like > xcbc(aes)...). > > Here's the patch to bring 3.8-rc4 up with CMAC ... > > Signed-off-by: Tom St Denis > > diff --git a/include/uapi/linux/pfkeyv2.h > b/include/uapi/linux/pfkeyv2.h > index 0b80c80..d61898e 100644 > --- a/include/uapi/linux/pfkeyv2.h > +++ b/include/uapi/linux/pfkeyv2.h > @@ -296,6 +296,7 @@ struct sadb_x_kmaddress { > #define SADB_X_AALG_SHA2_512HMAC 7 > #define SADB_X_AALG_RIPEMD160HMAC 8 > #define SADB_X_AALG_AES_XCBC_MAC 9 > +#define SADB_X_AALG_AES_CMAC_MAC 10 > #define SADB_X_AALG_NULL 251 /* kame */ > #define SADB_AALG_MAX 251 Should these values be based on IANA assigned IPSEC AH transform identifiers? https://www.iana.org/assignments/isakmp-registry/isakmp-registry.xml#isakmp-registry-6 There is no CMAC entry apparently ... despite the fact that CMAC is a proposed RFC standard for IPsec. It might be safer to move that to 14 since it's currently unassigned and then go through whatever channels are required to allocate it. Mostly this affects key setting. So this means my patch would break AH_RSA setkey calls (which the kernel doesn't support anyways). Problem seems to be that PFKEYv2 does not quite work with IKEv2, and XFRM API should be used instead. There is new numbers assigned for IKEv2: https://www.iana.org/assignments/ikev2-parameters/ikev2-parameters.xml#ikev2-parameters-7 For new SADB_X_AALG_*, I'd think you should use value from "Reserved for private use" range. Maybe 250? But maybe better solution might be to not make AES-CMAC (or other new algorithms) available throught PFKEY API at all, just XFRM? -Jussi -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] CMAC support for CryptoAPI, fixed patch issues, indent, and testmgr build issues
Quoting Tom St Denis : Hey all, Here's an updated patch which addresses a couple of build issues and coding style complaints. I still can't get it to run via testmgr I get [ 162.407807] alg: No test for cmac(aes) (cmac(aes-generic)) Despite the fact I have an entry for cmac(aes) (much like xcbc(aes)...). Here's the patch to bring 3.8-rc4 up with CMAC ... Signed-off-by: Tom St Denis diff --git a/include/uapi/linux/pfkeyv2.h b/include/uapi/linux/pfkeyv2.h index 0b80c80..d61898e 100644 --- a/include/uapi/linux/pfkeyv2.h +++ b/include/uapi/linux/pfkeyv2.h @@ -296,6 +296,7 @@ struct sadb_x_kmaddress { #define SADB_X_AALG_SHA2_512HMAC 7 #define SADB_X_AALG_RIPEMD160HMAC 8 #define SADB_X_AALG_AES_XCBC_MAC 9 +#define SADB_X_AALG_AES_CMAC_MAC 10 #define SADB_X_AALG_NULL 251 /* kame */ #define SADB_AALG_MAX 251 Should these values be based on IANA assigned IPSEC AH transform identifiers? https://www.iana.org/assignments/isakmp-registry/isakmp-registry.xml#isakmp-registry-6 -Jussi -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] CMAC support for CryptoAPI, fixed patch issues, indent, and testmgr build issues
Quoting Tom St Denis tstde...@elliptictech.com: Hey all, Here's an updated patch which addresses a couple of build issues and coding style complaints. I still can't get it to run via testmgr I get [ 162.407807] alg: No test for cmac(aes) (cmac(aes-generic)) Despite the fact I have an entry for cmac(aes) (much like xcbc(aes)...). Here's the patch to bring 3.8-rc4 up with CMAC ... Signed-off-by: Tom St Denis tstde...@elliptictech.com snip diff --git a/include/uapi/linux/pfkeyv2.h b/include/uapi/linux/pfkeyv2.h index 0b80c80..d61898e 100644 --- a/include/uapi/linux/pfkeyv2.h +++ b/include/uapi/linux/pfkeyv2.h @@ -296,6 +296,7 @@ struct sadb_x_kmaddress { #define SADB_X_AALG_SHA2_512HMAC 7 #define SADB_X_AALG_RIPEMD160HMAC 8 #define SADB_X_AALG_AES_XCBC_MAC 9 +#define SADB_X_AALG_AES_CMAC_MAC 10 #define SADB_X_AALG_NULL 251 /* kame */ #define SADB_AALG_MAX 251 Should these values be based on IANA assigned IPSEC AH transform identifiers? https://www.iana.org/assignments/isakmp-registry/isakmp-registry.xml#isakmp-registry-6 -Jussi -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] CMAC support for CryptoAPI, fixed patch issues, indent, and testmgr build issues
Quoting Tom St Denis tstde...@elliptictech.com: - Original Message - From: Jussi Kivilinna jussi.kivili...@mbnet.fi To: Tom St Denis tstde...@elliptictech.com Cc: linux-kernel@vger.kernel.org, Herbert Xu herb...@gondor.apana.org.au, David Miller da...@davemloft.net, linux-cry...@vger.kernel.org, Steffen Klassert steffen.klass...@secunet.com, net...@vger.kernel.org Sent: Wednesday, 23 January, 2013 9:36:44 AM Subject: Re: [PATCH] CMAC support for CryptoAPI, fixed patch issues, indent, and testmgr build issues Quoting Tom St Denis tstde...@elliptictech.com: Hey all, Here's an updated patch which addresses a couple of build issues and coding style complaints. I still can't get it to run via testmgr I get [ 162.407807] alg: No test for cmac(aes) (cmac(aes-generic)) Despite the fact I have an entry for cmac(aes) (much like xcbc(aes)...). Here's the patch to bring 3.8-rc4 up with CMAC ... Signed-off-by: Tom St Denis tstde...@elliptictech.com snip diff --git a/include/uapi/linux/pfkeyv2.h b/include/uapi/linux/pfkeyv2.h index 0b80c80..d61898e 100644 --- a/include/uapi/linux/pfkeyv2.h +++ b/include/uapi/linux/pfkeyv2.h @@ -296,6 +296,7 @@ struct sadb_x_kmaddress { #define SADB_X_AALG_SHA2_512HMAC 7 #define SADB_X_AALG_RIPEMD160HMAC 8 #define SADB_X_AALG_AES_XCBC_MAC 9 +#define SADB_X_AALG_AES_CMAC_MAC 10 #define SADB_X_AALG_NULL 251 /* kame */ #define SADB_AALG_MAX 251 Should these values be based on IANA assigned IPSEC AH transform identifiers? https://www.iana.org/assignments/isakmp-registry/isakmp-registry.xml#isakmp-registry-6 There is no CMAC entry apparently ... despite the fact that CMAC is a proposed RFC standard for IPsec. It might be safer to move that to 14 since it's currently unassigned and then go through whatever channels are required to allocate it. Mostly this affects key setting. So this means my patch would break AH_RSA setkey calls (which the kernel doesn't support anyways). Problem seems to be that PFKEYv2 does not quite work with IKEv2, and XFRM API should be used instead. There is new numbers assigned for IKEv2: https://www.iana.org/assignments/ikev2-parameters/ikev2-parameters.xml#ikev2-parameters-7 For new SADB_X_AALG_*, I'd think you should use value from Reserved for private use range. Maybe 250? But maybe better solution might be to not make AES-CMAC (or other new algorithms) available throught PFKEY API at all, just XFRM? -Jussi -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] crypto: fix FTBFS with ARM SHA1-asm and THUMB2_KERNEL
Quoting Jussi Kivilinna : Quoting Matt Sealey : This question is to the implementor/committer (Dave McCullough), how exactly did you measure the benchmark and can we reproduce it on some other ARM box? If it's long and laborious and not so important to test the IPsec tunnel use-case, what would be the simplest possible benchmark to see if the C vs. assembly version is faster for a particular ARM device? I can get hold of pretty much any Cortex-A8 or Cortex-A9 that matters, I have access to a Chromebook for A15, and maybe an i.MX27 or i.MX35 and a couple Marvell boards (ARMv6) if I set my mind to it... that much testing implies we find a pretty concise benchmark though with a fairly common kernel version we can spread around (i.MX, OMAP and the Chromebook, I can handle, the rest I'm a little wary of bothering to spend too much time on). I think that could cover a good swath of not-ARMv5 use cases from lower speeds to quad core monsters.. but I might stick to i.MX to start with.. There is 'tcrypt' module in crypto/ for quick benchmarking. 'modprobe tcrypt mode=500 sec=1' tests AES in various cipher-modes, using different buffer sizes and outputs results to kernel log. Actually mode=200 might be better, as mode=500 is for asynchronous implementations and might use hardware crypto if such device/module is available. -Jussi -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] crypto: fix FTBFS with ARM SHA1-asm and THUMB2_KERNEL
Quoting Matt Sealey : This question is to the implementor/committer (Dave McCullough), how exactly did you measure the benchmark and can we reproduce it on some other ARM box? If it's long and laborious and not so important to test the IPsec tunnel use-case, what would be the simplest possible benchmark to see if the C vs. assembly version is faster for a particular ARM device? I can get hold of pretty much any Cortex-A8 or Cortex-A9 that matters, I have access to a Chromebook for A15, and maybe an i.MX27 or i.MX35 and a couple Marvell boards (ARMv6) if I set my mind to it... that much testing implies we find a pretty concise benchmark though with a fairly common kernel version we can spread around (i.MX, OMAP and the Chromebook, I can handle, the rest I'm a little wary of bothering to spend too much time on). I think that could cover a good swath of not-ARMv5 use cases from lower speeds to quad core monsters.. but I might stick to i.MX to start with.. There is 'tcrypt' module in crypto/ for quick benchmarking. 'modprobe tcrypt mode=500 sec=1' tests AES in various cipher-modes, using different buffer sizes and outputs results to kernel log. -Jussi -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] crypto: fix FTBFS with ARM SHA1-asm and THUMB2_KERNEL
Quoting Matt Sealey m...@genesi-usa.com: This question is to the implementor/committer (Dave McCullough), how exactly did you measure the benchmark and can we reproduce it on some other ARM box? If it's long and laborious and not so important to test the IPsec tunnel use-case, what would be the simplest possible benchmark to see if the C vs. assembly version is faster for a particular ARM device? I can get hold of pretty much any Cortex-A8 or Cortex-A9 that matters, I have access to a Chromebook for A15, and maybe an i.MX27 or i.MX35 and a couple Marvell boards (ARMv6) if I set my mind to it... that much testing implies we find a pretty concise benchmark though with a fairly common kernel version we can spread around (i.MX, OMAP and the Chromebook, I can handle, the rest I'm a little wary of bothering to spend too much time on). I think that could cover a good swath of not-ARMv5 use cases from lower speeds to quad core monsters.. but I might stick to i.MX to start with.. There is 'tcrypt' module in crypto/ for quick benchmarking. 'modprobe tcrypt mode=500 sec=1' tests AES in various cipher-modes, using different buffer sizes and outputs results to kernel log. -Jussi -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] crypto: fix FTBFS with ARM SHA1-asm and THUMB2_KERNEL
Quoting Jussi Kivilinna jussi.kivili...@mbnet.fi: Quoting Matt Sealey m...@genesi-usa.com: This question is to the implementor/committer (Dave McCullough), how exactly did you measure the benchmark and can we reproduce it on some other ARM box? If it's long and laborious and not so important to test the IPsec tunnel use-case, what would be the simplest possible benchmark to see if the C vs. assembly version is faster for a particular ARM device? I can get hold of pretty much any Cortex-A8 or Cortex-A9 that matters, I have access to a Chromebook for A15, and maybe an i.MX27 or i.MX35 and a couple Marvell boards (ARMv6) if I set my mind to it... that much testing implies we find a pretty concise benchmark though with a fairly common kernel version we can spread around (i.MX, OMAP and the Chromebook, I can handle, the rest I'm a little wary of bothering to spend too much time on). I think that could cover a good swath of not-ARMv5 use cases from lower speeds to quad core monsters.. but I might stick to i.MX to start with.. There is 'tcrypt' module in crypto/ for quick benchmarking. 'modprobe tcrypt mode=500 sec=1' tests AES in various cipher-modes, using different buffer sizes and outputs results to kernel log. Actually mode=200 might be better, as mode=500 is for asynchronous implementations and might use hardware crypto if such device/module is available. -Jussi -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] Remove VLAIS usage from crypto/testmgr.c
Quoting Behan Webster : From: Jan-Simon Möller The use of variable length arrays in structs (VLAIS) in the Linux Kernel code precludes the use of compilers which don't implement VLAIS (for instance the Clang compiler). This patch instead allocates the appropriate amount of memory using an char array. Patch from series at http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20120507/142707.html by PaX Team. Signed-off-by: Jan-Simon Möller Cc: pagee...@freemail.hu Signed-off-by: Behan Webster --- crypto/testmgr.c | 23 +-- 1 file changed, 13 insertions(+), 10 deletions(-) diff --git a/crypto/testmgr.c b/crypto/testmgr.c index 941d75c..5b7b3a6 100644 --- a/crypto/testmgr.c +++ b/crypto/testmgr.c @@ -1578,16 +1578,19 @@ static int alg_test_crc32c(const struct alg_test_desc *desc, } do { - struct { - struct shash_desc shash; - char ctx[crypto_shash_descsize(tfm)]; - } sdesc; - - sdesc.shash.tfm = tfm; - sdesc.shash.flags = 0; - - *(u32 *)sdesc.ctx = le32_to_cpu(420553207); - err = crypto_shash_final(, (u8 *)); + char sdesc[sizeof(struct shash_desc) + + crypto_shash_descsize(tfm) + + CRYPTO_MINALIGN] CRYPTO_MINALIGN_ATTR; + struct shash_desc *shash = (struct shash_desc *)sdesc; + u32 *ctx = (u32 *)((unsigned long)(sdesc + + sizeof(struct shash_desc) + CRYPTO_MINALIGN - 1) + & ~(CRYPTO_MINALIGN - 1)); I think you should use '(u32 *)shash_desc_ctx(shash)' instead of getting ctx pointer manually. + + shash->tfm = tfm; + shash->flags = 0; + + *ctx = le32_to_cpu(420553207); + err = crypto_shash_final(shash, (u8 *)); if (err) { printk(KERN_ERR "alg: crc32c: Operation failed for " "%s: %d\n", driver, err); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] Remove VLAIS usage from crypto/testmgr.c
Quoting Behan Webster beh...@converseincode.com: From: Jan-Simon Möller dl...@gmx.de The use of variable length arrays in structs (VLAIS) in the Linux Kernel code precludes the use of compilers which don't implement VLAIS (for instance the Clang compiler). This patch instead allocates the appropriate amount of memory using an char array. Patch from series at http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20120507/142707.html by PaX Team. Signed-off-by: Jan-Simon Möller dl...@gmx.de Cc: pagee...@freemail.hu Signed-off-by: Behan Webster beh...@converseincode.com --- crypto/testmgr.c | 23 +-- 1 file changed, 13 insertions(+), 10 deletions(-) diff --git a/crypto/testmgr.c b/crypto/testmgr.c index 941d75c..5b7b3a6 100644 --- a/crypto/testmgr.c +++ b/crypto/testmgr.c @@ -1578,16 +1578,19 @@ static int alg_test_crc32c(const struct alg_test_desc *desc, } do { - struct { - struct shash_desc shash; - char ctx[crypto_shash_descsize(tfm)]; - } sdesc; - - sdesc.shash.tfm = tfm; - sdesc.shash.flags = 0; - - *(u32 *)sdesc.ctx = le32_to_cpu(420553207); - err = crypto_shash_final(sdesc.shash, (u8 *)val); + char sdesc[sizeof(struct shash_desc) + + crypto_shash_descsize(tfm) + + CRYPTO_MINALIGN] CRYPTO_MINALIGN_ATTR; + struct shash_desc *shash = (struct shash_desc *)sdesc; + u32 *ctx = (u32 *)((unsigned long)(sdesc + + sizeof(struct shash_desc) + CRYPTO_MINALIGN - 1) +~(CRYPTO_MINALIGN - 1)); I think you should use '(u32 *)shash_desc_ctx(shash)' instead of getting ctx pointer manually. + + shash-tfm = tfm; + shash-flags = 0; + + *ctx = le32_to_cpu(420553207); + err = crypto_shash_final(shash, (u8 *)val); if (err) { printk(KERN_ERR alg: crc32c: Operation failed for %s: %d\n, driver, err); -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux 3.6-rc5
Quoting Herbert Xu : On Sun, Sep 09, 2012 at 08:35:56AM -0700, Linus Torvalds wrote: On Sun, Sep 9, 2012 at 5:54 AM, Jussi Kivilinna wrote: > > Does reverting e46e9a46386bca8e80a6467b5c643dc494861896 help? > > That commit added crypto selftest for authenc(hmac(sha1),cbc(aes)) in 3.6, > and probably made this bug visible (but not directly causing it). So Romain said it does - where do we go from here? Revert testing it, or fix the authenc() case? I'd prefer the fix.. I'm working on this right now. If we don't get anywhere in a couple of days we can revert the test vector patch. It seems that authenc is chaining empty assoc scatterlist, which causes BUG_ON(!sg->length) set off in crypto/scatterwalk.c. Following fixes the bug and self-test passes, but not sure if it's correct (note, copy-paste to 'broken' email client, most likely does not apply etc): diff --git a/crypto/authenc.c b/crypto/authenc.c index 5ef7ba6..2373af5 100644 --- a/crypto/authenc.c +++ b/crypto/authenc.c @@ -336,7 +336,7 @@ static int crypto_authenc_genicv(struct aead_request *req, u8 *iv, cryptlen += ivsize; } - if (sg_is_last(assoc)) { + if (req->assoclen > 0 && sg_is_last(assoc)) { authenc_ahash_fn = crypto_authenc_ahash; sg_init_table(asg, 2); sg_set_page(asg, sg_page(assoc), assoc->length, assoc->offset); Also does crypto_authenc_iverify() need same fix? -Jussi Cheers, -- Email: Herbert Xu Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux 3.6-rc5
Quoting Herbert Xu : Can you try blacklisting/not loading sha1_ssse3 and aesni_intel to see which one of them is causing this crash? Of course if you can still reproduce this without loading either of them that would also be interesting to know. This triggers with aes-x86_64 and sha1_generic (and sha256 & sha512) too, with following test added to tcrypt: case 46: ret += tcrypt_test("authenc(hmac(sha1),cbc(aes))"); ret += tcrypt_test("authenc(hmac(sha256),cbc(aes))"); ret += tcrypt_test("authenc(hmac(sha512),cbc(aes))"); break; -Jussi Thanks, -- Email: Herbert Xu Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -- To unsubscribe from this list: send the line "unsubscribe linux-crypto" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux 3.6-rc5
Quoting Romain Francoise : Still seeing this BUG with -rc5, that I originally reported here: http://marc.info/?l=linux-crypto-vger=134653220530264=2 Does reverting e46e9a46386bca8e80a6467b5c643dc494861896 help? That commit added crypto selftest for authenc(hmac(sha1),cbc(aes)) in 3.6, and probably made this bug visible (but not directly causing it). -Jussi [ 26.362567] [ cut here ] [ 26.362583] kernel BUG at crypto/scatterwalk.c:37! [ 26.362606] invalid opcode: [#1] SMP [ 26.362622] Modules linked in: authenc xfrm6_mode_tunnel xfrm4_mode_tunnel cpufreq_conservative cpufreq_userspace cpufreq_powersave cpufreq_stats xfrm_user xfrm4_tunnel tunnel4 ipcomp xfrm_ipcomp esp4 ah4 binfmt_misc deflate zlib_deflate ctr twofish_generic twofish_avx_x86_64 twofish_x86_64_3way twofish_x86_64 twofish_common camellia_generic camellia_x86_64 serpent_avx_x86_64 serpent_sse2_x86_64 serpent_generic glue_helper lrw xts gf128mul blowfish_generic blowfish_x86_64 blowfish_common cast5 des_generic cbc xcbc rmd160 sha512_generic sha1_ssse3 sha1_generic hmac crypto_null af_key xfrm_algo ip6table_filter ip6_tables xt_recent xt_LOG nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack xt_tcpudp iptable_filter ip_tables x_tables hwmon_vid msr vhost_net macvtap macvlan tun loop bridge stp llc firewire_sbp2 fuse rc_dib0700_rc5 snd_hda_codec_hdmi dvb_usb_dib0700 dib7000m dib0090 dib8000 dib0070 dib7000p dib3000mc dibx000_common dvb_usb snd_hda_codec_realtek dvb_core rc_co re snd _hda_intel snd_hda_codec snd_seq_midi snd_seq_midi_event snd_hwdep snd_pcm_oss snd_rawmidi snd_mixer_oss snd_seq snd_pcm snd_seq_device radeon snd_timer snd soundcore acpi_cpufreq mperf ttm processor drm_kms_helper thermal_sys mxm_wmi drm snd_page_alloc i2c_algo_bit i2c_i801 i2c_core lpc_ich button psmouse evdev coretemp serio_raw wmi pcspkr mei kvm_intel kvm ext4 crc16 jbd2 mbcache sha256_generic usb_storage uas dm_crypt dm_mod raid10 raid1 md_mod sg ata_generic sd_mod hid_generic crc_t10dif pata_marvell usbhid hid crc32c_intel ghash_clmulni_intel aesni_intel aes_x86_64 aes_generic ablk_helper cryptd firewire_ohci microcode ahci firewire_core libahci crc_itu_t libata scsi_mod xhci_hcd ehci_hcd usbcore usb_common e1000e [ 26.371138] CPU 5 [ 26.371146] Pid: 3704, comm: cryptomgr_test Not tainted 3.6.0-rc5-ore #1 /DP67BG [ 26.374095] RIP: 0010:[] [] scatterwalk_start+0x11/0x20 [ 26.375598] RSP: 0018:88040d62b9d8 EFLAGS: 00010246 [ 26.377067] RAX: RBX: 88040b6e3868 RCX: 0014 [ 26.378567] RDX: 0020 RSI: 88040b6e3868 RDI: 88040d62b9e0 [ 26.380028] RBP: 0020 R08: 0001 R09: 88040b6e39a8 [ 26.381494] R10: a06b4000 R11: 88040b6e39fc R12: 0014 [ 26.383023] R13: 0001 R14: 88040b6e38f8 R15: [ 26.384488] FS: () GS:88041f54() knlGS: [ 26.385973] CS: 0010 DS: ES: CR0: 8005003b [ 26.387581] CR2: 7f54c282fbd0 CR3: 0180b000 CR4: 000407e0 [ 26.389057] DR0: DR1: DR2: [ 26.390547] DR3: DR6: 0ff0 DR7: 0400 [ 26.392015] Process cryptomgr_test (pid: 3704, threadinfo 88040d62a000, task 88040ca3ab60) [ 26.393558] Stack: [ 26.395103] 811d72fb 88040b6e3868 0020 88040b6e3868 [ 26.396622] 88040b6e3800 88040b6e3868 0020 88040b6e3868 [ 26.398163] 88040919f440 88040d62bcc8 a07edaa0 88040b6e3930 [ 26.399643] Call Trace: [ 26.401112] [] ? scatterwalk_map_and_copy+0x5b/0xd0 [ 26.402714] [] ? crypto_authenc_genicv+0xa0/0x300 [authenc] [ 26.404274] [] ? test_aead+0x58b/0xcd0 [ 26.406082] [] ? crypto_mod_get+0x10/0x30 [ 26.407704] [] ? crypto_alloc_base+0x53/0xb0 [ 26.409267] [] ? cryptd_alloc_ablkcipher+0x80/0xc0 [cryptd] [ 26.410838] [] ? __kmalloc+0x20d/0x250 [ 26.412364] [] ? crypto_spawn_tfm2+0x31/0x70 [ 26.413938] [] ? ablk_init_common+0x10/0x30 [ablk_helper] [ 26.415448] [] ? __crypto_alloc_tfm+0xf9/0x170 [ 26.416963] [] ? crypto_spawn_tfm+0x43/0x90 [ 26.418505] [] ? skcipher_geniv_init+0x1e/0x40 [ 26.420046] [] ? __crypto_alloc_tfm+0xf9/0x170 [ 26.421599] [] ? crypto_spawn_tfm+0x43/0x90 [ 26.423228] [] ? __kmalloc+0x20d/0x250 [ 26.424788] [] ? crypto_authenc_init_tfm+0x49/0xc0 [authenc] [ 26.426374] [] ? __crypto_alloc_tfm+0xf9/0x170 [ 26.427999] [] ? alg_test_aead+0x48/0xb0 [ 26.429781] [] ? alg_test+0xfe/0x310 [ 26.431503] [] ? __schedule+0x2ba/0x700 [ 26.433235] [] ? cryptomgr_probe+0xb0/0xb0 [ 26.434918] [] ? cryptomgr_test+0x38/0x40 [ 26.436524] [] ? kthread+0x85/0x90 [ 26.436526] [] ? kernel_thread_helper+0x4/0x10 [
Re: Linux 3.6-rc5
Quoting Romain Francoise rom...@orebokech.com: Still seeing this BUG with -rc5, that I originally reported here: http://marc.info/?l=linux-crypto-vgerm=134653220530264w=2 Does reverting e46e9a46386bca8e80a6467b5c643dc494861896 help? That commit added crypto selftest for authenc(hmac(sha1),cbc(aes)) in 3.6, and probably made this bug visible (but not directly causing it). -Jussi [ 26.362567] [ cut here ] [ 26.362583] kernel BUG at crypto/scatterwalk.c:37! [ 26.362606] invalid opcode: [#1] SMP [ 26.362622] Modules linked in: authenc xfrm6_mode_tunnel xfrm4_mode_tunnel cpufreq_conservative cpufreq_userspace cpufreq_powersave cpufreq_stats xfrm_user xfrm4_tunnel tunnel4 ipcomp xfrm_ipcomp esp4 ah4 binfmt_misc deflate zlib_deflate ctr twofish_generic twofish_avx_x86_64 twofish_x86_64_3way twofish_x86_64 twofish_common camellia_generic camellia_x86_64 serpent_avx_x86_64 serpent_sse2_x86_64 serpent_generic glue_helper lrw xts gf128mul blowfish_generic blowfish_x86_64 blowfish_common cast5 des_generic cbc xcbc rmd160 sha512_generic sha1_ssse3 sha1_generic hmac crypto_null af_key xfrm_algo ip6table_filter ip6_tables xt_recent xt_LOG nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack xt_tcpudp iptable_filter ip_tables x_tables hwmon_vid msr vhost_net macvtap macvlan tun loop bridge stp llc firewire_sbp2 fuse rc_dib0700_rc5 snd_hda_codec_hdmi dvb_usb_dib0700 dib7000m dib0090 dib8000 dib0070 dib7000p dib3000mc dibx000_common dvb_usb snd_hda_codec_realtek dvb_core rc_co re snd _hda_intel snd_hda_codec snd_seq_midi snd_seq_midi_event snd_hwdep snd_pcm_oss snd_rawmidi snd_mixer_oss snd_seq snd_pcm snd_seq_device radeon snd_timer snd soundcore acpi_cpufreq mperf ttm processor drm_kms_helper thermal_sys mxm_wmi drm snd_page_alloc i2c_algo_bit i2c_i801 i2c_core lpc_ich button psmouse evdev coretemp serio_raw wmi pcspkr mei kvm_intel kvm ext4 crc16 jbd2 mbcache sha256_generic usb_storage uas dm_crypt dm_mod raid10 raid1 md_mod sg ata_generic sd_mod hid_generic crc_t10dif pata_marvell usbhid hid crc32c_intel ghash_clmulni_intel aesni_intel aes_x86_64 aes_generic ablk_helper cryptd firewire_ohci microcode ahci firewire_core libahci crc_itu_t libata scsi_mod xhci_hcd ehci_hcd usbcore usb_common e1000e [ 26.371138] CPU 5 [ 26.371146] Pid: 3704, comm: cryptomgr_test Not tainted 3.6.0-rc5-ore #1 /DP67BG [ 26.374095] RIP: 0010:[811d70d1] [811d70d1] scatterwalk_start+0x11/0x20 [ 26.375598] RSP: 0018:88040d62b9d8 EFLAGS: 00010246 [ 26.377067] RAX: RBX: 88040b6e3868 RCX: 0014 [ 26.378567] RDX: 0020 RSI: 88040b6e3868 RDI: 88040d62b9e0 [ 26.380028] RBP: 0020 R08: 0001 R09: 88040b6e39a8 [ 26.381494] R10: a06b4000 R11: 88040b6e39fc R12: 0014 [ 26.383023] R13: 0001 R14: 88040b6e38f8 R15: [ 26.384488] FS: () GS:88041f54() knlGS: [ 26.385973] CS: 0010 DS: ES: CR0: 8005003b [ 26.387581] CR2: 7f54c282fbd0 CR3: 0180b000 CR4: 000407e0 [ 26.389057] DR0: DR1: DR2: [ 26.390547] DR3: DR6: 0ff0 DR7: 0400 [ 26.392015] Process cryptomgr_test (pid: 3704, threadinfo 88040d62a000, task 88040ca3ab60) [ 26.393558] Stack: [ 26.395103] 811d72fb 88040b6e3868 0020 88040b6e3868 [ 26.396622] 88040b6e3800 88040b6e3868 0020 88040b6e3868 [ 26.398163] 88040919f440 88040d62bcc8 a07edaa0 88040b6e3930 [ 26.399643] Call Trace: [ 26.401112] [811d72fb] ? scatterwalk_map_and_copy+0x5b/0xd0 [ 26.402714] [a07edaa0] ? crypto_authenc_genicv+0xa0/0x300 [authenc] [ 26.404274] [811de5bb] ? test_aead+0x58b/0xcd0 [ 26.406082] [811d4cb0] ? crypto_mod_get+0x10/0x30 [ 26.407704] [811d57c3] ? crypto_alloc_base+0x53/0xb0 [ 26.409267] [a01dd6e0] ? cryptd_alloc_ablkcipher+0x80/0xc0 [cryptd] [ 26.410838] [8113f3fd] ? __kmalloc+0x20d/0x250 [ 26.412364] [811d6321] ? crypto_spawn_tfm2+0x31/0x70 [ 26.413938] [a0066050] ? ablk_init_common+0x10/0x30 [ablk_helper] [ 26.415448] [811d56f9] ? __crypto_alloc_tfm+0xf9/0x170 [ 26.416963] [811d63a3] ? crypto_spawn_tfm+0x43/0x90 [ 26.418505] [811d998e] ? skcipher_geniv_init+0x1e/0x40 [ 26.420046] [811d56f9] ? __crypto_alloc_tfm+0xf9/0x170 [ 26.421599] [811d63a3] ? crypto_spawn_tfm+0x43/0x90 [ 26.423228] [8113f3fd] ? __kmalloc+0x20d/0x250 [ 26.424788] [a07ed359] ? crypto_authenc_init_tfm+0x49/0xc0 [authenc] [ 26.426374] [811d56f9] ?
Re: Linux 3.6-rc5
Quoting Herbert Xu herb...@gondor.apana.org.au: Can you try blacklisting/not loading sha1_ssse3 and aesni_intel to see which one of them is causing this crash? Of course if you can still reproduce this without loading either of them that would also be interesting to know. This triggers with aes-x86_64 and sha1_generic (and sha256 sha512) too, with following test added to tcrypt: case 46: ret += tcrypt_test(authenc(hmac(sha1),cbc(aes))); ret += tcrypt_test(authenc(hmac(sha256),cbc(aes))); ret += tcrypt_test(authenc(hmac(sha512),cbc(aes))); break; -Jussi Thanks, -- Email: Herbert Xu herb...@gondor.apana.org.au Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -- To unsubscribe from this list: send the line unsubscribe linux-crypto in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux 3.6-rc5
Quoting Herbert Xu herb...@gondor.apana.org.au: On Sun, Sep 09, 2012 at 08:35:56AM -0700, Linus Torvalds wrote: On Sun, Sep 9, 2012 at 5:54 AM, Jussi Kivilinna jussi.kivili...@mbnet.fi wrote: Does reverting e46e9a46386bca8e80a6467b5c643dc494861896 help? That commit added crypto selftest for authenc(hmac(sha1),cbc(aes)) in 3.6, and probably made this bug visible (but not directly causing it). So Romain said it does - where do we go from here? Revert testing it, or fix the authenc() case? I'd prefer the fix.. I'm working on this right now. If we don't get anywhere in a couple of days we can revert the test vector patch. It seems that authenc is chaining empty assoc scatterlist, which causes BUG_ON(!sg-length) set off in crypto/scatterwalk.c. Following fixes the bug and self-test passes, but not sure if it's correct (note, copy-paste to 'broken' email client, most likely does not apply etc): diff --git a/crypto/authenc.c b/crypto/authenc.c index 5ef7ba6..2373af5 100644 --- a/crypto/authenc.c +++ b/crypto/authenc.c @@ -336,7 +336,7 @@ static int crypto_authenc_genicv(struct aead_request *req, u8 *iv, cryptlen += ivsize; } - if (sg_is_last(assoc)) { + if (req-assoclen 0 sg_is_last(assoc)) { authenc_ahash_fn = crypto_authenc_ahash; sg_init_table(asg, 2); sg_set_page(asg, sg_page(assoc), assoc-length, assoc-offset); Also does crypto_authenc_iverify() need same fix? -Jussi Cheers, -- Email: Herbert Xu herb...@gondor.apana.org.au Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] crypto: twofish - add x86_64/avx assembler implementation
Quoting Borislav Petkov : On Wed, Aug 22, 2012 at 10:20:03PM +0300, Jussi Kivilinna wrote: Actually it does look better, at least for encryption. Decryption had different ordering for test, which appears to be bad on bulldozer as it is on sandy-bridge. So, yet another patch then :) Here you go: Thanks! With this patch twofish-avx is faster than twofish-3way for 256, 1k and 8k tests. sizeold-vs-new new-vs-3way old-vs-3way ecb-enc ecb-dec ecb-enc ecb-dec ecb-enc ecb-dec 256 1.10x 1.11x 1.01x 1.01x 0.92x 0.91x 1k 1.11x 1.12x 1.08x 1.07x 0.97x 0.96x 8k 1.11x 1.13x 1.10x 1.08x 0.99x 0.97x -Jussi [ 153.736745] [ 153.736745] testing speed of async ecb(twofish) encryption [ 153.745806] test 0 (128 bit key, 16 byte blocks): 4832343 operations in 1 seconds (77317488 bytes) [ 154.752525] test 1 (128 bit key, 64 byte blocks): 2049979 operations in 1 seconds (131198656 bytes) [ 155.755195] test 2 (128 bit key, 256 byte blocks): 620439 operations in 1 seconds (158832384 bytes) [ 156.761694] test 3 (128 bit key, 1024 byte blocks): 173900 operations in 1 seconds (178073600 bytes) [ 157.768282] test 4 (128 bit key, 8192 byte blocks): 22366 operations in 1 seconds (18372 bytes) [ 158.774815] test 5 (192 bit key, 16 byte blocks): 4850741 operations in 1 seconds (77611856 bytes) [ 159.781498] test 6 (192 bit key, 64 byte blocks): 2046772 operations in 1 seconds (130993408 bytes) [ 160.788163] test 7 (192 bit key, 256 byte blocks): 619915 operations in 1 seconds (158698240 bytes) [ 161.794636] test 8 (192 bit key, 1024 byte blocks): 173442 operations in 1 seconds (177604608 bytes) [ 162.801242] test 9 (192 bit key, 8192 byte blocks): 22083 operations in 1 seconds (180903936 bytes) [ 163.807793] test 10 (256 bit key, 16 byte blocks): 4862951 operations in 1 seconds (77807216 bytes) [ 164.814449] test 11 (256 bit key, 64 byte blocks): 2050036 operations in 1 seconds (131202304 bytes) [ 165.821121] test 12 (256 bit key, 256 byte blocks): 620349 operations in 1 seconds (158809344 bytes) [ 166.827621] test 13 (256 bit key, 1024 byte blocks): 173917 operations in 1 seconds (178091008 bytes) [ 167.834218] test 14 (256 bit key, 8192 byte blocks): 22362 operations in 1 seconds (183189504 bytes) [ 168.840798] [ 168.840798] testing speed of async ecb(twofish) decryption [ 168.849968] test 0 (128 bit key, 16 byte blocks): 4889899 operations in 1 seconds (78238384 bytes) [ 169.855439] test 1 (128 bit key, 64 byte blocks): 2052293 operations in 1 seconds (131346752 bytes) [ 170.862113] test 2 (128 bit key, 256 byte blocks): 616979 operations in 1 seconds (157946624 bytes) [ 171.868631] test 3 (128 bit key, 1024 byte blocks): 172773 operations in 1 seconds (176919552 bytes) [ 172.875244] test 4 (128 bit key, 8192 byte blocks): 4 operations in 1 seconds (182059008 bytes) [ 173.881777] test 5 (192 bit key, 16 byte blocks): 4893653 operations in 1 seconds (78298448 bytes) [ 174.888451] test 6 (192 bit key, 64 byte blocks): 2048078 operations in 1 seconds (131076992 bytes) [ 175.895131] test 7 (192 bit key, 256 byte blocks): 619204 operations in 1 seconds (158516224 bytes) [ 176.901651] test 8 (192 bit key, 1024 byte blocks): 172569 operations in 1 seconds (176710656 bytes) [ 177.908253] test 9 (192 bit key, 8192 byte blocks): 21888 operations in 1 seconds (179306496 bytes) [ 178.914781] test 10 (256 bit key, 16 byte blocks): 4921751 operations in 1 seconds (78748016 bytes) [ 179.917481] test 11 (256 bit key, 64 byte blocks): 2051219 operations in 1 seconds (131278016 bytes) [ 180.920147] test 12 (256 bit key, 256 byte blocks): 618536 operations in 1 seconds (158345216 bytes) [ 181.926637] test 13 (256 bit key, 1024 byte blocks): 172886 operations in 1 seconds (177035264 bytes) [ 182.933249] test 14 (256 bit key, 8192 byte blocks): 2 operations in 1 seconds (182042624 bytes) [ 183.939803] [ 183.939803] testing speed of async cbc(twofish) encryption [ 183.953902] test 0 (128 bit key, 16 byte blocks): 5195403 operations in 1 seconds (83126448 bytes) [ 184.962487] test 1 (128 bit key, 64 byte blocks): 1912010 operations in 1 seconds (122368640 bytes) [ 185.969150] test 2 (128 bit key, 256 byte blocks): 540125 operations in 1 seconds (138272000 bytes) [ 186.975650] test 3 (128 bit key, 1024 byte blocks): 140631 operations in 1 seconds (144006144 bytes) [ 187.982411] test 4 (128 bit key, 8192 byte blocks): 17737 operations in 1 seconds (145301504 bytes) [ 188.988782] test 5 (192 bit key, 16 byte blocks): 5182287 operations in 1 seconds (82916592 bytes) [ 189.995435] test 6 (192 bit key, 64 byte blocks): 1912356 operations in 1 seconds (122390784 bytes) [ 191.002093] test 7 (192 bit key, 256 byte blocks): 540991 operations in 1 seconds (138493696 bytes) [ 192.008600] test 8 (192 bit key, 1024 byte blocks): 140791 operations in 1
Re: [PATCH] crypto: twofish - add x86_64/avx assembler implementation
Quoting Borislav Petkov b...@alien8.de: On Wed, Aug 22, 2012 at 10:20:03PM +0300, Jussi Kivilinna wrote: Actually it does look better, at least for encryption. Decryption had different ordering for test, which appears to be bad on bulldozer as it is on sandy-bridge. So, yet another patch then :) Here you go: Thanks! With this patch twofish-avx is faster than twofish-3way for 256, 1k and 8k tests. sizeold-vs-new new-vs-3way old-vs-3way ecb-enc ecb-dec ecb-enc ecb-dec ecb-enc ecb-dec 256 1.10x 1.11x 1.01x 1.01x 0.92x 0.91x 1k 1.11x 1.12x 1.08x 1.07x 0.97x 0.96x 8k 1.11x 1.13x 1.10x 1.08x 0.99x 0.97x -Jussi [ 153.736745] [ 153.736745] testing speed of async ecb(twofish) encryption [ 153.745806] test 0 (128 bit key, 16 byte blocks): 4832343 operations in 1 seconds (77317488 bytes) [ 154.752525] test 1 (128 bit key, 64 byte blocks): 2049979 operations in 1 seconds (131198656 bytes) [ 155.755195] test 2 (128 bit key, 256 byte blocks): 620439 operations in 1 seconds (158832384 bytes) [ 156.761694] test 3 (128 bit key, 1024 byte blocks): 173900 operations in 1 seconds (178073600 bytes) [ 157.768282] test 4 (128 bit key, 8192 byte blocks): 22366 operations in 1 seconds (18372 bytes) [ 158.774815] test 5 (192 bit key, 16 byte blocks): 4850741 operations in 1 seconds (77611856 bytes) [ 159.781498] test 6 (192 bit key, 64 byte blocks): 2046772 operations in 1 seconds (130993408 bytes) [ 160.788163] test 7 (192 bit key, 256 byte blocks): 619915 operations in 1 seconds (158698240 bytes) [ 161.794636] test 8 (192 bit key, 1024 byte blocks): 173442 operations in 1 seconds (177604608 bytes) [ 162.801242] test 9 (192 bit key, 8192 byte blocks): 22083 operations in 1 seconds (180903936 bytes) [ 163.807793] test 10 (256 bit key, 16 byte blocks): 4862951 operations in 1 seconds (77807216 bytes) [ 164.814449] test 11 (256 bit key, 64 byte blocks): 2050036 operations in 1 seconds (131202304 bytes) [ 165.821121] test 12 (256 bit key, 256 byte blocks): 620349 operations in 1 seconds (158809344 bytes) [ 166.827621] test 13 (256 bit key, 1024 byte blocks): 173917 operations in 1 seconds (178091008 bytes) [ 167.834218] test 14 (256 bit key, 8192 byte blocks): 22362 operations in 1 seconds (183189504 bytes) [ 168.840798] [ 168.840798] testing speed of async ecb(twofish) decryption [ 168.849968] test 0 (128 bit key, 16 byte blocks): 4889899 operations in 1 seconds (78238384 bytes) [ 169.855439] test 1 (128 bit key, 64 byte blocks): 2052293 operations in 1 seconds (131346752 bytes) [ 170.862113] test 2 (128 bit key, 256 byte blocks): 616979 operations in 1 seconds (157946624 bytes) [ 171.868631] test 3 (128 bit key, 1024 byte blocks): 172773 operations in 1 seconds (176919552 bytes) [ 172.875244] test 4 (128 bit key, 8192 byte blocks): 4 operations in 1 seconds (182059008 bytes) [ 173.881777] test 5 (192 bit key, 16 byte blocks): 4893653 operations in 1 seconds (78298448 bytes) [ 174.888451] test 6 (192 bit key, 64 byte blocks): 2048078 operations in 1 seconds (131076992 bytes) [ 175.895131] test 7 (192 bit key, 256 byte blocks): 619204 operations in 1 seconds (158516224 bytes) [ 176.901651] test 8 (192 bit key, 1024 byte blocks): 172569 operations in 1 seconds (176710656 bytes) [ 177.908253] test 9 (192 bit key, 8192 byte blocks): 21888 operations in 1 seconds (179306496 bytes) [ 178.914781] test 10 (256 bit key, 16 byte blocks): 4921751 operations in 1 seconds (78748016 bytes) [ 179.917481] test 11 (256 bit key, 64 byte blocks): 2051219 operations in 1 seconds (131278016 bytes) [ 180.920147] test 12 (256 bit key, 256 byte blocks): 618536 operations in 1 seconds (158345216 bytes) [ 181.926637] test 13 (256 bit key, 1024 byte blocks): 172886 operations in 1 seconds (177035264 bytes) [ 182.933249] test 14 (256 bit key, 8192 byte blocks): 2 operations in 1 seconds (182042624 bytes) [ 183.939803] [ 183.939803] testing speed of async cbc(twofish) encryption [ 183.953902] test 0 (128 bit key, 16 byte blocks): 5195403 operations in 1 seconds (83126448 bytes) [ 184.962487] test 1 (128 bit key, 64 byte blocks): 1912010 operations in 1 seconds (122368640 bytes) [ 185.969150] test 2 (128 bit key, 256 byte blocks): 540125 operations in 1 seconds (138272000 bytes) [ 186.975650] test 3 (128 bit key, 1024 byte blocks): 140631 operations in 1 seconds (144006144 bytes) [ 187.982411] test 4 (128 bit key, 8192 byte blocks): 17737 operations in 1 seconds (145301504 bytes) [ 188.988782] test 5 (192 bit key, 16 byte blocks): 5182287 operations in 1 seconds (82916592 bytes) [ 189.995435] test 6 (192 bit key, 64 byte blocks): 1912356 operations in 1 seconds (122390784 bytes) [ 191.002093] test 7 (192 bit key, 256 byte blocks): 540991 operations in 1 seconds (138493696 bytes) [ 192.008600] test 8 (192 bit key, 1024 byte blocks): 140791
Re: [PATCH] crypto: twofish - add x86_64/avx assembler implementation
Quoting Jason Garrett-Glaser : On Wed, Aug 22, 2012 at 12:20 PM, Jussi Kivilinna wrote: Quoting Borislav Petkov : On Wed, Aug 22, 2012 at 07:35:12AM +0300, Jussi Kivilinna wrote: Looks that encryption lost ~0.4% while decryption gained ~1.8%. For 256 byte test, it's still slightly slower than twofish-3way (~3%). For 1k and 8k tests, it's ~5% faster. Here's very last test-patch, testing different ordering of fpu<->cpu reg instructions at few places. Hehe,. I don't mind testing patches, no worries there. Here are the results this time, doesn't look better than the last run, AFAICT. Actually it does look better, at least for encryption. Decryption had different ordering for test, which appears to be bad on bulldozer as it is on sandy-bridge. So, yet another patch then :) Interleaving at some new places (reordered lookup_32bit()s in G-macro) and doing one of the round rotations one round ahead. Also introduces some more paralellism inside lookup_32bit. Outsider looking in here, but avoiding the 256-way lookup tables entirely might be faster. Looking at the twofish code, one byte-wise calculation looks like this: a0 = x >> 4; b0 = x & 15; a1 = a0 ^ b0; b1 = ror4[b0] ^ ashx[a0]; a2 = qt0[n][a1]; b2 = qt1[n][b1]; a3 = a2 ^ b2; b3 = ror4[b2] ^ ashx[a2]; a4 = qt2[n][a3]; b4 = qt3[n][b3]; return (b4 << 4) | a4; This means that you can do something like this pseudocode (Intel syntax). pshufb on ymm registers is AVX2, but splitting it into xmm operations would probably be fine (as would using this for just a pure SSE implementation!). On AVX2 you' have to double the tables for both ways, naturally. constants: pb_0x0f = {0x0f,0x0f,0x0f ... } ashx: lookup table ror4: lookup table qt0[n]: lookup table qt1[n]: lookup table qt2[n]: lookup table qt3[n]: lookup table vpandb0, in, pb_0x0f vpsrlw a0, in, 4 vpanda0, a0, pb_0x0f; effectively vpsrlb, but that doesn't exist vpxora1, a0, b0 vpshufb a0, ashx, a0 vpshufb b0, ror4, b0 vpxorb1, a0, b0 vpshufb a2, qt0[n], a1 vpshufb b2, qt1[n], b1 vpxora3, a2, b2 vpshufb a3, ashx, a2 vpshufb b3, ror4, b2 vpxorb3, a2, b2 vpshufb a4, qt2[n], a3 vpshufb b4, qt3[n], b3 vpsllw b4, b4, 4 ; effectively vpsrlb, but that doesn't exist vporout, a4, b4 That's 15 instructions (plus maybe a move or two) to do 16 lookups for SSE (~9 cycles by my guessing on a Nehalem). AVX would run into the problem of lots of extra vinsert/vextract (just going 16-byte might be better, might be not, depending on execution units). AVX2 would be super fast (15 for 32). If this works, this could be quite a bit faster with the table-based approach. The above would implement twofish permutations q0 and q1? For byte-sliced implementation you would need 8 parallel blocks (16b registers, two parallel h-functions for round, 16/2). In this setup, for double h-function, you need 12 q0/1 operations (for 128bit key, for 192bit: 16, for 256bit: 20), plus 8 key material xors (for 192bit 12, 256bit 16) and MDS matrix multiplication (alot more than 15 instructions, I'd think). We do 16-rounds so that gives us, ((12*15+8+15)*16)/(8*16) > 25.3 cycles/byte. Usually I get ~2.5 instructions/cycle for pure SSE2, so that's 10 cycles/byte. After that we have PHT phase. But now problem is that PHT base uses 32-bit additions, so either we move between byte-sliced and dword-sliced modes here or move addition carry over bytes. After PHT there is 32-bit addition with key material and 32-bit rotations. I don't think this is going to work. For AVX2, vpgatherdd is going to speed up 32-bit lookups anyway. -Jussi Jason -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] crypto: twofish - add x86_64/avx assembler implementation
Quoting Jason Garrett-Glaser ja...@x264.com: On Wed, Aug 22, 2012 at 12:20 PM, Jussi Kivilinna jussi.kivili...@mbnet.fi wrote: Quoting Borislav Petkov b...@alien8.de: On Wed, Aug 22, 2012 at 07:35:12AM +0300, Jussi Kivilinna wrote: Looks that encryption lost ~0.4% while decryption gained ~1.8%. For 256 byte test, it's still slightly slower than twofish-3way (~3%). For 1k and 8k tests, it's ~5% faster. Here's very last test-patch, testing different ordering of fpu-cpu reg instructions at few places. Hehe,. I don't mind testing patches, no worries there. Here are the results this time, doesn't look better than the last run, AFAICT. Actually it does look better, at least for encryption. Decryption had different ordering for test, which appears to be bad on bulldozer as it is on sandy-bridge. So, yet another patch then :) Interleaving at some new places (reordered lookup_32bit()s in G-macro) and doing one of the round rotations one round ahead. Also introduces some more paralellism inside lookup_32bit. Outsider looking in here, but avoiding the 256-way lookup tables entirely might be faster. Looking at the twofish code, one byte-wise calculation looks like this: a0 = x 4; b0 = x 15; a1 = a0 ^ b0; b1 = ror4[b0] ^ ashx[a0]; a2 = qt0[n][a1]; b2 = qt1[n][b1]; a3 = a2 ^ b2; b3 = ror4[b2] ^ ashx[a2]; a4 = qt2[n][a3]; b4 = qt3[n][b3]; return (b4 4) | a4; This means that you can do something like this pseudocode (Intel syntax). pshufb on ymm registers is AVX2, but splitting it into xmm operations would probably be fine (as would using this for just a pure SSE implementation!). On AVX2 you' have to double the tables for both ways, naturally. constants: pb_0x0f = {0x0f,0x0f,0x0f ... } ashx: lookup table ror4: lookup table qt0[n]: lookup table qt1[n]: lookup table qt2[n]: lookup table qt3[n]: lookup table vpandb0, in, pb_0x0f vpsrlw a0, in, 4 vpanda0, a0, pb_0x0f; effectively vpsrlb, but that doesn't exist vpxora1, a0, b0 vpshufb a0, ashx, a0 vpshufb b0, ror4, b0 vpxorb1, a0, b0 vpshufb a2, qt0[n], a1 vpshufb b2, qt1[n], b1 vpxora3, a2, b2 vpshufb a3, ashx, a2 vpshufb b3, ror4, b2 vpxorb3, a2, b2 vpshufb a4, qt2[n], a3 vpshufb b4, qt3[n], b3 vpsllw b4, b4, 4 ; effectively vpsrlb, but that doesn't exist vporout, a4, b4 That's 15 instructions (plus maybe a move or two) to do 16 lookups for SSE (~9 cycles by my guessing on a Nehalem). AVX would run into the problem of lots of extra vinsert/vextract (just going 16-byte might be better, might be not, depending on execution units). AVX2 would be super fast (15 for 32). If this works, this could be quite a bit faster with the table-based approach. The above would implement twofish permutations q0 and q1? For byte-sliced implementation you would need 8 parallel blocks (16b registers, two parallel h-functions for round, 16/2). In this setup, for double h-function, you need 12 q0/1 operations (for 128bit key, for 192bit: 16, for 256bit: 20), plus 8 key material xors (for 192bit 12, 256bit 16) and MDS matrix multiplication (alot more than 15 instructions, I'd think). We do 16-rounds so that gives us, ((12*15+8+15)*16)/(8*16) 25.3 cycles/byte. Usually I get ~2.5 instructions/cycle for pure SSE2, so that's 10 cycles/byte. After that we have PHT phase. But now problem is that PHT base uses 32-bit additions, so either we move between byte-sliced and dword-sliced modes here or move addition carry over bytes. After PHT there is 32-bit addition with key material and 32-bit rotations. I don't think this is going to work. For AVX2, vpgatherdd is going to speed up 32-bit lookups anyway. -Jussi Jason -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] crypto: twofish - add x86_64/avx assembler implementation
Quoting Borislav Petkov : > On Wed, Aug 22, 2012 at 07:35:12AM +0300, Jussi Kivilinna wrote: >> Looks that encryption lost ~0.4% while decryption gained ~1.8%. >> >> For 256 byte test, it's still slightly slower than twofish-3way >> (~3%). For 1k >> and 8k tests, it's ~5% faster. >> >> Here's very last test-patch, testing different ordering of fpu<->cpu reg >> instructions at few places. > > Hehe, > > I don't mind testing patches, no worries there. Here are the results > this time, doesn't look better than the last run, AFAICT. > Actually it does look better, at least for encryption. Decryption had different ordering for test, which appears to be bad on bulldozer as it is on sandy-bridge. So, yet another patch then :) Interleaving at some new places (reordered lookup_32bit()s in G-macro) and doing one of the round rotations one round ahead. Also introduces some more paralellism inside lookup_32bit. --- arch/x86/crypto/twofish-avx-x86_64-asm_64.S | 227 +-- 1 file changed, 142 insertions(+), 85 deletions(-) diff --git a/arch/x86/crypto/twofish-avx-x86_64-asm_64.S b/arch/x86/crypto/twofish-avx-x86_64-asm_64.S index 35f4557..1585abb 100644 --- a/arch/x86/crypto/twofish-avx-x86_64-asm_64.S +++ b/arch/x86/crypto/twofish-avx-x86_64-asm_64.S @@ -4,6 +4,8 @@ * Copyright (C) 2012 Johannes Goetzfried * * + * Copyright © 2012 Jussi Kivilinna + * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation; either version 2 of the License, or @@ -47,16 +49,22 @@ #define RC2 %xmm6 #define RD2 %xmm7 -#define RX %xmm8 -#define RY %xmm9 +#define RX0 %xmm8 +#define RY0 %xmm9 + +#define RX1 %xmm10 +#define RY1 %xmm11 -#define RK1 %xmm10 -#define RK2 %xmm11 +#define RK1 %xmm12 +#define RK2 %xmm13 -#define RID1 %rax -#define RID1b %al -#define RID2 %rbx -#define RID2b %bl +#define RT %xmm14 +#define RR %xmm15 + +#define RID1 %rbp +#define RID1d %ebp +#define RID2 %rsi +#define RID2d %esi #define RGI1 %rdx #define RGI1bl %dl @@ -65,6 +73,13 @@ #define RGI2bl %cl #define RGI2bh %ch +#define RGI3 %rax +#define RGI3bl %al +#define RGI3bh %ah +#define RGI4 %rbx +#define RGI4bl %bl +#define RGI4bh %bh + #define RGS1 %r8 #define RGS1d %r8d #define RGS2 %r9 @@ -73,89 +88,123 @@ #define RGS3d %r10d -#define lookup_32bit(t0, t1, t2, t3, src, dst) \ - movbsrc ## bl,RID1b; \ - movbsrc ## bh,RID2b; \ - movlt0(CTX, RID1, 4), dst ## d; \ - xorlt1(CTX, RID2, 4), dst ## d; \ +#define lookup_32bit(t0, t1, t2, t3, src, dst, interleave_op, il_reg) \ + movzbl src ## bl,RID1d; \ + movzbl src ## bh,RID2d; \ shrq $16, src; \ - movbsrc ## bl,RID1b; \ - movbsrc ## bh,RID2b; \ + movlt0(CTX, RID1, 4), dst ## d; \ + movlt1(CTX, RID2, 4), RID2d; \ + movzbl src ## bl,RID1d; \ + xorlRID2d,dst ## d; \ + movzbl src ## bh,RID2d; \ + interleave_op(il_reg); \ xorlt2(CTX, RID1, 4), dst ## d; \ xorlt3(CTX, RID2, 4), dst ## d; -#define G(a, x, t0, t1, t2, t3) \ - vmovq a,RGI1; \ - vpsrldq $8, a,x; \ - vmovq x,RGI2; \ +#define dummy(d) /* do nothing */ + +#define shr_next(reg) \ + shrq $16, reg; + +#define G(gi1, gi2, x, t0, t1, t2, t3) \ + lookup_32bit(t0, t1, t2, t3, ##gi1, RGS1, shr_next, ##gi1); \ + lookup_32bit(t0, t1, t2, t3, ##gi2, RGS3, shr_next, ##gi2); \ + \ + lookup_32bit(t0, t1, t2, t3, ##gi1, RGS2, dummy, none); \ + shlq $32, RGS2;\ + orq RGS1, RGS2; \ + lookup_32bit(t0, t1, t2, t3, ##gi2, RGS1, dummy, none); \ + shlq $32, RGS1;\ + orq RGS1, RGS3; + +#define round_head_2(a, b, x1, y1, x2, y2) \ + vmovq b ## 1, RGI3; \ + vpextrq $1, b ## 1, RGI4; \ \ - lookup_32bit(t0, t1, t2, t3, RGI1, RGS1); \ - shrq $16, RGI1; \ - lookup_32bit(t0, t1, t2, t3, RGI1, RGS2); \ - shlq $32, RGS2; \ - orq RGS1, RGS2; \ + G(RGI1, RGI2, x1, s0, s1, s2, s3); \ + vmovq a ## 2, RGI1; \ + vpextrq $1, a ## 2, RGI2; \ + vmo
Re: [PATCH] crypto: twofish - add x86_64/avx assembler implementation
Quoting Borislav Petkov b...@alien8.de: On Wed, Aug 22, 2012 at 07:35:12AM +0300, Jussi Kivilinna wrote: Looks that encryption lost ~0.4% while decryption gained ~1.8%. For 256 byte test, it's still slightly slower than twofish-3way (~3%). For 1k and 8k tests, it's ~5% faster. Here's very last test-patch, testing different ordering of fpu-cpu reg instructions at few places. Hehe, I don't mind testing patches, no worries there. Here are the results this time, doesn't look better than the last run, AFAICT. Actually it does look better, at least for encryption. Decryption had different ordering for test, which appears to be bad on bulldozer as it is on sandy-bridge. So, yet another patch then :) Interleaving at some new places (reordered lookup_32bit()s in G-macro) and doing one of the round rotations one round ahead. Also introduces some more paralellism inside lookup_32bit. --- arch/x86/crypto/twofish-avx-x86_64-asm_64.S | 227 +-- 1 file changed, 142 insertions(+), 85 deletions(-) diff --git a/arch/x86/crypto/twofish-avx-x86_64-asm_64.S b/arch/x86/crypto/twofish-avx-x86_64-asm_64.S index 35f4557..1585abb 100644 --- a/arch/x86/crypto/twofish-avx-x86_64-asm_64.S +++ b/arch/x86/crypto/twofish-avx-x86_64-asm_64.S @@ -4,6 +4,8 @@ * Copyright (C) 2012 Johannes Goetzfried * johannes.goetzfr...@informatik.stud.uni-erlangen.de * + * Copyright © 2012 Jussi Kivilinna jussi.kivili...@mbnet.fi + * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation; either version 2 of the License, or @@ -47,16 +49,22 @@ #define RC2 %xmm6 #define RD2 %xmm7 -#define RX %xmm8 -#define RY %xmm9 +#define RX0 %xmm8 +#define RY0 %xmm9 + +#define RX1 %xmm10 +#define RY1 %xmm11 -#define RK1 %xmm10 -#define RK2 %xmm11 +#define RK1 %xmm12 +#define RK2 %xmm13 -#define RID1 %rax -#define RID1b %al -#define RID2 %rbx -#define RID2b %bl +#define RT %xmm14 +#define RR %xmm15 + +#define RID1 %rbp +#define RID1d %ebp +#define RID2 %rsi +#define RID2d %esi #define RGI1 %rdx #define RGI1bl %dl @@ -65,6 +73,13 @@ #define RGI2bl %cl #define RGI2bh %ch +#define RGI3 %rax +#define RGI3bl %al +#define RGI3bh %ah +#define RGI4 %rbx +#define RGI4bl %bl +#define RGI4bh %bh + #define RGS1 %r8 #define RGS1d %r8d #define RGS2 %r9 @@ -73,89 +88,123 @@ #define RGS3d %r10d -#define lookup_32bit(t0, t1, t2, t3, src, dst) \ - movbsrc ## bl,RID1b; \ - movbsrc ## bh,RID2b; \ - movlt0(CTX, RID1, 4), dst ## d; \ - xorlt1(CTX, RID2, 4), dst ## d; \ +#define lookup_32bit(t0, t1, t2, t3, src, dst, interleave_op, il_reg) \ + movzbl src ## bl,RID1d; \ + movzbl src ## bh,RID2d; \ shrq $16, src; \ - movbsrc ## bl,RID1b; \ - movbsrc ## bh,RID2b; \ + movlt0(CTX, RID1, 4), dst ## d; \ + movlt1(CTX, RID2, 4), RID2d; \ + movzbl src ## bl,RID1d; \ + xorlRID2d,dst ## d; \ + movzbl src ## bh,RID2d; \ + interleave_op(il_reg); \ xorlt2(CTX, RID1, 4), dst ## d; \ xorlt3(CTX, RID2, 4), dst ## d; -#define G(a, x, t0, t1, t2, t3) \ - vmovq a,RGI1; \ - vpsrldq $8, a,x; \ - vmovq x,RGI2; \ +#define dummy(d) /* do nothing */ + +#define shr_next(reg) \ + shrq $16, reg; + +#define G(gi1, gi2, x, t0, t1, t2, t3) \ + lookup_32bit(t0, t1, t2, t3, ##gi1, RGS1, shr_next, ##gi1); \ + lookup_32bit(t0, t1, t2, t3, ##gi2, RGS3, shr_next, ##gi2); \ + \ + lookup_32bit(t0, t1, t2, t3, ##gi1, RGS2, dummy, none); \ + shlq $32, RGS2;\ + orq RGS1, RGS2; \ + lookup_32bit(t0, t1, t2, t3, ##gi2, RGS1, dummy, none); \ + shlq $32, RGS1;\ + orq RGS1, RGS3; + +#define round_head_2(a, b, x1, y1, x2, y2) \ + vmovq b ## 1, RGI3; \ + vpextrq $1, b ## 1, RGI4; \ \ - lookup_32bit(t0, t1, t2, t3, RGI1, RGS1); \ - shrq $16, RGI1; \ - lookup_32bit(t0, t1, t2, t3, RGI1, RGS2); \ - shlq $32, RGS2; \ - orq RGS1, RGS2; \ + G(RGI1, RGI2, x1, s0, s1, s2, s3); \ + vmovq a ## 2, RGI1; \ + vpextrq $1, a ## 2, RGI2; \ + vmovq RGS2, x1
Re: [PATCH] crypto: twofish - add x86_64/avx assembler implementation
Quoting Borislav Petkov : > > Here you go: > > [ 52.282208] > [ 52.282208] testing speed of async ecb(twofish) encryption Thanks! Looks that encryption lost ~0.4% while decryption gained ~1.8%. For 256 byte test, it's still slightly slower than twofish-3way (~3%). For 1k and 8k tests, it's ~5% faster. Here's very last test-patch, testing different ordering of fpu<->cpu reg instructions at few places. --- arch/x86/crypto/twofish-avx-x86_64-asm_64.S | 232 ++- 1 file changed, 154 insertions(+), 78 deletions(-) diff --git a/arch/x86/crypto/twofish-avx-x86_64-asm_64.S b/arch/x86/crypto/twofish-avx-x86_64-asm_64.S index 35f4557..693963a 100644 --- a/arch/x86/crypto/twofish-avx-x86_64-asm_64.S +++ b/arch/x86/crypto/twofish-avx-x86_64-asm_64.S @@ -4,6 +4,8 @@ * Copyright (C) 2012 Johannes Goetzfried * * + * Copyright © 2012 Jussi Kivilinna + * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation; either version 2 of the License, or @@ -47,16 +49,21 @@ #define RC2 %xmm6 #define RD2 %xmm7 -#define RX %xmm8 -#define RY %xmm9 +#define RX0 %xmm8 +#define RY0 %xmm9 + +#define RX1 %xmm10 +#define RY1 %xmm11 + +#define RK1 %xmm12 +#define RK2 %xmm13 -#define RK1 %xmm10 -#define RK2 %xmm11 +#define RT %xmm14 -#define RID1 %rax -#define RID1b %al -#define RID2 %rbx -#define RID2b %bl +#define RID1 %rbp +#define RID1d %ebp +#define RID2 %rsi +#define RID2d %esi #define RGI1 %rdx #define RGI1bl %dl @@ -65,6 +72,13 @@ #define RGI2bl %cl #define RGI2bh %ch +#define RGI3 %rax +#define RGI3bl %al +#define RGI3bh %ah +#define RGI4 %rbx +#define RGI4bl %bl +#define RGI4bh %bh + #define RGS1 %r8 #define RGS1d %r8d #define RGS2 %r9 @@ -73,40 +87,58 @@ #define RGS3d %r10d -#define lookup_32bit(t0, t1, t2, t3, src, dst) \ - movbsrc ## bl,RID1b; \ - movbsrc ## bh,RID2b; \ +#define lookup_32bit(t0, t1, t2, t3, src, dst, interleave_op, il_reg) \ + movzbl src ## bl,RID1d; \ + movzbl src ## bh,RID2d; \ + shrq $16, src; \ movlt0(CTX, RID1, 4), dst ## d; \ xorlt1(CTX, RID2, 4), dst ## d; \ - shrq $16, src; \ - movbsrc ## bl,RID1b; \ - movbsrc ## bh,RID2b; \ + movzbl src ## bl,RID1d; \ + movzbl src ## bh,RID2d; \ + interleave_op(il_reg); \ xorlt2(CTX, RID1, 4), dst ## d; \ xorlt3(CTX, RID2, 4), dst ## d; -#define G(a, x, t0, t1, t2, t3) \ - vmovq a,RGI1; \ - vpsrldq $8, a,x; \ - vmovq x,RGI2; \ - \ - lookup_32bit(t0, t1, t2, t3, RGI1, RGS1); \ - shrq $16, RGI1; \ - lookup_32bit(t0, t1, t2, t3, RGI1, RGS2); \ - shlq $32, RGS2; \ - orq RGS1, RGS2; \ - \ - lookup_32bit(t0, t1, t2, t3, RGI2, RGS1); \ - shrq $16, RGI2; \ - lookup_32bit(t0, t1, t2, t3, RGI2, RGS3); \ - shlq $32, RGS3; \ - orq RGS1, RGS3; \ - \ - vmovq RGS2, x; \ - vpinsrq $1, RGS3, x, x; +#define dummy(d) /* do nothing */ -#define encround(a, b, c, d, x, y) \ - G(a, x, s0, s1, s2, s3); \ - G(b, y, s1, s2, s3, s0); \ +#define shr_next(reg) \ + shrq $16, reg; + +#define G_enc(gi1, gi2, x, t0, t1, t2, t3) \ + lookup_32bit(t0, t1, t2, t3, ##gi1, RGS1, shr_next, ##gi1); \ + lookup_32bit(t0, t1, t2, t3, ##gi1, RGS2, dummy, none); \ + shlq $32, RGS2;\ + orq RGS1, RGS2; \ + \ + lookup_32bit(t0, t1, t2, t3, ##gi2, RGS3, shr_next, ##gi2); \ + lookup_32bit(t0, t1, t2, t3, ##gi2, RGS1, dummy, none); \ + shlq $32, RGS1;\ + orq RGS1, RGS3; + +#define encround_head_2(a, b, c, d, x1, y1, x2, y2) \ + vmovq b ## 1, RGI3; \ + vpextrq $1, b ## 1, RGI4; \ + G_enc(RGI1, RGI2, x1, s0, s1, s2, s3); \ + vmovq a ## 2, RGI1; \ + vpextrq $1, a ## 2, RGI2; \ + vmovq RGS2, x1; \ + vpinsrq $1, RGS3, x1, x1; \ + G_enc(RGI3, RGI4, y1, s1, s2, s3, s0); \ + vmovq
Re: [PATCH] crypto: twofish - add x86_64/avx assembler implementation
Quoting Borislav Petkov b...@alien8.de: Here you go: [ 52.282208] [ 52.282208] testing speed of async ecb(twofish) encryption Thanks! Looks that encryption lost ~0.4% while decryption gained ~1.8%. For 256 byte test, it's still slightly slower than twofish-3way (~3%). For 1k and 8k tests, it's ~5% faster. Here's very last test-patch, testing different ordering of fpu-cpu reg instructions at few places. --- arch/x86/crypto/twofish-avx-x86_64-asm_64.S | 232 ++- 1 file changed, 154 insertions(+), 78 deletions(-) diff --git a/arch/x86/crypto/twofish-avx-x86_64-asm_64.S b/arch/x86/crypto/twofish-avx-x86_64-asm_64.S index 35f4557..693963a 100644 --- a/arch/x86/crypto/twofish-avx-x86_64-asm_64.S +++ b/arch/x86/crypto/twofish-avx-x86_64-asm_64.S @@ -4,6 +4,8 @@ * Copyright (C) 2012 Johannes Goetzfried * johannes.goetzfr...@informatik.stud.uni-erlangen.de * + * Copyright © 2012 Jussi Kivilinna jussi.kivili...@mbnet.fi + * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation; either version 2 of the License, or @@ -47,16 +49,21 @@ #define RC2 %xmm6 #define RD2 %xmm7 -#define RX %xmm8 -#define RY %xmm9 +#define RX0 %xmm8 +#define RY0 %xmm9 + +#define RX1 %xmm10 +#define RY1 %xmm11 + +#define RK1 %xmm12 +#define RK2 %xmm13 -#define RK1 %xmm10 -#define RK2 %xmm11 +#define RT %xmm14 -#define RID1 %rax -#define RID1b %al -#define RID2 %rbx -#define RID2b %bl +#define RID1 %rbp +#define RID1d %ebp +#define RID2 %rsi +#define RID2d %esi #define RGI1 %rdx #define RGI1bl %dl @@ -65,6 +72,13 @@ #define RGI2bl %cl #define RGI2bh %ch +#define RGI3 %rax +#define RGI3bl %al +#define RGI3bh %ah +#define RGI4 %rbx +#define RGI4bl %bl +#define RGI4bh %bh + #define RGS1 %r8 #define RGS1d %r8d #define RGS2 %r9 @@ -73,40 +87,58 @@ #define RGS3d %r10d -#define lookup_32bit(t0, t1, t2, t3, src, dst) \ - movbsrc ## bl,RID1b; \ - movbsrc ## bh,RID2b; \ +#define lookup_32bit(t0, t1, t2, t3, src, dst, interleave_op, il_reg) \ + movzbl src ## bl,RID1d; \ + movzbl src ## bh,RID2d; \ + shrq $16, src; \ movlt0(CTX, RID1, 4), dst ## d; \ xorlt1(CTX, RID2, 4), dst ## d; \ - shrq $16, src; \ - movbsrc ## bl,RID1b; \ - movbsrc ## bh,RID2b; \ + movzbl src ## bl,RID1d; \ + movzbl src ## bh,RID2d; \ + interleave_op(il_reg); \ xorlt2(CTX, RID1, 4), dst ## d; \ xorlt3(CTX, RID2, 4), dst ## d; -#define G(a, x, t0, t1, t2, t3) \ - vmovq a,RGI1; \ - vpsrldq $8, a,x; \ - vmovq x,RGI2; \ - \ - lookup_32bit(t0, t1, t2, t3, RGI1, RGS1); \ - shrq $16, RGI1; \ - lookup_32bit(t0, t1, t2, t3, RGI1, RGS2); \ - shlq $32, RGS2; \ - orq RGS1, RGS2; \ - \ - lookup_32bit(t0, t1, t2, t3, RGI2, RGS1); \ - shrq $16, RGI2; \ - lookup_32bit(t0, t1, t2, t3, RGI2, RGS3); \ - shlq $32, RGS3; \ - orq RGS1, RGS3; \ - \ - vmovq RGS2, x; \ - vpinsrq $1, RGS3, x, x; +#define dummy(d) /* do nothing */ -#define encround(a, b, c, d, x, y) \ - G(a, x, s0, s1, s2, s3); \ - G(b, y, s1, s2, s3, s0); \ +#define shr_next(reg) \ + shrq $16, reg; + +#define G_enc(gi1, gi2, x, t0, t1, t2, t3) \ + lookup_32bit(t0, t1, t2, t3, ##gi1, RGS1, shr_next, ##gi1); \ + lookup_32bit(t0, t1, t2, t3, ##gi1, RGS2, dummy, none); \ + shlq $32, RGS2;\ + orq RGS1, RGS2; \ + \ + lookup_32bit(t0, t1, t2, t3, ##gi2, RGS3, shr_next, ##gi2); \ + lookup_32bit(t0, t1, t2, t3, ##gi2, RGS1, dummy, none); \ + shlq $32, RGS1;\ + orq RGS1, RGS3; + +#define encround_head_2(a, b, c, d, x1, y1, x2, y2) \ + vmovq b ## 1, RGI3; \ + vpextrq $1, b ## 1, RGI4; \ + G_enc(RGI1, RGI2, x1, s0, s1, s2, s3); \ + vmovq a ## 2, RGI1; \ + vpextrq $1, a ## 2, RGI2; \ + vmovq RGS2, x1; \ + vpinsrq $1, RGS3, x1, x1; \ + G_enc(RGI3, RGI4
Re: on stack dynamic allocations
Quoting David Daney : On 08/16/2012 02:20 PM, Kasatkin, Dmitry wrote: Hello, Some places in the code uses variable-size allocation on stack.. For example from hmac_setkey(): struct { struct shash_desc shash; char ctx[crypto_shash_descsize(hash)]; } desc; sparse complains CHECK crypto/hmac.c crypto/hmac.c:57:47: error: bad constant expression I like it instead of kmalloc.. But what is position of kernel community about it? If you know that the range of crypto_shash_descsize(hash) is bounded, just use the upper bound. If the range of crypto_shash_descsize(hash) is unbounded, then the stack will overflow and ... BOOM! Quick look shows that largest crypto_shash_descsize() would be with hmac+s390/sha512, 16 + 332 = 348. Crypto-api also prevents registering shash with descsize larger than (PAGE_SIZE / 8). -Jussi -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] crypto: twofish - add x86_64/avx assembler implementation
Quoting Borislav Petkov : > > Yep, looks better than the previous run and also a bit better or on par > with the initial run I did. > I made few further changes, mainly moving/interleaving 'vmovq/vpextrq' ahead so they should be completed before those target registers are needed. This only gave 0.5% increase on Sandy-bridge, but might help more on Bulldozer. -Jussi --- arch/x86/crypto/twofish-avx-x86_64-asm_64.S | 205 +-- 1 file changed, 130 insertions(+), 75 deletions(-) diff --git a/arch/x86/crypto/twofish-avx-x86_64-asm_64.S b/arch/x86/crypto/twofish-avx-x86_64-asm_64.S index 35f4557..6638a87 100644 --- a/arch/x86/crypto/twofish-avx-x86_64-asm_64.S +++ b/arch/x86/crypto/twofish-avx-x86_64-asm_64.S @@ -4,6 +4,8 @@ * Copyright (C) 2012 Johannes Goetzfried * * + * Copyright © 2012 Jussi Kivilinna + * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation; either version 2 of the License, or @@ -47,16 +49,21 @@ #define RC2 %xmm6 #define RD2 %xmm7 -#define RX %xmm8 -#define RY %xmm9 +#define RX0 %xmm8 +#define RY0 %xmm9 + +#define RX1 %xmm10 +#define RY1 %xmm11 -#define RK1 %xmm10 -#define RK2 %xmm11 +#define RK1 %xmm12 +#define RK2 %xmm13 -#define RID1 %rax -#define RID1b %al -#define RID2 %rbx -#define RID2b %bl +#define RT %xmm14 + +#define RID1 %rbp +#define RID1d %ebp +#define RID2 %rsi +#define RID2d %esi #define RGI1 %rdx #define RGI1bl %dl @@ -65,6 +72,13 @@ #define RGI2bl %cl #define RGI2bh %ch +#define RGI3 %rax +#define RGI3bl %al +#define RGI3bh %ah +#define RGI4 %rbx +#define RGI4bl %bl +#define RGI4bh %bh + #define RGS1 %r8 #define RGS1d %r8d #define RGS2 %r9 @@ -73,40 +87,53 @@ #define RGS3d %r10d -#define lookup_32bit(t0, t1, t2, t3, src, dst) \ - movbsrc ## bl,RID1b; \ - movbsrc ## bh,RID2b; \ +#define lookup_32bit(t0, t1, t2, t3, src, dst, interleave_op, il_reg) \ + movzbl src ## bl,RID1d; \ + movzbl src ## bh,RID2d; \ + shrq $16, src; \ movlt0(CTX, RID1, 4), dst ## d; \ xorlt1(CTX, RID2, 4), dst ## d; \ - shrq $16, src; \ - movbsrc ## bl,RID1b; \ - movbsrc ## bh,RID2b; \ + movzbl src ## bl,RID1d; \ + movzbl src ## bh,RID2d; \ + interleave_op(il_reg); \ xorlt2(CTX, RID1, 4), dst ## d; \ xorlt3(CTX, RID2, 4), dst ## d; -#define G(a, x, t0, t1, t2, t3) \ - vmovq a,RGI1; \ - vpsrldq $8, a,x; \ - vmovq x,RGI2; \ - \ - lookup_32bit(t0, t1, t2, t3, RGI1, RGS1); \ - shrq $16, RGI1; \ - lookup_32bit(t0, t1, t2, t3, RGI1, RGS2); \ - shlq $32, RGS2; \ - orq RGS1, RGS2; \ +#define dummy(d) /* do nothing */ + +#define shr_next(reg) \ + shrq $16, reg; + +#define G(gi1, gi2, x, t0, t1, t2, t3) \ + lookup_32bit(t0, t1, t2, t3, ##gi1, RGS1, shr_next, ##gi1); \ + lookup_32bit(t0, t1, t2, t3, ##gi1, RGS2, dummy, none); \ + shlq $32, RGS2;\ + orq RGS1, RGS2; \ \ - lookup_32bit(t0, t1, t2, t3, RGI2, RGS1); \ - shrq $16, RGI2; \ - lookup_32bit(t0, t1, t2, t3, RGI2, RGS3); \ - shlq $32, RGS3; \ - orq RGS1, RGS3; \ + lookup_32bit(t0, t1, t2, t3, ##gi2, RGS3, shr_next, ##gi2); \ + lookup_32bit(t0, t1, t2, t3, ##gi2, RGS1, dummy, none); \ + shlq $32, RGS1;\ + orq RGS1, RGS3; \ \ - vmovq RGS2, x; \ + vmovq RGS2, x; \ vpinsrq $1, RGS3, x, x; -#define encround(a, b, c, d, x, y) \ - G(a, x, s0, s1, s2, s3); \ - G(b, y, s1, s2, s3, s0); \ +#define encround_head_2(a, b, c, d, x1, y1, x2, y2) \ + vmovq b ## 1, RGI3; \ + vpextrq $1, b ## 1, RGI4; \ + G(RGI1, RGI2, x1, s0, s1, s2, s3); \ + vmovq a ## 2, RGI1; \ + vpextrq $1, a ## 2, RGI2; \ + G(RGI3, RGI4, y1, s1, s2, s3, s0); \ + vmovq b ## 2, RGI3; \ + vpextrq $1, b ## 2, RGI4; \ +
Re: [PATCH] crypto: twofish - add x86_64/avx assembler implementation
Quoting Borislav Petkov b...@alien8.de: Yep, looks better than the previous run and also a bit better or on par with the initial run I did. I made few further changes, mainly moving/interleaving 'vmovq/vpextrq' ahead so they should be completed before those target registers are needed. This only gave 0.5% increase on Sandy-bridge, but might help more on Bulldozer. -Jussi --- arch/x86/crypto/twofish-avx-x86_64-asm_64.S | 205 +-- 1 file changed, 130 insertions(+), 75 deletions(-) diff --git a/arch/x86/crypto/twofish-avx-x86_64-asm_64.S b/arch/x86/crypto/twofish-avx-x86_64-asm_64.S index 35f4557..6638a87 100644 --- a/arch/x86/crypto/twofish-avx-x86_64-asm_64.S +++ b/arch/x86/crypto/twofish-avx-x86_64-asm_64.S @@ -4,6 +4,8 @@ * Copyright (C) 2012 Johannes Goetzfried * johannes.goetzfr...@informatik.stud.uni-erlangen.de * + * Copyright © 2012 Jussi Kivilinna jussi.kivili...@mbnet.fi + * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation; either version 2 of the License, or @@ -47,16 +49,21 @@ #define RC2 %xmm6 #define RD2 %xmm7 -#define RX %xmm8 -#define RY %xmm9 +#define RX0 %xmm8 +#define RY0 %xmm9 + +#define RX1 %xmm10 +#define RY1 %xmm11 -#define RK1 %xmm10 -#define RK2 %xmm11 +#define RK1 %xmm12 +#define RK2 %xmm13 -#define RID1 %rax -#define RID1b %al -#define RID2 %rbx -#define RID2b %bl +#define RT %xmm14 + +#define RID1 %rbp +#define RID1d %ebp +#define RID2 %rsi +#define RID2d %esi #define RGI1 %rdx #define RGI1bl %dl @@ -65,6 +72,13 @@ #define RGI2bl %cl #define RGI2bh %ch +#define RGI3 %rax +#define RGI3bl %al +#define RGI3bh %ah +#define RGI4 %rbx +#define RGI4bl %bl +#define RGI4bh %bh + #define RGS1 %r8 #define RGS1d %r8d #define RGS2 %r9 @@ -73,40 +87,53 @@ #define RGS3d %r10d -#define lookup_32bit(t0, t1, t2, t3, src, dst) \ - movbsrc ## bl,RID1b; \ - movbsrc ## bh,RID2b; \ +#define lookup_32bit(t0, t1, t2, t3, src, dst, interleave_op, il_reg) \ + movzbl src ## bl,RID1d; \ + movzbl src ## bh,RID2d; \ + shrq $16, src; \ movlt0(CTX, RID1, 4), dst ## d; \ xorlt1(CTX, RID2, 4), dst ## d; \ - shrq $16, src; \ - movbsrc ## bl,RID1b; \ - movbsrc ## bh,RID2b; \ + movzbl src ## bl,RID1d; \ + movzbl src ## bh,RID2d; \ + interleave_op(il_reg); \ xorlt2(CTX, RID1, 4), dst ## d; \ xorlt3(CTX, RID2, 4), dst ## d; -#define G(a, x, t0, t1, t2, t3) \ - vmovq a,RGI1; \ - vpsrldq $8, a,x; \ - vmovq x,RGI2; \ - \ - lookup_32bit(t0, t1, t2, t3, RGI1, RGS1); \ - shrq $16, RGI1; \ - lookup_32bit(t0, t1, t2, t3, RGI1, RGS2); \ - shlq $32, RGS2; \ - orq RGS1, RGS2; \ +#define dummy(d) /* do nothing */ + +#define shr_next(reg) \ + shrq $16, reg; + +#define G(gi1, gi2, x, t0, t1, t2, t3) \ + lookup_32bit(t0, t1, t2, t3, ##gi1, RGS1, shr_next, ##gi1); \ + lookup_32bit(t0, t1, t2, t3, ##gi1, RGS2, dummy, none); \ + shlq $32, RGS2;\ + orq RGS1, RGS2; \ \ - lookup_32bit(t0, t1, t2, t3, RGI2, RGS1); \ - shrq $16, RGI2; \ - lookup_32bit(t0, t1, t2, t3, RGI2, RGS3); \ - shlq $32, RGS3; \ - orq RGS1, RGS3; \ + lookup_32bit(t0, t1, t2, t3, ##gi2, RGS3, shr_next, ##gi2); \ + lookup_32bit(t0, t1, t2, t3, ##gi2, RGS1, dummy, none); \ + shlq $32, RGS1;\ + orq RGS1, RGS3; \ \ - vmovq RGS2, x; \ + vmovq RGS2, x; \ vpinsrq $1, RGS3, x, x; -#define encround(a, b, c, d, x, y) \ - G(a, x, s0, s1, s2, s3); \ - G(b, y, s1, s2, s3, s0); \ +#define encround_head_2(a, b, c, d, x1, y1, x2, y2) \ + vmovq b ## 1, RGI3; \ + vpextrq $1, b ## 1, RGI4; \ + G(RGI1, RGI2, x1, s0, s1, s2, s3); \ + vmovq a ## 2, RGI1; \ + vpextrq $1, a ## 2, RGI2; \ + G(RGI3, RGI4, y1, s1, s2, s3, s0); \ + vmovq b
Re: on stack dynamic allocations
Quoting David Daney ddaney.c...@gmail.com: On 08/16/2012 02:20 PM, Kasatkin, Dmitry wrote: Hello, Some places in the code uses variable-size allocation on stack.. For example from hmac_setkey(): struct { struct shash_desc shash; char ctx[crypto_shash_descsize(hash)]; } desc; sparse complains CHECK crypto/hmac.c crypto/hmac.c:57:47: error: bad constant expression I like it instead of kmalloc.. But what is position of kernel community about it? If you know that the range of crypto_shash_descsize(hash) is bounded, just use the upper bound. If the range of crypto_shash_descsize(hash) is unbounded, then the stack will overflow and ... BOOM! Quick look shows that largest crypto_shash_descsize() would be with hmac+s390/sha512, 16 + 332 = 348. Crypto-api also prevents registering shash with descsize larger than (PAGE_SIZE / 8). -Jussi -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] crypto: twofish - add x86_64/avx assembler implementation
Quoting Borislav Petkov : On Wed, Aug 15, 2012 at 08:34:25PM +0300, Jussi Kivilinna wrote: About ~5% slower, probably because I was tuning for sandy-bridge and introduced more FPU<=>CPU register moves. Here's new version of patch, with FPU<=>CPU moves from original implementation. (Note: also changes encryption function to inline all code in to main function, decryption still places common code to separate function to reduce object size. This is to measure the difference.) Yep, looks better than the previous run and also a bit better or on par with the initial run I did. Thanks again. Speed gained with patch is ~8%, and is able of getting twofish-avx pass twofish-3way. The thing is, I'm not sure whether optimizing the thing for each uarch is a workable solution software-wise or maybe having a single version which performs sufficiently ok on all uarches is easier/better to maintain without causing code bloat. Hmmm... Agreed, testing on multiple CPUs to get single well working version is what I have done in the past. But purchasing all the latest CPUs on the market isn't option for me, and for testing AVX I'm stuck with sandy-bridge :) -Jussi 4th: ran like 1st. [ 1014.074150] [ 1014.074150] testing speed of async ecb(twofish) encryption [ 1014.083829] test 0 (128 bit key, 16 byte blocks): 4870055 operations in 1 seconds (77920880 bytes) [ 1015.092757] test 1 (128 bit key, 64 byte blocks): 2043828 operations in 1 seconds (130804992 bytes) [ 1016.099441] test 2 (128 bit key, 256 byte blocks): 606400 operations in 1 seconds (155238400 bytes) [ 1017.105939] test 3 (128 bit key, 1024 byte blocks): 168939 operations in 1 seconds (172993536 bytes) [ 1018.112517] test 4 (128 bit key, 8192 byte blocks): 21777 operations in 1 seconds (178397184 bytes) [ 1019.119035] test 5 (192 bit key, 16 byte blocks): 4882254 operations in 1 seconds (78116064 bytes) [ 1020.125716] test 6 (192 bit key, 64 byte blocks): 2043230 operations in 1 seconds (130766720 bytes) [ 1021.132391] test 7 (192 bit key, 256 byte blocks): 607477 operations in 1 seconds (155514112 bytes) [ 1022.138889] test 8 (192 bit key, 1024 byte blocks): 168743 operations in 1 seconds (172792832 bytes) [ 1023.145476] test 9 (192 bit key, 8192 byte blocks): 21442 operations in 1 seconds (175652864 bytes) [ 1024.152012] test 10 (256 bit key, 16 byte blocks): 4891863 operations in 1 seconds (78269808 bytes) [ 1025.158684] test 11 (256 bit key, 64 byte blocks): 2049390 operations in 1 seconds (131160960 bytes) [ 1026.165366] test 12 (256 bit key, 256 byte blocks): 606847 operations in 1 seconds (155352832 bytes) [ 1027.171841] test 13 (256 bit key, 1024 byte blocks): 169228 operations in 1 seconds (173289472 bytes) [ 1028.178436] test 14 (256 bit key, 8192 byte blocks): 21773 operations in 1 seconds (178364416 bytes) [ 1029.184981] [ 1029.184981] testing speed of async ecb(twofish) decryption [ 1029.194508] test 0 (128 bit key, 16 byte blocks): 4931065 operations in 1 seconds (78897040 bytes) [ 1030.199640] test 1 (128 bit key, 64 byte blocks): 2056931 operations in 1 seconds (131643584 bytes) [ 1031.206303] test 2 (128 bit key, 256 byte blocks): 589409 operations in 1 seconds (150888704 bytes) [ 1032.212832] test 3 (128 bit key, 1024 byte blocks): 163681 operations in 1 seconds (167609344 bytes) [ 1033.219443] test 4 (128 bit key, 8192 byte blocks): 21062 operations in 1 seconds (172539904 bytes) [ 1034.225979] test 5 (192 bit key, 16 byte blocks): 4931537 operations in 1 seconds (78904592 bytes) [ 1035.232608] test 6 (192 bit key, 64 byte blocks): 2053989 operations in 1 seconds (131455296 bytes) [ 1036.239289] test 7 (192 bit key, 256 byte blocks): 589591 operations in 1 seconds (150935296 bytes) [ 1037.241784] test 8 (192 bit key, 1024 byte blocks): 163565 operations in 1 seconds (167490560 bytes) [ 1038.244387] test 9 (192 bit key, 8192 byte blocks): 20899 operations in 1 seconds (171204608 bytes) [ 1039.250923] test 10 (256 bit key, 16 byte blocks): 4937343 operations in 1 seconds (78997488 bytes) [ 1040.257589] test 11 (256 bit key, 64 byte blocks): 2050678 operations in 1 seconds (131243392 bytes) [ 1041.264262] test 12 (256 bit key, 256 byte blocks): 586869 operations in 1 seconds (150238464 bytes) [ 1042.270753] test 13 (256 bit key, 1024 byte blocks): 163548 operations in 1 seconds (167473152 bytes) [ 1043.277365] test 14 (256 bit key, 8192 byte blocks): 21053 operations in 1 seconds (172466176 bytes) [ 1044.283892] [ 1044.283892] testing speed of async cbc(twofish) encryption [ 1044.293349] test 0 (128 bit key, 16 byte blocks): 5186240 operations in 1 seconds (82979840 bytes) [ 1045.298534] test 1 (128 bit key, 64 byte blocks): 1921034 operations in 1 seconds (122946176 bytes) [ 1046.305207] test 2 (128 bit key, 256 byte blocks): 542787 operations in 1 seconds (138953472 bytes) [ 1047.311699] test 3 (128 bit key, 1024 byte blocks): 141399 ope
Re: [PATCH] crypto: twofish - add x86_64/avx assembler implementation
Quoting Borislav Petkov b...@alien8.de: On Wed, Aug 15, 2012 at 08:34:25PM +0300, Jussi Kivilinna wrote: About ~5% slower, probably because I was tuning for sandy-bridge and introduced more FPU=CPU register moves. Here's new version of patch, with FPU=CPU moves from original implementation. (Note: also changes encryption function to inline all code in to main function, decryption still places common code to separate function to reduce object size. This is to measure the difference.) Yep, looks better than the previous run and also a bit better or on par with the initial run I did. Thanks again. Speed gained with patch is ~8%, and is able of getting twofish-avx pass twofish-3way. The thing is, I'm not sure whether optimizing the thing for each uarch is a workable solution software-wise or maybe having a single version which performs sufficiently ok on all uarches is easier/better to maintain without causing code bloat. Hmmm... Agreed, testing on multiple CPUs to get single well working version is what I have done in the past. But purchasing all the latest CPUs on the market isn't option for me, and for testing AVX I'm stuck with sandy-bridge :) -Jussi 4th: ran like 1st. [ 1014.074150] [ 1014.074150] testing speed of async ecb(twofish) encryption [ 1014.083829] test 0 (128 bit key, 16 byte blocks): 4870055 operations in 1 seconds (77920880 bytes) [ 1015.092757] test 1 (128 bit key, 64 byte blocks): 2043828 operations in 1 seconds (130804992 bytes) [ 1016.099441] test 2 (128 bit key, 256 byte blocks): 606400 operations in 1 seconds (155238400 bytes) [ 1017.105939] test 3 (128 bit key, 1024 byte blocks): 168939 operations in 1 seconds (172993536 bytes) [ 1018.112517] test 4 (128 bit key, 8192 byte blocks): 21777 operations in 1 seconds (178397184 bytes) [ 1019.119035] test 5 (192 bit key, 16 byte blocks): 4882254 operations in 1 seconds (78116064 bytes) [ 1020.125716] test 6 (192 bit key, 64 byte blocks): 2043230 operations in 1 seconds (130766720 bytes) [ 1021.132391] test 7 (192 bit key, 256 byte blocks): 607477 operations in 1 seconds (155514112 bytes) [ 1022.138889] test 8 (192 bit key, 1024 byte blocks): 168743 operations in 1 seconds (172792832 bytes) [ 1023.145476] test 9 (192 bit key, 8192 byte blocks): 21442 operations in 1 seconds (175652864 bytes) [ 1024.152012] test 10 (256 bit key, 16 byte blocks): 4891863 operations in 1 seconds (78269808 bytes) [ 1025.158684] test 11 (256 bit key, 64 byte blocks): 2049390 operations in 1 seconds (131160960 bytes) [ 1026.165366] test 12 (256 bit key, 256 byte blocks): 606847 operations in 1 seconds (155352832 bytes) [ 1027.171841] test 13 (256 bit key, 1024 byte blocks): 169228 operations in 1 seconds (173289472 bytes) [ 1028.178436] test 14 (256 bit key, 8192 byte blocks): 21773 operations in 1 seconds (178364416 bytes) [ 1029.184981] [ 1029.184981] testing speed of async ecb(twofish) decryption [ 1029.194508] test 0 (128 bit key, 16 byte blocks): 4931065 operations in 1 seconds (78897040 bytes) [ 1030.199640] test 1 (128 bit key, 64 byte blocks): 2056931 operations in 1 seconds (131643584 bytes) [ 1031.206303] test 2 (128 bit key, 256 byte blocks): 589409 operations in 1 seconds (150888704 bytes) [ 1032.212832] test 3 (128 bit key, 1024 byte blocks): 163681 operations in 1 seconds (167609344 bytes) [ 1033.219443] test 4 (128 bit key, 8192 byte blocks): 21062 operations in 1 seconds (172539904 bytes) [ 1034.225979] test 5 (192 bit key, 16 byte blocks): 4931537 operations in 1 seconds (78904592 bytes) [ 1035.232608] test 6 (192 bit key, 64 byte blocks): 2053989 operations in 1 seconds (131455296 bytes) [ 1036.239289] test 7 (192 bit key, 256 byte blocks): 589591 operations in 1 seconds (150935296 bytes) [ 1037.241784] test 8 (192 bit key, 1024 byte blocks): 163565 operations in 1 seconds (167490560 bytes) [ 1038.244387] test 9 (192 bit key, 8192 byte blocks): 20899 operations in 1 seconds (171204608 bytes) [ 1039.250923] test 10 (256 bit key, 16 byte blocks): 4937343 operations in 1 seconds (78997488 bytes) [ 1040.257589] test 11 (256 bit key, 64 byte blocks): 2050678 operations in 1 seconds (131243392 bytes) [ 1041.264262] test 12 (256 bit key, 256 byte blocks): 586869 operations in 1 seconds (150238464 bytes) [ 1042.270753] test 13 (256 bit key, 1024 byte blocks): 163548 operations in 1 seconds (167473152 bytes) [ 1043.277365] test 14 (256 bit key, 8192 byte blocks): 21053 operations in 1 seconds (172466176 bytes) [ 1044.283892] [ 1044.283892] testing speed of async cbc(twofish) encryption [ 1044.293349] test 0 (128 bit key, 16 byte blocks): 5186240 operations in 1 seconds (82979840 bytes) [ 1045.298534] test 1 (128 bit key, 64 byte blocks): 1921034 operations in 1 seconds (122946176 bytes) [ 1046.305207] test 2 (128 bit key, 256 byte blocks): 542787 operations in 1 seconds (138953472 bytes) [ 1047.311699] test 3 (128 bit key, 1024 byte blocks): 141399
Re: [PATCH] crypto: twofish - add x86_64/avx assembler implementation
Quoting Borislav Petkov : > On Wed, Aug 15, 2012 at 05:22:03PM +0300, Jussi Kivilinna wrote: > >> Patch replaces 'movb' instructions with 'movzbl' to break false >> register dependencies and interleaves instructions better for >> out-of-order scheduling. >> >> Also move common round code to separate function to reduce object >> size. > > Ok, redid the first test > Thanks. > $ modprobe twofish-avx-x86_64 > $ modprobe tcrypt mode=504 sec=1 > > and from quickly juxtaposing the two results, I'd say the patch makes > things slightly worse but you'd need to run your scripts on it to get > the accurate results: > About ~5% slower, probably because I was tuning for sandy-bridge and introduced more FPU<=>CPU register moves. Here's new version of patch, with FPU<=>CPU moves from original implementation. (Note: also changes encryption function to inline all code in to main function, decryption still places common code to separate function to reduce object size. This is to measure the difference.) -Jussi --- arch/x86/crypto/twofish-avx-x86_64-asm_64.S | 124 +-- 1 file changed, 77 insertions(+), 47 deletions(-) diff --git a/arch/x86/crypto/twofish-avx-x86_64-asm_64.S b/arch/x86/crypto/twofish-avx-x86_64-asm_64.S index 35f4557..d331ab8 100644 --- a/arch/x86/crypto/twofish-avx-x86_64-asm_64.S +++ b/arch/x86/crypto/twofish-avx-x86_64-asm_64.S @@ -47,15 +47,22 @@ #define RC2 %xmm6 #define RD2 %xmm7 -#define RX %xmm8 -#define RY %xmm9 +#define RX0 %xmm8 +#define RY0 %xmm9 -#define RK1 %xmm10 -#define RK2 %xmm11 +#define RX1 %xmm10 +#define RY1 %xmm11 + +#define RK1 %xmm12 +#define RK2 %xmm13 + +#define RT %xmm14 #define RID1 %rax +#define RID1d %eax #define RID1b %al #define RID2 %rbx +#define RID2d %ebx #define RID2b %bl #define RGI1 %rdx @@ -73,40 +80,48 @@ #define RGS3d %r10d -#define lookup_32bit(t0, t1, t2, t3, src, dst) \ - movbsrc ## bl,RID1b; \ - movbsrc ## bh,RID2b; \ +#define lookup_32bit(t0, t1, t2, t3, src, dst, interleave_op, il_reg) \ + movzbl src ## bl,RID1d; \ + movzbl src ## bh,RID2d; \ + shrq $16, src; \ movlt0(CTX, RID1, 4), dst ## d; \ xorlt1(CTX, RID2, 4), dst ## d; \ - shrq $16, src; \ - movbsrc ## bl,RID1b; \ - movbsrc ## bh,RID2b; \ + movzbl src ## bl,RID1d; \ + movzbl src ## bh,RID2d; \ + interleave_op(il_reg); \ xorlt2(CTX, RID1, 4), dst ## d; \ xorlt3(CTX, RID2, 4), dst ## d; +#define dummy(d) /* do nothing */ + +#define shr_next(reg) \ + shrq $16, reg; + #define G(a, x, t0, t1, t2, t3) \ - vmovq a,RGI1; \ - vpsrldq $8, a,x; \ - vmovq x,RGI2; \ + vmovq a, RGI1; \ + vpextrq $1, a, RGI2; \ \ - lookup_32bit(t0, t1, t2, t3, RGI1, RGS1); \ - shrq $16, RGI1; \ - lookup_32bit(t0, t1, t2, t3, RGI1, RGS2); \ + lookup_32bit(t0, t1, t2, t3, RGI1, RGS1, shr_next, RGI1); \ + lookup_32bit(t0, t1, t2, t3, RGI1, RGS2, dummy, none); \ shlq $32, RGS2; \ orq RGS1, RGS2; \ \ - lookup_32bit(t0, t1, t2, t3, RGI2, RGS1); \ - shrq $16, RGI2; \ - lookup_32bit(t0, t1, t2, t3, RGI2, RGS3); \ - shlq $32, RGS3; \ + lookup_32bit(t0, t1, t2, t3, RGI2, RGS3, shr_next, RGI2); \ + lookup_32bit(t0, t1, t2, t3, RGI2, RGS1, dummy, none); \ + shlq $32, RGS1; \ orq RGS1, RGS3; \ \ vmovq RGS2, x; \ vpinsrq $1, RGS3, x, x; -#define encround(a, b, c, d, x, y) \ - G(a, x, s0, s1, s2, s3); \ - G(b, y, s1, s2, s3, s0); \ +#define encround_g1g2(a, b, c, d, x, y) \ + G(a, x, s0, s1, s2, s3); \ + G(b, y, s1, s2, s3, s0); + +#define encround_end(a, b, c, d, x, y) \ + vpslld $1, d, RT; \ + vpsrld $(32 - 1), d, d; \ + vpord, RT, d; \ vpaddd x, y, x; \ vpaddd y, x, y; \ vpaddd x, RK1, x; \ @@ -115,14 +130,16 @@ vpsrld $1, c, x; \ vpslld $(32 - 1), c, c; \ vporc, x, c; \ - vpslld $1, d, x; \ -
Re: [PATCH] crypto: twofish - add x86_64/avx assembler implementation
> On Wed, Aug 15, 2012 at 04:48:54PM +0300, Jussi Kivilinna wrote: > > I posted patch that optimize twofish-avx few weeks ago: > > http://marc.info/?l=linux-crypto-vger=134364845024825=2 > > > > I'd be interested to know, if this is patch helps on Bulldozer. > > Sure, can you inline it here too please. The "Download message RAW" link > on marc.info gives me a diff but patch says: > > patching file arch/x86/crypto/twofish-avx-x86_64-asm_64.S > patch unexpectedly ends in middle of line > > Thanks. Here... Patch replaces 'movb' instructions with 'movzbl' to break false register dependencies and interleaves instructions better for out-of-order scheduling. Also move common round code to separate function to reduce object size. Tested on Core i5-2450M. --- arch/x86/crypto/twofish-avx-x86_64-asm_64.S | 144 +-- 1 file changed, 92 insertions(+), 52 deletions(-) diff --git a/arch/x86/crypto/twofish-avx-x86_64-asm_64.S b/arch/x86/crypto/twofish-avx-x86_64-asm_64.S index 35f4557..42b27b7 100644 --- a/arch/x86/crypto/twofish-avx-x86_64-asm_64.S +++ b/arch/x86/crypto/twofish-avx-x86_64-asm_64.S @@ -47,15 +47,22 @@ #define RC2 %xmm6 #define RD2 %xmm7 -#define RX %xmm8 -#define RY %xmm9 +#define RX0 %xmm8 +#define RY0 %xmm9 -#define RK1 %xmm10 -#define RK2 %xmm11 +#define RX1 %xmm10 +#define RY1 %xmm11 + +#define RK1 %xmm12 +#define RK2 %xmm13 + +#define RT %xmm14 #define RID1 %rax +#define RID1d %eax #define RID1b %al #define RID2 %rbx +#define RID2d %ebx #define RID2b %bl #define RGI1 %rdx @@ -73,40 +80,45 @@ #define RGS3d %r10d -#define lookup_32bit(t0, t1, t2, t3, src, dst) \ - movbsrc ## bl,RID1b; \ - movbsrc ## bh,RID2b; \ +#define lookup_32bit(t0, t1, t2, t3, src, dst, interleave_op, il_reg) \ + movzbl src ## bl,RID1d; \ + movzbl src ## bh,RID2d; \ + shrq $16, src; \ movlt0(CTX, RID1, 4), dst ## d; \ xorlt1(CTX, RID2, 4), dst ## d; \ - shrq $16, src; \ - movbsrc ## bl,RID1b; \ - movbsrc ## bh,RID2b; \ + movzbl src ## bl,RID1d; \ + movzbl src ## bh,RID2d; \ + interleave_op(il_reg); \ xorlt2(CTX, RID1, 4), dst ## d; \ xorlt3(CTX, RID2, 4), dst ## d; +#define dummy(d) /* do nothing */ + +#define shr_next(reg) \ + shrq $16, reg; + #define G(a, x, t0, t1, t2, t3) \ vmovq a,RGI1; \ - vpsrldq $8, a,x; \ - vmovq x,RGI2; \ + vpextrq $1, a,RGI2; \ \ - lookup_32bit(t0, t1, t2, t3, RGI1, RGS1); \ - shrq $16, RGI1; \ - lookup_32bit(t0, t1, t2, t3, RGI1, RGS2); \ - shlq $32, RGS2; \ - orq RGS1, RGS2; \ + lookup_32bit(t0, t1, t2, t3, RGI1, RGS1, shr_next, RGI1); \ + vmovd RGS1d, x;\ + lookup_32bit(t0, t1, t2, t3, RGI1, RGS2, dummy, none); \ + vpinsrd $1, RGS2d, x, x; \ \ - lookup_32bit(t0, t1, t2, t3, RGI2, RGS1); \ - shrq $16, RGI2; \ - lookup_32bit(t0, t1, t2, t3, RGI2, RGS3); \ - shlq $32, RGS3; \ - orq RGS1, RGS3; \ - \ - vmovq RGS2, x; \ - vpinsrq $1, RGS3, x, x; + lookup_32bit(t0, t1, t2, t3, RGI2, RGS1, shr_next, RGI2); \ + vpinsrd $2, RGS1d, x, x; \ + lookup_32bit(t0, t1, t2, t3, RGI2, RGS3, dummy, none); \ + vpinsrd $3, RGS3d, x, x; + +#define encround_g1g2(a, b, c, d, x, y) \ + G(a, x, s0, s1, s2, s3); \ + G(b, y, s1, s2, s3, s0); -#define encround(a, b, c, d, x, y) \ - G(a, x, s0, s1, s2, s3); \ - G(b, y, s1, s2, s3, s0); \ +#define encround_end(a, b, c, d, x, y) \ + vpslld $1, d, RT; \ + vpsrld $(32 - 1), d, d; \ + vpord, RT, d; \ vpaddd x, y, x; \ vpaddd y, x, y; \ vpaddd x, RK1, x; \ @@ -115,14 +127,16 @@ vpsrld $1, c, x; \ vpslld $(32 - 1), c, c; \ vporc, x, c; \ - vpslld $1, d, x; \ - vpsrld $(32 - 1), d, d; \ - vpord, x, d; \ vpxor d, y, d; -#define decround(a, b, c, d, x, y) \ - G(a, x
Re: [PATCH] crypto: twofish - add x86_64/avx assembler implementation
Quoting Borislav Petkov : Ok, here we go. Raw data below. Thanks alot! Twofish-avx appears somewhat slower than 3way, ~9% slower with 256byte blocks to ~3% slower with 8kb blocks. Let me know if you need more tests. I posted patch that optimize twofish-avx few weeks ago: http://marc.info/?l=linux-crypto-vger=134364845024825=2 I'd be interested to know, if this is patch helps on Bulldozer. -Jussi HTH. -- Regards/Gruss, Boris. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] crypto: twofish - add x86_64/avx assembler implementation
Quoting Borislav Petkov : On Wed, Aug 15, 2012 at 11:42:16AM +0300, Jussi Kivilinna wrote: I started thinking about the performance on AMD Bulldozer. vmovq/vmovd/vpextr*/vpinsr* between FPU and general purpose registers on AMD CPU is alot slower (latencies from 8 to 12 cycles) than on Intel sandy-bridge (where instructions have latency of 1 to 2). See: http://www.agner.org/optimize/instruction_tables.pdf It would be really good, if implementation could be tested on AMD CPU to determinate, if it causes performance regression. However I don't have access to machine with such CPU. But I do. :) And if you tell me exactly how to run the tests and on what kernel, I'll try to do so. Twofish-avx (CONFIG_TWOFISH_AVX_X86_64) is available in 3.6-rc1. For testing you need CRYPTO_TEST build as module. You should turn off turbo-core, freq-scaling, etc. Testing twofish-avx ('async twofish' speed test): modprobe twofish-avx-x86_64 modprobe tcrypt mode=504 sec=1 Testing twofish-x86_64-3way ('sync twofish' speed test): modprobe twofish-x86_64-3way modprobe tcrypt mode=202 sec=1 Loading tcrypt will block until tests are complete, after which modprobe will return with error. This is expected. Results are in kernel log. -Jussi HTH. -- Regards/Gruss, Boris. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] crypto: twofish - add x86_64/avx assembler implementation
Quoting Johannes Goetzfried : This patch adds a x86_64/avx assembler implementation of the Twofish block cipher. The implementation processes eight blocks in parallel (two 4 block chunk AVX operations). The table-lookups are done in general-purpose registers. For small blocksizes the 3way-parallel functions from the twofish-x86_64-3way module are called. A good performance increase is provided for blocksizes greater or equal to 128B. Patch has been tested with tcrypt and automated filesystem tests. Tcrypt benchmark results: Intel Core i5-2500 CPU (fam:6, model:42, step:7) I started thinking about the performance on AMD Bulldozer. vmovq/vmovd/vpextr*/vpinsr* between FPU and general purpose registers on AMD CPU is alot slower (latencies from 8 to 12 cycles) than on Intel sandy-bridge (where instructions have latency of 1 to 2). See: http://www.agner.org/optimize/instruction_tables.pdf It would be really good, if implementation could be tested on AMD CPU to determinate, if it causes performance regression. However I don't have access to machine with such CPU. -Jussi -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] crypto: twofish - add x86_64/avx assembler implementation
Quoting Johannes Goetzfried johannes.goetzfr...@informatik.stud.uni-erlangen.de: This patch adds a x86_64/avx assembler implementation of the Twofish block cipher. The implementation processes eight blocks in parallel (two 4 block chunk AVX operations). The table-lookups are done in general-purpose registers. For small blocksizes the 3way-parallel functions from the twofish-x86_64-3way module are called. A good performance increase is provided for blocksizes greater or equal to 128B. Patch has been tested with tcrypt and automated filesystem tests. Tcrypt benchmark results: Intel Core i5-2500 CPU (fam:6, model:42, step:7) I started thinking about the performance on AMD Bulldozer. vmovq/vmovd/vpextr*/vpinsr* between FPU and general purpose registers on AMD CPU is alot slower (latencies from 8 to 12 cycles) than on Intel sandy-bridge (where instructions have latency of 1 to 2). See: http://www.agner.org/optimize/instruction_tables.pdf It would be really good, if implementation could be tested on AMD CPU to determinate, if it causes performance regression. However I don't have access to machine with such CPU. -Jussi -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] crypto: twofish - add x86_64/avx assembler implementation
Quoting Borislav Petkov b...@alien8.de: On Wed, Aug 15, 2012 at 11:42:16AM +0300, Jussi Kivilinna wrote: I started thinking about the performance on AMD Bulldozer. vmovq/vmovd/vpextr*/vpinsr* between FPU and general purpose registers on AMD CPU is alot slower (latencies from 8 to 12 cycles) than on Intel sandy-bridge (where instructions have latency of 1 to 2). See: http://www.agner.org/optimize/instruction_tables.pdf It would be really good, if implementation could be tested on AMD CPU to determinate, if it causes performance regression. However I don't have access to machine with such CPU. But I do. :) And if you tell me exactly how to run the tests and on what kernel, I'll try to do so. Twofish-avx (CONFIG_TWOFISH_AVX_X86_64) is available in 3.6-rc1. For testing you need CRYPTO_TEST build as module. You should turn off turbo-core, freq-scaling, etc. Testing twofish-avx ('async twofish' speed test): modprobe twofish-avx-x86_64 modprobe tcrypt mode=504 sec=1 Testing twofish-x86_64-3way ('sync twofish' speed test): modprobe twofish-x86_64-3way modprobe tcrypt mode=202 sec=1 Loading tcrypt will block until tests are complete, after which modprobe will return with error. This is expected. Results are in kernel log. -Jussi HTH. -- Regards/Gruss, Boris. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] crypto: twofish - add x86_64/avx assembler implementation
Quoting Borislav Petkov b...@alien8.de: Ok, here we go. Raw data below. Thanks alot! Twofish-avx appears somewhat slower than 3way, ~9% slower with 256byte blocks to ~3% slower with 8kb blocks. snip Let me know if you need more tests. I posted patch that optimize twofish-avx few weeks ago: http://marc.info/?l=linux-crypto-vgerm=134364845024825w=2 I'd be interested to know, if this is patch helps on Bulldozer. -Jussi HTH. -- Regards/Gruss, Boris. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] crypto: twofish - add x86_64/avx assembler implementation
On Wed, Aug 15, 2012 at 04:48:54PM +0300, Jussi Kivilinna wrote: I posted patch that optimize twofish-avx few weeks ago: http://marc.info/?l=linux-crypto-vgerm=134364845024825w=2 I'd be interested to know, if this is patch helps on Bulldozer. Sure, can you inline it here too please. The Download message RAW link on marc.info gives me a diff but patch says: patching file arch/x86/crypto/twofish-avx-x86_64-asm_64.S patch unexpectedly ends in middle of line Thanks. Here... Patch replaces 'movb' instructions with 'movzbl' to break false register dependencies and interleaves instructions better for out-of-order scheduling. Also move common round code to separate function to reduce object size. Tested on Core i5-2450M. --- arch/x86/crypto/twofish-avx-x86_64-asm_64.S | 144 +-- 1 file changed, 92 insertions(+), 52 deletions(-) diff --git a/arch/x86/crypto/twofish-avx-x86_64-asm_64.S b/arch/x86/crypto/twofish-avx-x86_64-asm_64.S index 35f4557..42b27b7 100644 --- a/arch/x86/crypto/twofish-avx-x86_64-asm_64.S +++ b/arch/x86/crypto/twofish-avx-x86_64-asm_64.S @@ -47,15 +47,22 @@ #define RC2 %xmm6 #define RD2 %xmm7 -#define RX %xmm8 -#define RY %xmm9 +#define RX0 %xmm8 +#define RY0 %xmm9 -#define RK1 %xmm10 -#define RK2 %xmm11 +#define RX1 %xmm10 +#define RY1 %xmm11 + +#define RK1 %xmm12 +#define RK2 %xmm13 + +#define RT %xmm14 #define RID1 %rax +#define RID1d %eax #define RID1b %al #define RID2 %rbx +#define RID2d %ebx #define RID2b %bl #define RGI1 %rdx @@ -73,40 +80,45 @@ #define RGS3d %r10d -#define lookup_32bit(t0, t1, t2, t3, src, dst) \ - movbsrc ## bl,RID1b; \ - movbsrc ## bh,RID2b; \ +#define lookup_32bit(t0, t1, t2, t3, src, dst, interleave_op, il_reg) \ + movzbl src ## bl,RID1d; \ + movzbl src ## bh,RID2d; \ + shrq $16, src; \ movlt0(CTX, RID1, 4), dst ## d; \ xorlt1(CTX, RID2, 4), dst ## d; \ - shrq $16, src; \ - movbsrc ## bl,RID1b; \ - movbsrc ## bh,RID2b; \ + movzbl src ## bl,RID1d; \ + movzbl src ## bh,RID2d; \ + interleave_op(il_reg); \ xorlt2(CTX, RID1, 4), dst ## d; \ xorlt3(CTX, RID2, 4), dst ## d; +#define dummy(d) /* do nothing */ + +#define shr_next(reg) \ + shrq $16, reg; + #define G(a, x, t0, t1, t2, t3) \ vmovq a,RGI1; \ - vpsrldq $8, a,x; \ - vmovq x,RGI2; \ + vpextrq $1, a,RGI2; \ \ - lookup_32bit(t0, t1, t2, t3, RGI1, RGS1); \ - shrq $16, RGI1; \ - lookup_32bit(t0, t1, t2, t3, RGI1, RGS2); \ - shlq $32, RGS2; \ - orq RGS1, RGS2; \ + lookup_32bit(t0, t1, t2, t3, RGI1, RGS1, shr_next, RGI1); \ + vmovd RGS1d, x;\ + lookup_32bit(t0, t1, t2, t3, RGI1, RGS2, dummy, none); \ + vpinsrd $1, RGS2d, x, x; \ \ - lookup_32bit(t0, t1, t2, t3, RGI2, RGS1); \ - shrq $16, RGI2; \ - lookup_32bit(t0, t1, t2, t3, RGI2, RGS3); \ - shlq $32, RGS3; \ - orq RGS1, RGS3; \ - \ - vmovq RGS2, x; \ - vpinsrq $1, RGS3, x, x; + lookup_32bit(t0, t1, t2, t3, RGI2, RGS1, shr_next, RGI2); \ + vpinsrd $2, RGS1d, x, x; \ + lookup_32bit(t0, t1, t2, t3, RGI2, RGS3, dummy, none); \ + vpinsrd $3, RGS3d, x, x; + +#define encround_g1g2(a, b, c, d, x, y) \ + G(a, x, s0, s1, s2, s3); \ + G(b, y, s1, s2, s3, s0); -#define encround(a, b, c, d, x, y) \ - G(a, x, s0, s1, s2, s3); \ - G(b, y, s1, s2, s3, s0); \ +#define encround_end(a, b, c, d, x, y) \ + vpslld $1, d, RT; \ + vpsrld $(32 - 1), d, d; \ + vpord, RT, d; \ vpaddd x, y, x; \ vpaddd y, x, y; \ vpaddd x, RK1, x; \ @@ -115,14 +127,16 @@ vpsrld $1, c, x; \ vpslld $(32 - 1), c, c; \ vporc, x, c; \ - vpslld $1, d, x; \ - vpsrld $(32 - 1), d, d; \ - vpord, x, d; \ vpxor d, y, d; -#define decround(a, b, c, d, x, y) \ - G(a, x, s0, s1, s2, s3); \ - G(b, y, s1, s2, s3, s0); \ +#define
Re: [PATCH] crypto: twofish - add x86_64/avx assembler implementation
Quoting Borislav Petkov b...@alien8.de: On Wed, Aug 15, 2012 at 05:22:03PM +0300, Jussi Kivilinna wrote: Patch replaces 'movb' instructions with 'movzbl' to break false register dependencies and interleaves instructions better for out-of-order scheduling. Also move common round code to separate function to reduce object size. Ok, redid the first test Thanks. $ modprobe twofish-avx-x86_64 $ modprobe tcrypt mode=504 sec=1 and from quickly juxtaposing the two results, I'd say the patch makes things slightly worse but you'd need to run your scripts on it to get the accurate results: About ~5% slower, probably because I was tuning for sandy-bridge and introduced more FPU=CPU register moves. Here's new version of patch, with FPU=CPU moves from original implementation. (Note: also changes encryption function to inline all code in to main function, decryption still places common code to separate function to reduce object size. This is to measure the difference.) -Jussi --- arch/x86/crypto/twofish-avx-x86_64-asm_64.S | 124 +-- 1 file changed, 77 insertions(+), 47 deletions(-) diff --git a/arch/x86/crypto/twofish-avx-x86_64-asm_64.S b/arch/x86/crypto/twofish-avx-x86_64-asm_64.S index 35f4557..d331ab8 100644 --- a/arch/x86/crypto/twofish-avx-x86_64-asm_64.S +++ b/arch/x86/crypto/twofish-avx-x86_64-asm_64.S @@ -47,15 +47,22 @@ #define RC2 %xmm6 #define RD2 %xmm7 -#define RX %xmm8 -#define RY %xmm9 +#define RX0 %xmm8 +#define RY0 %xmm9 -#define RK1 %xmm10 -#define RK2 %xmm11 +#define RX1 %xmm10 +#define RY1 %xmm11 + +#define RK1 %xmm12 +#define RK2 %xmm13 + +#define RT %xmm14 #define RID1 %rax +#define RID1d %eax #define RID1b %al #define RID2 %rbx +#define RID2d %ebx #define RID2b %bl #define RGI1 %rdx @@ -73,40 +80,48 @@ #define RGS3d %r10d -#define lookup_32bit(t0, t1, t2, t3, src, dst) \ - movbsrc ## bl,RID1b; \ - movbsrc ## bh,RID2b; \ +#define lookup_32bit(t0, t1, t2, t3, src, dst, interleave_op, il_reg) \ + movzbl src ## bl,RID1d; \ + movzbl src ## bh,RID2d; \ + shrq $16, src; \ movlt0(CTX, RID1, 4), dst ## d; \ xorlt1(CTX, RID2, 4), dst ## d; \ - shrq $16, src; \ - movbsrc ## bl,RID1b; \ - movbsrc ## bh,RID2b; \ + movzbl src ## bl,RID1d; \ + movzbl src ## bh,RID2d; \ + interleave_op(il_reg); \ xorlt2(CTX, RID1, 4), dst ## d; \ xorlt3(CTX, RID2, 4), dst ## d; +#define dummy(d) /* do nothing */ + +#define shr_next(reg) \ + shrq $16, reg; + #define G(a, x, t0, t1, t2, t3) \ - vmovq a,RGI1; \ - vpsrldq $8, a,x; \ - vmovq x,RGI2; \ + vmovq a, RGI1; \ + vpextrq $1, a, RGI2; \ \ - lookup_32bit(t0, t1, t2, t3, RGI1, RGS1); \ - shrq $16, RGI1; \ - lookup_32bit(t0, t1, t2, t3, RGI1, RGS2); \ + lookup_32bit(t0, t1, t2, t3, RGI1, RGS1, shr_next, RGI1); \ + lookup_32bit(t0, t1, t2, t3, RGI1, RGS2, dummy, none); \ shlq $32, RGS2; \ orq RGS1, RGS2; \ \ - lookup_32bit(t0, t1, t2, t3, RGI2, RGS1); \ - shrq $16, RGI2; \ - lookup_32bit(t0, t1, t2, t3, RGI2, RGS3); \ - shlq $32, RGS3; \ + lookup_32bit(t0, t1, t2, t3, RGI2, RGS3, shr_next, RGI2); \ + lookup_32bit(t0, t1, t2, t3, RGI2, RGS1, dummy, none); \ + shlq $32, RGS1; \ orq RGS1, RGS3; \ \ vmovq RGS2, x; \ vpinsrq $1, RGS3, x, x; -#define encround(a, b, c, d, x, y) \ - G(a, x, s0, s1, s2, s3); \ - G(b, y, s1, s2, s3, s0); \ +#define encround_g1g2(a, b, c, d, x, y) \ + G(a, x, s0, s1, s2, s3); \ + G(b, y, s1, s2, s3, s0); + +#define encround_end(a, b, c, d, x, y) \ + vpslld $1, d, RT; \ + vpsrld $(32 - 1), d, d; \ + vpord, RT, d; \ vpaddd x, y, x; \ vpaddd y, x, y; \ vpaddd x, RK1, x; \ @@ -115,14 +130,16 @@ vpsrld $1, c, x; \ vpslld $(32 - 1), c, c; \ vporc, x, c; \ - vpslld $1, d, x; \ - vpsrld $(32 - 1), d, d; \ - vpord, x, d; \ vpxor
Re: [PATCH] rndis_wlan: Fix potential memory leak in update_pmkid()
Quoting Alexey Khoroshilov : Do not leak memory by updating pointer with potentially NULL realloc return value. Found by Linux Driver Verification project (linuxtesting.org). Signed-off-by: Alexey Khoroshilov Thanks! Acked-by: Jussi Kivilinna --- drivers/net/wireless/rndis_wlan.c |6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/drivers/net/wireless/rndis_wlan.c b/drivers/net/wireless/rndis_wlan.c index 241162e..7a4ae9e 100644 --- a/drivers/net/wireless/rndis_wlan.c +++ b/drivers/net/wireless/rndis_wlan.c @@ -1803,6 +1803,7 @@ static struct ndis_80211_pmkid *update_pmkid(struct usbnet *usbdev, struct cfg80211_pmksa *pmksa, int max_pmkids) { + struct ndis_80211_pmkid *new_pmkids; int i, err, newlen; unsigned int count; @@ -1833,11 +1834,12 @@ static struct ndis_80211_pmkid *update_pmkid(struct usbnet *usbdev, /* add new pmkid */ newlen = sizeof(*pmkids) + (count + 1) * sizeof(pmkids->bssid_info[0]); - pmkids = krealloc(pmkids, newlen, GFP_KERNEL); - if (!pmkids) { + new_pmkids = krealloc(pmkids, newlen, GFP_KERNEL); + if (!new_pmkids) { err = -ENOMEM; goto error; } + pmkids = new_pmkids; pmkids->length = cpu_to_le32(newlen); pmkids->bssid_info_count = cpu_to_le32(count + 1); -- 1.7.9.5 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] rndis_wlan: Fix potential memory leak in update_pmkid()
Quoting Alexey Khoroshilov khoroshi...@ispras.ru: Do not leak memory by updating pointer with potentially NULL realloc return value. Found by Linux Driver Verification project (linuxtesting.org). Signed-off-by: Alexey Khoroshilov khoroshi...@ispras.ru Thanks! Acked-by: Jussi Kivilinna jussi.kivili...@mbnet.fi --- drivers/net/wireless/rndis_wlan.c |6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/drivers/net/wireless/rndis_wlan.c b/drivers/net/wireless/rndis_wlan.c index 241162e..7a4ae9e 100644 --- a/drivers/net/wireless/rndis_wlan.c +++ b/drivers/net/wireless/rndis_wlan.c @@ -1803,6 +1803,7 @@ static struct ndis_80211_pmkid *update_pmkid(struct usbnet *usbdev, struct cfg80211_pmksa *pmksa, int max_pmkids) { + struct ndis_80211_pmkid *new_pmkids; int i, err, newlen; unsigned int count; @@ -1833,11 +1834,12 @@ static struct ndis_80211_pmkid *update_pmkid(struct usbnet *usbdev, /* add new pmkid */ newlen = sizeof(*pmkids) + (count + 1) * sizeof(pmkids-bssid_info[0]); - pmkids = krealloc(pmkids, newlen, GFP_KERNEL); - if (!pmkids) { + new_pmkids = krealloc(pmkids, newlen, GFP_KERNEL); + if (!new_pmkids) { err = -ENOMEM; goto error; } + pmkids = new_pmkids; pmkids-length = cpu_to_le32(newlen); pmkids-bssid_info_count = cpu_to_le32(count + 1); -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: linux-next: Tree for July 2 (crypto/hifn_795x)
Quoting Randy Dunlap : On 07/02/2012 12:23 AM, Stephen Rothwell wrote: Hi all, Changes since 20120629: on i386: ERROR: "__divdi3" [drivers/crypto/hifn_795x.ko] undefined! This is caused by commit feb7b7ab928afa97a79a9c424e4e0691f49d63be. hifn_795x has "DIV_ROUND_UP(NSEC_PER_SEC, dev->pk_clk_freq)", which should be changed to DIV_ROUND_UP_ULL now that NSEC_PER_SEC is 64bit on 32bit archs. Patch to fix hifn_795x is attached (only compile tested). -Jussi crypto: hifn_795x - fix 64bit division and undefined __divdi3 on 32bit archs From: Jussi Kivilinna Commit feb7b7ab928afa97a79a9c424e4e0691f49d63be changed NSEC_PER_SEC to 64-bit constant, which causes "DIV_ROUND_UP(NSEC_PER_SEC, dev->pk_clk_freq)" to generate __divdi3 call on 32-bit archs. Fix this by changing DIV_ROUND_UP to DIV_ROUND_UP_ULL. Signed-off-by: Jussi Kivilinna --- drivers/crypto/hifn_795x.c |4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/crypto/hifn_795x.c b/drivers/crypto/hifn_795x.c index c9c4bef..df14358 100644 --- a/drivers/crypto/hifn_795x.c +++ b/drivers/crypto/hifn_795x.c @@ -821,8 +821,8 @@ static int hifn_register_rng(struct hifn_device *dev) /* * We must wait at least 256 Pk_clk cycles between two reads of the rng. */ - dev->rng_wait_time = DIV_ROUND_UP(NSEC_PER_SEC, dev->pk_clk_freq) * - 256; + dev->rng_wait_time = DIV_ROUND_UP_ULL(NSEC_PER_SEC, + dev->pk_clk_freq) * 256; dev->rng.name = dev->name; dev->rng.data_present = hifn_rng_data_present,
Re: linux-next: Tree for July 2 (crypto/hifn_795x)
Quoting Randy Dunlap rdun...@xenotime.net: On 07/02/2012 12:23 AM, Stephen Rothwell wrote: Hi all, Changes since 20120629: on i386: ERROR: __divdi3 [drivers/crypto/hifn_795x.ko] undefined! This is caused by commit feb7b7ab928afa97a79a9c424e4e0691f49d63be. hifn_795x has DIV_ROUND_UP(NSEC_PER_SEC, dev-pk_clk_freq), which should be changed to DIV_ROUND_UP_ULL now that NSEC_PER_SEC is 64bit on 32bit archs. Patch to fix hifn_795x is attached (only compile tested). -Jussi crypto: hifn_795x - fix 64bit division and undefined __divdi3 on 32bit archs From: Jussi Kivilinna jussi.kivili...@mbnet.fi Commit feb7b7ab928afa97a79a9c424e4e0691f49d63be changed NSEC_PER_SEC to 64-bit constant, which causes DIV_ROUND_UP(NSEC_PER_SEC, dev-pk_clk_freq) to generate __divdi3 call on 32-bit archs. Fix this by changing DIV_ROUND_UP to DIV_ROUND_UP_ULL. Signed-off-by: Jussi Kivilinna jussi.kivili...@mbnet.fi --- drivers/crypto/hifn_795x.c |4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/crypto/hifn_795x.c b/drivers/crypto/hifn_795x.c index c9c4bef..df14358 100644 --- a/drivers/crypto/hifn_795x.c +++ b/drivers/crypto/hifn_795x.c @@ -821,8 +821,8 @@ static int hifn_register_rng(struct hifn_device *dev) /* * We must wait at least 256 Pk_clk cycles between two reads of the rng. */ - dev-rng_wait_time = DIV_ROUND_UP(NSEC_PER_SEC, dev-pk_clk_freq) * - 256; + dev-rng_wait_time = DIV_ROUND_UP_ULL(NSEC_PER_SEC, + dev-pk_clk_freq) * 256; dev-rng.name = dev-name; dev-rng.data_present = hifn_rng_data_present,