Re: LXC broken with 5.10-stable?, ok with 5.9-stable (Re: Linux 5.10.3)

2020-12-27 Thread Jussi Kivilinna

On 27.12.2020 21.05, Linus Torvalds wrote:

On Sun, Dec 27, 2020 at 10:39 AM Jussi Kivilinna  wrote:


5.10.3 with patch compiles fine, but does not solve the issue.


Duh. adding the read_iter only fixes kernel_read(). For splice, it also needs a

 .splice_read = generic_file_splice_read,

in the file operations, something like this...

Does that get things working?



Yes, LXC works for me now. Thanks.

-Jussi


Re: LXC broken with 5.10-stable?, ok with 5.9-stable (Re: Linux 5.10.3)

2020-12-27 Thread Jussi Kivilinna

On 27.12.2020 19.20, Linus Torvalds wrote:

On Sun, Dec 27, 2020 at 8:32 AM Jussi Kivilinna  wrote:


Has this been fixed in 5.11-rc? Is there any patch that I could backport and 
test with 5.10?


Here's a patch to test. Entirely untested by me. I'm surprised at how
people use sendfile() on random files. Oh well..



5.10.3 with patch compiles fine, but does not solve the issue.

The test case from bugzilla still fails and LXC container wont start.

-Jussi


LXC broken with 5.10-stable?, ok with 5.9-stable (Re: Linux 5.10.3)

2020-12-27 Thread Jussi Kivilinna

Hello,

Now that 5.9 series is EOL, I tried to move to 5.10.3. I ran in to regression 
where LXC containers do not start with newer kernel. I found that issue had 
been reported (bisected + with reduced test case) in bugzilla at: 
https://bugzilla.kernel.org/show_bug.cgi?id=209971

Has this been fixed in 5.11-rc? Is there any patch that I could backport and 
test with 5.10?

-Jussi

On 26.12.2020 17.20, Greg Kroah-Hartman wrote:

I'm announcing the release of the 5.10.3 kernel.

All users of the 5.10 kernel series must upgrade.

The updated 5.10.y git tree can be found at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git 
linux-5.10.y
and can be browsed at the normal kernel.org git web browser:

https://git.kernel.org/?p=linux/kernel/git/stable/linux-stable.git;a=summary

thanks,

greg k-h




Re: [PATCH 3/5] lib/mpi: Fix for building for MIPS32 with Clang

2019-08-12 Thread Jussi Kivilinna
Hello,

On 12.8.2019 20.14, Nathan Chancellor wrote:
> On Mon, Aug 12, 2019 at 10:35:53AM +0300, Jussi Kivilinna wrote:
>> Hello,
>>
>> On 12.8.2019 6.31, Nathan Chancellor wrote:
>>> From: Vladimir Serbinenko 
>>>
>>> clang doesn't recognise =l / =h assembly operand specifiers but apparently
>>> handles C version well.
>>>
>>> lib/mpi/generic_mpih-mul1.c:37:24: error: invalid use of a cast in a
>>> inline asm context requiring an l-value: remove the cast or build with
>>> -fheinous-gnu-extensions
>>> umul_ppmm(prod_high, prod_low, s1_ptr[j], s2_limb);
>>> ~^
>>> lib/mpi/longlong.h:652:20: note: expanded from macro 'umul_ppmm'
>>> : "=l" ((USItype)(w0)), \
>>> ~~^~~
>>> lib/mpi/generic_mpih-mul1.c:37:3: error: invalid output constraint '=h'
>>> in asm
>>> umul_ppmm(prod_high, prod_low, s1_ptr[j], s2_limb);
>>> ^
>>> lib/mpi/longlong.h:653:7: note: expanded from macro 'umul_ppmm'
>>>  "=h" ((USItype)(w1)) \
>>>  ^
>>> 2 errors generated.
>>>
>>> Fixes: 5ce3e312ec5c ("crypto: GnuPG based MPI lib - header files (part 2)")
>>> Link: https://github.com/ClangBuiltLinux/linux/issues/605
>>> Link: 
>>> https://github.com/gpg/libgcrypt/commit/1ecbd0bca31d462719a2a6590c1d03244e76ef89
>>> Signed-off-by: Vladimir Serbinenko 
>>> [jk: add changelog, rebase on libgcrypt repository, reformat changed
>>>  line so it does not go over 80 characters]
>>> Signed-off-by: Jussi Kivilinna 
>>
>> This is my signed-off-by for libgcrypt project, not kernel. I do not think
>> signed-offs can be passed from other projects in this way.
>>
>> -Jussi
> 
> Hi Jussi,
> 
> I am no signoff expert but if I am reading the developer certificate of
> origin in the libgcrypt repo correctly [1], your signoff on this commit
> falls under:
> 
> (d) I understand and agree that this project and the contribution
> are public and that a record of the contribution (including all
> personal information I submit with it, including my sign-off) is
> maintained indefinitely and may be redistributed consistent with
> this project or the open source license(s) involved.

There is nothing wrong with the commit in libgcrypt repo and/or my 
libgcrypt-DCO-sign-off.

> 
> This file is maintained under the LGPL because it was taken straight
> from the libgcrypr repo and per (b), I can submit this commit here
> with everything intact.

But you do not have my kernel-DCO-sign-off for this patch. I have not
been involved with this kernel patch in anyway, have not integrated 
it to kernel, not testing it on kernel.. I do not own it. However, 
with this signed-off-by line you have involved me to kernel patch 
process in which for this patch I'm not interested. So to be clear, 
I retract my kernel-DCO-signed-off for this kernel patch:

  NOT-Signed-off-by: Jussi Kivilinna 

Of course you can copy the original libgcrypt commit message to this
patch, but I think it needs to be clearly quoted so that my 
libgcrypt-DCO-signed-off line wont be mixed with kernel-DOC-signed-off
lines. 

> 
> However, I don't want to upset you in any way though so if you are not
> comfortable with that, I suppose I can remove it as if Vladimir
> submitted this fix to me directly (as I got permission for his signoff).
> I need to resubmit this fix to an appropriate maintainer so let me know
> what you think.

That's quite complicated approach. Fast and easier process would be if you
just own the patch yourself. Libgcrypt (and target file in libgcrypt) 
is LGPL v2.1+, so the license is compatible with kernel and you are good 
to go with just your own (kernel DCO) signed-off-by.

-Jussi

> 
> [1]: 
> https://github.com/gpg/libgcrypt/blob/3bb858551cd5d84e43b800edfa2b07d1529718a9/doc/DCO
> 
> Cheers,
> Nathan
> 



Re: [PATCH 3/5] lib/mpi: Fix for building for MIPS32 with Clang

2019-08-12 Thread Jussi Kivilinna
Hello,

On 12.8.2019 6.31, Nathan Chancellor wrote:
> From: Vladimir Serbinenko 
> 
> clang doesn't recognise =l / =h assembly operand specifiers but apparently
> handles C version well.
> 
> lib/mpi/generic_mpih-mul1.c:37:24: error: invalid use of a cast in a
> inline asm context requiring an l-value: remove the cast or build with
> -fheinous-gnu-extensions
> umul_ppmm(prod_high, prod_low, s1_ptr[j], s2_limb);
> ~^
> lib/mpi/longlong.h:652:20: note: expanded from macro 'umul_ppmm'
> : "=l" ((USItype)(w0)), \
> ~~^~~
> lib/mpi/generic_mpih-mul1.c:37:3: error: invalid output constraint '=h'
> in asm
> umul_ppmm(prod_high, prod_low, s1_ptr[j], s2_limb);
> ^
> lib/mpi/longlong.h:653:7: note: expanded from macro 'umul_ppmm'
>  "=h" ((USItype)(w1)) \
>  ^
> 2 errors generated.
> 
> Fixes: 5ce3e312ec5c ("crypto: GnuPG based MPI lib - header files (part 2)")
> Link: https://github.com/ClangBuiltLinux/linux/issues/605
> Link: 
> https://github.com/gpg/libgcrypt/commit/1ecbd0bca31d462719a2a6590c1d03244e76ef89
> Signed-off-by: Vladimir Serbinenko 
> [jk: add changelog, rebase on libgcrypt repository, reformat changed
>  line so it does not go over 80 characters]
> Signed-off-by: Jussi Kivilinna 

This is my signed-off-by for libgcrypt project, not kernel. I do not think
signed-offs can be passed from other projects in this way.

-Jussi

> [nc: Added build error and tags to commit message
>  Added Vladimir's signoff with his permission
>  Adjusted Jussi's comment to wrap at 73 characters
>  Modified commit subject to mirror MIPS64 commit
>  Removed space between defined and (__clang__)]
> Signed-off-by: Nathan Chancellor 
> ---
>  lib/mpi/longlong.h | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/lib/mpi/longlong.h b/lib/mpi/longlong.h
> index 3bb6260d8f42..8a1507fc94dd 100644
> --- a/lib/mpi/longlong.h
> +++ b/lib/mpi/longlong.h
> @@ -639,7 +639,8 @@ do { \
>   **  MIPS  *
>   ***/
>  #if defined(__mips__) && W_TYPE_SIZE == 32
> -#if (__GNUC__ >= 5) || (__GNUC__ >= 4 && __GNUC_MINOR__ >= 4)
> +#if defined(__clang__) || (__GNUC__ >= 5) || (__GNUC__ == 4 && \
> +   __GNUC_MINOR__ >= 4)
>  #define umul_ppmm(w1, w0, u, v)  \
>  do { \
>   UDItype __ll = (UDItype)(u) * (v);  \
> 


Re: Question about ctr mode 3des-ede IV len

2016-12-08 Thread Jussi Kivilinna
Hello,

07.12.2016, 14:43, Longpeng (Mike) kirjoitti:
> Hi Jussi and Herbert,
> 
> I saw serveral des3-ede testcases(in crypto/testmgr.h) has 16-bytes IV, and 
> the
> libgcrypt/nettle/RFC1851 said the IV-len is 8-bytes.
> 
> Would you please tell me why these testcases has 16-bytes IV ?

Because I used same tool to create test-vectors which I had previously used to 
create AES/Camellia/Serpent/Twofish test-vectors. So, I must have forgotten to 
change 16-byte IV generation to 8 bytes and thus those testcases in 
crypto/testmgr.h have wrong length. The extra trailing 8 bytes are not used and 
can be removed.

-Jussi

> 
> Thank you. :)
> 


Re: Question about ctr mode 3des-ede IV len

2016-12-08 Thread Jussi Kivilinna
Hello,

07.12.2016, 14:43, Longpeng (Mike) kirjoitti:
> Hi Jussi and Herbert,
> 
> I saw serveral des3-ede testcases(in crypto/testmgr.h) has 16-bytes IV, and 
> the
> libgcrypt/nettle/RFC1851 said the IV-len is 8-bytes.
> 
> Would you please tell me why these testcases has 16-bytes IV ?

Because I used same tool to create test-vectors which I had previously used to 
create AES/Camellia/Serpent/Twofish test-vectors. So, I must have forgotten to 
change 16-byte IV generation to 8 bytes and thus those testcases in 
crypto/testmgr.h have wrong length. The extra trailing 8 bytes are not used and 
can be removed.

-Jussi

> 
> Thank you. :)
> 


Re: Kernel crypto API: cryptoperf performance measurement

2014-08-21 Thread Jussi Kivilinna

On 2014-08-20 21:14, Milan Broz wrote:
> On 08/20/2014 03:25 PM, Jussi Kivilinna wrote:
>>> One to four GB per second for XTS? 12 GB per second for AES CBC? Somehow 
>>> that 
>>> does not sound right.
>>
>> Agreed, those do not look correct... I wonder what happened there. On
>> new run, I got more sane results:
> 
> Which cryptsetup version are you using?
> 
> There was a bug in that test on fast machines (fixed in 1.6.3, I hope :)

I had version 1.6.1 at hand.

> 
> But anyway, it is not intended as rigorous speed test,
> it was intended for comparison of ciphers speed on particular machine.
>

True, but it's nice easy test when compared to parsing results from
tcrypt speed tests.

-Jussi

> Test basically tries to encrypt 1MB block (or multiple of this
> if machine is too fast). All it runs through kernel userspace crypto API
> interface.
> (Real FDE is always slower because it runs over 512bytes blocks.)
> 
> Milan
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Kernel crypto API: cryptoperf performance measurement

2014-08-21 Thread Jussi Kivilinna

On 2014-08-20 21:14, Milan Broz wrote:
 On 08/20/2014 03:25 PM, Jussi Kivilinna wrote:
 One to four GB per second for XTS? 12 GB per second for AES CBC? Somehow 
 that 
 does not sound right.

 Agreed, those do not look correct... I wonder what happened there. On
 new run, I got more sane results:
 
 Which cryptsetup version are you using?
 
 There was a bug in that test on fast machines (fixed in 1.6.3, I hope :)

I had version 1.6.1 at hand.

 
 But anyway, it is not intended as rigorous speed test,
 it was intended for comparison of ciphers speed on particular machine.


True, but it's nice easy test when compared to parsing results from
tcrypt speed tests.

-Jussi

 Test basically tries to encrypt 1MB block (or multiple of this
 if machine is too fast). All it runs through kernel userspace crypto API
 interface.
 (Real FDE is always slower because it runs over 512bytes blocks.)
 
 Milan
 
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Kernel crypto API: cryptoperf performance measurement

2014-08-20 Thread Jussi Kivilinna
Hello,

On 2014-08-19 21:23, Stephan Mueller wrote:
> Am Dienstag, 19. August 2014, 10:17:36 schrieb Jussi Kivilinna:
> 
> Hi Jussi,
> 
>> Hello,
>>
>> On 2014-08-17 18:55, Stephan Mueller wrote:
>>> Hi,
>>>
>>> during playing around with the kernel crypto API, I implemented a
>>> performance measurement tool kit for the various kernel crypto API cipher
>>> types. The cryptoperf tool kit is provided in [1].
>>>
>>> Comments are welcome.
>>
>> Your results are quite slow compared to, for example "cryptsetup
>> benchmark", which uses kernel crypto from userspace.
>>
>> With Intel i5-2450M (turbo enabled), I get:
>>
>> #  Algorithm | Key |  Encryption |  Decryption
>>  aes-cbc   128b   524,0 MiB/s  11909,1 MiB/s
>>  serpent-cbc   128b60,9 MiB/s   219,4 MiB/s
>>  twofish-cbc   128b   143,4 MiB/s   240,3 MiB/s
>>  aes-cbc   256b   330,4 MiB/s  1242,8 MiB/s
>>  serpent-cbc   256b66,1 MiB/s   220,3 MiB/s
>>  twofish-cbc   256b   143,5 MiB/s   221,8 MiB/s
>>  aes-xts   256b  1268,7 MiB/s  4193,0 MiB/s
>>  serpent-xts   256b   234,8 MiB/s   224,6 MiB/s
>>  twofish-xts   256b   253,5 MiB/s   254,7 MiB/s
>>  aes-xts   512b  2535,0 MiB/s  2945,0 MiB/s
>>  serpent-xts   512b   274,2 MiB/s   242,3 MiB/s
>>  twofish-xts   512b   250,0 MiB/s   245,8 MiB/s
> 
> One to four GB per second for XTS? 12 GB per second for AES CBC? Somehow that 
> does not sound right.

Agreed, those do not look correct... I wonder what happened there. On
new run, I got more sane results:

#  Algorithm | Key |  Encryption |  Decryption
 aes-cbc   128b   139,1 MiB/s  1713,6 MiB/s
 serpent-cbc   128b62,2 MiB/s   232,9 MiB/s
 twofish-cbc   128b   116,3 MiB/s   243,7 MiB/s
 aes-cbc   256b   375,1 MiB/s  1159,4 MiB/s
 serpent-cbc   256b62,1 MiB/s   214,9 MiB/s
 twofish-cbc   256b   139,3 MiB/s   217,5 MiB/s
 aes-xts   256b  1296,4 MiB/s  1272,5 MiB/s
 serpent-xts   256b   283,3 MiB/s   275,6 MiB/s
 twofish-xts   256b   294,8 MiB/s   299,3 MiB/s
 aes-xts   512b   984,3 MiB/s   991,1 MiB/s
 serpent-xts   512b   227,7 MiB/s   220,6 MiB/s
 twofish-xts   512b   220,6 MiB/s   220,2 MiB/s

-Jussi
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Kernel crypto API: cryptoperf performance measurement

2014-08-20 Thread Jussi Kivilinna
Hello,

On 2014-08-19 21:23, Stephan Mueller wrote:
 Am Dienstag, 19. August 2014, 10:17:36 schrieb Jussi Kivilinna:
 
 Hi Jussi,
 
 Hello,

 On 2014-08-17 18:55, Stephan Mueller wrote:
 Hi,

 during playing around with the kernel crypto API, I implemented a
 performance measurement tool kit for the various kernel crypto API cipher
 types. The cryptoperf tool kit is provided in [1].

 Comments are welcome.

 Your results are quite slow compared to, for example cryptsetup
 benchmark, which uses kernel crypto from userspace.

 With Intel i5-2450M (turbo enabled), I get:

 #  Algorithm | Key |  Encryption |  Decryption
  aes-cbc   128b   524,0 MiB/s  11909,1 MiB/s
  serpent-cbc   128b60,9 MiB/s   219,4 MiB/s
  twofish-cbc   128b   143,4 MiB/s   240,3 MiB/s
  aes-cbc   256b   330,4 MiB/s  1242,8 MiB/s
  serpent-cbc   256b66,1 MiB/s   220,3 MiB/s
  twofish-cbc   256b   143,5 MiB/s   221,8 MiB/s
  aes-xts   256b  1268,7 MiB/s  4193,0 MiB/s
  serpent-xts   256b   234,8 MiB/s   224,6 MiB/s
  twofish-xts   256b   253,5 MiB/s   254,7 MiB/s
  aes-xts   512b  2535,0 MiB/s  2945,0 MiB/s
  serpent-xts   512b   274,2 MiB/s   242,3 MiB/s
  twofish-xts   512b   250,0 MiB/s   245,8 MiB/s
 
 One to four GB per second for XTS? 12 GB per second for AES CBC? Somehow that 
 does not sound right.

Agreed, those do not look correct... I wonder what happened there. On
new run, I got more sane results:

#  Algorithm | Key |  Encryption |  Decryption
 aes-cbc   128b   139,1 MiB/s  1713,6 MiB/s
 serpent-cbc   128b62,2 MiB/s   232,9 MiB/s
 twofish-cbc   128b   116,3 MiB/s   243,7 MiB/s
 aes-cbc   256b   375,1 MiB/s  1159,4 MiB/s
 serpent-cbc   256b62,1 MiB/s   214,9 MiB/s
 twofish-cbc   256b   139,3 MiB/s   217,5 MiB/s
 aes-xts   256b  1296,4 MiB/s  1272,5 MiB/s
 serpent-xts   256b   283,3 MiB/s   275,6 MiB/s
 twofish-xts   256b   294,8 MiB/s   299,3 MiB/s
 aes-xts   512b   984,3 MiB/s   991,1 MiB/s
 serpent-xts   512b   227,7 MiB/s   220,6 MiB/s
 twofish-xts   512b   220,6 MiB/s   220,2 MiB/s

-Jussi
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Kernel crypto API: cryptoperf performance measurement

2014-08-19 Thread Jussi Kivilinna
Hello,

On 2014-08-17 18:55, Stephan Mueller wrote:
> Hi,
> 
> during playing around with the kernel crypto API, I implemented a performance 
> measurement tool kit for the various kernel crypto API cipher types. The 
> cryptoperf tool kit is provided in [1].
> 
> Comments are welcome.

Your results are quite slow compared to, for example "cryptsetup
benchmark", which uses kernel crypto from userspace.

With Intel i5-2450M (turbo enabled), I get:

#  Algorithm | Key |  Encryption |  Decryption
 aes-cbc   128b   524,0 MiB/s  11909,1 MiB/s
 serpent-cbc   128b60,9 MiB/s   219,4 MiB/s
 twofish-cbc   128b   143,4 MiB/s   240,3 MiB/s
 aes-cbc   256b   330,4 MiB/s  1242,8 MiB/s
 serpent-cbc   256b66,1 MiB/s   220,3 MiB/s
 twofish-cbc   256b   143,5 MiB/s   221,8 MiB/s
 aes-xts   256b  1268,7 MiB/s  4193,0 MiB/s
 serpent-xts   256b   234,8 MiB/s   224,6 MiB/s
 twofish-xts   256b   253,5 MiB/s   254,7 MiB/s
 aes-xts   512b  2535,0 MiB/s  2945,0 MiB/s
 serpent-xts   512b   274,2 MiB/s   242,3 MiB/s
 twofish-xts   512b   250,0 MiB/s   245,8 MiB/s

> 
> In general, the results are as expected, i.e. the assembler implementations 
> are faster than the pure C implementations. However, there are curious 
> results 
> which probably should be checked by the maintainers of the respective ciphers 
> (hoping that my tool works correctly ;-) ):
> 
> ablkcipher
> --
> 
> - cryptd is slower by factor 10 across the board
> 
> blkcipher
> -
> 
> - Blowfish x86_64 assembler together with the generic C block chaining modes 
> is significantly slower than Blowfish implemented in generic C
> 
> - Blowfish x86_64 assembler in ECB is significantly slower than generic C 
> Blowfish ECB
> 
> - Serpent assembler implementations are not significantly faster than generic 
> C implementations
> 
> - AES-NI ECB, LRW, CTR is significantly slower than AES i586 assembler.
> 
> - AES-NI ECB, LRW, CTR is not significantly faster than AES generic C
> 

Quite many assembly implementations get speed up from processing
parallel block cipher blocks, which modes of operation that (CTR, XTS,
LWR, CBC(dec)). For small buffer sizes, these implementations will use
the non-parallel implementation of cipher.

-Jussi

> rng
> ---
> 
> - The ANSI X9.31 RNG seems to work massively faster than the underlying AES 
> cipher (by about a factor of 5). I am unsure about the cause of this.
> 
> 
> Caveat
> --
> 
> Please note that there is one small error which I am unsure how to fix it as 
> documented in the TODO file.
> 
> [1] http://www.chronox.de/cryptoperf.html
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Kernel crypto API: cryptoperf performance measurement

2014-08-19 Thread Jussi Kivilinna
Hello,

On 2014-08-17 18:55, Stephan Mueller wrote:
 Hi,
 
 during playing around with the kernel crypto API, I implemented a performance 
 measurement tool kit for the various kernel crypto API cipher types. The 
 cryptoperf tool kit is provided in [1].
 
 Comments are welcome.

Your results are quite slow compared to, for example cryptsetup
benchmark, which uses kernel crypto from userspace.

With Intel i5-2450M (turbo enabled), I get:

#  Algorithm | Key |  Encryption |  Decryption
 aes-cbc   128b   524,0 MiB/s  11909,1 MiB/s
 serpent-cbc   128b60,9 MiB/s   219,4 MiB/s
 twofish-cbc   128b   143,4 MiB/s   240,3 MiB/s
 aes-cbc   256b   330,4 MiB/s  1242,8 MiB/s
 serpent-cbc   256b66,1 MiB/s   220,3 MiB/s
 twofish-cbc   256b   143,5 MiB/s   221,8 MiB/s
 aes-xts   256b  1268,7 MiB/s  4193,0 MiB/s
 serpent-xts   256b   234,8 MiB/s   224,6 MiB/s
 twofish-xts   256b   253,5 MiB/s   254,7 MiB/s
 aes-xts   512b  2535,0 MiB/s  2945,0 MiB/s
 serpent-xts   512b   274,2 MiB/s   242,3 MiB/s
 twofish-xts   512b   250,0 MiB/s   245,8 MiB/s

 
 In general, the results are as expected, i.e. the assembler implementations 
 are faster than the pure C implementations. However, there are curious 
 results 
 which probably should be checked by the maintainers of the respective ciphers 
 (hoping that my tool works correctly ;-) ):
 
 ablkcipher
 --
 
 - cryptd is slower by factor 10 across the board
 
 blkcipher
 -
 
 - Blowfish x86_64 assembler together with the generic C block chaining modes 
 is significantly slower than Blowfish implemented in generic C
 
 - Blowfish x86_64 assembler in ECB is significantly slower than generic C 
 Blowfish ECB
 
 - Serpent assembler implementations are not significantly faster than generic 
 C implementations
 
 - AES-NI ECB, LRW, CTR is significantly slower than AES i586 assembler.
 
 - AES-NI ECB, LRW, CTR is not significantly faster than AES generic C
 

Quite many assembly implementations get speed up from processing
parallel block cipher blocks, which modes of operation that (CTR, XTS,
LWR, CBC(dec)). For small buffer sizes, these implementations will use
the non-parallel implementation of cipher.

-Jussi

 rng
 ---
 
 - The ANSI X9.31 RNG seems to work massively faster than the underlying AES 
 cipher (by about a factor of 5). I am unsure about the cause of this.
 
 
 Caveat
 --
 
 Please note that there is one small error which I am unsure how to fix it as 
 documented in the TODO file.
 
 [1] http://www.chronox.de/cryptoperf.html
 
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Documentation: "kerneli" typo in description for "Serpent cipher algorithm" Bug #60848

2013-10-02 Thread Jussi Kivilinna
On 02.10.2013 21:12, Rob Landley wrote:
> On 10/02/2013 11:10:37 AM, Kevin Mulvey wrote:
>> change kerneli to kernel as well as kerneli.org to kernel.org
>>
>> Signed-off-by: Kevin Mulvey 
> 
> There's a bug number for this?
> 
> Acked, queued. (Although I'm not sure the value of pointing to www.kernel.org 
> for this.)

I think kerneli.org is correct.. see old website at 
http://web.archive.org/web/20010201085500/http://www.kerneli.org/

-Jussi

> 
> Thanks,
> 
> Rob
> 
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Documentation: kerneli typo in description for Serpent cipher algorithm Bug #60848

2013-10-02 Thread Jussi Kivilinna
On 02.10.2013 21:12, Rob Landley wrote:
 On 10/02/2013 11:10:37 AM, Kevin Mulvey wrote:
 change kerneli to kernel as well as kerneli.org to kernel.org

 Signed-off-by: Kevin Mulvey ke...@kevinmulvey.net
 
 There's a bug number for this?
 
 Acked, queued. (Although I'm not sure the value of pointing to www.kernel.org 
 for this.)

I think kerneli.org is correct.. see old website at 
http://web.archive.org/web/20010201085500/http://www.kerneli.org/

-Jussi

 
 Thanks,
 
 Rob
 
 -- 
 To unsubscribe from this list: send the line unsubscribe linux-crypto in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [cryptomgr_test] BUG: unable to handle kernel NULL pointer dereference at (null)

2013-06-18 Thread Jussi Kivilinna
Hello,

Appears to be caused by some memory corruption. Changing from SLOB allocator to 
SLUB made this crash disappear.

Some different crashes with same config:

[0.246152] cryptomgr_test (23) used greatest stack depth: 6400 bytes left
[0.246929] cryptomgr_test (24) used greatest stack depth: 5384 bytes left
[0.248851] modprobe (33) used greatest stack depth: 5376 bytes left
[0.250351] alg: No test for crc32 (crc32-pclmul)
[0.251669] BUG: unable to handle kernel paging request at 882006646e18
[0.252007] IP: [] task_active_pid_ns+0x17/0x30
[0.252007] PGD 2af8067 PUD 0 
[0.252007] Oops:  [#1] SMP 
[0.252007] Modules linked in:
[0.252007] CPU: 0 PID: 43 Comm: kworker/u2:1 Not tainted 
3.10.0-rc1-crash1-00048-gf9a31a2 #24
[0.252007] task: 880006694000 ti: 880006698000 task.ti: 
880006698000
[0.252007] RIP: 0010:[]  [] 
task_active_pid_ns+0x17/0x30
[0.252007] RSP: 0018:880006699dd8  EFLAGS: 00010002
[0.252007] RAX: 880006646e00 RBX: 880006694000 RCX: 0001
[0.252007] RDX: 001fffe0 RSI: 00098000 RDI: 880006655000
[0.252007] RBP: 880006699dd8 R08: 000d R09: 8800066945d0
[0.252007] R10:  R11:  R12: 0011
[0.252007] R13:  R14: 88000704c000 R15: 880006693ff0
[0.252007] FS:  () GS:880007c0() 
knlGS:
[0.252007] CS:  0010 DS:  ES:  CR0: 80050033
[0.252007] CR2: 882006646e18 CR3: 02015000 CR4: 001407f0
[0.252007] DR0:  DR1:  DR2: 
[0.252007] DR3:  DR6: 0ff0 DR7: 0400
[0.252007] Stack:
[0.252007]  880006699e98 81078054 81077fc2 
0011
[0.252007]     
880006699e78
[0.252007]  0046  8106c5ef 

[0.252007] Call Trace:
[0.252007]  [] do_notify_parent+0x114/0x580
[0.252007]  [] ? do_notify_parent+0x82/0x580
[0.252007]  [] ? do_exit+0x80f/0xa20
[0.252007]  [] do_exit+0x8de/0xa20
[0.252007]  [] wait_for_helper+0x98/0xa0
[0.252007]  [] ? call_helper+0x20/0x20
[0.252007]  [] ret_from_fork+0x7c/0xb0
[0.252007]  [] ? call_helper+0x20/0x20
[0.252007] Code: 1f 44 00 00 5d c3 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 
00 48 8b 87 48 02 00 00 55 48 89 e5 48 85 c0 74 10 8b 50 04 48 c1 e2 05 <48> 8b 
44 10 38 eb 0a 66 90 31 c0 66 0f 1f 44 00 00 5d c3 66 0f 
[0.252007] RIP  [] task_active_pid_ns+0x17/0x30
[0.252007]  RSP 
[0.252007] CR2: 882006646e18
[0.252007] ---[ end trace 7caca246688ed8b9 ]---
[0.252007] Kernel panic - not syncing: Fatal exception

...

[0.328072] kernel tried to execute NX-protected page - exploit attempt? 
(uid: 0)
[0.328683] BUG: unable to handle kernel paging request at 88000644cd98
[0.329227] IP: [] 0x88000644cd97
[0.329690] PGD 2af8067 PUD 2af9067 PMD 864001e3 
[0.330182] Oops: 0011 [#1] SMP 
[0.330449] Modules linked in:
[0.330694] CPU: 1 PID: 1 Comm: swapper/0 Not tainted 
3.10.0-rc1-crash1-00048-gf9a31a2 #24
[0.331314] task: 88000644d000 ti: 88000644e000 task.ti: 
88000644e000
[0.331899] RIP: 0010:[]  [] 
0x88000644cd97
[0.332004] RSP: 0018:880007d03eb8  EFLAGS: 00010296
[0.332004] RAX: 88000644cd98 RBX: 880007d0e880 RCX: 0002
[0.332004] RDX: 880006a82560 RSI: 88000644d5d0 RDI: 880006a82560
[0.332004] RBP: 880007d03f20 R08: 0002 R09: 
[0.332004] R10:  R11:  R12: 8203d000
[0.332004] R13: 880006a6dd40 R14: 000a R15: 0008
[0.332004] FS:  () GS:880007d0() 
knlGS:
[0.332004] CS:  0010 DS:  ES:  CR0: 80050033
[0.332004] CR2: 88000644cd98 CR3: 02015000 CR4: 001407e0
[0.332004] DR0:  DR1:  DR2: 
[0.332004] DR3:  DR6: 0ff0 DR7: 0400
[0.332004] Stack:
[0.332004]  810db172 880005252e80 88000644d000 
8800070bfe80
[0.332004]  88000644d000 88000644ffd8 880007d0e8a8 

[0.332004]  88000644ffd8 0009 0101 
82007048
[0.332004] Call Trace:
[0.332004]   
[0.332004]  [] ? rcu_process_callbacks+0x322/0x5a0
[0.332004]  [] __do_softirq+0xd0/0x1a0
[0.332004]  [] irq_exit+0x59/0xb0
[0.332004]  [] smp_apic_timer_interrupt+0x8a/0xa0
[0.332004]  [] apic_timer_interrupt+0x6f/0x80
[0.332004]   
[0.332004]  [] ? __lock_acquire+0xaee/0xcc0
[0.332004]  [] ? 

Re: [cryptomgr_test] BUG: unable to handle kernel NULL pointer dereference at (null)

2013-06-18 Thread Jussi Kivilinna
Hello,

Appears to be caused by some memory corruption. Changing from SLOB allocator to 
SLUB made this crash disappear.

Some different crashes with same config:

[0.246152] cryptomgr_test (23) used greatest stack depth: 6400 bytes left
[0.246929] cryptomgr_test (24) used greatest stack depth: 5384 bytes left
[0.248851] modprobe (33) used greatest stack depth: 5376 bytes left
[0.250351] alg: No test for crc32 (crc32-pclmul)
[0.251669] BUG: unable to handle kernel paging request at 882006646e18
[0.252007] IP: [81085f07] task_active_pid_ns+0x17/0x30
[0.252007] PGD 2af8067 PUD 0 
[0.252007] Oops:  [#1] SMP 
[0.252007] Modules linked in:
[0.252007] CPU: 0 PID: 43 Comm: kworker/u2:1 Not tainted 
3.10.0-rc1-crash1-00048-gf9a31a2 #24
[0.252007] task: 880006694000 ti: 880006698000 task.ti: 
880006698000
[0.252007] RIP: 0010:[81085f07]  [81085f07] 
task_active_pid_ns+0x17/0x30
[0.252007] RSP: 0018:880006699dd8  EFLAGS: 00010002
[0.252007] RAX: 880006646e00 RBX: 880006694000 RCX: 0001
[0.252007] RDX: 001fffe0 RSI: 00098000 RDI: 880006655000
[0.252007] RBP: 880006699dd8 R08: 000d R09: 8800066945d0
[0.252007] R10:  R11:  R12: 0011
[0.252007] R13:  R14: 88000704c000 R15: 880006693ff0
[0.252007] FS:  () GS:880007c0() 
knlGS:
[0.252007] CS:  0010 DS:  ES:  CR0: 80050033
[0.252007] CR2: 882006646e18 CR3: 02015000 CR4: 001407f0
[0.252007] DR0:  DR1:  DR2: 
[0.252007] DR3:  DR6: 0ff0 DR7: 0400
[0.252007] Stack:
[0.252007]  880006699e98 81078054 81077fc2 
0011
[0.252007]     
880006699e78
[0.252007]  0046  8106c5ef 

[0.252007] Call Trace:
[0.252007]  [81078054] do_notify_parent+0x114/0x580
[0.252007]  [81077fc2] ? do_notify_parent+0x82/0x580
[0.252007]  [8106c5ef] ? do_exit+0x80f/0xa20
[0.252007]  [8106c6be] do_exit+0x8de/0xa20
[0.252007]  [8107f628] wait_for_helper+0x98/0xa0
[0.252007]  [8107f590] ? call_helper+0x20/0x20
[0.252007]  [8196207c] ret_from_fork+0x7c/0xb0
[0.252007]  [8107f590] ? call_helper+0x20/0x20
[0.252007] Code: 1f 44 00 00 5d c3 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 
00 48 8b 87 48 02 00 00 55 48 89 e5 48 85 c0 74 10 8b 50 04 48 c1 e2 05 48 8b 
44 10 38 eb 0a 66 90 31 c0 66 0f 1f 44 00 00 5d c3 66 0f 
[0.252007] RIP  [81085f07] task_active_pid_ns+0x17/0x30
[0.252007]  RSP 880006699dd8
[0.252007] CR2: 882006646e18
[0.252007] ---[ end trace 7caca246688ed8b9 ]---
[0.252007] Kernel panic - not syncing: Fatal exception

...

[0.328072] kernel tried to execute NX-protected page - exploit attempt? 
(uid: 0)
[0.328683] BUG: unable to handle kernel paging request at 88000644cd98
[0.329227] IP: [88000644cd98] 0x88000644cd97
[0.329690] PGD 2af8067 PUD 2af9067 PMD 864001e3 
[0.330182] Oops: 0011 [#1] SMP 
[0.330449] Modules linked in:
[0.330694] CPU: 1 PID: 1 Comm: swapper/0 Not tainted 
3.10.0-rc1-crash1-00048-gf9a31a2 #24
[0.331314] task: 88000644d000 ti: 88000644e000 task.ti: 
88000644e000
[0.331899] RIP: 0010:[88000644cd98]  [88000644cd98] 
0x88000644cd97
[0.332004] RSP: 0018:880007d03eb8  EFLAGS: 00010296
[0.332004] RAX: 88000644cd98 RBX: 880007d0e880 RCX: 0002
[0.332004] RDX: 880006a82560 RSI: 88000644d5d0 RDI: 880006a82560
[0.332004] RBP: 880007d03f20 R08: 0002 R09: 
[0.332004] R10:  R11:  R12: 8203d000
[0.332004] R13: 880006a6dd40 R14: 000a R15: 0008
[0.332004] FS:  () GS:880007d0() 
knlGS:
[0.332004] CS:  0010 DS:  ES:  CR0: 80050033
[0.332004] CR2: 88000644cd98 CR3: 02015000 CR4: 001407e0
[0.332004] DR0:  DR1:  DR2: 
[0.332004] DR3:  DR6: 0ff0 DR7: 0400
[0.332004] Stack:
[0.332004]  810db172 880005252e80 88000644d000 
8800070bfe80
[0.332004]  88000644d000 88000644ffd8 880007d0e8a8 

[0.332004]  88000644ffd8 0009 0101 
82007048
[0.332004] Call Trace:
[0.332004]  IRQ 
[0.332004]  [810db172] ? rcu_process_callbacks+0x322/0x5a0
[0.332004]  

[PATCH] crypto: aesni_intel - fix accessing of unaligned memory

2013-06-11 Thread Jussi Kivilinna
The new XTS code for aesni_intel uses input buffers directly as memory operands
for pxor instructions, which causes crash if those buffers are not aligned to
16 bytes.

Patch changes XTS code to handle unaligned memory correctly, by loading memory
with movdqu instead.

Reported-by: Dave Jones 
Tested-by: Dave Jones 
Signed-off-by: Jussi Kivilinna 
---
 arch/x86/crypto/aesni-intel_asm.S |   48 +
 1 file changed, 32 insertions(+), 16 deletions(-)

diff --git a/arch/x86/crypto/aesni-intel_asm.S 
b/arch/x86/crypto/aesni-intel_asm.S
index 62fe22c..477e9d7 100644
--- a/arch/x86/crypto/aesni-intel_asm.S
+++ b/arch/x86/crypto/aesni-intel_asm.S
@@ -2681,56 +2681,68 @@ ENTRY(aesni_xts_crypt8)
addq %rcx, KEYP
 
movdqa IV, STATE1
-   pxor 0x00(INP), STATE1
+   movdqu 0x00(INP), INC
+   pxor INC, STATE1
movdqu IV, 0x00(OUTP)
 
_aesni_gf128mul_x_ble()
movdqa IV, STATE2
-   pxor 0x10(INP), STATE2
+   movdqu 0x10(INP), INC
+   pxor INC, STATE2
movdqu IV, 0x10(OUTP)
 
_aesni_gf128mul_x_ble()
movdqa IV, STATE3
-   pxor 0x20(INP), STATE3
+   movdqu 0x20(INP), INC
+   pxor INC, STATE3
movdqu IV, 0x20(OUTP)
 
_aesni_gf128mul_x_ble()
movdqa IV, STATE4
-   pxor 0x30(INP), STATE4
+   movdqu 0x30(INP), INC
+   pxor INC, STATE4
movdqu IV, 0x30(OUTP)
 
call *%r11
 
-   pxor 0x00(OUTP), STATE1
+   movdqu 0x00(OUTP), INC
+   pxor INC, STATE1
movdqu STATE1, 0x00(OUTP)
 
_aesni_gf128mul_x_ble()
movdqa IV, STATE1
-   pxor 0x40(INP), STATE1
+   movdqu 0x40(INP), INC
+   pxor INC, STATE1
movdqu IV, 0x40(OUTP)
 
-   pxor 0x10(OUTP), STATE2
+   movdqu 0x10(OUTP), INC
+   pxor INC, STATE2
movdqu STATE2, 0x10(OUTP)
 
_aesni_gf128mul_x_ble()
movdqa IV, STATE2
-   pxor 0x50(INP), STATE2
+   movdqu 0x50(INP), INC
+   pxor INC, STATE2
movdqu IV, 0x50(OUTP)
 
-   pxor 0x20(OUTP), STATE3
+   movdqu 0x20(OUTP), INC
+   pxor INC, STATE3
movdqu STATE3, 0x20(OUTP)
 
_aesni_gf128mul_x_ble()
movdqa IV, STATE3
-   pxor 0x60(INP), STATE3
+   movdqu 0x60(INP), INC
+   pxor INC, STATE3
movdqu IV, 0x60(OUTP)
 
-   pxor 0x30(OUTP), STATE4
+   movdqu 0x30(OUTP), INC
+   pxor INC, STATE4
movdqu STATE4, 0x30(OUTP)
 
_aesni_gf128mul_x_ble()
movdqa IV, STATE4
-   pxor 0x70(INP), STATE4
+   movdqu 0x70(INP), INC
+   pxor INC, STATE4
movdqu IV, 0x70(OUTP)
 
_aesni_gf128mul_x_ble()
@@ -2738,16 +2750,20 @@ ENTRY(aesni_xts_crypt8)
 
call *%r11
 
-   pxor 0x40(OUTP), STATE1
+   movdqu 0x40(OUTP), INC
+   pxor INC, STATE1
movdqu STATE1, 0x40(OUTP)
 
-   pxor 0x50(OUTP), STATE2
+   movdqu 0x50(OUTP), INC
+   pxor INC, STATE2
movdqu STATE2, 0x50(OUTP)
 
-   pxor 0x60(OUTP), STATE3
+   movdqu 0x60(OUTP), INC
+   pxor INC, STATE3
movdqu STATE3, 0x60(OUTP)
 
-   pxor 0x70(OUTP), STATE4
+   movdqu 0x70(OUTP), INC
+   pxor INC, STATE4
movdqu STATE4, 0x70(OUTP)
 
ret

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GPF in aesni_xts_crypt8 (3.10-rc5)

2013-06-11 Thread Jussi Kivilinna
Hello,

Does attached patch help?

-Jussi

On 11.06.2013 20:26, Dave Jones wrote:
> Just found that 3.10-rc doesn't boot on my laptop with encrypted disk.
> 
> 
> general protection fault:  [#1] PREEMPT SMP DEBUG_PAGEALLOC
> Modules linked in: xfs libcrc32c dm_crypt crc32c_intel ghash_clmulni_intel 
> aesni_intel glue_helper ablk_helper i915 i2c_algo_bit drm_kms_helper drm 
> i2c_core video
> CPU: 1 PID: 53 Comm: kworker/1:1 Not tainted 3.10.0-rc5+ #5 
> Hardware name: LENOVO 2356JK8/2356JK8, BIOS G7ET94WW (2.54 ) 04/30/2013
> Workqueue: kcryptd kcryptd_crypt [dm_crypt]
> task: 880135c58000 ti: 880135c54000 task.ti: 880135c54000
> RIP: 0010:[]  [] 
> aesni_xts_crypt8+0x42/0x1e0 [aesni_intel]
> RSP: 0018:880135c55b68  EFLAGS: 00010282
> RAX: a0142eb8 RBX: 0080 RCX: 00f0
> RDX: 8801316eeaa8 RSI: 8801316eeaa8 RDI: 88012fd84440
> RBP: 880135c55b70 R08: 8801304fe118 R09: 0020
> R10: 00f0 R11: a0142eb8 R12: 8801316eeb28
> R13: 0080 R14: 8801316eeb28 R15: 0180
> FS:  () GS:88013940() knlGS:
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2: 0039e88bc720 CR3: 01c0b000 CR4: 001407e0
> Stack:
>  a0143683 880135c55c40 a00602fb 880135c55c70
>  a0146060 01ad0190 a0146060 ea0004c5bb80
>  8801316eeaa8 ea0004c5bb80 8801316eeaa8 8801304fe0c0
> Call Trace:
>  [] ? aesni_xts_dec8+0x13/0x20 [aesni_intel]
>  [] glue_xts_crypt_128bit+0x10b/0x1c0 [glue_helper]
>  [] xts_decrypt+0x4b/0x50 [aesni_intel]
>  [] ablk_decrypt+0x4f/0xd0 [ablk_helper]
>  [] crypt_convert+0x352/0x3b0 [dm_crypt]
>  [] kcryptd_crypt+0x355/0x4e0 [dm_crypt]
>  [] ? process_one_work+0x1a5/0x700
>  [] process_one_work+0x211/0x700
>  [] ? process_one_work+0x1a5/0x700
>  [] worker_thread+0x11b/0x3a0
>  [] ? process_one_work+0x700/0x700
>  [] kthread+0xed/0x100
>  [] ? insert_kthread_work+0x80/0x80
>  [] ret_from_fork+0x7c/0xb0
>  [] ? insert_kthread_work+0x80/0x80
> Code: 8d 04 25 b8 2e 14 a0 41 0f 44 ca 4c 0f 44 d8 66 44 0f 6f 14 25 00 70 14 
> a0 41 0f 10 18 44 8b 8f e0 01 00 00 48 01 cf 66 0f 6f c3 <66> 0f ef 02 f3 0f 
> 7f 1e 66 44 0f 70 db 13 66 0f d4 db 66 41 0f 
> RIP  [] aesni_xts_crypt8+0x42/0x1e0 [aesni_intel]
>  RSP 
> 
>0: 8d 04 25 b8 2e 14 a0lea0xa0142eb8,%eax
>7: 41 0f 44 ca cmove  %r10d,%ecx
>b: 4c 0f 44 d8 cmove  %rax,%r11
>f: 66 44 0f 6f 14 25 00movdqa 0xa0147000,%xmm10
>   16: 70 14 a0 
>   19: 41 0f 10 18 movups (%r8),%xmm3
>   1d: 44 8b 8f e0 01 00 00mov0x1e0(%rdi),%r9d
>   24: 48 01 cfadd%rcx,%rdi
>   27: 66 0f 6f c3 movdqa %xmm3,%xmm0
>   2b:*66 0f ef 02 pxor   (%rdx),%xmm0 <-- trapping 
> instruction
>   2f: f3 0f 7f 1e movdqu %xmm3,(%rsi)
>   33: 66 44 0f 70 db 13   pshufd $0x13,%xmm3,%xmm11
>   39: 66 0f d4 db paddq  %xmm3,%xmm3
>   3d: 66  data16
>   3e: 41  rex.B
>   3f: 
> 
> 

crypto: aesni_intel - fix accessing of unaligned memory

From: Jussi Kivilinna 

The new XTS code for aesni_intel uses input buffers directly as memory operands
for pxor instructions, which causes crash if those buffers are not aligned to
16 bytes.

Patch change XTS code to handle unaligned memory correctly, by loading memory
with movdqu instead.

Reported-by: Dave Jones 
Signed-off-by: Jussi Kivilinna 
---
 arch/x86/crypto/aesni-intel_asm.S |   48 +
 1 file changed, 32 insertions(+), 16 deletions(-)

diff --git a/arch/x86/crypto/aesni-intel_asm.S b/arch/x86/crypto/aesni-intel_asm.S
index 62fe22c..477e9d7 100644
--- a/arch/x86/crypto/aesni-intel_asm.S
+++ b/arch/x86/crypto/aesni-intel_asm.S
@@ -2681,56 +2681,68 @@ ENTRY(aesni_xts_crypt8)
 	addq %rcx, KEYP
 
 	movdqa IV, STATE1
-	pxor 0x00(INP), STATE1
+	movdqu 0x00(INP), INC
+	pxor INC, STATE1
 	movdqu IV, 0x00(OUTP)
 
 	_aesni_gf128mul_x_ble()
 	movdqa IV, STATE2
-	pxor 0x10(INP), STATE2
+	movdqu 0x10(INP), INC
+	pxor INC, STATE2
 	movdqu IV, 0x10(OUTP)
 
 	_aesni_gf128mul_x_ble()
 	movdqa IV, STATE3
-	pxor 0x20(INP), STATE3
+	movdqu 0x20(INP), INC
+	pxor INC, STATE3
 	movdqu IV, 0x20(OUTP)
 
 	_aesni_gf128mul_x_ble()
 	movdqa IV, STATE4
-	pxor 0x30(INP), STATE4
+	movdqu 0x30(INP), INC
+	pxor INC, STATE4
 	movdqu IV, 0x30(OUTP)
 
 	call *%r11
 
-	pxor 0x00(OUTP), STATE1
+	movdqu 0x00(OUTP), INC
+	pxor INC, STATE1
 	movdqu STATE1, 0x00(OUTP)
 
 	_aesni_gf128mul_x_ble()
 	movdqa IV, STATE1
-	pxor 0x40(INP), STATE1
+	movdqu 0x40(INP), INC
+	pxor INC, STATE1
 	movdqu IV, 0x40(OUTP)
 
-	p

Re: GPF in aesni_xts_crypt8 (3.10-rc5)

2013-06-11 Thread Jussi Kivilinna
Hello,

Does attached patch help?

-Jussi

On 11.06.2013 20:26, Dave Jones wrote:
 Just found that 3.10-rc doesn't boot on my laptop with encrypted disk.
 
 
 general protection fault:  [#1] PREEMPT SMP DEBUG_PAGEALLOC
 Modules linked in: xfs libcrc32c dm_crypt crc32c_intel ghash_clmulni_intel 
 aesni_intel glue_helper ablk_helper i915 i2c_algo_bit drm_kms_helper drm 
 i2c_core video
 CPU: 1 PID: 53 Comm: kworker/1:1 Not tainted 3.10.0-rc5+ #5 
 Hardware name: LENOVO 2356JK8/2356JK8, BIOS G7ET94WW (2.54 ) 04/30/2013
 Workqueue: kcryptd kcryptd_crypt [dm_crypt]
 task: 880135c58000 ti: 880135c54000 task.ti: 880135c54000
 RIP: 0010:[a01433a2]  [a01433a2] 
 aesni_xts_crypt8+0x42/0x1e0 [aesni_intel]
 RSP: 0018:880135c55b68  EFLAGS: 00010282
 RAX: a0142eb8 RBX: 0080 RCX: 00f0
 RDX: 8801316eeaa8 RSI: 8801316eeaa8 RDI: 88012fd84440
 RBP: 880135c55b70 R08: 8801304fe118 R09: 0020
 R10: 00f0 R11: a0142eb8 R12: 8801316eeb28
 R13: 0080 R14: 8801316eeb28 R15: 0180
 FS:  () GS:88013940() knlGS:
 CS:  0010 DS:  ES:  CR0: 80050033
 CR2: 0039e88bc720 CR3: 01c0b000 CR4: 001407e0
 Stack:
  a0143683 880135c55c40 a00602fb 880135c55c70
  a0146060 01ad0190 a0146060 ea0004c5bb80
  8801316eeaa8 ea0004c5bb80 8801316eeaa8 8801304fe0c0
 Call Trace:
  [a0143683] ? aesni_xts_dec8+0x13/0x20 [aesni_intel]
  [a00602fb] glue_xts_crypt_128bit+0x10b/0x1c0 [glue_helper]
  [a014358b] xts_decrypt+0x4b/0x50 [aesni_intel]
  [a000617f] ablk_decrypt+0x4f/0xd0 [ablk_helper]
  [a0067202] crypt_convert+0x352/0x3b0 [dm_crypt]
  [a00675b5] kcryptd_crypt+0x355/0x4e0 [dm_crypt]
  [81061b35] ? process_one_work+0x1a5/0x700
  [81061ba1] process_one_work+0x211/0x700
  [81061b35] ? process_one_work+0x1a5/0x700
  [810621ab] worker_thread+0x11b/0x3a0
  [81062090] ? process_one_work+0x700/0x700
  [81069f4d] kthread+0xed/0x100
  [81069e60] ? insert_kthread_work+0x80/0x80
  [815fd41c] ret_from_fork+0x7c/0xb0
  [81069e60] ? insert_kthread_work+0x80/0x80
 Code: 8d 04 25 b8 2e 14 a0 41 0f 44 ca 4c 0f 44 d8 66 44 0f 6f 14 25 00 70 14 
 a0 41 0f 10 18 44 8b 8f e0 01 00 00 48 01 cf 66 0f 6f c3 66 0f ef 02 f3 0f 
 7f 1e 66 44 0f 70 db 13 66 0f d4 db 66 41 0f 
 RIP  [a01433a2] aesni_xts_crypt8+0x42/0x1e0 [aesni_intel]
  RSP 880135c55b68
 
0: 8d 04 25 b8 2e 14 a0lea0xa0142eb8,%eax
7: 41 0f 44 ca cmove  %r10d,%ecx
b: 4c 0f 44 d8 cmove  %rax,%r11
f: 66 44 0f 6f 14 25 00movdqa 0xa0147000,%xmm10
   16: 70 14 a0 
   19: 41 0f 10 18 movups (%r8),%xmm3
   1d: 44 8b 8f e0 01 00 00mov0x1e0(%rdi),%r9d
   24: 48 01 cfadd%rcx,%rdi
   27: 66 0f 6f c3 movdqa %xmm3,%xmm0
   2b:*66 0f ef 02 pxor   (%rdx),%xmm0 -- trapping 
 instruction
   2f: f3 0f 7f 1e movdqu %xmm3,(%rsi)
   33: 66 44 0f 70 db 13   pshufd $0x13,%xmm3,%xmm11
   39: 66 0f d4 db paddq  %xmm3,%xmm3
   3d: 66  data16
   3e: 41  rex.B
   3f: 
 
 

crypto: aesni_intel - fix accessing of unaligned memory

From: Jussi Kivilinna jussi.kivili...@iki.fi

The new XTS code for aesni_intel uses input buffers directly as memory operands
for pxor instructions, which causes crash if those buffers are not aligned to
16 bytes.

Patch change XTS code to handle unaligned memory correctly, by loading memory
with movdqu instead.

Reported-by: Dave Jones da...@redhat.com
Signed-off-by: Jussi Kivilinna jussi.kivili...@iki.fi
---
 arch/x86/crypto/aesni-intel_asm.S |   48 +
 1 file changed, 32 insertions(+), 16 deletions(-)

diff --git a/arch/x86/crypto/aesni-intel_asm.S b/arch/x86/crypto/aesni-intel_asm.S
index 62fe22c..477e9d7 100644
--- a/arch/x86/crypto/aesni-intel_asm.S
+++ b/arch/x86/crypto/aesni-intel_asm.S
@@ -2681,56 +2681,68 @@ ENTRY(aesni_xts_crypt8)
 	addq %rcx, KEYP
 
 	movdqa IV, STATE1
-	pxor 0x00(INP), STATE1
+	movdqu 0x00(INP), INC
+	pxor INC, STATE1
 	movdqu IV, 0x00(OUTP)
 
 	_aesni_gf128mul_x_ble()
 	movdqa IV, STATE2
-	pxor 0x10(INP), STATE2
+	movdqu 0x10(INP), INC
+	pxor INC, STATE2
 	movdqu IV, 0x10(OUTP)
 
 	_aesni_gf128mul_x_ble()
 	movdqa IV, STATE3
-	pxor 0x20(INP), STATE3
+	movdqu 0x20(INP), INC
+	pxor INC, STATE3
 	movdqu IV, 0x20(OUTP)
 
 	_aesni_gf128mul_x_ble()
 	movdqa IV, STATE4
-	pxor 0x30(INP), STATE4
+	movdqu 0x30(INP), INC
+	pxor INC, STATE4
 	movdqu IV, 0x30(OUTP)
 
 	call *%r11
 
-	pxor 0x00(OUTP), STATE1
+	movdqu 0x00(OUTP), INC
+	pxor INC, STATE1
 	movdqu STATE1, 0x00(OUTP)
 
 	_aesni_gf128mul_x_ble()
 	movdqa IV, STATE1
-	pxor 0x40(INP

[PATCH] crypto: aesni_intel - fix accessing of unaligned memory

2013-06-11 Thread Jussi Kivilinna
The new XTS code for aesni_intel uses input buffers directly as memory operands
for pxor instructions, which causes crash if those buffers are not aligned to
16 bytes.

Patch changes XTS code to handle unaligned memory correctly, by loading memory
with movdqu instead.

Reported-by: Dave Jones da...@redhat.com
Tested-by: Dave Jones da...@redhat.com
Signed-off-by: Jussi Kivilinna jussi.kivili...@iki.fi
---
 arch/x86/crypto/aesni-intel_asm.S |   48 +
 1 file changed, 32 insertions(+), 16 deletions(-)

diff --git a/arch/x86/crypto/aesni-intel_asm.S 
b/arch/x86/crypto/aesni-intel_asm.S
index 62fe22c..477e9d7 100644
--- a/arch/x86/crypto/aesni-intel_asm.S
+++ b/arch/x86/crypto/aesni-intel_asm.S
@@ -2681,56 +2681,68 @@ ENTRY(aesni_xts_crypt8)
addq %rcx, KEYP
 
movdqa IV, STATE1
-   pxor 0x00(INP), STATE1
+   movdqu 0x00(INP), INC
+   pxor INC, STATE1
movdqu IV, 0x00(OUTP)
 
_aesni_gf128mul_x_ble()
movdqa IV, STATE2
-   pxor 0x10(INP), STATE2
+   movdqu 0x10(INP), INC
+   pxor INC, STATE2
movdqu IV, 0x10(OUTP)
 
_aesni_gf128mul_x_ble()
movdqa IV, STATE3
-   pxor 0x20(INP), STATE3
+   movdqu 0x20(INP), INC
+   pxor INC, STATE3
movdqu IV, 0x20(OUTP)
 
_aesni_gf128mul_x_ble()
movdqa IV, STATE4
-   pxor 0x30(INP), STATE4
+   movdqu 0x30(INP), INC
+   pxor INC, STATE4
movdqu IV, 0x30(OUTP)
 
call *%r11
 
-   pxor 0x00(OUTP), STATE1
+   movdqu 0x00(OUTP), INC
+   pxor INC, STATE1
movdqu STATE1, 0x00(OUTP)
 
_aesni_gf128mul_x_ble()
movdqa IV, STATE1
-   pxor 0x40(INP), STATE1
+   movdqu 0x40(INP), INC
+   pxor INC, STATE1
movdqu IV, 0x40(OUTP)
 
-   pxor 0x10(OUTP), STATE2
+   movdqu 0x10(OUTP), INC
+   pxor INC, STATE2
movdqu STATE2, 0x10(OUTP)
 
_aesni_gf128mul_x_ble()
movdqa IV, STATE2
-   pxor 0x50(INP), STATE2
+   movdqu 0x50(INP), INC
+   pxor INC, STATE2
movdqu IV, 0x50(OUTP)
 
-   pxor 0x20(OUTP), STATE3
+   movdqu 0x20(OUTP), INC
+   pxor INC, STATE3
movdqu STATE3, 0x20(OUTP)
 
_aesni_gf128mul_x_ble()
movdqa IV, STATE3
-   pxor 0x60(INP), STATE3
+   movdqu 0x60(INP), INC
+   pxor INC, STATE3
movdqu IV, 0x60(OUTP)
 
-   pxor 0x30(OUTP), STATE4
+   movdqu 0x30(OUTP), INC
+   pxor INC, STATE4
movdqu STATE4, 0x30(OUTP)
 
_aesni_gf128mul_x_ble()
movdqa IV, STATE4
-   pxor 0x70(INP), STATE4
+   movdqu 0x70(INP), INC
+   pxor INC, STATE4
movdqu IV, 0x70(OUTP)
 
_aesni_gf128mul_x_ble()
@@ -2738,16 +2750,20 @@ ENTRY(aesni_xts_crypt8)
 
call *%r11
 
-   pxor 0x40(OUTP), STATE1
+   movdqu 0x40(OUTP), INC
+   pxor INC, STATE1
movdqu STATE1, 0x40(OUTP)
 
-   pxor 0x50(OUTP), STATE2
+   movdqu 0x50(OUTP), INC
+   pxor INC, STATE2
movdqu STATE2, 0x50(OUTP)
 
-   pxor 0x60(OUTP), STATE3
+   movdqu 0x60(OUTP), INC
+   pxor INC, STATE3
movdqu STATE3, 0x60(OUTP)
 
-   pxor 0x70(OUTP), STATE4
+   movdqu 0x70(OUTP), INC
+   pxor INC, STATE4
movdqu STATE4, 0x70(OUTP)
 
ret

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/4] Simple correctness and speed test for CRCT10DIF hash

2013-04-17 Thread Jussi Kivilinna
On 16.04.2013 19:20, Tim Chen wrote:
> These are simple tests to do sanity check of CRC T10 DIF hash.  The
> correctness of the transform can be checked with the command
>   modprobe tcrypt mode=47
> The speed of the transform can be evaluated with the command
>   modprobe tcrypt mode=320
> 
> Set the cpu frequency to constant and turn turbo off when running the
> speed test so the frequency governor will not tweak the frequency and
> affects the measurements.
> 
> Signed-off-by: Tim Chen 
> Tested-by: Keith Busch 

>  
> +#define CRCT10DIF_TEST_VECTORS   2
> +static struct hash_testvec crct10dif_tv_template[] = {
> + {
> + .plaintext = "abc",
> + .psize  = 3,
> +#ifdef __LITTLE_ENDIAN
> + .digest = "\x3b\x44",
> +#else
> + .digest = "\x44\x3b",
> +#endif
> + }, {
> + .plaintext =
> + "abcd",
> + .psize  = 56,
> +#ifdef __LITTLE_ENDIAN
> + .digest = "\xe3\x9c",
> +#else
> + .digest = "\x9c\xe3",
> +#endif
> + .np = 2,
> + .tap= { 28, 28 }
> + }
> +};
> +

Are these large enough to test all code paths in the PCLMULQDQ implementation?

-Jussi

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/4] Accelerated CRC T10 DIF computation with PCLMULQDQ instruction

2013-04-17 Thread Jussi Kivilinna
On 16.04.2013 19:20, Tim Chen wrote:
> This is the x86_64 CRC T10 DIF transform accelerated with the PCLMULQDQ
> instructions.  Details discussing the implementation can be found in the
> paper:
> 
> "Fast CRC Computation for Generic Polynomials Using PCLMULQDQ Instruction"
> URL: http://download.intel.com/design/intarch/papers/323102.pdf

URL does not work.

> 
> Signed-off-by: Tim Chen 
> Tested-by: Keith Busch 
> ---
>  arch/x86/crypto/crct10dif-pcl-asm_64.S | 659 
> +
>  1 file changed, 659 insertions(+)
>  create mode 100644 arch/x86/crypto/crct10dif-pcl-asm_64.S

> +
> + # Allocate Stack Space
> + mov %rsp, %rcx
> + sub $16*10, %rsp
> + and $~(0x20 - 1), %rsp
> +
> + # push the xmm registers into the stack to maintain
> + movdqa %xmm10, 16*2(%rsp)
> + movdqa %xmm11, 16*3(%rsp)
> + movdqa %xmm8 , 16*4(%rsp)
> + movdqa %xmm12, 16*5(%rsp)
> + movdqa %xmm13, 16*6(%rsp)
> + movdqa %xmm6,  16*7(%rsp)
> + movdqa %xmm7,  16*8(%rsp)
> + movdqa %xmm9,  16*9(%rsp)

You don't need to store (and restore) these, as 'crc_t10dif_pcl' is called 
between kernel_fpu_begin/_end.

> +
> +
> + # check if smaller than 256
> + cmp $256, arg3
> +

> +_cleanup:
> + # scale the result back to 16 bits
> + shr $16, %eax
> + movdqa  16*2(%rsp), %xmm10
> + movdqa  16*3(%rsp), %xmm11
> + movdqa  16*4(%rsp), %xmm8
> + movdqa  16*5(%rsp), %xmm12
> + movdqa  16*6(%rsp), %xmm13
> + movdqa  16*7(%rsp), %xmm6
> + movdqa  16*8(%rsp), %xmm7
> + movdqa  16*9(%rsp), %xmm9

Registers are overwritten by kernel_fpu_end.

> + mov %rcx, %rsp
> + ret
> +ENDPROC(crc_t10dif_pcl)
> +

You should move ENDPROC at end of the full function.

> +
> +
> +.align 16
> +_less_than_128:
> +
> + # check if there is enough buffer to be able to fold 16B at a time
> + cmp $32, arg3

> + movdqa  (%rsp), %xmm7
> + pshufb  %xmm11, %xmm7
> + pxor%xmm0 , %xmm7   # xor the initial crc value
> +
> + psrldq  $7, %xmm7
> +
> + jmp _barrett

Move ENDPROC here.


 -Jussi
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/4] Accelerated CRC T10 DIF computation with PCLMULQDQ instruction

2013-04-17 Thread Jussi Kivilinna
On 16.04.2013 19:20, Tim Chen wrote:
 This is the x86_64 CRC T10 DIF transform accelerated with the PCLMULQDQ
 instructions.  Details discussing the implementation can be found in the
 paper:
 
 Fast CRC Computation for Generic Polynomials Using PCLMULQDQ Instruction
 URL: http://download.intel.com/design/intarch/papers/323102.pdf

URL does not work.

 
 Signed-off-by: Tim Chen tim.c.c...@linux.intel.com
 Tested-by: Keith Busch keith.bu...@intel.com
 ---
  arch/x86/crypto/crct10dif-pcl-asm_64.S | 659 
 +
  1 file changed, 659 insertions(+)
  create mode 100644 arch/x86/crypto/crct10dif-pcl-asm_64.S
snip
 +
 + # Allocate Stack Space
 + mov %rsp, %rcx
 + sub $16*10, %rsp
 + and $~(0x20 - 1), %rsp
 +
 + # push the xmm registers into the stack to maintain
 + movdqa %xmm10, 16*2(%rsp)
 + movdqa %xmm11, 16*3(%rsp)
 + movdqa %xmm8 , 16*4(%rsp)
 + movdqa %xmm12, 16*5(%rsp)
 + movdqa %xmm13, 16*6(%rsp)
 + movdqa %xmm6,  16*7(%rsp)
 + movdqa %xmm7,  16*8(%rsp)
 + movdqa %xmm9,  16*9(%rsp)

You don't need to store (and restore) these, as 'crc_t10dif_pcl' is called 
between kernel_fpu_begin/_end.

 +
 +
 + # check if smaller than 256
 + cmp $256, arg3
 +
snip
 +_cleanup:
 + # scale the result back to 16 bits
 + shr $16, %eax
 + movdqa  16*2(%rsp), %xmm10
 + movdqa  16*3(%rsp), %xmm11
 + movdqa  16*4(%rsp), %xmm8
 + movdqa  16*5(%rsp), %xmm12
 + movdqa  16*6(%rsp), %xmm13
 + movdqa  16*7(%rsp), %xmm6
 + movdqa  16*8(%rsp), %xmm7
 + movdqa  16*9(%rsp), %xmm9

Registers are overwritten by kernel_fpu_end.

 + mov %rcx, %rsp
 + ret
 +ENDPROC(crc_t10dif_pcl)
 +

You should move ENDPROC at end of the full function.

 +
 +
 +.align 16
 +_less_than_128:
 +
 + # check if there is enough buffer to be able to fold 16B at a time
 + cmp $32, arg3
snip
 + movdqa  (%rsp), %xmm7
 + pshufb  %xmm11, %xmm7
 + pxor%xmm0 , %xmm7   # xor the initial crc value
 +
 + psrldq  $7, %xmm7
 +
 + jmp _barrett

Move ENDPROC here.


 -Jussi
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/4] Simple correctness and speed test for CRCT10DIF hash

2013-04-17 Thread Jussi Kivilinna
On 16.04.2013 19:20, Tim Chen wrote:
 These are simple tests to do sanity check of CRC T10 DIF hash.  The
 correctness of the transform can be checked with the command
   modprobe tcrypt mode=47
 The speed of the transform can be evaluated with the command
   modprobe tcrypt mode=320
 
 Set the cpu frequency to constant and turn turbo off when running the
 speed test so the frequency governor will not tweak the frequency and
 affects the measurements.
 
 Signed-off-by: Tim Chen tim.c.c...@linux.intel.com
 Tested-by: Keith Busch keith.bu...@intel.com
snip
  
 +#define CRCT10DIF_TEST_VECTORS   2
 +static struct hash_testvec crct10dif_tv_template[] = {
 + {
 + .plaintext = abc,
 + .psize  = 3,
 +#ifdef __LITTLE_ENDIAN
 + .digest = \x3b\x44,
 +#else
 + .digest = \x44\x3b,
 +#endif
 + }, {
 + .plaintext =
 + abcd,
 + .psize  = 56,
 +#ifdef __LITTLE_ENDIAN
 + .digest = \xe3\x9c,
 +#else
 + .digest = \x9c\xe3,
 +#endif
 + .np = 2,
 + .tap= { 28, 28 }
 + }
 +};
 +

Are these large enough to test all code paths in the PCLMULQDQ implementation?

-Jussi

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC PATCH 2/6] crypto: tcrypt - add async cipher speed tests for blowfish

2013-04-13 Thread Jussi Kivilinna
Signed-off-by: Jussi Kivilinna 
---
 crypto/tcrypt.c |   15 +++
 1 file changed, 15 insertions(+)

diff --git a/crypto/tcrypt.c b/crypto/tcrypt.c
index 24ea7df..66d254c 100644
--- a/crypto/tcrypt.c
+++ b/crypto/tcrypt.c
@@ -1768,6 +1768,21 @@ static int do_test(int m)
   speed_template_32_64);
break;
 
+   case 509:
+   test_acipher_speed("ecb(blowfish)", ENCRYPT, sec, NULL, 0,
+  speed_template_8_32);
+   test_acipher_speed("ecb(blowfish)", DECRYPT, sec, NULL, 0,
+  speed_template_8_32);
+   test_acipher_speed("cbc(blowfish)", ENCRYPT, sec, NULL, 0,
+  speed_template_8_32);
+   test_acipher_speed("cbc(blowfish)", DECRYPT, sec, NULL, 0,
+  speed_template_8_32);
+   test_acipher_speed("ctr(blowfish)", ENCRYPT, sec, NULL, 0,
+  speed_template_8_32);
+   test_acipher_speed("ctr(blowfish)", DECRYPT, sec, NULL, 0,
+  speed_template_8_32);
+   break;
+
case 1000:
test_available();
break;

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC PATCH 5/6] crypto: serpent - add AVX2/x86_64 assembler implementation of serpent cipher

2013-04-13 Thread Jussi Kivilinna
Patch adds AVX2/x86-64 implementation of Serpent cipher, requiring 16 parallel
blocks for input (256 bytes). Implementation is based on the AVX implementation
and extends to use the 256-bit wide YMM registers. Since serpent does not use
table look-ups, this implementation should be close to two times faster than
the AVX implementation.

Signed-off-by: Jussi Kivilinna 
---
 arch/x86/crypto/Makefile  |2 
 arch/x86/crypto/serpent-avx2-asm_64.S |  800 +
 arch/x86/crypto/serpent_avx2_glue.c   |  562 
 arch/x86/crypto/serpent_avx_glue.c|   62 ++
 arch/x86/include/asm/crypto/serpent-avx.h |   24 +
 crypto/Kconfig|   23 +
 crypto/testmgr.c  |   15 +
 7 files changed, 1468 insertions(+), 20 deletions(-)
 create mode 100644 arch/x86/crypto/serpent-avx2-asm_64.S
 create mode 100644 arch/x86/crypto/serpent_avx2_glue.c

diff --git a/arch/x86/crypto/Makefile b/arch/x86/crypto/Makefile
index 1f6e0c2..a21af59 100644
--- a/arch/x86/crypto/Makefile
+++ b/arch/x86/crypto/Makefile
@@ -43,6 +43,7 @@ endif
 # These modules require assembler to support AVX2.
 ifeq ($(avx2_supported),yes)
obj-$(CONFIG_CRYPTO_BLOWFISH_AVX2_X86_64) += blowfish-avx2.o
+   obj-$(CONFIG_CRYPTO_SERPENT_AVX2_X86_64) += serpent-avx2.o
obj-$(CONFIG_CRYPTO_TWOFISH_AVX2_X86_64) += twofish-avx2.o
 endif
 
@@ -72,6 +73,7 @@ endif
 
 ifeq ($(avx2_supported),yes)
blowfish-avx2-y := blowfish-avx2-asm_64.o blowfish_avx2_glue.o
+   serpent-avx2-y := serpent-avx2-asm_64.o serpent_avx2_glue.o
twofish-avx2-y := twofish-avx2-asm_64.o twofish_avx2_glue.o
 endif
 
diff --git a/arch/x86/crypto/serpent-avx2-asm_64.S 
b/arch/x86/crypto/serpent-avx2-asm_64.S
new file mode 100644
index 000..b222085
--- /dev/null
+++ b/arch/x86/crypto/serpent-avx2-asm_64.S
@@ -0,0 +1,800 @@
+/*
+ * x86_64/AVX2 assembler optimized version of Serpent
+ *
+ * Copyright © 2012-2013 Jussi Kivilinna 
+ *
+ * Based on AVX assembler implementation of Serpent by:
+ *  Copyright © 2012 Johannes Goetzfried
+ *  
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ */
+
+#include 
+#include "glue_helper-asm-avx2.S"
+
+.file "serpent-avx2-asm_64.S"
+
+.data
+.align 16
+
+.Lbswap128_mask:
+   .byte 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0
+.Lxts_gf128mul_and_shl1_mask_0:
+   .byte 0x87, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0
+.Lxts_gf128mul_and_shl1_mask_1:
+   .byte 0x0e, 1, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0
+
+.text
+
+#define CTX %rdi
+
+#define RNOT %ymm0
+#define tp  %ymm1
+
+#define RA1 %ymm2
+#define RA2 %ymm3
+#define RB1 %ymm4
+#define RB2 %ymm5
+#define RC1 %ymm6
+#define RC2 %ymm7
+#define RD1 %ymm8
+#define RD2 %ymm9
+#define RE1 %ymm10
+#define RE2 %ymm11
+
+#define RK0 %ymm12
+#define RK1 %ymm13
+#define RK2 %ymm14
+#define RK3 %ymm15
+
+#define RK0x %xmm12
+#define RK1x %xmm13
+#define RK2x %xmm14
+#define RK3x %xmm15
+
+#define S0_1(x0, x1, x2, x3, x4)  \
+   vporx0,   x3, tp; \
+   vpxor   x3,   x0, x0; \
+   vpxor   x2,   x3, x4; \
+   vpxor   RNOT, x4, x4; \
+   vpxor   x1,   tp, x3; \
+   vpand   x0,   x1, x1; \
+   vpxor   x4,   x1, x1; \
+   vpxor   x0,   x2, x2;
+#define S0_2(x0, x1, x2, x3, x4)  \
+   vpxor   x3,   x0, x0; \
+   vporx0,   x4, x4; \
+   vpxor   x2,   x0, x0; \
+   vpand   x1,   x2, x2; \
+   vpxor   x2,   x3, x3; \
+   vpxor   RNOT, x1, x1; \
+   vpxor   x4,   x2, x2; \
+   vpxor   x2,   x1, x1;
+
+#define S1_1(x0, x1, x2, x3, x4)  \
+   vpxor   x0,   x1, tp; \
+   vpxor   x3,   x0, x0; \
+   vpxor   RNOT, x3, x3; \
+   vpand   tp,   x1, x4; \
+   vportp,   x0, x0; \
+   vpxor   x2,   x3, x3; \
+   vpxor   x3,   x0, x0; \
+   vpxor   x3,   tp, x1;
+#define S1_2(x0, x1, x2, x3, x4)  \
+   vpxor   x4,   x3, x3; \
+   vporx4,   x1, x1; \
+   vpxor   x2,   x4, x4; \
+   vpand   x0,   x2, x2; \
+   vpxor   x1,   x2, x2; \
+   vporx0,   x1, x1; \
+   vpxor   RNOT, x0, x0; \
+   vpxor   x2,   x0, x0; \
+   vpxor   x1,   x4, x4;
+
+#define S2_1(x0, x1, x2, x3, x4)  \
+   vpxor   RNOT, x3, x3; \
+   vpxor   x0,   x1, x1; \
+   vpand   x2,   x0, tp; \
+   vpxor   x3,   tp, tp; \
+   vporx0,   x3, x3; \
+   vpxor   x1,   x2,

[RFC PATCH 4/6] crypto: twofish - add AVX2/x86_64 assembler implementation of twofish cipher

2013-04-13 Thread Jussi Kivilinna
Patch adds AVX2/x86-64 implementation of Twofish cipher, requiring 16 parallel
blocks for input (256 bytes). Table look-ups are performed using vpgatherdd
instruction directly from vector registers and thus should be faster than
earlier implementations. Implementation also uses 256-bit wide YMM registers,
which should give additional speed up compared to the AVX implementation.

Signed-off-by: Jussi Kivilinna 
---
 arch/x86/crypto/Makefile   |2 
 arch/x86/crypto/glue_helper-asm-avx2.S |  180 ++
 arch/x86/crypto/twofish-avx2-asm_64.S  |  600 
 arch/x86/crypto/twofish_avx2_glue.c|  584 +++
 arch/x86/crypto/twofish_avx_glue.c |   14 +
 arch/x86/include/asm/crypto/twofish.h  |   18 +
 crypto/Kconfig |   24 +
 crypto/testmgr.c   |   12 +
 8 files changed, 1432 insertions(+), 2 deletions(-)
 create mode 100644 arch/x86/crypto/glue_helper-asm-avx2.S
 create mode 100644 arch/x86/crypto/twofish-avx2-asm_64.S
 create mode 100644 arch/x86/crypto/twofish_avx2_glue.c

diff --git a/arch/x86/crypto/Makefile b/arch/x86/crypto/Makefile
index 28464ef..1f6e0c2 100644
--- a/arch/x86/crypto/Makefile
+++ b/arch/x86/crypto/Makefile
@@ -43,6 +43,7 @@ endif
 # These modules require assembler to support AVX2.
 ifeq ($(avx2_supported),yes)
obj-$(CONFIG_CRYPTO_BLOWFISH_AVX2_X86_64) += blowfish-avx2.o
+   obj-$(CONFIG_CRYPTO_TWOFISH_AVX2_X86_64) += twofish-avx2.o
 endif
 
 aes-i586-y := aes-i586-asm_32.o aes_glue.o
@@ -71,6 +72,7 @@ endif
 
 ifeq ($(avx2_supported),yes)
blowfish-avx2-y := blowfish-avx2-asm_64.o blowfish_avx2_glue.o
+   twofish-avx2-y := twofish-avx2-asm_64.o twofish_avx2_glue.o
 endif
 
 aesni-intel-y := aesni-intel_asm.o aesni-intel_glue.o fpu.o
diff --git a/arch/x86/crypto/glue_helper-asm-avx2.S 
b/arch/x86/crypto/glue_helper-asm-avx2.S
new file mode 100644
index 000..a53ac11
--- /dev/null
+++ b/arch/x86/crypto/glue_helper-asm-avx2.S
@@ -0,0 +1,180 @@
+/*
+ * Shared glue code for 128bit block ciphers, AVX2 assembler macros
+ *
+ * Copyright © 2012-2013 Jussi Kivilinna 
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ */
+
+#define load_16way(src, x0, x1, x2, x3, x4, x5, x6, x7) \
+   vmovdqu (0*32)(src), x0; \
+   vmovdqu (1*32)(src), x1; \
+   vmovdqu (2*32)(src), x2; \
+   vmovdqu (3*32)(src), x3; \
+   vmovdqu (4*32)(src), x4; \
+   vmovdqu (5*32)(src), x5; \
+   vmovdqu (6*32)(src), x6; \
+   vmovdqu (7*32)(src), x7;
+
+#define store_16way(dst, x0, x1, x2, x3, x4, x5, x6, x7) \
+   vmovdqu x0, (0*32)(dst); \
+   vmovdqu x1, (1*32)(dst); \
+   vmovdqu x2, (2*32)(dst); \
+   vmovdqu x3, (3*32)(dst); \
+   vmovdqu x4, (4*32)(dst); \
+   vmovdqu x5, (5*32)(dst); \
+   vmovdqu x6, (6*32)(dst); \
+   vmovdqu x7, (7*32)(dst);
+
+#define store_cbc_16way(src, dst, x0, x1, x2, x3, x4, x5, x6, x7, t0) \
+   vpxor t0, t0, t0; \
+   vinserti128 $1, (src), t0, t0; \
+   vpxor t0, x0, x0; \
+   vpxor (0*32+16)(src), x1, x1; \
+   vpxor (1*32+16)(src), x2, x2; \
+   vpxor (2*32+16)(src), x3, x3; \
+   vpxor (3*32+16)(src), x4, x4; \
+   vpxor (4*32+16)(src), x5, x5; \
+   vpxor (5*32+16)(src), x6, x6; \
+   vpxor (6*32+16)(src), x7, x7; \
+   store_16way(dst, x0, x1, x2, x3, x4, x5, x6, x7);
+
+#define inc_le128(x, minus_one, tmp) \
+   vpcmpeqq minus_one, x, tmp; \
+   vpsubq minus_one, x, x; \
+   vpslldq $8, tmp, tmp; \
+   vpsubq tmp, x, x;
+
+#define add2_le128(x, minus_one, minus_two, tmp1, tmp2) \
+   vpcmpeqq minus_one, x, tmp1; \
+   vpcmpeqq minus_two, x, tmp2; \
+   vpsubq minus_two, x, x; \
+   vpor tmp2, tmp1, tmp1; \
+   vpslldq $8, tmp1, tmp1; \
+   vpsubq tmp1, x, x;
+
+#define load_ctr_16way(iv, bswap, x0, x1, x2, x3, x4, x5, x6, x7, t0, t0x, t1, 
\
+  t1x, t2, t2x, t3, t3x, t4, t5) \
+   vpcmpeqd t0, t0, t0; \
+   vpsrldq $8, t0, t0; /* ab: -1:0 ; cd: -1:0 */ \
+   vpaddq t0, t0, t4; /* ab: -2:0 ; cd: -2:0 */\
+   \
+   /* load IV and byteswap */ \
+   vmovdqu (iv), t2x; \
+   vmovdqa t2x, t3x; \
+   inc_le128(t2x, t0x, t1x); \
+   vbroadcasti128 bswap, t1; \
+   vinserti128 $1, t2x, t3, t2; /* ab: le0 ; cd: le1 */ \
+   vpshufb t1, t2, x0; \
+   \
+   /* construct IVs */ \
+   add2_le128(t2, t0, t4, t3, t5); /* ab: le2 ; cd: le3 */ \
+   vpshufb t1, t2, x1; \
+   add2_le128(t2, t0, t4, t3, t5); \
+   vpshufb t1, t2, x2; \
+   add2_le128(t2, t0, t4, t3, t5); \
+   vpshufb t1, t2, x3; \
+   add2_le128(t2, t0, t4, t3, t5); \
+   vpshufb t1, t2, x4; \
+   add2_le128(t2, t0, t4, t3, t5

[RFC PATCH 1/6] crypto: testmgr - extend camellia test-vectors for camellia-aesni/avx2

2013-04-13 Thread Jussi Kivilinna
Signed-off-by: Jussi Kivilinna 
---
 crypto/testmgr.h | 1100 --
 1 file changed, 1062 insertions(+), 38 deletions(-)

diff --git a/crypto/testmgr.h b/crypto/testmgr.h
index d503660..dc2c054 100644
--- a/crypto/testmgr.h
+++ b/crypto/testmgr.h
@@ -20997,8 +20997,72 @@ static struct cipher_testvec 
camellia_enc_tv_template[] = {
  "\x86\x1D\xB4\x28\xBF\x56\xED\x61"
  "\xF8\x8F\x03\x9A\x31\xC8\x3C\xD3"
  "\x6A\x01\x75\x0C\xA3\x17\xAE\x45"
- "\xDC\x50\xE7\x7E\x15\x89\x20\xB7",
-   .ilen   = 496,
+ "\xDC\x50\xE7\x7E\x15\x89\x20\xB7"
+ "\x2B\xC2\x59\xF0\x64\xFB\x92\x06"
+ "\x9D\x34\xCB\x3F\xD6\x6D\x04\x78"
+ "\x0F\xA6\x1A\xB1\x48\xDF\x53\xEA"
+ "\x81\x18\x8C\x23\xBA\x2E\xC5\x5C"
+ "\xF3\x67\xFE\x95\x09\xA0\x37\xCE"
+ "\x42\xD9\x70\x07\x7B\x12\xA9\x1D"
+ "\xB4\x4B\xE2\x56\xED\x84\x1B\x8F"
+ "\x26\xBD\x31\xC8\x5F\xF6\x6A\x01"
+ "\x98\x0C\xA3\x3A\xD1\x45\xDC\x73"
+ "\x0A\x7E\x15\xAC\x20\xB7\x4E\xE5"
+ "\x59\xF0\x87\x1E\x92\x29\xC0\x34"
+ "\xCB\x62\xF9\x6D\x04\x9B\x0F\xA6"
+ "\x3D\xD4\x48\xDF\x76\x0D\x81\x18"
+ "\xAF\x23\xBA\x51\xE8\x5C\xF3\x8A"
+ "\x21\x95\x2C\xC3\x37\xCE\x65\xFC"
+ "\x70\x07\x9E\x12\xA9\x40\xD7\x4B"
+ "\xE2\x79\x10\x84\x1B\xB2\x26\xBD"
+ "\x54\xEB\x5F\xF6\x8D\x01\x98\x2F"
+ "\xC6\x3A\xD1\x68\xFF\x73\x0A\xA1"
+ "\x15\xAC\x43\xDA\x4E\xE5\x7C\x13"
+ "\x87\x1E\xB5\x29\xC0\x57\xEE\x62"
+ "\xF9\x90\x04\x9B\x32\xC9\x3D\xD4"
+ "\x6B\x02\x76\x0D\xA4\x18\xAF\x46"
+ "\xDD\x51\xE8\x7F\x16\x8A\x21\xB8"
+ "\x2C\xC3\x5A\xF1\x65\xFC\x93\x07"
+ "\x9E\x35\xCC\x40\xD7\x6E\x05\x79"
+ "\x10\xA7\x1B\xB2\x49\xE0\x54\xEB"
+ "\x82\x19\x8D\x24\xBB\x2F\xC6\x5D"
+ "\xF4\x68\xFF\x96\x0A\xA1\x38\xCF"
+ "\x43\xDA\x71\x08\x7C\x13\xAA\x1E"
+ "\xB5\x4C\xE3\x57\xEE\x85\x1C\x90"
+ "\x27\xBE\x32\xC9\x60\xF7\x6B\x02"
+ "\x99\x0D\xA4\x3B\xD2\x46\xDD\x74"
+ "\x0B\x7F\x16\xAD\x21\xB8\x4F\xE6"
+ "\x5A\xF1\x88\x1F\x93\x2A\xC1\x35"
+ "\xCC\x63\xFA\x6E\x05\x9C\x10\xA7"
+ "\x3E\xD5\x49\xE0\x77\x0E\x82\x19"
+ "\xB0\x24\xBB\x52\xE9\x5D\xF4\x8B"
+ "\x22\x96\x2D\xC4\x38\xCF\x66\xFD"
+ "\x71\x08\x9F\x13\xAA\x41\xD8\x4C"
+ "\xE3\x7A\x11\x85\x1C\xB3\x27\xBE"
+ "\x55\xEC\x60\xF7\x8E\x02\x99\x30"
+ "\xC7\x3B\xD2\x69\x00\x74\x0B\xA2"
+ "\x16\xAD\x44\xDB\x4F\xE6\x7D\x14"
+ "\x88\x1F\xB6\x2A\xC1\x58\xEF\x63"
+ "\xFA\x91\x05\x9C\x33\xCA\x3E\xD5"
+ "\x6C\x03\x77\x0E\xA5\x19\xB0\x47"
+ "\xDE\x52\xE9\x80\x17\x8B\x22\xB9"
+ "\x2D\xC4\x5B\xF2\x66\xFD\x94\x08"
+ "\x9F\x36\xCD\x41\xD8\x6F\x06\x7A"
+ "\x11\xA8\x1C\xB3\x4A\xE1\x55\xEC"
+ "\x83\x1A\x8E\x25\xBC\x30\xC7\x5E"
+ "\xF5\x69\x00\x97\x0B\xA2\x39\xD0"
+ "\x44\xDB\x72\x09\x7D\x14\xAB\x1F"
+ "\xB6\x4D\xE4\x58\xEF\x86\x1D\x91"
+ "\x28\xBF\x33\xCA\x61\xF8\x6C\x03"
+ "\x9A\x0E\xA5\x3C\xD3\x47\xDE\x75"
+ "\x0C\x80\x17\xAE\x22\xB9\x50\xE7"
+ "\x5B\xF2\x89\x20\x94\x2B\xC2\x36"
+ &

[RFC PATCH 3/6] crypto: blowfish - add AVX2/x86_64 implementation of blowfish cipher

2013-04-13 Thread Jussi Kivilinna
Patch adds AVX2/x86-64 implementation of Blowfish cipher, requiring 32 parallel
blocks for input (256 bytes). Table look-ups are performed using vpgatherdd
instruction directly from vector registers and thus should be faster than
earlier implementations.

Signed-off-by: Jussi Kivilinna 
---
 arch/x86/crypto/Makefile   |   11 +
 arch/x86/crypto/blowfish-avx2-asm_64.S |  449 +
 arch/x86/crypto/blowfish_avx2_glue.c   |  585 
 arch/x86/crypto/blowfish_glue.c|   32 --
 arch/x86/include/asm/cpufeature.h  |1 
 arch/x86/include/asm/crypto/blowfish.h |   43 ++
 crypto/Kconfig |   18 +
 crypto/testmgr.c   |   12 +
 8 files changed, 1127 insertions(+), 24 deletions(-)
 create mode 100644 arch/x86/crypto/blowfish-avx2-asm_64.S
 create mode 100644 arch/x86/crypto/blowfish_avx2_glue.c
 create mode 100644 arch/x86/include/asm/crypto/blowfish.h

diff --git a/arch/x86/crypto/Makefile b/arch/x86/crypto/Makefile
index 03cd731..28464ef 100644
--- a/arch/x86/crypto/Makefile
+++ b/arch/x86/crypto/Makefile
@@ -3,6 +3,8 @@
 #
 
 avx_supported := $(call as-instr,vpxor %xmm0$(comma)%xmm0$(comma)%xmm0,yes,no)
+avx2_supported := $(call as-instr,vpgatherdd %ymm0$(comma)(%eax$(comma)%ymm1\
+   $(comma)4)$(comma)%ymm2,yes,no)
 
 obj-$(CONFIG_CRYPTO_ABLK_HELPER_X86) += ablk_helper.o
 obj-$(CONFIG_CRYPTO_GLUE_HELPER_X86) += glue_helper.o
@@ -38,6 +40,11 @@ ifeq ($(avx_supported),yes)
obj-$(CONFIG_CRYPTO_SERPENT_AVX_X86_64) += serpent-avx-x86_64.o
 endif
 
+# These modules require assembler to support AVX2.
+ifeq ($(avx2_supported),yes)
+   obj-$(CONFIG_CRYPTO_BLOWFISH_AVX2_X86_64) += blowfish-avx2.o
+endif
+
 aes-i586-y := aes-i586-asm_32.o aes_glue.o
 twofish-i586-y := twofish-i586-asm_32.o twofish_glue.o
 salsa20-i586-y := salsa20-i586-asm_32.o salsa20_glue.o
@@ -62,6 +69,10 @@ ifeq ($(avx_supported),yes)
serpent_avx_glue.o
 endif
 
+ifeq ($(avx2_supported),yes)
+   blowfish-avx2-y := blowfish-avx2-asm_64.o blowfish_avx2_glue.o
+endif
+
 aesni-intel-y := aesni-intel_asm.o aesni-intel_glue.o fpu.o
 ghash-clmulni-intel-y := ghash-clmulni-intel_asm.o ghash-clmulni-intel_glue.o
 sha1-ssse3-y := sha1_ssse3_asm.o sha1_ssse3_glue.o
diff --git a/arch/x86/crypto/blowfish-avx2-asm_64.S 
b/arch/x86/crypto/blowfish-avx2-asm_64.S
new file mode 100644
index 000..784452e
--- /dev/null
+++ b/arch/x86/crypto/blowfish-avx2-asm_64.S
@@ -0,0 +1,449 @@
+/*
+ * x86_64/AVX2 assembler optimized version of Blowfish
+ *
+ * Copyright © 2012-2013 Jussi Kivilinna 
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ */
+
+#include 
+
+.file "blowfish-avx2-asm_64.S"
+
+.data
+.align 32
+
+.Lprefetch_mask:
+.long 0*64
+.long 1*64
+.long 2*64
+.long 3*64
+.long 4*64
+.long 5*64
+.long 6*64
+.long 7*64
+
+.Lbswap32_mask:
+.long 0x00010203
+.long 0x04050607
+.long 0x08090a0b
+.long 0x0c0d0e0f
+
+.Lbswap128_mask:
+   .byte 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0
+.Lbswap_iv_mask:
+   .byte 7, 6, 5, 4, 3, 2, 1, 0, 7, 6, 5, 4, 3, 2, 1, 0
+
+.text
+/* structure of crypto context */
+#define p  0
+#define s0 ((16 + 2) * 4)
+#define s1 ((16 + 2 + (1 * 256)) * 4)
+#define s2 ((16 + 2 + (2 * 256)) * 4)
+#define s3 ((16 + 2 + (3 * 256)) * 4)
+
+/* register macros */
+#define CTX%rdi
+#define RIO %rdx
+
+#define RS0%rax
+#define RS1%r8
+#define RS2%r9
+#define RS3%r10
+
+#define RLOOP  %r11
+#define RLOOPd %r11d
+
+#define RXr0   %ymm8
+#define RXr1   %ymm9
+#define RXr2   %ymm10
+#define RXr3   %ymm11
+#define RXl0   %ymm12
+#define RXl1   %ymm13
+#define RXl2   %ymm14
+#define RXl3   %ymm15
+
+/* temp regs */
+#define RT0%ymm0
+#define RT0x   %xmm0
+#define RT1%ymm1
+#define RT1x   %xmm1
+#define RIDX0  %ymm2
+#define RIDX1  %ymm3
+#define RIDX1x %xmm3
+#define RIDX2  %ymm4
+#define RIDX3  %ymm5
+
+/* vpgatherdd mask and '-1' */
+#define RNOT   %ymm6
+
+/* byte mask, (-1 >> 24) */
+#define RBYTE  %ymm7
+
+/***
+ * 32-way AVX2 blowfish
+ ***/
+#define F(xl, xr) \
+   vpsrld $24, xl, RIDX0; \
+   vpsrld $16, xl, RIDX1; \
+   vpsrld $8, xl, RIDX2; \
+   vpand RBYTE, RIDX1, RIDX1; \
+   vpand RBYTE, RIDX2, RIDX2; \
+   vpand RBYTE, xl, RIDX3; \
+   \
+   vpgatherdd RNOT, (RS0, RIDX0, 4), RT0; \
+   vpcmpeqd RNOT, RNOT, RNOT; \
+   vpcmpeqd RIDX0, RIDX0, RIDX0; \
+   \
+   vpgatherdd RNOT, (RS1, RIDX1, 4), RT1; \
+   vpcmpeqd RIDX1, RIDX1, RIDX1; \
+   vpad

[RFC PATCH 0/6] Add AVX2 accelerated implementations for Blowfish, Twofish, Serpent and Camellia

2013-04-13 Thread Jussi Kivilinna
The following series implements four block ciphers - Blowfish, Twofish, Serpent
and Camellia - using AVX2 instruction set. This work on AVX2 implementations
started over year ago and have been available at
https://github.com/jkivilin/crypto-avx2

The Serpent and Camellia implementations are directly based on the word-sliced
and byte-sliced AVX implementations and have been extended to use the 256-bit
YMM registers. As such the performance should be better than with the 128-bit
wide AVX implementations. (Camellia implementation needs some extra handling
for the AES-NI as AES instructions have remained only 128-bit wide.)

Blowfish and Twofish implementations utilize the new vpgatherdd instruction to
perform eight vectorized 8x32-bit table look-ups at once. This is different
from the previous word-sliced AVX implementations, where table look-ups have
to performed through general purpose registers. AVX2 implementations thus
avoid additional moving of data between the SIMD and general purpose registers
and therefore should be faster.

For obvious reasons, I have not tested these implementations on real hardware.
Kernel tcrypt tests have been run under Bochs, which should contain somewhat
working AVX2 implementation. But I cannot be sure, even the Intel SDE emulator
that I used for testing these implementations did not quite follow the specs
(a past version of SDE that I initially used allowed vector registers to
vgather be same, whereas specs say that in such case exception should be
raised). Because of this, the first versions of patchset in above repository
are broken.

So since I'm unable to verify that these implementations work on real hardware
and are unable to conduct real performance evaluation, I'm sending this
patchset as RFC. Maybe someone can actually test these on real hardware and
maybe give acked-by in case these look ok(?). If such is not possible, I'll
do the testing myself when those Haswell processors come available where I
live.

-Jussi

---

Jussi Kivilinna (6):
  crypto: testmgr - extend camellia test-vectors for camellia-aesni/avx2
  crypto: tcrypt - add async cipher speed tests for blowfish
  crypto: blowfish - add AVX2/x86_64 implementation of blowfish cipher
  crypto: twofish - add AVX2/x86_64 assembler implementation of twofish 
cipher
  crypto: serpent - add AVX2/x86_64 assembler implementation of serpent 
cipher
  crypto: camellia - add AVX2/AES-NI/x86_64 assembler implementation of 
camellia cipher


 arch/x86/crypto/Makefile |   17 
 arch/x86/crypto/blowfish-avx2-asm_64.S   |  449 +
 arch/x86/crypto/blowfish_avx2_glue.c |  585 +++
 arch/x86/crypto/blowfish_glue.c  |   32 -
 arch/x86/crypto/camellia-aesni-avx2-asm_64.S | 1368 ++
 arch/x86/crypto/camellia_aesni_avx2_glue.c   |  586 +++
 arch/x86/crypto/camellia_aesni_avx_glue.c|   17 
 arch/x86/crypto/glue_helper-asm-avx2.S   |  180 +++
 arch/x86/crypto/serpent-avx2-asm_64.S|  800 +++
 arch/x86/crypto/serpent_avx2_glue.c  |  562 +++
 arch/x86/crypto/serpent_avx_glue.c   |   62 +
 arch/x86/crypto/twofish-avx2-asm_64.S|  600 +++
 arch/x86/crypto/twofish_avx2_glue.c  |  584 +++
 arch/x86/crypto/twofish_avx_glue.c   |   14 
 arch/x86/include/asm/cpufeature.h|1 
 arch/x86/include/asm/crypto/blowfish.h   |   43 +
 arch/x86/include/asm/crypto/camellia.h   |   19 
 arch/x86/include/asm/crypto/serpent-avx.h|   24 
 arch/x86/include/asm/crypto/twofish.h|   18 
 crypto/Kconfig   |   88 ++
 crypto/tcrypt.c  |   15 
 crypto/testmgr.c |   51 +
 crypto/testmgr.h | 1100 -
 23 files changed, 7128 insertions(+), 87 deletions(-)
 create mode 100644 arch/x86/crypto/blowfish-avx2-asm_64.S
 create mode 100644 arch/x86/crypto/blowfish_avx2_glue.c
 create mode 100644 arch/x86/crypto/camellia-aesni-avx2-asm_64.S
 create mode 100644 arch/x86/crypto/camellia_aesni_avx2_glue.c
 create mode 100644 arch/x86/crypto/glue_helper-asm-avx2.S
 create mode 100644 arch/x86/crypto/serpent-avx2-asm_64.S
 create mode 100644 arch/x86/crypto/serpent_avx2_glue.c
 create mode 100644 arch/x86/crypto/twofish-avx2-asm_64.S
 create mode 100644 arch/x86/crypto/twofish_avx2_glue.c
 create mode 100644 arch/x86/include/asm/crypto/blowfish.h

-- 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC PATCH 0/6] Add AVX2 accelerated implementations for Blowfish, Twofish, Serpent and Camellia

2013-04-13 Thread Jussi Kivilinna
The following series implements four block ciphers - Blowfish, Twofish, Serpent
and Camellia - using AVX2 instruction set. This work on AVX2 implementations
started over year ago and have been available at
https://github.com/jkivilin/crypto-avx2

The Serpent and Camellia implementations are directly based on the word-sliced
and byte-sliced AVX implementations and have been extended to use the 256-bit
YMM registers. As such the performance should be better than with the 128-bit
wide AVX implementations. (Camellia implementation needs some extra handling
for the AES-NI as AES instructions have remained only 128-bit wide.)

Blowfish and Twofish implementations utilize the new vpgatherdd instruction to
perform eight vectorized 8x32-bit table look-ups at once. This is different
from the previous word-sliced AVX implementations, where table look-ups have
to performed through general purpose registers. AVX2 implementations thus
avoid additional moving of data between the SIMD and general purpose registers
and therefore should be faster.

For obvious reasons, I have not tested these implementations on real hardware.
Kernel tcrypt tests have been run under Bochs, which should contain somewhat
working AVX2 implementation. But I cannot be sure, even the Intel SDE emulator
that I used for testing these implementations did not quite follow the specs
(a past version of SDE that I initially used allowed vector registers to
vgather be same, whereas specs say that in such case exception should be
raised). Because of this, the first versions of patchset in above repository
are broken.

So since I'm unable to verify that these implementations work on real hardware
and are unable to conduct real performance evaluation, I'm sending this
patchset as RFC. Maybe someone can actually test these on real hardware and
maybe give acked-by in case these look ok(?). If such is not possible, I'll
do the testing myself when those Haswell processors come available where I
live.

-Jussi

---

Jussi Kivilinna (6):
  crypto: testmgr - extend camellia test-vectors for camellia-aesni/avx2
  crypto: tcrypt - add async cipher speed tests for blowfish
  crypto: blowfish - add AVX2/x86_64 implementation of blowfish cipher
  crypto: twofish - add AVX2/x86_64 assembler implementation of twofish 
cipher
  crypto: serpent - add AVX2/x86_64 assembler implementation of serpent 
cipher
  crypto: camellia - add AVX2/AES-NI/x86_64 assembler implementation of 
camellia cipher


 arch/x86/crypto/Makefile |   17 
 arch/x86/crypto/blowfish-avx2-asm_64.S   |  449 +
 arch/x86/crypto/blowfish_avx2_glue.c |  585 +++
 arch/x86/crypto/blowfish_glue.c  |   32 -
 arch/x86/crypto/camellia-aesni-avx2-asm_64.S | 1368 ++
 arch/x86/crypto/camellia_aesni_avx2_glue.c   |  586 +++
 arch/x86/crypto/camellia_aesni_avx_glue.c|   17 
 arch/x86/crypto/glue_helper-asm-avx2.S   |  180 +++
 arch/x86/crypto/serpent-avx2-asm_64.S|  800 +++
 arch/x86/crypto/serpent_avx2_glue.c  |  562 +++
 arch/x86/crypto/serpent_avx_glue.c   |   62 +
 arch/x86/crypto/twofish-avx2-asm_64.S|  600 +++
 arch/x86/crypto/twofish_avx2_glue.c  |  584 +++
 arch/x86/crypto/twofish_avx_glue.c   |   14 
 arch/x86/include/asm/cpufeature.h|1 
 arch/x86/include/asm/crypto/blowfish.h   |   43 +
 arch/x86/include/asm/crypto/camellia.h   |   19 
 arch/x86/include/asm/crypto/serpent-avx.h|   24 
 arch/x86/include/asm/crypto/twofish.h|   18 
 crypto/Kconfig   |   88 ++
 crypto/tcrypt.c  |   15 
 crypto/testmgr.c |   51 +
 crypto/testmgr.h | 1100 -
 23 files changed, 7128 insertions(+), 87 deletions(-)
 create mode 100644 arch/x86/crypto/blowfish-avx2-asm_64.S
 create mode 100644 arch/x86/crypto/blowfish_avx2_glue.c
 create mode 100644 arch/x86/crypto/camellia-aesni-avx2-asm_64.S
 create mode 100644 arch/x86/crypto/camellia_aesni_avx2_glue.c
 create mode 100644 arch/x86/crypto/glue_helper-asm-avx2.S
 create mode 100644 arch/x86/crypto/serpent-avx2-asm_64.S
 create mode 100644 arch/x86/crypto/serpent_avx2_glue.c
 create mode 100644 arch/x86/crypto/twofish-avx2-asm_64.S
 create mode 100644 arch/x86/crypto/twofish_avx2_glue.c
 create mode 100644 arch/x86/include/asm/crypto/blowfish.h

-- 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC PATCH 1/6] crypto: testmgr - extend camellia test-vectors for camellia-aesni/avx2

2013-04-13 Thread Jussi Kivilinna
Signed-off-by: Jussi Kivilinna jussi.kivili...@iki.fi
---
 crypto/testmgr.h | 1100 --
 1 file changed, 1062 insertions(+), 38 deletions(-)

diff --git a/crypto/testmgr.h b/crypto/testmgr.h
index d503660..dc2c054 100644
--- a/crypto/testmgr.h
+++ b/crypto/testmgr.h
@@ -20997,8 +20997,72 @@ static struct cipher_testvec 
camellia_enc_tv_template[] = {
  \x86\x1D\xB4\x28\xBF\x56\xED\x61
  \xF8\x8F\x03\x9A\x31\xC8\x3C\xD3
  \x6A\x01\x75\x0C\xA3\x17\xAE\x45
- \xDC\x50\xE7\x7E\x15\x89\x20\xB7,
-   .ilen   = 496,
+ \xDC\x50\xE7\x7E\x15\x89\x20\xB7
+ \x2B\xC2\x59\xF0\x64\xFB\x92\x06
+ \x9D\x34\xCB\x3F\xD6\x6D\x04\x78
+ \x0F\xA6\x1A\xB1\x48\xDF\x53\xEA
+ \x81\x18\x8C\x23\xBA\x2E\xC5\x5C
+ \xF3\x67\xFE\x95\x09\xA0\x37\xCE
+ \x42\xD9\x70\x07\x7B\x12\xA9\x1D
+ \xB4\x4B\xE2\x56\xED\x84\x1B\x8F
+ \x26\xBD\x31\xC8\x5F\xF6\x6A\x01
+ \x98\x0C\xA3\x3A\xD1\x45\xDC\x73
+ \x0A\x7E\x15\xAC\x20\xB7\x4E\xE5
+ \x59\xF0\x87\x1E\x92\x29\xC0\x34
+ \xCB\x62\xF9\x6D\x04\x9B\x0F\xA6
+ \x3D\xD4\x48\xDF\x76\x0D\x81\x18
+ \xAF\x23\xBA\x51\xE8\x5C\xF3\x8A
+ \x21\x95\x2C\xC3\x37\xCE\x65\xFC
+ \x70\x07\x9E\x12\xA9\x40\xD7\x4B
+ \xE2\x79\x10\x84\x1B\xB2\x26\xBD
+ \x54\xEB\x5F\xF6\x8D\x01\x98\x2F
+ \xC6\x3A\xD1\x68\xFF\x73\x0A\xA1
+ \x15\xAC\x43\xDA\x4E\xE5\x7C\x13
+ \x87\x1E\xB5\x29\xC0\x57\xEE\x62
+ \xF9\x90\x04\x9B\x32\xC9\x3D\xD4
+ \x6B\x02\x76\x0D\xA4\x18\xAF\x46
+ \xDD\x51\xE8\x7F\x16\x8A\x21\xB8
+ \x2C\xC3\x5A\xF1\x65\xFC\x93\x07
+ \x9E\x35\xCC\x40\xD7\x6E\x05\x79
+ \x10\xA7\x1B\xB2\x49\xE0\x54\xEB
+ \x82\x19\x8D\x24\xBB\x2F\xC6\x5D
+ \xF4\x68\xFF\x96\x0A\xA1\x38\xCF
+ \x43\xDA\x71\x08\x7C\x13\xAA\x1E
+ \xB5\x4C\xE3\x57\xEE\x85\x1C\x90
+ \x27\xBE\x32\xC9\x60\xF7\x6B\x02
+ \x99\x0D\xA4\x3B\xD2\x46\xDD\x74
+ \x0B\x7F\x16\xAD\x21\xB8\x4F\xE6
+ \x5A\xF1\x88\x1F\x93\x2A\xC1\x35
+ \xCC\x63\xFA\x6E\x05\x9C\x10\xA7
+ \x3E\xD5\x49\xE0\x77\x0E\x82\x19
+ \xB0\x24\xBB\x52\xE9\x5D\xF4\x8B
+ \x22\x96\x2D\xC4\x38\xCF\x66\xFD
+ \x71\x08\x9F\x13\xAA\x41\xD8\x4C
+ \xE3\x7A\x11\x85\x1C\xB3\x27\xBE
+ \x55\xEC\x60\xF7\x8E\x02\x99\x30
+ \xC7\x3B\xD2\x69\x00\x74\x0B\xA2
+ \x16\xAD\x44\xDB\x4F\xE6\x7D\x14
+ \x88\x1F\xB6\x2A\xC1\x58\xEF\x63
+ \xFA\x91\x05\x9C\x33\xCA\x3E\xD5
+ \x6C\x03\x77\x0E\xA5\x19\xB0\x47
+ \xDE\x52\xE9\x80\x17\x8B\x22\xB9
+ \x2D\xC4\x5B\xF2\x66\xFD\x94\x08
+ \x9F\x36\xCD\x41\xD8\x6F\x06\x7A
+ \x11\xA8\x1C\xB3\x4A\xE1\x55\xEC
+ \x83\x1A\x8E\x25\xBC\x30\xC7\x5E
+ \xF5\x69\x00\x97\x0B\xA2\x39\xD0
+ \x44\xDB\x72\x09\x7D\x14\xAB\x1F
+ \xB6\x4D\xE4\x58\xEF\x86\x1D\x91
+ \x28\xBF\x33\xCA\x61\xF8\x6C\x03
+ \x9A\x0E\xA5\x3C\xD3\x47\xDE\x75
+ \x0C\x80\x17\xAE\x22\xB9\x50\xE7
+ \x5B\xF2\x89\x20\x94\x2B\xC2\x36
+ \xCD\x64\xFB\x6F\x06\x9D\x11\xA8
+ \x3F\xD6\x4A\xE1\x78\x0F\x83\x1A
+ \xB1\x25\xBC\x53\xEA\x5E\xF5\x8C
+ \x00\x97\x2E\xC5\x39\xD0\x67\xFE
+ \x72\x09\xA0\x14\xAB\x42\xD9\x4D,
+   .ilen   = 1008,
.result = \xED\xCD\xDB\xB8\x68\xCE\xBD\xEA
  \x9D\x9D\xCD\x9F\x4F\xFC\x4D\xB7
  \xA5\xFF\x6F\x43\x0F\xBA\x32\x04
@@ -21060,11 +21124,75 @@ static struct cipher_testvec 
camellia_enc_tv_template[] = {
  \x2C\x35\x1B\x38\x85\x7D\xE8\xF3
  \x87\x4F\xDA\xD8\x5F\xFC\xB6\x44
  \xD0\xE3\x9B\x8B\xBF\xD6\xB8\xC4

[RFC PATCH 3/6] crypto: blowfish - add AVX2/x86_64 implementation of blowfish cipher

2013-04-13 Thread Jussi Kivilinna
Patch adds AVX2/x86-64 implementation of Blowfish cipher, requiring 32 parallel
blocks for input (256 bytes). Table look-ups are performed using vpgatherdd
instruction directly from vector registers and thus should be faster than
earlier implementations.

Signed-off-by: Jussi Kivilinna jussi.kivili...@iki.fi
---
 arch/x86/crypto/Makefile   |   11 +
 arch/x86/crypto/blowfish-avx2-asm_64.S |  449 +
 arch/x86/crypto/blowfish_avx2_glue.c   |  585 
 arch/x86/crypto/blowfish_glue.c|   32 --
 arch/x86/include/asm/cpufeature.h  |1 
 arch/x86/include/asm/crypto/blowfish.h |   43 ++
 crypto/Kconfig |   18 +
 crypto/testmgr.c   |   12 +
 8 files changed, 1127 insertions(+), 24 deletions(-)
 create mode 100644 arch/x86/crypto/blowfish-avx2-asm_64.S
 create mode 100644 arch/x86/crypto/blowfish_avx2_glue.c
 create mode 100644 arch/x86/include/asm/crypto/blowfish.h

diff --git a/arch/x86/crypto/Makefile b/arch/x86/crypto/Makefile
index 03cd731..28464ef 100644
--- a/arch/x86/crypto/Makefile
+++ b/arch/x86/crypto/Makefile
@@ -3,6 +3,8 @@
 #
 
 avx_supported := $(call as-instr,vpxor %xmm0$(comma)%xmm0$(comma)%xmm0,yes,no)
+avx2_supported := $(call as-instr,vpgatherdd %ymm0$(comma)(%eax$(comma)%ymm1\
+   $(comma)4)$(comma)%ymm2,yes,no)
 
 obj-$(CONFIG_CRYPTO_ABLK_HELPER_X86) += ablk_helper.o
 obj-$(CONFIG_CRYPTO_GLUE_HELPER_X86) += glue_helper.o
@@ -38,6 +40,11 @@ ifeq ($(avx_supported),yes)
obj-$(CONFIG_CRYPTO_SERPENT_AVX_X86_64) += serpent-avx-x86_64.o
 endif
 
+# These modules require assembler to support AVX2.
+ifeq ($(avx2_supported),yes)
+   obj-$(CONFIG_CRYPTO_BLOWFISH_AVX2_X86_64) += blowfish-avx2.o
+endif
+
 aes-i586-y := aes-i586-asm_32.o aes_glue.o
 twofish-i586-y := twofish-i586-asm_32.o twofish_glue.o
 salsa20-i586-y := salsa20-i586-asm_32.o salsa20_glue.o
@@ -62,6 +69,10 @@ ifeq ($(avx_supported),yes)
serpent_avx_glue.o
 endif
 
+ifeq ($(avx2_supported),yes)
+   blowfish-avx2-y := blowfish-avx2-asm_64.o blowfish_avx2_glue.o
+endif
+
 aesni-intel-y := aesni-intel_asm.o aesni-intel_glue.o fpu.o
 ghash-clmulni-intel-y := ghash-clmulni-intel_asm.o ghash-clmulni-intel_glue.o
 sha1-ssse3-y := sha1_ssse3_asm.o sha1_ssse3_glue.o
diff --git a/arch/x86/crypto/blowfish-avx2-asm_64.S 
b/arch/x86/crypto/blowfish-avx2-asm_64.S
new file mode 100644
index 000..784452e
--- /dev/null
+++ b/arch/x86/crypto/blowfish-avx2-asm_64.S
@@ -0,0 +1,449 @@
+/*
+ * x86_64/AVX2 assembler optimized version of Blowfish
+ *
+ * Copyright © 2012-2013 Jussi Kivilinna jussi.kivili...@iki.fi
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ */
+
+#include linux/linkage.h
+
+.file blowfish-avx2-asm_64.S
+
+.data
+.align 32
+
+.Lprefetch_mask:
+.long 0*64
+.long 1*64
+.long 2*64
+.long 3*64
+.long 4*64
+.long 5*64
+.long 6*64
+.long 7*64
+
+.Lbswap32_mask:
+.long 0x00010203
+.long 0x04050607
+.long 0x08090a0b
+.long 0x0c0d0e0f
+
+.Lbswap128_mask:
+   .byte 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0
+.Lbswap_iv_mask:
+   .byte 7, 6, 5, 4, 3, 2, 1, 0, 7, 6, 5, 4, 3, 2, 1, 0
+
+.text
+/* structure of crypto context */
+#define p  0
+#define s0 ((16 + 2) * 4)
+#define s1 ((16 + 2 + (1 * 256)) * 4)
+#define s2 ((16 + 2 + (2 * 256)) * 4)
+#define s3 ((16 + 2 + (3 * 256)) * 4)
+
+/* register macros */
+#define CTX%rdi
+#define RIO %rdx
+
+#define RS0%rax
+#define RS1%r8
+#define RS2%r9
+#define RS3%r10
+
+#define RLOOP  %r11
+#define RLOOPd %r11d
+
+#define RXr0   %ymm8
+#define RXr1   %ymm9
+#define RXr2   %ymm10
+#define RXr3   %ymm11
+#define RXl0   %ymm12
+#define RXl1   %ymm13
+#define RXl2   %ymm14
+#define RXl3   %ymm15
+
+/* temp regs */
+#define RT0%ymm0
+#define RT0x   %xmm0
+#define RT1%ymm1
+#define RT1x   %xmm1
+#define RIDX0  %ymm2
+#define RIDX1  %ymm3
+#define RIDX1x %xmm3
+#define RIDX2  %ymm4
+#define RIDX3  %ymm5
+
+/* vpgatherdd mask and '-1' */
+#define RNOT   %ymm6
+
+/* byte mask, (-1  24) */
+#define RBYTE  %ymm7
+
+/***
+ * 32-way AVX2 blowfish
+ ***/
+#define F(xl, xr) \
+   vpsrld $24, xl, RIDX0; \
+   vpsrld $16, xl, RIDX1; \
+   vpsrld $8, xl, RIDX2; \
+   vpand RBYTE, RIDX1, RIDX1; \
+   vpand RBYTE, RIDX2, RIDX2; \
+   vpand RBYTE, xl, RIDX3; \
+   \
+   vpgatherdd RNOT, (RS0, RIDX0, 4), RT0; \
+   vpcmpeqd RNOT, RNOT, RNOT; \
+   vpcmpeqd RIDX0, RIDX0, RIDX0; \
+   \
+   vpgatherdd RNOT, (RS1, RIDX1, 4), RT1; \
+   vpcmpeqd RIDX1, RIDX1, RIDX1

[RFC PATCH 4/6] crypto: twofish - add AVX2/x86_64 assembler implementation of twofish cipher

2013-04-13 Thread Jussi Kivilinna
Patch adds AVX2/x86-64 implementation of Twofish cipher, requiring 16 parallel
blocks for input (256 bytes). Table look-ups are performed using vpgatherdd
instruction directly from vector registers and thus should be faster than
earlier implementations. Implementation also uses 256-bit wide YMM registers,
which should give additional speed up compared to the AVX implementation.

Signed-off-by: Jussi Kivilinna jussi.kivili...@iki.fi
---
 arch/x86/crypto/Makefile   |2 
 arch/x86/crypto/glue_helper-asm-avx2.S |  180 ++
 arch/x86/crypto/twofish-avx2-asm_64.S  |  600 
 arch/x86/crypto/twofish_avx2_glue.c|  584 +++
 arch/x86/crypto/twofish_avx_glue.c |   14 +
 arch/x86/include/asm/crypto/twofish.h  |   18 +
 crypto/Kconfig |   24 +
 crypto/testmgr.c   |   12 +
 8 files changed, 1432 insertions(+), 2 deletions(-)
 create mode 100644 arch/x86/crypto/glue_helper-asm-avx2.S
 create mode 100644 arch/x86/crypto/twofish-avx2-asm_64.S
 create mode 100644 arch/x86/crypto/twofish_avx2_glue.c

diff --git a/arch/x86/crypto/Makefile b/arch/x86/crypto/Makefile
index 28464ef..1f6e0c2 100644
--- a/arch/x86/crypto/Makefile
+++ b/arch/x86/crypto/Makefile
@@ -43,6 +43,7 @@ endif
 # These modules require assembler to support AVX2.
 ifeq ($(avx2_supported),yes)
obj-$(CONFIG_CRYPTO_BLOWFISH_AVX2_X86_64) += blowfish-avx2.o
+   obj-$(CONFIG_CRYPTO_TWOFISH_AVX2_X86_64) += twofish-avx2.o
 endif
 
 aes-i586-y := aes-i586-asm_32.o aes_glue.o
@@ -71,6 +72,7 @@ endif
 
 ifeq ($(avx2_supported),yes)
blowfish-avx2-y := blowfish-avx2-asm_64.o blowfish_avx2_glue.o
+   twofish-avx2-y := twofish-avx2-asm_64.o twofish_avx2_glue.o
 endif
 
 aesni-intel-y := aesni-intel_asm.o aesni-intel_glue.o fpu.o
diff --git a/arch/x86/crypto/glue_helper-asm-avx2.S 
b/arch/x86/crypto/glue_helper-asm-avx2.S
new file mode 100644
index 000..a53ac11
--- /dev/null
+++ b/arch/x86/crypto/glue_helper-asm-avx2.S
@@ -0,0 +1,180 @@
+/*
+ * Shared glue code for 128bit block ciphers, AVX2 assembler macros
+ *
+ * Copyright © 2012-2013 Jussi Kivilinna jussi.kivili...@mbnet.fi
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ */
+
+#define load_16way(src, x0, x1, x2, x3, x4, x5, x6, x7) \
+   vmovdqu (0*32)(src), x0; \
+   vmovdqu (1*32)(src), x1; \
+   vmovdqu (2*32)(src), x2; \
+   vmovdqu (3*32)(src), x3; \
+   vmovdqu (4*32)(src), x4; \
+   vmovdqu (5*32)(src), x5; \
+   vmovdqu (6*32)(src), x6; \
+   vmovdqu (7*32)(src), x7;
+
+#define store_16way(dst, x0, x1, x2, x3, x4, x5, x6, x7) \
+   vmovdqu x0, (0*32)(dst); \
+   vmovdqu x1, (1*32)(dst); \
+   vmovdqu x2, (2*32)(dst); \
+   vmovdqu x3, (3*32)(dst); \
+   vmovdqu x4, (4*32)(dst); \
+   vmovdqu x5, (5*32)(dst); \
+   vmovdqu x6, (6*32)(dst); \
+   vmovdqu x7, (7*32)(dst);
+
+#define store_cbc_16way(src, dst, x0, x1, x2, x3, x4, x5, x6, x7, t0) \
+   vpxor t0, t0, t0; \
+   vinserti128 $1, (src), t0, t0; \
+   vpxor t0, x0, x0; \
+   vpxor (0*32+16)(src), x1, x1; \
+   vpxor (1*32+16)(src), x2, x2; \
+   vpxor (2*32+16)(src), x3, x3; \
+   vpxor (3*32+16)(src), x4, x4; \
+   vpxor (4*32+16)(src), x5, x5; \
+   vpxor (5*32+16)(src), x6, x6; \
+   vpxor (6*32+16)(src), x7, x7; \
+   store_16way(dst, x0, x1, x2, x3, x4, x5, x6, x7);
+
+#define inc_le128(x, minus_one, tmp) \
+   vpcmpeqq minus_one, x, tmp; \
+   vpsubq minus_one, x, x; \
+   vpslldq $8, tmp, tmp; \
+   vpsubq tmp, x, x;
+
+#define add2_le128(x, minus_one, minus_two, tmp1, tmp2) \
+   vpcmpeqq minus_one, x, tmp1; \
+   vpcmpeqq minus_two, x, tmp2; \
+   vpsubq minus_two, x, x; \
+   vpor tmp2, tmp1, tmp1; \
+   vpslldq $8, tmp1, tmp1; \
+   vpsubq tmp1, x, x;
+
+#define load_ctr_16way(iv, bswap, x0, x1, x2, x3, x4, x5, x6, x7, t0, t0x, t1, 
\
+  t1x, t2, t2x, t3, t3x, t4, t5) \
+   vpcmpeqd t0, t0, t0; \
+   vpsrldq $8, t0, t0; /* ab: -1:0 ; cd: -1:0 */ \
+   vpaddq t0, t0, t4; /* ab: -2:0 ; cd: -2:0 */\
+   \
+   /* load IV and byteswap */ \
+   vmovdqu (iv), t2x; \
+   vmovdqa t2x, t3x; \
+   inc_le128(t2x, t0x, t1x); \
+   vbroadcasti128 bswap, t1; \
+   vinserti128 $1, t2x, t3, t2; /* ab: le0 ; cd: le1 */ \
+   vpshufb t1, t2, x0; \
+   \
+   /* construct IVs */ \
+   add2_le128(t2, t0, t4, t3, t5); /* ab: le2 ; cd: le3 */ \
+   vpshufb t1, t2, x1; \
+   add2_le128(t2, t0, t4, t3, t5); \
+   vpshufb t1, t2, x2; \
+   add2_le128(t2, t0, t4, t3, t5); \
+   vpshufb t1, t2, x3; \
+   add2_le128(t2, t0, t4, t3, t5); \
+   vpshufb t1, t2

[RFC PATCH 5/6] crypto: serpent - add AVX2/x86_64 assembler implementation of serpent cipher

2013-04-13 Thread Jussi Kivilinna
Patch adds AVX2/x86-64 implementation of Serpent cipher, requiring 16 parallel
blocks for input (256 bytes). Implementation is based on the AVX implementation
and extends to use the 256-bit wide YMM registers. Since serpent does not use
table look-ups, this implementation should be close to two times faster than
the AVX implementation.

Signed-off-by: Jussi Kivilinna jussi.kivili...@iki.fi
---
 arch/x86/crypto/Makefile  |2 
 arch/x86/crypto/serpent-avx2-asm_64.S |  800 +
 arch/x86/crypto/serpent_avx2_glue.c   |  562 
 arch/x86/crypto/serpent_avx_glue.c|   62 ++
 arch/x86/include/asm/crypto/serpent-avx.h |   24 +
 crypto/Kconfig|   23 +
 crypto/testmgr.c  |   15 +
 7 files changed, 1468 insertions(+), 20 deletions(-)
 create mode 100644 arch/x86/crypto/serpent-avx2-asm_64.S
 create mode 100644 arch/x86/crypto/serpent_avx2_glue.c

diff --git a/arch/x86/crypto/Makefile b/arch/x86/crypto/Makefile
index 1f6e0c2..a21af59 100644
--- a/arch/x86/crypto/Makefile
+++ b/arch/x86/crypto/Makefile
@@ -43,6 +43,7 @@ endif
 # These modules require assembler to support AVX2.
 ifeq ($(avx2_supported),yes)
obj-$(CONFIG_CRYPTO_BLOWFISH_AVX2_X86_64) += blowfish-avx2.o
+   obj-$(CONFIG_CRYPTO_SERPENT_AVX2_X86_64) += serpent-avx2.o
obj-$(CONFIG_CRYPTO_TWOFISH_AVX2_X86_64) += twofish-avx2.o
 endif
 
@@ -72,6 +73,7 @@ endif
 
 ifeq ($(avx2_supported),yes)
blowfish-avx2-y := blowfish-avx2-asm_64.o blowfish_avx2_glue.o
+   serpent-avx2-y := serpent-avx2-asm_64.o serpent_avx2_glue.o
twofish-avx2-y := twofish-avx2-asm_64.o twofish_avx2_glue.o
 endif
 
diff --git a/arch/x86/crypto/serpent-avx2-asm_64.S 
b/arch/x86/crypto/serpent-avx2-asm_64.S
new file mode 100644
index 000..b222085
--- /dev/null
+++ b/arch/x86/crypto/serpent-avx2-asm_64.S
@@ -0,0 +1,800 @@
+/*
+ * x86_64/AVX2 assembler optimized version of Serpent
+ *
+ * Copyright © 2012-2013 Jussi Kivilinna jussi.kivili...@mbnet.fi
+ *
+ * Based on AVX assembler implementation of Serpent by:
+ *  Copyright © 2012 Johannes Goetzfried
+ *  johannes.goetzfr...@informatik.stud.uni-erlangen.de
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ */
+
+#include linux/linkage.h
+#include glue_helper-asm-avx2.S
+
+.file serpent-avx2-asm_64.S
+
+.data
+.align 16
+
+.Lbswap128_mask:
+   .byte 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0
+.Lxts_gf128mul_and_shl1_mask_0:
+   .byte 0x87, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0
+.Lxts_gf128mul_and_shl1_mask_1:
+   .byte 0x0e, 1, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0
+
+.text
+
+#define CTX %rdi
+
+#define RNOT %ymm0
+#define tp  %ymm1
+
+#define RA1 %ymm2
+#define RA2 %ymm3
+#define RB1 %ymm4
+#define RB2 %ymm5
+#define RC1 %ymm6
+#define RC2 %ymm7
+#define RD1 %ymm8
+#define RD2 %ymm9
+#define RE1 %ymm10
+#define RE2 %ymm11
+
+#define RK0 %ymm12
+#define RK1 %ymm13
+#define RK2 %ymm14
+#define RK3 %ymm15
+
+#define RK0x %xmm12
+#define RK1x %xmm13
+#define RK2x %xmm14
+#define RK3x %xmm15
+
+#define S0_1(x0, x1, x2, x3, x4)  \
+   vporx0,   x3, tp; \
+   vpxor   x3,   x0, x0; \
+   vpxor   x2,   x3, x4; \
+   vpxor   RNOT, x4, x4; \
+   vpxor   x1,   tp, x3; \
+   vpand   x0,   x1, x1; \
+   vpxor   x4,   x1, x1; \
+   vpxor   x0,   x2, x2;
+#define S0_2(x0, x1, x2, x3, x4)  \
+   vpxor   x3,   x0, x0; \
+   vporx0,   x4, x4; \
+   vpxor   x2,   x0, x0; \
+   vpand   x1,   x2, x2; \
+   vpxor   x2,   x3, x3; \
+   vpxor   RNOT, x1, x1; \
+   vpxor   x4,   x2, x2; \
+   vpxor   x2,   x1, x1;
+
+#define S1_1(x0, x1, x2, x3, x4)  \
+   vpxor   x0,   x1, tp; \
+   vpxor   x3,   x0, x0; \
+   vpxor   RNOT, x3, x3; \
+   vpand   tp,   x1, x4; \
+   vportp,   x0, x0; \
+   vpxor   x2,   x3, x3; \
+   vpxor   x3,   x0, x0; \
+   vpxor   x3,   tp, x1;
+#define S1_2(x0, x1, x2, x3, x4)  \
+   vpxor   x4,   x3, x3; \
+   vporx4,   x1, x1; \
+   vpxor   x2,   x4, x4; \
+   vpand   x0,   x2, x2; \
+   vpxor   x1,   x2, x2; \
+   vporx0,   x1, x1; \
+   vpxor   RNOT, x0, x0; \
+   vpxor   x2,   x0, x0; \
+   vpxor   x1,   x4, x4;
+
+#define S2_1(x0, x1, x2, x3, x4)  \
+   vpxor   RNOT, x3, x3; \
+   vpxor   x0,   x1, x1; \
+   vpand   x2,   x0, tp; \
+   vpxor   x3

[RFC PATCH 2/6] crypto: tcrypt - add async cipher speed tests for blowfish

2013-04-13 Thread Jussi Kivilinna
Signed-off-by: Jussi Kivilinna jussi.kivili...@iki.fi
---
 crypto/tcrypt.c |   15 +++
 1 file changed, 15 insertions(+)

diff --git a/crypto/tcrypt.c b/crypto/tcrypt.c
index 24ea7df..66d254c 100644
--- a/crypto/tcrypt.c
+++ b/crypto/tcrypt.c
@@ -1768,6 +1768,21 @@ static int do_test(int m)
   speed_template_32_64);
break;
 
+   case 509:
+   test_acipher_speed(ecb(blowfish), ENCRYPT, sec, NULL, 0,
+  speed_template_8_32);
+   test_acipher_speed(ecb(blowfish), DECRYPT, sec, NULL, 0,
+  speed_template_8_32);
+   test_acipher_speed(cbc(blowfish), ENCRYPT, sec, NULL, 0,
+  speed_template_8_32);
+   test_acipher_speed(cbc(blowfish), DECRYPT, sec, NULL, 0,
+  speed_template_8_32);
+   test_acipher_speed(ctr(blowfish), ENCRYPT, sec, NULL, 0,
+  speed_template_8_32);
+   test_acipher_speed(ctr(blowfish), DECRYPT, sec, NULL, 0,
+  speed_template_8_32);
+   break;
+
case 1000:
test_available();
break;

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 06/11] Create module providing optimized SHA256 routines using SSSE3, AVX or AVX2 instructions.

2013-03-24 Thread Jussi Kivilinna
On 22.03.2013 23:29, Tim Chen wrote:
> We added glue code and config options to create crypto
> module that uses SSE/AVX/AVX2 optimized SHA256 x86_64 assembly routines.
> 
> Signed-off-by: Tim Chen 

..snip..

> diff --git a/arch/x86/crypto/sha256_ssse3_glue.c 
> b/arch/x86/crypto/sha256_ssse3_glue.c
> new file mode 100644
> index 000..5876a19
> --- /dev/null
> +++ b/arch/x86/crypto/sha256_ssse3_glue.c

..snip..

> +static int __init sha256_ssse3_mod_init(void)
> +{
> + /* test for SSE3 first */
> + if (cpu_has_xmm3)
> + sha256_transform_asm = sha256_transform_ssse3;
> +

This causes OOPS on my computer. Maybe use 'cpu_has_ssse3' instead?

-Jussi
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 03/11] Optimized sha256 x86_64 assembly routine using Supplemental SSE3 instructions.

2013-03-24 Thread Jussi Kivilinna
On 22.03.2013 23:29, Tim Chen wrote:
> Provides SHA256 x86_64 assembly routine optimized with SSSE3 instructions.
> Speedup of 40% or more has been measured over the generic implementation.
> 
> Signed-off-by: Tim Chen 
> ---
>  arch/x86/crypto/sha256-ssse3-asm.S | 504 
> +
>  1 file changed, 504 insertions(+)
>  create mode 100644 arch/x86/crypto/sha256-ssse3-asm.S
> 
> diff --git a/arch/x86/crypto/sha256-ssse3-asm.S 
> b/arch/x86/crypto/sha256-ssse3-asm.S

..snip..

> +
> +
> +## void sha256_transform_ssse3(void *input_data, UINT32 digest[8], UINT64 
> num_blks)
> +## arg 1 : pointer to input data
> +## arg 2 : pointer to digest
> +## arg 3 : Num blocks
> +
> +.text
> +.global sha256_transform_ssse3
> +.align 32
> +sha256_transform_ssse3:

Maybe use ENRTY/ENDPROC macros for exporting functions from assembly?

-Jussi

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 11/11] Create module providing optimized SHA512 routines using SSSE3, AVX or AVX2 instructions.

2013-03-24 Thread Jussi Kivilinna
On 22.03.2013 23:29, Tim Chen wrote:
> We added glue code and config options to create crypto
> module that uses SSE/AVX/AVX2 optimized SHA512 x86_64 assembly routines.
> 
> Signed-off-by: Tim Chen 
> ---
>  arch/x86/crypto/Makefile|   2 +
>  arch/x86/crypto/sha512_ssse3_glue.c | 276 
> 
>  crypto/Kconfig  |  11 ++
>  3 files changed, 289 insertions(+)
>  create mode 100644 arch/x86/crypto/sha512_ssse3_glue.c
> 
> diff --git a/arch/x86/crypto/Makefile b/arch/x86/crypto/Makefile
> index 02a664a..7d12625 100644
> --- a/arch/x86/crypto/Makefile
> +++ b/arch/x86/crypto/Makefile
> @@ -28,6 +28,7 @@ obj-$(CONFIG_CRYPTO_GHASH_CLMUL_NI_INTEL) += 
> ghash-clmulni-intel.o
>  obj-$(CONFIG_CRYPTO_CRC32C_INTEL) += crc32c-intel.o
>  obj-$(CONFIG_CRYPTO_SHA1_SSSE3) += sha1-ssse3.o
>  obj-$(CONFIG_CRYPTO_SHA256_SSSE3) += sha256-ssse3.o
> +obj-$(CONFIG_CRYPTO_SHA512_SSSE3) += sha512-ssse3.o
>  
>  aes-i586-y := aes-i586-asm_32.o aes_glue.o
>  twofish-i586-y := twofish-i586-asm_32.o twofish_glue.o
> @@ -54,3 +55,4 @@ sha1-ssse3-y := sha1_ssse3_asm.o sha1_ssse3_glue.o
>  crc32c-intel-y := crc32c-intel_glue.o
>  crc32c-intel-$(CONFIG_CRYPTO_CRC32C_X86_64) += crc32c-pcl-intel-asm_64.o
>  sha256-ssse3-y := sha256-ssse3-asm.o sha256-avx-asm.o sha256-avx2-asm.o 
> sha256_ssse3_glue.o
> +sha512-ssse3-y := sha512-ssse3-asm.o sha512-avx-asm.o sha512-avx2-asm.o 
> sha512_ssse3_glue.o
> diff --git a/arch/x86/crypto/sha512_ssse3_glue.c 
> b/arch/x86/crypto/sha512_ssse3_glue.c
> new file mode 100644
> index 000..25a2e07
> --- /dev/null
> +++ b/arch/x86/crypto/sha512_ssse3_glue.c

...snip..

> +#include 
> +
> +asmlinkage void sha512_transform_ssse3(const char *data, u64 *digest,
> +  u64 rounds);
> +#ifdef CONFIG_AS_AVX
> +asmlinkage void sha512_transform_avx(const char *data, u64 *digest,
> +  u64 rounds);
> +asmlinkage void sha512_transform_rorx(const char *data, u64 *digest,
> +  u64 rounds);
> +#endif
> +

Is CONFIG_AS_AVX enough to ensure that rorx is supported by assembler?

You also have #ifdef CONFIG_AS_AVX / #endif missing in 'sha256-avx-asm.S', 
'sha256-avx2-asm.S', 'sha512-avx-asm.S' and 'sha512-avx2-asm.S'.

-Jussi

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 06/11] Create module providing optimized SHA256 routines using SSSE3, AVX or AVX2 instructions.

2013-03-24 Thread Jussi Kivilinna
On 22.03.2013 23:29, Tim Chen wrote:
> We added glue code and config options to create crypto
> module that uses SSE/AVX/AVX2 optimized SHA256 x86_64 assembly routines.
> 
> Signed-off-by: Tim Chen 
> ---

I could not apply this patch cleanly on top of cryptodev-2.6 tree:

Applying: Create module providing optimized SHA256 routines using SSSE3, AVX or 
AVX2 instructions.
Using index info to reconstruct a base tree...
Falling back to patching base and 3-way merge...
Auto-merging crypto/Kconfig
Auto-merging arch/x86/crypto/Makefile
CONFLICT (content): Merge conflict in arch/x86/crypto/Makefile
Failed to merge in the changes.
Patch failed at 0006 Create module providing optimized SHA256 routines using 
SSSE3, AVX or AVX2 instructions.

-Jussi
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 06/11] Create module providing optimized SHA256 routines using SSSE3, AVX or AVX2 instructions.

2013-03-24 Thread Jussi Kivilinna
On 22.03.2013 23:29, Tim Chen wrote:
 We added glue code and config options to create crypto
 module that uses SSE/AVX/AVX2 optimized SHA256 x86_64 assembly routines.
 
 Signed-off-by: Tim Chen tim.c.c...@linux.intel.com
 ---

I could not apply this patch cleanly on top of cryptodev-2.6 tree:

Applying: Create module providing optimized SHA256 routines using SSSE3, AVX or 
AVX2 instructions.
Using index info to reconstruct a base tree...
Falling back to patching base and 3-way merge...
Auto-merging crypto/Kconfig
Auto-merging arch/x86/crypto/Makefile
CONFLICT (content): Merge conflict in arch/x86/crypto/Makefile
Failed to merge in the changes.
Patch failed at 0006 Create module providing optimized SHA256 routines using 
SSSE3, AVX or AVX2 instructions.

-Jussi
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 11/11] Create module providing optimized SHA512 routines using SSSE3, AVX or AVX2 instructions.

2013-03-24 Thread Jussi Kivilinna
On 22.03.2013 23:29, Tim Chen wrote:
 We added glue code and config options to create crypto
 module that uses SSE/AVX/AVX2 optimized SHA512 x86_64 assembly routines.
 
 Signed-off-by: Tim Chen tim.c.c...@linux.intel.com
 ---
  arch/x86/crypto/Makefile|   2 +
  arch/x86/crypto/sha512_ssse3_glue.c | 276 
 
  crypto/Kconfig  |  11 ++
  3 files changed, 289 insertions(+)
  create mode 100644 arch/x86/crypto/sha512_ssse3_glue.c
 
 diff --git a/arch/x86/crypto/Makefile b/arch/x86/crypto/Makefile
 index 02a664a..7d12625 100644
 --- a/arch/x86/crypto/Makefile
 +++ b/arch/x86/crypto/Makefile
 @@ -28,6 +28,7 @@ obj-$(CONFIG_CRYPTO_GHASH_CLMUL_NI_INTEL) += 
 ghash-clmulni-intel.o
  obj-$(CONFIG_CRYPTO_CRC32C_INTEL) += crc32c-intel.o
  obj-$(CONFIG_CRYPTO_SHA1_SSSE3) += sha1-ssse3.o
  obj-$(CONFIG_CRYPTO_SHA256_SSSE3) += sha256-ssse3.o
 +obj-$(CONFIG_CRYPTO_SHA512_SSSE3) += sha512-ssse3.o
  
  aes-i586-y := aes-i586-asm_32.o aes_glue.o
  twofish-i586-y := twofish-i586-asm_32.o twofish_glue.o
 @@ -54,3 +55,4 @@ sha1-ssse3-y := sha1_ssse3_asm.o sha1_ssse3_glue.o
  crc32c-intel-y := crc32c-intel_glue.o
  crc32c-intel-$(CONFIG_CRYPTO_CRC32C_X86_64) += crc32c-pcl-intel-asm_64.o
  sha256-ssse3-y := sha256-ssse3-asm.o sha256-avx-asm.o sha256-avx2-asm.o 
 sha256_ssse3_glue.o
 +sha512-ssse3-y := sha512-ssse3-asm.o sha512-avx-asm.o sha512-avx2-asm.o 
 sha512_ssse3_glue.o
 diff --git a/arch/x86/crypto/sha512_ssse3_glue.c 
 b/arch/x86/crypto/sha512_ssse3_glue.c
 new file mode 100644
 index 000..25a2e07
 --- /dev/null
 +++ b/arch/x86/crypto/sha512_ssse3_glue.c

...snip..

 +#include linux/string.h
 +
 +asmlinkage void sha512_transform_ssse3(const char *data, u64 *digest,
 +  u64 rounds);
 +#ifdef CONFIG_AS_AVX
 +asmlinkage void sha512_transform_avx(const char *data, u64 *digest,
 +  u64 rounds);
 +asmlinkage void sha512_transform_rorx(const char *data, u64 *digest,
 +  u64 rounds);
 +#endif
 +

Is CONFIG_AS_AVX enough to ensure that rorx is supported by assembler?

You also have #ifdef CONFIG_AS_AVX / #endif missing in 'sha256-avx-asm.S', 
'sha256-avx2-asm.S', 'sha512-avx-asm.S' and 'sha512-avx2-asm.S'.

-Jussi

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 03/11] Optimized sha256 x86_64 assembly routine using Supplemental SSE3 instructions.

2013-03-24 Thread Jussi Kivilinna
On 22.03.2013 23:29, Tim Chen wrote:
 Provides SHA256 x86_64 assembly routine optimized with SSSE3 instructions.
 Speedup of 40% or more has been measured over the generic implementation.
 
 Signed-off-by: Tim Chen tim.c.c...@linux.intel.com
 ---
  arch/x86/crypto/sha256-ssse3-asm.S | 504 
 +
  1 file changed, 504 insertions(+)
  create mode 100644 arch/x86/crypto/sha256-ssse3-asm.S
 
 diff --git a/arch/x86/crypto/sha256-ssse3-asm.S 
 b/arch/x86/crypto/sha256-ssse3-asm.S

..snip..

 +
 +
 +## void sha256_transform_ssse3(void *input_data, UINT32 digest[8], UINT64 
 num_blks)
 +## arg 1 : pointer to input data
 +## arg 2 : pointer to digest
 +## arg 3 : Num blocks
 +
 +.text
 +.global sha256_transform_ssse3
 +.align 32
 +sha256_transform_ssse3:

Maybe use ENRTY/ENDPROC macros for exporting functions from assembly?

-Jussi

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 06/11] Create module providing optimized SHA256 routines using SSSE3, AVX or AVX2 instructions.

2013-03-24 Thread Jussi Kivilinna
On 22.03.2013 23:29, Tim Chen wrote:
 We added glue code and config options to create crypto
 module that uses SSE/AVX/AVX2 optimized SHA256 x86_64 assembly routines.
 
 Signed-off-by: Tim Chen tim.c.c...@linux.intel.com

..snip..

 diff --git a/arch/x86/crypto/sha256_ssse3_glue.c 
 b/arch/x86/crypto/sha256_ssse3_glue.c
 new file mode 100644
 index 000..5876a19
 --- /dev/null
 +++ b/arch/x86/crypto/sha256_ssse3_glue.c

..snip..

 +static int __init sha256_ssse3_mod_init(void)
 +{
 + /* test for SSE3 first */
 + if (cpu_has_xmm3)
 + sha256_transform_asm = sha256_transform_ssse3;
 +

This causes OOPS on my computer. Maybe use 'cpu_has_ssse3' instead?

-Jussi
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] CMAC support for CryptoAPI, fixed patch issues, indent, and testmgr build issues

2013-01-31 Thread Jussi Kivilinna

Quoting Steffen Klassert :


On Thu, Jan 24, 2013 at 01:25:46PM +0200, Jussi Kivilinna wrote:


Maybe it would be cleaner to not mess with pfkeyv2.h at all, but  
instead mark algorithms that do not support pfkey with flag. See  
patch below.




As nobody seems to have another opinion, we could go either with your
approach, or we can invert the logic and mark all existing algorithms
as pfkey supported. Then we would not need to bother about pfkey again.

I'd be fine with both. Do you want to submit a patch?



Ok, I'll invert the logic and send new patch.

-Jussi



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] CMAC support for CryptoAPI, fixed patch issues, indent, and testmgr build issues

2013-01-31 Thread Jussi Kivilinna

Quoting Steffen Klassert steffen.klass...@secunet.com:


On Thu, Jan 24, 2013 at 01:25:46PM +0200, Jussi Kivilinna wrote:


Maybe it would be cleaner to not mess with pfkeyv2.h at all, but  
instead mark algorithms that do not support pfkey with flag. See  
patch below.




As nobody seems to have another opinion, we could go either with your
approach, or we can invert the logic and mark all existing algorithms
as pfkey supported. Then we would not need to bother about pfkey again.

I'd be fine with both. Do you want to submit a patch?



Ok, I'll invert the logic and send new patch.

-Jussi



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] CMAC support for CryptoAPI, fixed patch issues, indent, and testmgr build issues

2013-01-24 Thread Jussi Kivilinna
Quoting Steffen Klassert :

> On Wed, Jan 23, 2013 at 05:35:10PM +0200, Jussi Kivilinna wrote:
>>
>> Problem seems to be that PFKEYv2 does not quite work with IKEv2, and
>> XFRM API should be used instead. There is new numbers assigned for
>> IKEv2: 
>> https://www.iana.org/assignments/ikev2-parameters/ikev2-parameters.xml#ikev2-parameters-7
>>
>> For new SADB_X_AALG_*, I'd think you should use value from "Reserved
>> for private use" range. Maybe 250?
>
> This would be an option, but we have just a few slots for private
> algorithms.
>
>>
>> But maybe better solution might be to not make AES-CMAC (or other
>> new algorithms) available throught PFKEY API at all, just XFRM?
>>
>
> It is probably the best to make new algorithms unavailable for pfkey
> as long as they have no official ikev1 iana transform identifier.
>
> But how to do that? Perhaps we can assign SADB_X_AALG_NOPFKEY to
> the private value 255 and return -EINVAL if pfkey tries to register
> such an algorithm. The netlink interface does not use these
> identifiers, everything should work as expected. So it should be
> possible to use these algoritms with iproute2 and the most modern
> ike deamons.

Maybe it would be cleaner to not mess with pfkeyv2.h at all, but instead mark 
algorithms that do not support pfkey with flag. See patch below.

Then I started looking up if sadb_alg_id is being used somewhere outside pfkey. 
Seems that its value is just being copied around.. but at 
"http://lxr.linux.no/linux+v3.7/net/xfrm/xfrm_policy.c#L1991; it's used as 
bit-index. So do larger values than 31 break some stuff? Can multiple 
algorithms have same sadb_alg_id value? Also in af_key.c, sadb_alg_id being 
used as bit-index.

-Jussi

---
ONLY COMPILE TESTED!
---
 include/net/xfrm.h   |5 +++--
 net/key/af_key.c |   39 +++
 net/xfrm/xfrm_algo.c |   12 ++--
 3 files changed, 40 insertions(+), 16 deletions(-)

diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index 421f764..5d5eec2 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -1320,6 +1320,7 @@ struct xfrm_algo_desc {
char *name;
char *compat;
u8 available:1;
+   u8 sadb_disabled:1;
union {
struct xfrm_algo_aead_info aead;
struct xfrm_algo_auth_info auth;
@@ -1561,8 +1562,8 @@ extern void xfrm_input_init(void);
 extern int xfrm_parse_spi(struct sk_buff *skb, u8 nexthdr, __be32 *spi, __be32 
*seq);
 
 extern void xfrm_probe_algs(void);
-extern int xfrm_count_auth_supported(void);
-extern int xfrm_count_enc_supported(void);
+extern int xfrm_count_sadb_auth_supported(void);
+extern int xfrm_count_sadb_enc_supported(void);
 extern struct xfrm_algo_desc *xfrm_aalg_get_byidx(unsigned int idx);
 extern struct xfrm_algo_desc *xfrm_ealg_get_byidx(unsigned int idx);
 extern struct xfrm_algo_desc *xfrm_aalg_get_byid(int alg_id);
diff --git a/net/key/af_key.c b/net/key/af_key.c
index 5b426a6..307cf1d 100644
--- a/net/key/af_key.c
+++ b/net/key/af_key.c
@@ -816,18 +816,21 @@ static struct sk_buff *__pfkey_xfrm_state2msg(const 
struct xfrm_state *x,
sa->sadb_sa_auth = 0;
if (x->aalg) {
struct xfrm_algo_desc *a = 
xfrm_aalg_get_byname(x->aalg->alg_name, 0);
-   sa->sadb_sa_auth = a ? a->desc.sadb_alg_id : 0;
+   sa->sadb_sa_auth = (a && !a->sadb_disabled) ?
+   a->desc.sadb_alg_id : 0;
}
sa->sadb_sa_encrypt = 0;
BUG_ON(x->ealg && x->calg);
if (x->ealg) {
struct xfrm_algo_desc *a = 
xfrm_ealg_get_byname(x->ealg->alg_name, 0);
-   sa->sadb_sa_encrypt = a ? a->desc.sadb_alg_id : 0;
+   sa->sadb_sa_encrypt = (a && !a->sadb_disabled) ?
+   a->desc.sadb_alg_id : 0;
}
/* KAME compatible: sadb_sa_encrypt is overloaded with calg id */
if (x->calg) {
struct xfrm_algo_desc *a = 
xfrm_calg_get_byname(x->calg->alg_name, 0);
-   sa->sadb_sa_encrypt = a ? a->desc.sadb_alg_id : 0;
+   sa->sadb_sa_encrypt = (a && !a->sadb_disabled) ?
+   a->desc.sadb_alg_id : 0;
}
 
sa->sadb_sa_flags = 0;
@@ -1138,7 +1141,7 @@ static struct xfrm_state * pfkey_msg2xfrm_state(struct 
net *net,
if (sa->sadb_sa_auth) {
int keysize = 0;
struct xfrm_algo_desc *a = xfrm_aalg_get_byid(sa->sadb_sa_auth);
-   if (!a) {
+   if (!a || a->sadb_disabled) {
err = -ENOSYS;
goto out;
}
@@ -1160,7 +1163,7 @@

Re: [PATCH] CMAC support for CryptoAPI, fixed patch issues, indent, and testmgr build issues

2013-01-24 Thread Jussi Kivilinna

Quoting YOSHIFUJI Hideaki :


YOSHIFUJI Hideaki wrote:

Jussi Kivilinna wrote:


diff --git a/include/uapi/linux/pfkeyv2.h
b/include/uapi/linux/pfkeyv2.h
index 0b80c80..d61898e 100644
--- a/include/uapi/linux/pfkeyv2.h
+++ b/include/uapi/linux/pfkeyv2.h
@@ -296,6 +296,7 @@ struct sadb_x_kmaddress {
 #define SADB_X_AALG_SHA2_512HMAC7
 #define SADB_X_AALG_RIPEMD160HMAC8
 #define SADB_X_AALG_AES_XCBC_MAC9
+#define SADB_X_AALG_AES_CMAC_MAC10
 #define SADB_X_AALG_NULL251/* kame */
 #define SADB_AALG_MAX251


Should these values be based on IANA assigned IPSEC AH transform
identifiers?

https://www.iana.org/assignments/isakmp-registry/isakmp-registry.xml#isakmp-registry-6


There is no CMAC entry apparently ... despite the fact that CMAC  
is a proposed RFC standard for IPsec.


It might be safer to move that to 14 since it's currently  
unassigned and then go through whatever channels are required to  
allocate it.  Mostly this affects key setting.  So this means my  
patch would break AH_RSA setkey calls (which the kernel doesn't  
support anyways).




Problem seems to be that PFKEYv2 does not quite work with IKEv2,  
and XFRM API should be used instead. There is new numbers assigned  
for IKEv2:  
https://www.iana.org/assignments/ikev2-parameters/ikev2-parameters.xml#ikev2-parameters-7


For new SADB_X_AALG_*, I'd think you should use value from  
"Reserved for private use" range. Maybe 250?


We can choose any value unless we do not break existing
binaries.  When IKE used, the daemon is responsible
for translation.


I meant, we can choose any values "if" we do not break ...



Ok, so giving '10' to AES-CMAC is fine after all?

And if I'd want to add Camellia-CTR and Camellia-CCM support, I can  
choose next free numbers from SADB_X_EALG_*?


-Jussi


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] CMAC support for CryptoAPI, fixed patch issues, indent, and testmgr build issues

2013-01-24 Thread Jussi Kivilinna
Quoting Steffen Klassert steffen.klass...@secunet.com:

 On Wed, Jan 23, 2013 at 05:35:10PM +0200, Jussi Kivilinna wrote:

 Problem seems to be that PFKEYv2 does not quite work with IKEv2, and
 XFRM API should be used instead. There is new numbers assigned for
 IKEv2: 
 https://www.iana.org/assignments/ikev2-parameters/ikev2-parameters.xml#ikev2-parameters-7

 For new SADB_X_AALG_*, I'd think you should use value from Reserved
 for private use range. Maybe 250?

 This would be an option, but we have just a few slots for private
 algorithms.


 But maybe better solution might be to not make AES-CMAC (or other
 new algorithms) available throught PFKEY API at all, just XFRM?


 It is probably the best to make new algorithms unavailable for pfkey
 as long as they have no official ikev1 iana transform identifier.

 But how to do that? Perhaps we can assign SADB_X_AALG_NOPFKEY to
 the private value 255 and return -EINVAL if pfkey tries to register
 such an algorithm. The netlink interface does not use these
 identifiers, everything should work as expected. So it should be
 possible to use these algoritms with iproute2 and the most modern
 ike deamons.

Maybe it would be cleaner to not mess with pfkeyv2.h at all, but instead mark 
algorithms that do not support pfkey with flag. See patch below.

Then I started looking up if sadb_alg_id is being used somewhere outside pfkey. 
Seems that its value is just being copied around.. but at 
http://lxr.linux.no/linux+v3.7/net/xfrm/xfrm_policy.c#L1991; it's used as 
bit-index. So do larger values than 31 break some stuff? Can multiple 
algorithms have same sadb_alg_id value? Also in af_key.c, sadb_alg_id being 
used as bit-index.

-Jussi

---
ONLY COMPILE TESTED!
---
 include/net/xfrm.h   |5 +++--
 net/key/af_key.c |   39 +++
 net/xfrm/xfrm_algo.c |   12 ++--
 3 files changed, 40 insertions(+), 16 deletions(-)

diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index 421f764..5d5eec2 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -1320,6 +1320,7 @@ struct xfrm_algo_desc {
char *name;
char *compat;
u8 available:1;
+   u8 sadb_disabled:1;
union {
struct xfrm_algo_aead_info aead;
struct xfrm_algo_auth_info auth;
@@ -1561,8 +1562,8 @@ extern void xfrm_input_init(void);
 extern int xfrm_parse_spi(struct sk_buff *skb, u8 nexthdr, __be32 *spi, __be32 
*seq);
 
 extern void xfrm_probe_algs(void);
-extern int xfrm_count_auth_supported(void);
-extern int xfrm_count_enc_supported(void);
+extern int xfrm_count_sadb_auth_supported(void);
+extern int xfrm_count_sadb_enc_supported(void);
 extern struct xfrm_algo_desc *xfrm_aalg_get_byidx(unsigned int idx);
 extern struct xfrm_algo_desc *xfrm_ealg_get_byidx(unsigned int idx);
 extern struct xfrm_algo_desc *xfrm_aalg_get_byid(int alg_id);
diff --git a/net/key/af_key.c b/net/key/af_key.c
index 5b426a6..307cf1d 100644
--- a/net/key/af_key.c
+++ b/net/key/af_key.c
@@ -816,18 +816,21 @@ static struct sk_buff *__pfkey_xfrm_state2msg(const 
struct xfrm_state *x,
sa-sadb_sa_auth = 0;
if (x-aalg) {
struct xfrm_algo_desc *a = 
xfrm_aalg_get_byname(x-aalg-alg_name, 0);
-   sa-sadb_sa_auth = a ? a-desc.sadb_alg_id : 0;
+   sa-sadb_sa_auth = (a  !a-sadb_disabled) ?
+   a-desc.sadb_alg_id : 0;
}
sa-sadb_sa_encrypt = 0;
BUG_ON(x-ealg  x-calg);
if (x-ealg) {
struct xfrm_algo_desc *a = 
xfrm_ealg_get_byname(x-ealg-alg_name, 0);
-   sa-sadb_sa_encrypt = a ? a-desc.sadb_alg_id : 0;
+   sa-sadb_sa_encrypt = (a  !a-sadb_disabled) ?
+   a-desc.sadb_alg_id : 0;
}
/* KAME compatible: sadb_sa_encrypt is overloaded with calg id */
if (x-calg) {
struct xfrm_algo_desc *a = 
xfrm_calg_get_byname(x-calg-alg_name, 0);
-   sa-sadb_sa_encrypt = a ? a-desc.sadb_alg_id : 0;
+   sa-sadb_sa_encrypt = (a  !a-sadb_disabled) ?
+   a-desc.sadb_alg_id : 0;
}
 
sa-sadb_sa_flags = 0;
@@ -1138,7 +1141,7 @@ static struct xfrm_state * pfkey_msg2xfrm_state(struct 
net *net,
if (sa-sadb_sa_auth) {
int keysize = 0;
struct xfrm_algo_desc *a = xfrm_aalg_get_byid(sa-sadb_sa_auth);
-   if (!a) {
+   if (!a || a-sadb_disabled) {
err = -ENOSYS;
goto out;
}
@@ -1160,7 +1163,7 @@ static struct xfrm_state * pfkey_msg2xfrm_state(struct 
net *net,
if (sa-sadb_sa_encrypt) {
if (hdr-sadb_msg_satype == SADB_X_SATYPE_IPCOMP) {
struct xfrm_algo_desc *a = 
xfrm_calg_get_byid(sa-sadb_sa_encrypt);
-   if (!a) {
+   if (!a || a-sadb_disabled

Re: [PATCH] CMAC support for CryptoAPI, fixed patch issues, indent, and testmgr build issues

2013-01-24 Thread Jussi Kivilinna

Quoting YOSHIFUJI Hideaki yoshf...@linux-ipv6.org:


YOSHIFUJI Hideaki wrote:

Jussi Kivilinna wrote:


diff --git a/include/uapi/linux/pfkeyv2.h
b/include/uapi/linux/pfkeyv2.h
index 0b80c80..d61898e 100644
--- a/include/uapi/linux/pfkeyv2.h
+++ b/include/uapi/linux/pfkeyv2.h
@@ -296,6 +296,7 @@ struct sadb_x_kmaddress {
 #define SADB_X_AALG_SHA2_512HMAC7
 #define SADB_X_AALG_RIPEMD160HMAC8
 #define SADB_X_AALG_AES_XCBC_MAC9
+#define SADB_X_AALG_AES_CMAC_MAC10
 #define SADB_X_AALG_NULL251/* kame */
 #define SADB_AALG_MAX251


Should these values be based on IANA assigned IPSEC AH transform
identifiers?

https://www.iana.org/assignments/isakmp-registry/isakmp-registry.xml#isakmp-registry-6


There is no CMAC entry apparently ... despite the fact that CMAC  
is a proposed RFC standard for IPsec.


It might be safer to move that to 14 since it's currently  
unassigned and then go through whatever channels are required to  
allocate it.  Mostly this affects key setting.  So this means my  
patch would break AH_RSA setkey calls (which the kernel doesn't  
support anyways).




Problem seems to be that PFKEYv2 does not quite work with IKEv2,  
and XFRM API should be used instead. There is new numbers assigned  
for IKEv2:  
https://www.iana.org/assignments/ikev2-parameters/ikev2-parameters.xml#ikev2-parameters-7


For new SADB_X_AALG_*, I'd think you should use value from  
Reserved for private use range. Maybe 250?


We can choose any value unless we do not break existing
binaries.  When IKE used, the daemon is responsible
for translation.


I meant, we can choose any values if we do not break ...



Ok, so giving '10' to AES-CMAC is fine after all?

And if I'd want to add Camellia-CTR and Camellia-CCM support, I can  
choose next free numbers from SADB_X_EALG_*?


-Jussi


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] CMAC support for CryptoAPI, fixed patch issues, indent, and testmgr build issues

2013-01-23 Thread Jussi Kivilinna

Quoting Tom St Denis :


- Original Message -

From: "Jussi Kivilinna" 
To: "Tom St Denis" 
Cc: linux-kernel@vger.kernel.org, "Herbert Xu"  
, "David Miller" ,
linux-cry...@vger.kernel.org, "Steffen Klassert"  
, net...@vger.kernel.org

Sent: Wednesday, 23 January, 2013 9:36:44 AM
Subject: Re: [PATCH] CMAC support for CryptoAPI, fixed patch  
issues, indent, and testmgr build issues


Quoting Tom St Denis :

> Hey all,
>
> Here's an updated patch which addresses a couple of build issues
> and
> coding style complaints.
>
> I still can't get it to run via testmgr I get
>
> [  162.407807] alg: No test for cmac(aes) (cmac(aes-generic))
>
> Despite the fact I have an entry for cmac(aes) (much like
> xcbc(aes)...).
>
> Here's the patch to bring 3.8-rc4 up with CMAC ...
>
> Signed-off-by: Tom St Denis 
>

> diff --git a/include/uapi/linux/pfkeyv2.h
> b/include/uapi/linux/pfkeyv2.h
> index 0b80c80..d61898e 100644
> --- a/include/uapi/linux/pfkeyv2.h
> +++ b/include/uapi/linux/pfkeyv2.h
> @@ -296,6 +296,7 @@ struct sadb_x_kmaddress {
>  #define SADB_X_AALG_SHA2_512HMAC  7
>  #define SADB_X_AALG_RIPEMD160HMAC 8
>  #define SADB_X_AALG_AES_XCBC_MAC  9
> +#define SADB_X_AALG_AES_CMAC_MAC  10
>  #define SADB_X_AALG_NULL  251 /* kame */
>  #define SADB_AALG_MAX 251

Should these values be based on IANA assigned IPSEC AH transform
identifiers?

https://www.iana.org/assignments/isakmp-registry/isakmp-registry.xml#isakmp-registry-6


There is no CMAC entry apparently ... despite the fact that CMAC is  
a proposed RFC standard for IPsec.


It might be safer to move that to 14 since it's currently unassigned  
and then go through whatever channels are required to allocate it.   
Mostly this affects key setting.  So this means my patch would break  
AH_RSA setkey calls (which the kernel doesn't support anyways).




Problem seems to be that PFKEYv2 does not quite work with IKEv2, and  
XFRM API should be used instead. There is new numbers assigned for  
IKEv2:  
https://www.iana.org/assignments/ikev2-parameters/ikev2-parameters.xml#ikev2-parameters-7


For new SADB_X_AALG_*, I'd think you should use value from "Reserved  
for private use" range. Maybe 250?


But maybe better solution might be to not make AES-CMAC (or other new  
algorithms) available throught PFKEY API at all, just XFRM?


-Jussi


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] CMAC support for CryptoAPI, fixed patch issues, indent, and testmgr build issues

2013-01-23 Thread Jussi Kivilinna

Quoting Tom St Denis :


Hey all,

Here's an updated patch which addresses a couple of build issues and  
coding style complaints.


I still can't get it to run via testmgr I get

[  162.407807] alg: No test for cmac(aes) (cmac(aes-generic))

Despite the fact I have an entry for cmac(aes) (much like xcbc(aes)...).

Here's the patch to bring 3.8-rc4 up with CMAC ...

Signed-off-by: Tom St Denis 




diff --git a/include/uapi/linux/pfkeyv2.h b/include/uapi/linux/pfkeyv2.h
index 0b80c80..d61898e 100644
--- a/include/uapi/linux/pfkeyv2.h
+++ b/include/uapi/linux/pfkeyv2.h
@@ -296,6 +296,7 @@ struct sadb_x_kmaddress {
 #define SADB_X_AALG_SHA2_512HMAC   7
 #define SADB_X_AALG_RIPEMD160HMAC  8
 #define SADB_X_AALG_AES_XCBC_MAC   9
+#define SADB_X_AALG_AES_CMAC_MAC   10
 #define SADB_X_AALG_NULL   251 /* kame */
 #define SADB_AALG_MAX  251


Should these values be based on IANA assigned IPSEC AH transform identifiers?

https://www.iana.org/assignments/isakmp-registry/isakmp-registry.xml#isakmp-registry-6

-Jussi


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] CMAC support for CryptoAPI, fixed patch issues, indent, and testmgr build issues

2013-01-23 Thread Jussi Kivilinna

Quoting Tom St Denis tstde...@elliptictech.com:


Hey all,

Here's an updated patch which addresses a couple of build issues and  
coding style complaints.


I still can't get it to run via testmgr I get

[  162.407807] alg: No test for cmac(aes) (cmac(aes-generic))

Despite the fact I have an entry for cmac(aes) (much like xcbc(aes)...).

Here's the patch to bring 3.8-rc4 up with CMAC ...

Signed-off-by: Tom St Denis tstde...@elliptictech.com


snip

diff --git a/include/uapi/linux/pfkeyv2.h b/include/uapi/linux/pfkeyv2.h
index 0b80c80..d61898e 100644
--- a/include/uapi/linux/pfkeyv2.h
+++ b/include/uapi/linux/pfkeyv2.h
@@ -296,6 +296,7 @@ struct sadb_x_kmaddress {
 #define SADB_X_AALG_SHA2_512HMAC   7
 #define SADB_X_AALG_RIPEMD160HMAC  8
 #define SADB_X_AALG_AES_XCBC_MAC   9
+#define SADB_X_AALG_AES_CMAC_MAC   10
 #define SADB_X_AALG_NULL   251 /* kame */
 #define SADB_AALG_MAX  251


Should these values be based on IANA assigned IPSEC AH transform identifiers?

https://www.iana.org/assignments/isakmp-registry/isakmp-registry.xml#isakmp-registry-6

-Jussi


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] CMAC support for CryptoAPI, fixed patch issues, indent, and testmgr build issues

2013-01-23 Thread Jussi Kivilinna

Quoting Tom St Denis tstde...@elliptictech.com:


- Original Message -

From: Jussi Kivilinna jussi.kivili...@mbnet.fi
To: Tom St Denis tstde...@elliptictech.com
Cc: linux-kernel@vger.kernel.org, Herbert Xu  
herb...@gondor.apana.org.au, David Miller da...@davemloft.net,
linux-cry...@vger.kernel.org, Steffen Klassert  
steffen.klass...@secunet.com, net...@vger.kernel.org

Sent: Wednesday, 23 January, 2013 9:36:44 AM
Subject: Re: [PATCH] CMAC support for CryptoAPI, fixed patch  
issues, indent, and testmgr build issues


Quoting Tom St Denis tstde...@elliptictech.com:

 Hey all,

 Here's an updated patch which addresses a couple of build issues
 and
 coding style complaints.

 I still can't get it to run via testmgr I get

 [  162.407807] alg: No test for cmac(aes) (cmac(aes-generic))

 Despite the fact I have an entry for cmac(aes) (much like
 xcbc(aes)...).

 Here's the patch to bring 3.8-rc4 up with CMAC ...

 Signed-off-by: Tom St Denis tstde...@elliptictech.com

snip
 diff --git a/include/uapi/linux/pfkeyv2.h
 b/include/uapi/linux/pfkeyv2.h
 index 0b80c80..d61898e 100644
 --- a/include/uapi/linux/pfkeyv2.h
 +++ b/include/uapi/linux/pfkeyv2.h
 @@ -296,6 +296,7 @@ struct sadb_x_kmaddress {
  #define SADB_X_AALG_SHA2_512HMAC  7
  #define SADB_X_AALG_RIPEMD160HMAC 8
  #define SADB_X_AALG_AES_XCBC_MAC  9
 +#define SADB_X_AALG_AES_CMAC_MAC  10
  #define SADB_X_AALG_NULL  251 /* kame */
  #define SADB_AALG_MAX 251

Should these values be based on IANA assigned IPSEC AH transform
identifiers?

https://www.iana.org/assignments/isakmp-registry/isakmp-registry.xml#isakmp-registry-6


There is no CMAC entry apparently ... despite the fact that CMAC is  
a proposed RFC standard for IPsec.


It might be safer to move that to 14 since it's currently unassigned  
and then go through whatever channels are required to allocate it.   
Mostly this affects key setting.  So this means my patch would break  
AH_RSA setkey calls (which the kernel doesn't support anyways).




Problem seems to be that PFKEYv2 does not quite work with IKEv2, and  
XFRM API should be used instead. There is new numbers assigned for  
IKEv2:  
https://www.iana.org/assignments/ikev2-parameters/ikev2-parameters.xml#ikev2-parameters-7


For new SADB_X_AALG_*, I'd think you should use value from Reserved  
for private use range. Maybe 250?


But maybe better solution might be to not make AES-CMAC (or other new  
algorithms) available throught PFKEY API at all, just XFRM?


-Jussi


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] crypto: fix FTBFS with ARM SHA1-asm and THUMB2_KERNEL

2013-01-21 Thread Jussi Kivilinna

Quoting Jussi Kivilinna :


Quoting Matt Sealey :


This question is to the implementor/committer (Dave McCullough), how
exactly did you measure the benchmark and can we reproduce it on some
other ARM box?

If it's long and laborious and not so important to test the IPsec
tunnel use-case, what would be the simplest possible benchmark to see
if the C vs. assembly version is faster for a particular ARM device? I
can get hold of pretty much any Cortex-A8 or Cortex-A9 that matters, I
have access to a Chromebook for A15, and maybe an i.MX27 or i.MX35 and
a couple Marvell boards (ARMv6) if I set my mind to it... that much
testing implies we find a pretty concise benchmark though with a
fairly common kernel version we can spread around (i.MX, OMAP and the
Chromebook, I can handle, the rest I'm a little wary of bothering to
spend too much time on). I think that could cover a good swath of
not-ARMv5 use cases from lower speeds to quad core monsters.. but I
might stick to i.MX to start with..


There is 'tcrypt' module in crypto/ for quick benchmarking.  
'modprobe tcrypt mode=500 sec=1' tests AES in various cipher-modes,  
using different buffer sizes and outputs results to kernel log.




Actually mode=200 might be better, as mode=500 is for asynchronous  
implementations and might use hardware crypto if such device/module is  
available.


-Jussi


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] crypto: fix FTBFS with ARM SHA1-asm and THUMB2_KERNEL

2013-01-21 Thread Jussi Kivilinna

Quoting Matt Sealey :


This question is to the implementor/committer (Dave McCullough), how
exactly did you measure the benchmark and can we reproduce it on some
other ARM box?

If it's long and laborious and not so important to test the IPsec
tunnel use-case, what would be the simplest possible benchmark to see
if the C vs. assembly version is faster for a particular ARM device? I
can get hold of pretty much any Cortex-A8 or Cortex-A9 that matters, I
have access to a Chromebook for A15, and maybe an i.MX27 or i.MX35 and
a couple Marvell boards (ARMv6) if I set my mind to it... that much
testing implies we find a pretty concise benchmark though with a
fairly common kernel version we can spread around (i.MX, OMAP and the
Chromebook, I can handle, the rest I'm a little wary of bothering to
spend too much time on). I think that could cover a good swath of
not-ARMv5 use cases from lower speeds to quad core monsters.. but I
might stick to i.MX to start with..


There is 'tcrypt' module in crypto/ for quick benchmarking. 'modprobe  
tcrypt mode=500 sec=1' tests AES in various cipher-modes, using  
different buffer sizes and outputs results to kernel log.


-Jussi

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] crypto: fix FTBFS with ARM SHA1-asm and THUMB2_KERNEL

2013-01-21 Thread Jussi Kivilinna

Quoting Matt Sealey m...@genesi-usa.com:


This question is to the implementor/committer (Dave McCullough), how
exactly did you measure the benchmark and can we reproduce it on some
other ARM box?

If it's long and laborious and not so important to test the IPsec
tunnel use-case, what would be the simplest possible benchmark to see
if the C vs. assembly version is faster for a particular ARM device? I
can get hold of pretty much any Cortex-A8 or Cortex-A9 that matters, I
have access to a Chromebook for A15, and maybe an i.MX27 or i.MX35 and
a couple Marvell boards (ARMv6) if I set my mind to it... that much
testing implies we find a pretty concise benchmark though with a
fairly common kernel version we can spread around (i.MX, OMAP and the
Chromebook, I can handle, the rest I'm a little wary of bothering to
spend too much time on). I think that could cover a good swath of
not-ARMv5 use cases from lower speeds to quad core monsters.. but I
might stick to i.MX to start with..


There is 'tcrypt' module in crypto/ for quick benchmarking. 'modprobe  
tcrypt mode=500 sec=1' tests AES in various cipher-modes, using  
different buffer sizes and outputs results to kernel log.


-Jussi

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] crypto: fix FTBFS with ARM SHA1-asm and THUMB2_KERNEL

2013-01-21 Thread Jussi Kivilinna

Quoting Jussi Kivilinna jussi.kivili...@mbnet.fi:


Quoting Matt Sealey m...@genesi-usa.com:


This question is to the implementor/committer (Dave McCullough), how
exactly did you measure the benchmark and can we reproduce it on some
other ARM box?

If it's long and laborious and not so important to test the IPsec
tunnel use-case, what would be the simplest possible benchmark to see
if the C vs. assembly version is faster for a particular ARM device? I
can get hold of pretty much any Cortex-A8 or Cortex-A9 that matters, I
have access to a Chromebook for A15, and maybe an i.MX27 or i.MX35 and
a couple Marvell boards (ARMv6) if I set my mind to it... that much
testing implies we find a pretty concise benchmark though with a
fairly common kernel version we can spread around (i.MX, OMAP and the
Chromebook, I can handle, the rest I'm a little wary of bothering to
spend too much time on). I think that could cover a good swath of
not-ARMv5 use cases from lower speeds to quad core monsters.. but I
might stick to i.MX to start with..


There is 'tcrypt' module in crypto/ for quick benchmarking.  
'modprobe tcrypt mode=500 sec=1' tests AES in various cipher-modes,  
using different buffer sizes and outputs results to kernel log.




Actually mode=200 might be better, as mode=500 is for asynchronous  
implementations and might use hardware crypto if such device/module is  
available.


-Jussi


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2] Remove VLAIS usage from crypto/testmgr.c

2012-10-31 Thread Jussi Kivilinna

Quoting Behan Webster :


From: Jan-Simon Möller 

The use of variable length arrays in structs (VLAIS) in the Linux Kernel code
precludes the use of compilers which don't implement VLAIS (for instance the
Clang compiler). This patch instead allocates the appropriate amount  
of memory

using an char array.

Patch from series at
http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20120507/142707.html
by PaX Team.

Signed-off-by: Jan-Simon Möller 
Cc: pagee...@freemail.hu
Signed-off-by: Behan Webster 
---
 crypto/testmgr.c |   23 +--
 1 file changed, 13 insertions(+), 10 deletions(-)

diff --git a/crypto/testmgr.c b/crypto/testmgr.c
index 941d75c..5b7b3a6 100644
--- a/crypto/testmgr.c
+++ b/crypto/testmgr.c
@@ -1578,16 +1578,19 @@ static int alg_test_crc32c(const struct  
alg_test_desc *desc,

}

do {
-   struct {
-   struct shash_desc shash;
-   char ctx[crypto_shash_descsize(tfm)];
-   } sdesc;
-
-   sdesc.shash.tfm = tfm;
-   sdesc.shash.flags = 0;
-
-   *(u32 *)sdesc.ctx = le32_to_cpu(420553207);
-   err = crypto_shash_final(, (u8 *));
+   char sdesc[sizeof(struct shash_desc)
+   + crypto_shash_descsize(tfm)
+   + CRYPTO_MINALIGN] CRYPTO_MINALIGN_ATTR;
+   struct shash_desc *shash = (struct shash_desc *)sdesc;
+   u32 *ctx = (u32 *)((unsigned long)(sdesc
+   + sizeof(struct shash_desc) + CRYPTO_MINALIGN - 1)
+   & ~(CRYPTO_MINALIGN - 1));


I think you should use '(u32 *)shash_desc_ctx(shash)' instead of  
getting ctx pointer manually.



+
+   shash->tfm = tfm;
+   shash->flags = 0;
+
+   *ctx = le32_to_cpu(420553207);
+   err = crypto_shash_final(shash, (u8 *));
if (err) {
printk(KERN_ERR "alg: crc32c: Operation failed for "
   "%s: %d\n", driver, err);




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2] Remove VLAIS usage from crypto/testmgr.c

2012-10-31 Thread Jussi Kivilinna

Quoting Behan Webster beh...@converseincode.com:


From: Jan-Simon Möller dl...@gmx.de

The use of variable length arrays in structs (VLAIS) in the Linux Kernel code
precludes the use of compilers which don't implement VLAIS (for instance the
Clang compiler). This patch instead allocates the appropriate amount  
of memory

using an char array.

Patch from series at
http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20120507/142707.html
by PaX Team.

Signed-off-by: Jan-Simon Möller dl...@gmx.de
Cc: pagee...@freemail.hu
Signed-off-by: Behan Webster beh...@converseincode.com
---
 crypto/testmgr.c |   23 +--
 1 file changed, 13 insertions(+), 10 deletions(-)

diff --git a/crypto/testmgr.c b/crypto/testmgr.c
index 941d75c..5b7b3a6 100644
--- a/crypto/testmgr.c
+++ b/crypto/testmgr.c
@@ -1578,16 +1578,19 @@ static int alg_test_crc32c(const struct  
alg_test_desc *desc,

}

do {
-   struct {
-   struct shash_desc shash;
-   char ctx[crypto_shash_descsize(tfm)];
-   } sdesc;
-
-   sdesc.shash.tfm = tfm;
-   sdesc.shash.flags = 0;
-
-   *(u32 *)sdesc.ctx = le32_to_cpu(420553207);
-   err = crypto_shash_final(sdesc.shash, (u8 *)val);
+   char sdesc[sizeof(struct shash_desc)
+   + crypto_shash_descsize(tfm)
+   + CRYPTO_MINALIGN] CRYPTO_MINALIGN_ATTR;
+   struct shash_desc *shash = (struct shash_desc *)sdesc;
+   u32 *ctx = (u32 *)((unsigned long)(sdesc
+   + sizeof(struct shash_desc) + CRYPTO_MINALIGN - 1)
+~(CRYPTO_MINALIGN - 1));


I think you should use '(u32 *)shash_desc_ctx(shash)' instead of  
getting ctx pointer manually.



+
+   shash-tfm = tfm;
+   shash-flags = 0;
+
+   *ctx = le32_to_cpu(420553207);
+   err = crypto_shash_final(shash, (u8 *)val);
if (err) {
printk(KERN_ERR alg: crc32c: Operation failed for 
   %s: %d\n, driver, err);




--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Linux 3.6-rc5

2012-09-09 Thread Jussi Kivilinna

Quoting Herbert Xu :


On Sun, Sep 09, 2012 at 08:35:56AM -0700, Linus Torvalds wrote:

On Sun, Sep 9, 2012 at 5:54 AM, Jussi Kivilinna
 wrote:
>
> Does reverting e46e9a46386bca8e80a6467b5c643dc494861896 help?
>
> That commit added crypto selftest for authenc(hmac(sha1),cbc(aes)) in 3.6,
> and probably made this bug visible (but not directly causing it).

So Romain said it does - where do we go from here? Revert testing it,
or fix the authenc() case? I'd prefer the fix..


I'm working on this right now.  If we don't get anywhere in a
couple of days we can revert the test vector patch.



It seems that authenc is chaining empty assoc scatterlist, which causes
BUG_ON(!sg->length) set off in crypto/scatterwalk.c.

Following fixes the bug and self-test passes, but not sure if it's correct
(note, copy-paste to 'broken' email client, most likely does not apply etc):

diff --git a/crypto/authenc.c b/crypto/authenc.c
index 5ef7ba6..2373af5 100644
--- a/crypto/authenc.c
+++ b/crypto/authenc.c
@@ -336,7 +336,7 @@ static int crypto_authenc_genicv(struct  
aead_request *req, u8 *iv,

cryptlen += ivsize;
}

-   if (sg_is_last(assoc)) {
+   if (req->assoclen > 0 && sg_is_last(assoc)) {
authenc_ahash_fn = crypto_authenc_ahash;
sg_init_table(asg, 2);
sg_set_page(asg, sg_page(assoc), assoc->length,  
assoc->offset);



Also does crypto_authenc_iverify() need same fix?

-Jussi


Cheers,
--
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt






--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Linux 3.6-rc5

2012-09-09 Thread Jussi Kivilinna

Quoting Herbert Xu :



Can you try blacklisting/not loading sha1_ssse3 and aesni_intel
to see which one of them is causing this crash? Of course if you
can still reproduce this without loading either of them that would
also be interesting to know.


This triggers with aes-x86_64 and sha1_generic (and sha256 & sha512)  
too, with following test added to tcrypt:

case 46:
ret += tcrypt_test("authenc(hmac(sha1),cbc(aes))");
ret += tcrypt_test("authenc(hmac(sha256),cbc(aes))");
ret += tcrypt_test("authenc(hmac(sha512),cbc(aes))");
break;

-Jussi



Thanks,
--
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html






--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Linux 3.6-rc5

2012-09-09 Thread Jussi Kivilinna

Quoting Romain Francoise :


Still seeing this BUG with -rc5, that I originally reported here:
http://marc.info/?l=linux-crypto-vger=134653220530264=2


Does reverting e46e9a46386bca8e80a6467b5c643dc494861896 help?

That commit added crypto selftest for authenc(hmac(sha1),cbc(aes)) in  
3.6, and probably made this bug visible (but not directly causing it).


-Jussi



[   26.362567] [ cut here ]
[   26.362583] kernel BUG at crypto/scatterwalk.c:37!
[   26.362606] invalid opcode:  [#1] SMP
[   26.362622] Modules linked in: authenc xfrm6_mode_tunnel  
xfrm4_mode_tunnel cpufreq_conservative cpufreq_userspace  
cpufreq_powersave cpufreq_stats xfrm_user xfrm4_tunnel tunnel4  
ipcomp xfrm_ipcomp esp4 ah4 binfmt_misc deflate zlib_deflate ctr  
twofish_generic twofish_avx_x86_64 twofish_x86_64_3way  
twofish_x86_64 twofish_common camellia_generic camellia_x86_64  
serpent_avx_x86_64 serpent_sse2_x86_64 serpent_generic glue_helper  
lrw xts gf128mul blowfish_generic blowfish_x86_64 blowfish_common  
cast5 des_generic cbc xcbc rmd160 sha512_generic sha1_ssse3  
sha1_generic hmac crypto_null af_key xfrm_algo ip6table_filter  
ip6_tables xt_recent xt_LOG nf_conntrack_ipv4 nf_defrag_ipv4  
xt_state nf_conntrack xt_tcpudp iptable_filter ip_tables x_tables  
hwmon_vid msr vhost_net macvtap macvlan tun loop bridge stp llc  
firewire_sbp2 fuse rc_dib0700_rc5 snd_hda_codec_hdmi dvb_usb_dib0700  
dib7000m dib0090 dib8000 dib0070 dib7000p dib3000mc dibx000_common  
dvb_usb snd_hda_codec_realtek dvb_core rc_co

 re snd
_hda_intel snd_hda_codec snd_seq_midi snd_seq_midi_event snd_hwdep  
snd_pcm_oss snd_rawmidi snd_mixer_oss snd_seq snd_pcm snd_seq_device  
radeon snd_timer snd soundcore acpi_cpufreq mperf ttm processor  
drm_kms_helper thermal_sys mxm_wmi drm snd_page_alloc i2c_algo_bit  
i2c_i801 i2c_core lpc_ich button psmouse evdev coretemp serio_raw  
wmi pcspkr mei kvm_intel kvm ext4 crc16 jbd2 mbcache sha256_generic  
usb_storage uas dm_crypt dm_mod raid10 raid1 md_mod sg ata_generic  
sd_mod hid_generic crc_t10dif pata_marvell usbhid hid crc32c_intel  
ghash_clmulni_intel aesni_intel aes_x86_64 aes_generic ablk_helper  
cryptd firewire_ohci microcode ahci firewire_core libahci crc_itu_t  
libata scsi_mod xhci_hcd ehci_hcd usbcore usb_common e1000e

[   26.371138] CPU 5
[   26.371146] Pid: 3704, comm: cryptomgr_test Not tainted  
3.6.0-rc5-ore #1  /DP67BG
[   26.374095] RIP: 0010:[]  []  
scatterwalk_start+0x11/0x20

[   26.375598] RSP: 0018:88040d62b9d8  EFLAGS: 00010246
[   26.377067] RAX:  RBX: 88040b6e3868 RCX:  
0014
[   26.378567] RDX: 0020 RSI: 88040b6e3868 RDI:  
88040d62b9e0
[   26.380028] RBP: 0020 R08: 0001 R09:  
88040b6e39a8
[   26.381494] R10: a06b4000 R11: 88040b6e39fc R12:  
0014
[   26.383023] R13: 0001 R14: 88040b6e38f8 R15:  

[   26.384488] FS:  () GS:88041f54()  
knlGS:

[   26.385973] CS:  0010 DS:  ES:  CR0: 8005003b
[   26.387581] CR2: 7f54c282fbd0 CR3: 0180b000 CR4:  
000407e0
[   26.389057] DR0:  DR1:  DR2:  

[   26.390547] DR3:  DR6: 0ff0 DR7:  
0400
[   26.392015] Process cryptomgr_test (pid: 3704, threadinfo  
88040d62a000, task 88040ca3ab60)

[   26.393558] Stack:
[   26.395103]  811d72fb 88040b6e3868 0020  
88040b6e3868
[   26.396622]  88040b6e3800 88040b6e3868 0020  
88040b6e3868
[   26.398163]  88040919f440 88040d62bcc8 a07edaa0  
88040b6e3930

[   26.399643] Call Trace:
[   26.401112]  [] ? scatterwalk_map_and_copy+0x5b/0xd0
[   26.402714]  [] ?  
crypto_authenc_genicv+0xa0/0x300 [authenc]

[   26.404274]  [] ? test_aead+0x58b/0xcd0
[   26.406082]  [] ? crypto_mod_get+0x10/0x30
[   26.407704]  [] ? crypto_alloc_base+0x53/0xb0
[   26.409267]  [] ?  
cryptd_alloc_ablkcipher+0x80/0xc0 [cryptd]

[   26.410838]  [] ? __kmalloc+0x20d/0x250
[   26.412364]  [] ? crypto_spawn_tfm2+0x31/0x70
[   26.413938]  [] ? ablk_init_common+0x10/0x30  
[ablk_helper]

[   26.415448]  [] ? __crypto_alloc_tfm+0xf9/0x170
[   26.416963]  [] ? crypto_spawn_tfm+0x43/0x90
[   26.418505]  [] ? skcipher_geniv_init+0x1e/0x40
[   26.420046]  [] ? __crypto_alloc_tfm+0xf9/0x170
[   26.421599]  [] ? crypto_spawn_tfm+0x43/0x90
[   26.423228]  [] ? __kmalloc+0x20d/0x250
[   26.424788]  [] ?  
crypto_authenc_init_tfm+0x49/0xc0 [authenc]

[   26.426374]  [] ? __crypto_alloc_tfm+0xf9/0x170
[   26.427999]  [] ? alg_test_aead+0x48/0xb0
[   26.429781]  [] ? alg_test+0xfe/0x310
[   26.431503]  [] ? __schedule+0x2ba/0x700
[   26.433235]  [] ? cryptomgr_probe+0xb0/0xb0
[   26.434918]  [] ? cryptomgr_test+0x38/0x40
[   26.436524]  [] ? kthread+0x85/0x90
[   26.436526]  [] ? kernel_thread_helper+0x4/0x10
[ 

Re: Linux 3.6-rc5

2012-09-09 Thread Jussi Kivilinna

Quoting Romain Francoise rom...@orebokech.com:


Still seeing this BUG with -rc5, that I originally reported here:
http://marc.info/?l=linux-crypto-vgerm=134653220530264w=2


Does reverting e46e9a46386bca8e80a6467b5c643dc494861896 help?

That commit added crypto selftest for authenc(hmac(sha1),cbc(aes)) in  
3.6, and probably made this bug visible (but not directly causing it).


-Jussi



[   26.362567] [ cut here ]
[   26.362583] kernel BUG at crypto/scatterwalk.c:37!
[   26.362606] invalid opcode:  [#1] SMP
[   26.362622] Modules linked in: authenc xfrm6_mode_tunnel  
xfrm4_mode_tunnel cpufreq_conservative cpufreq_userspace  
cpufreq_powersave cpufreq_stats xfrm_user xfrm4_tunnel tunnel4  
ipcomp xfrm_ipcomp esp4 ah4 binfmt_misc deflate zlib_deflate ctr  
twofish_generic twofish_avx_x86_64 twofish_x86_64_3way  
twofish_x86_64 twofish_common camellia_generic camellia_x86_64  
serpent_avx_x86_64 serpent_sse2_x86_64 serpent_generic glue_helper  
lrw xts gf128mul blowfish_generic blowfish_x86_64 blowfish_common  
cast5 des_generic cbc xcbc rmd160 sha512_generic sha1_ssse3  
sha1_generic hmac crypto_null af_key xfrm_algo ip6table_filter  
ip6_tables xt_recent xt_LOG nf_conntrack_ipv4 nf_defrag_ipv4  
xt_state nf_conntrack xt_tcpudp iptable_filter ip_tables x_tables  
hwmon_vid msr vhost_net macvtap macvlan tun loop bridge stp llc  
firewire_sbp2 fuse rc_dib0700_rc5 snd_hda_codec_hdmi dvb_usb_dib0700  
dib7000m dib0090 dib8000 dib0070 dib7000p dib3000mc dibx000_common  
dvb_usb snd_hda_codec_realtek dvb_core rc_co

 re snd
_hda_intel snd_hda_codec snd_seq_midi snd_seq_midi_event snd_hwdep  
snd_pcm_oss snd_rawmidi snd_mixer_oss snd_seq snd_pcm snd_seq_device  
radeon snd_timer snd soundcore acpi_cpufreq mperf ttm processor  
drm_kms_helper thermal_sys mxm_wmi drm snd_page_alloc i2c_algo_bit  
i2c_i801 i2c_core lpc_ich button psmouse evdev coretemp serio_raw  
wmi pcspkr mei kvm_intel kvm ext4 crc16 jbd2 mbcache sha256_generic  
usb_storage uas dm_crypt dm_mod raid10 raid1 md_mod sg ata_generic  
sd_mod hid_generic crc_t10dif pata_marvell usbhid hid crc32c_intel  
ghash_clmulni_intel aesni_intel aes_x86_64 aes_generic ablk_helper  
cryptd firewire_ohci microcode ahci firewire_core libahci crc_itu_t  
libata scsi_mod xhci_hcd ehci_hcd usbcore usb_common e1000e

[   26.371138] CPU 5
[   26.371146] Pid: 3704, comm: cryptomgr_test Not tainted  
3.6.0-rc5-ore #1  /DP67BG
[   26.374095] RIP: 0010:[811d70d1]  [811d70d1]  
scatterwalk_start+0x11/0x20

[   26.375598] RSP: 0018:88040d62b9d8  EFLAGS: 00010246
[   26.377067] RAX:  RBX: 88040b6e3868 RCX:  
0014
[   26.378567] RDX: 0020 RSI: 88040b6e3868 RDI:  
88040d62b9e0
[   26.380028] RBP: 0020 R08: 0001 R09:  
88040b6e39a8
[   26.381494] R10: a06b4000 R11: 88040b6e39fc R12:  
0014
[   26.383023] R13: 0001 R14: 88040b6e38f8 R15:  

[   26.384488] FS:  () GS:88041f54()  
knlGS:

[   26.385973] CS:  0010 DS:  ES:  CR0: 8005003b
[   26.387581] CR2: 7f54c282fbd0 CR3: 0180b000 CR4:  
000407e0
[   26.389057] DR0:  DR1:  DR2:  

[   26.390547] DR3:  DR6: 0ff0 DR7:  
0400
[   26.392015] Process cryptomgr_test (pid: 3704, threadinfo  
88040d62a000, task 88040ca3ab60)

[   26.393558] Stack:
[   26.395103]  811d72fb 88040b6e3868 0020  
88040b6e3868
[   26.396622]  88040b6e3800 88040b6e3868 0020  
88040b6e3868
[   26.398163]  88040919f440 88040d62bcc8 a07edaa0  
88040b6e3930

[   26.399643] Call Trace:
[   26.401112]  [811d72fb] ? scatterwalk_map_and_copy+0x5b/0xd0
[   26.402714]  [a07edaa0] ?  
crypto_authenc_genicv+0xa0/0x300 [authenc]

[   26.404274]  [811de5bb] ? test_aead+0x58b/0xcd0
[   26.406082]  [811d4cb0] ? crypto_mod_get+0x10/0x30
[   26.407704]  [811d57c3] ? crypto_alloc_base+0x53/0xb0
[   26.409267]  [a01dd6e0] ?  
cryptd_alloc_ablkcipher+0x80/0xc0 [cryptd]

[   26.410838]  [8113f3fd] ? __kmalloc+0x20d/0x250
[   26.412364]  [811d6321] ? crypto_spawn_tfm2+0x31/0x70
[   26.413938]  [a0066050] ? ablk_init_common+0x10/0x30  
[ablk_helper]

[   26.415448]  [811d56f9] ? __crypto_alloc_tfm+0xf9/0x170
[   26.416963]  [811d63a3] ? crypto_spawn_tfm+0x43/0x90
[   26.418505]  [811d998e] ? skcipher_geniv_init+0x1e/0x40
[   26.420046]  [811d56f9] ? __crypto_alloc_tfm+0xf9/0x170
[   26.421599]  [811d63a3] ? crypto_spawn_tfm+0x43/0x90
[   26.423228]  [8113f3fd] ? __kmalloc+0x20d/0x250
[   26.424788]  [a07ed359] ?  
crypto_authenc_init_tfm+0x49/0xc0 [authenc]

[   26.426374]  [811d56f9] ? 

Re: Linux 3.6-rc5

2012-09-09 Thread Jussi Kivilinna

Quoting Herbert Xu herb...@gondor.apana.org.au:



Can you try blacklisting/not loading sha1_ssse3 and aesni_intel
to see which one of them is causing this crash? Of course if you
can still reproduce this without loading either of them that would
also be interesting to know.


This triggers with aes-x86_64 and sha1_generic (and sha256  sha512)  
too, with following test added to tcrypt:

case 46:
ret += tcrypt_test(authenc(hmac(sha1),cbc(aes)));
ret += tcrypt_test(authenc(hmac(sha256),cbc(aes)));
ret += tcrypt_test(authenc(hmac(sha512),cbc(aes)));
break;

-Jussi



Thanks,
--
Email: Herbert Xu herb...@gondor.apana.org.au
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line unsubscribe linux-crypto in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html






--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Linux 3.6-rc5

2012-09-09 Thread Jussi Kivilinna

Quoting Herbert Xu herb...@gondor.apana.org.au:


On Sun, Sep 09, 2012 at 08:35:56AM -0700, Linus Torvalds wrote:

On Sun, Sep 9, 2012 at 5:54 AM, Jussi Kivilinna
jussi.kivili...@mbnet.fi wrote:

 Does reverting e46e9a46386bca8e80a6467b5c643dc494861896 help?

 That commit added crypto selftest for authenc(hmac(sha1),cbc(aes)) in 3.6,
 and probably made this bug visible (but not directly causing it).

So Romain said it does - where do we go from here? Revert testing it,
or fix the authenc() case? I'd prefer the fix..


I'm working on this right now.  If we don't get anywhere in a
couple of days we can revert the test vector patch.



It seems that authenc is chaining empty assoc scatterlist, which causes
BUG_ON(!sg-length) set off in crypto/scatterwalk.c.

Following fixes the bug and self-test passes, but not sure if it's correct
(note, copy-paste to 'broken' email client, most likely does not apply etc):

diff --git a/crypto/authenc.c b/crypto/authenc.c
index 5ef7ba6..2373af5 100644
--- a/crypto/authenc.c
+++ b/crypto/authenc.c
@@ -336,7 +336,7 @@ static int crypto_authenc_genicv(struct  
aead_request *req, u8 *iv,

cryptlen += ivsize;
}

-   if (sg_is_last(assoc)) {
+   if (req-assoclen  0  sg_is_last(assoc)) {
authenc_ahash_fn = crypto_authenc_ahash;
sg_init_table(asg, 2);
sg_set_page(asg, sg_page(assoc), assoc-length,  
assoc-offset);



Also does crypto_authenc_iverify() need same fix?

-Jussi


Cheers,
--
Email: Herbert Xu herb...@gondor.apana.org.au
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt






--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] crypto: twofish - add x86_64/avx assembler implementation

2012-08-28 Thread Jussi Kivilinna

Quoting Borislav Petkov :


On Wed, Aug 22, 2012 at 10:20:03PM +0300, Jussi Kivilinna wrote:
Actually it does look better, at least for encryption. Decryption  
had different

ordering for test, which appears to be bad on bulldozer as it is on
sandy-bridge.

So, yet another patch then :)


Here you go:


Thanks!

With this patch twofish-avx is faster than twofish-3way for 256, 1k  
and 8k tests.


sizeold-vs-new  new-vs-3way old-vs-3way
ecb-enc ecb-dec ecb-enc ecb-dec ecb-enc ecb-dec
256 1.10x   1.11x   1.01x   1.01x   0.92x   0.91x
1k  1.11x   1.12x   1.08x   1.07x   0.97x   0.96x
8k  1.11x   1.13x   1.10x   1.08x   0.99x   0.97x

-Jussi



[  153.736745]
[  153.736745] testing speed of async ecb(twofish) encryption
[  153.745806] test 0 (128 bit key, 16 byte blocks): 4832343  
operations in 1 seconds (77317488 bytes)
[  154.752525] test 1 (128 bit key, 64 byte blocks): 2049979  
operations in 1 seconds (131198656 bytes)
[  155.755195] test 2 (128 bit key, 256 byte blocks): 620439  
operations in 1 seconds (158832384 bytes)
[  156.761694] test 3 (128 bit key, 1024 byte blocks): 173900  
operations in 1 seconds (178073600 bytes)
[  157.768282] test 4 (128 bit key, 8192 byte blocks): 22366  
operations in 1 seconds (18372 bytes)
[  158.774815] test 5 (192 bit key, 16 byte blocks): 4850741  
operations in 1 seconds (77611856 bytes)
[  159.781498] test 6 (192 bit key, 64 byte blocks): 2046772  
operations in 1 seconds (130993408 bytes)
[  160.788163] test 7 (192 bit key, 256 byte blocks): 619915  
operations in 1 seconds (158698240 bytes)
[  161.794636] test 8 (192 bit key, 1024 byte blocks): 173442  
operations in 1 seconds (177604608 bytes)
[  162.801242] test 9 (192 bit key, 8192 byte blocks): 22083  
operations in 1 seconds (180903936 bytes)
[  163.807793] test 10 (256 bit key, 16 byte blocks): 4862951  
operations in 1 seconds (77807216 bytes)
[  164.814449] test 11 (256 bit key, 64 byte blocks): 2050036  
operations in 1 seconds (131202304 bytes)
[  165.821121] test 12 (256 bit key, 256 byte blocks): 620349  
operations in 1 seconds (158809344 bytes)
[  166.827621] test 13 (256 bit key, 1024 byte blocks): 173917  
operations in 1 seconds (178091008 bytes)
[  167.834218] test 14 (256 bit key, 8192 byte blocks): 22362  
operations in 1 seconds (183189504 bytes)

[  168.840798]
[  168.840798] testing speed of async ecb(twofish) decryption
[  168.849968] test 0 (128 bit key, 16 byte blocks): 4889899  
operations in 1 seconds (78238384 bytes)
[  169.855439] test 1 (128 bit key, 64 byte blocks): 2052293  
operations in 1 seconds (131346752 bytes)
[  170.862113] test 2 (128 bit key, 256 byte blocks): 616979  
operations in 1 seconds (157946624 bytes)
[  171.868631] test 3 (128 bit key, 1024 byte blocks): 172773  
operations in 1 seconds (176919552 bytes)
[  172.875244] test 4 (128 bit key, 8192 byte blocks): 4  
operations in 1 seconds (182059008 bytes)
[  173.881777] test 5 (192 bit key, 16 byte blocks): 4893653  
operations in 1 seconds (78298448 bytes)
[  174.888451] test 6 (192 bit key, 64 byte blocks): 2048078  
operations in 1 seconds (131076992 bytes)
[  175.895131] test 7 (192 bit key, 256 byte blocks): 619204  
operations in 1 seconds (158516224 bytes)
[  176.901651] test 8 (192 bit key, 1024 byte blocks): 172569  
operations in 1 seconds (176710656 bytes)
[  177.908253] test 9 (192 bit key, 8192 byte blocks): 21888  
operations in 1 seconds (179306496 bytes)
[  178.914781] test 10 (256 bit key, 16 byte blocks): 4921751  
operations in 1 seconds (78748016 bytes)
[  179.917481] test 11 (256 bit key, 64 byte blocks): 2051219  
operations in 1 seconds (131278016 bytes)
[  180.920147] test 12 (256 bit key, 256 byte blocks): 618536  
operations in 1 seconds (158345216 bytes)
[  181.926637] test 13 (256 bit key, 1024 byte blocks): 172886  
operations in 1 seconds (177035264 bytes)
[  182.933249] test 14 (256 bit key, 8192 byte blocks): 2  
operations in 1 seconds (182042624 bytes)

[  183.939803]
[  183.939803] testing speed of async cbc(twofish) encryption
[  183.953902] test 0 (128 bit key, 16 byte blocks): 5195403  
operations in 1 seconds (83126448 bytes)
[  184.962487] test 1 (128 bit key, 64 byte blocks): 1912010  
operations in 1 seconds (122368640 bytes)
[  185.969150] test 2 (128 bit key, 256 byte blocks): 540125  
operations in 1 seconds (138272000 bytes)
[  186.975650] test 3 (128 bit key, 1024 byte blocks): 140631  
operations in 1 seconds (144006144 bytes)
[  187.982411] test 4 (128 bit key, 8192 byte blocks): 17737  
operations in 1 seconds (145301504 bytes)
[  188.988782] test 5 (192 bit key, 16 byte blocks): 5182287  
operations in 1 seconds (82916592 bytes)
[  189.995435] test 6 (192 bit key, 64 byte blocks): 1912356  
operations in 1 seconds (122390784 bytes)
[  191.002093] test 7 (192 bit key, 256 byte blocks): 540991  
operations in 1 seconds (138493696 bytes)
[  192.008600] test 8 (192 bit key, 1024 byte blocks): 140791  
operations in 1

Re: [PATCH] crypto: twofish - add x86_64/avx assembler implementation

2012-08-28 Thread Jussi Kivilinna

Quoting Borislav Petkov b...@alien8.de:


On Wed, Aug 22, 2012 at 10:20:03PM +0300, Jussi Kivilinna wrote:
Actually it does look better, at least for encryption. Decryption  
had different

ordering for test, which appears to be bad on bulldozer as it is on
sandy-bridge.

So, yet another patch then :)


Here you go:


Thanks!

With this patch twofish-avx is faster than twofish-3way for 256, 1k  
and 8k tests.


sizeold-vs-new  new-vs-3way old-vs-3way
ecb-enc ecb-dec ecb-enc ecb-dec ecb-enc ecb-dec
256 1.10x   1.11x   1.01x   1.01x   0.92x   0.91x
1k  1.11x   1.12x   1.08x   1.07x   0.97x   0.96x
8k  1.11x   1.13x   1.10x   1.08x   0.99x   0.97x

-Jussi



[  153.736745]
[  153.736745] testing speed of async ecb(twofish) encryption
[  153.745806] test 0 (128 bit key, 16 byte blocks): 4832343  
operations in 1 seconds (77317488 bytes)
[  154.752525] test 1 (128 bit key, 64 byte blocks): 2049979  
operations in 1 seconds (131198656 bytes)
[  155.755195] test 2 (128 bit key, 256 byte blocks): 620439  
operations in 1 seconds (158832384 bytes)
[  156.761694] test 3 (128 bit key, 1024 byte blocks): 173900  
operations in 1 seconds (178073600 bytes)
[  157.768282] test 4 (128 bit key, 8192 byte blocks): 22366  
operations in 1 seconds (18372 bytes)
[  158.774815] test 5 (192 bit key, 16 byte blocks): 4850741  
operations in 1 seconds (77611856 bytes)
[  159.781498] test 6 (192 bit key, 64 byte blocks): 2046772  
operations in 1 seconds (130993408 bytes)
[  160.788163] test 7 (192 bit key, 256 byte blocks): 619915  
operations in 1 seconds (158698240 bytes)
[  161.794636] test 8 (192 bit key, 1024 byte blocks): 173442  
operations in 1 seconds (177604608 bytes)
[  162.801242] test 9 (192 bit key, 8192 byte blocks): 22083  
operations in 1 seconds (180903936 bytes)
[  163.807793] test 10 (256 bit key, 16 byte blocks): 4862951  
operations in 1 seconds (77807216 bytes)
[  164.814449] test 11 (256 bit key, 64 byte blocks): 2050036  
operations in 1 seconds (131202304 bytes)
[  165.821121] test 12 (256 bit key, 256 byte blocks): 620349  
operations in 1 seconds (158809344 bytes)
[  166.827621] test 13 (256 bit key, 1024 byte blocks): 173917  
operations in 1 seconds (178091008 bytes)
[  167.834218] test 14 (256 bit key, 8192 byte blocks): 22362  
operations in 1 seconds (183189504 bytes)

[  168.840798]
[  168.840798] testing speed of async ecb(twofish) decryption
[  168.849968] test 0 (128 bit key, 16 byte blocks): 4889899  
operations in 1 seconds (78238384 bytes)
[  169.855439] test 1 (128 bit key, 64 byte blocks): 2052293  
operations in 1 seconds (131346752 bytes)
[  170.862113] test 2 (128 bit key, 256 byte blocks): 616979  
operations in 1 seconds (157946624 bytes)
[  171.868631] test 3 (128 bit key, 1024 byte blocks): 172773  
operations in 1 seconds (176919552 bytes)
[  172.875244] test 4 (128 bit key, 8192 byte blocks): 4  
operations in 1 seconds (182059008 bytes)
[  173.881777] test 5 (192 bit key, 16 byte blocks): 4893653  
operations in 1 seconds (78298448 bytes)
[  174.888451] test 6 (192 bit key, 64 byte blocks): 2048078  
operations in 1 seconds (131076992 bytes)
[  175.895131] test 7 (192 bit key, 256 byte blocks): 619204  
operations in 1 seconds (158516224 bytes)
[  176.901651] test 8 (192 bit key, 1024 byte blocks): 172569  
operations in 1 seconds (176710656 bytes)
[  177.908253] test 9 (192 bit key, 8192 byte blocks): 21888  
operations in 1 seconds (179306496 bytes)
[  178.914781] test 10 (256 bit key, 16 byte blocks): 4921751  
operations in 1 seconds (78748016 bytes)
[  179.917481] test 11 (256 bit key, 64 byte blocks): 2051219  
operations in 1 seconds (131278016 bytes)
[  180.920147] test 12 (256 bit key, 256 byte blocks): 618536  
operations in 1 seconds (158345216 bytes)
[  181.926637] test 13 (256 bit key, 1024 byte blocks): 172886  
operations in 1 seconds (177035264 bytes)
[  182.933249] test 14 (256 bit key, 8192 byte blocks): 2  
operations in 1 seconds (182042624 bytes)

[  183.939803]
[  183.939803] testing speed of async cbc(twofish) encryption
[  183.953902] test 0 (128 bit key, 16 byte blocks): 5195403  
operations in 1 seconds (83126448 bytes)
[  184.962487] test 1 (128 bit key, 64 byte blocks): 1912010  
operations in 1 seconds (122368640 bytes)
[  185.969150] test 2 (128 bit key, 256 byte blocks): 540125  
operations in 1 seconds (138272000 bytes)
[  186.975650] test 3 (128 bit key, 1024 byte blocks): 140631  
operations in 1 seconds (144006144 bytes)
[  187.982411] test 4 (128 bit key, 8192 byte blocks): 17737  
operations in 1 seconds (145301504 bytes)
[  188.988782] test 5 (192 bit key, 16 byte blocks): 5182287  
operations in 1 seconds (82916592 bytes)
[  189.995435] test 6 (192 bit key, 64 byte blocks): 1912356  
operations in 1 seconds (122390784 bytes)
[  191.002093] test 7 (192 bit key, 256 byte blocks): 540991  
operations in 1 seconds (138493696 bytes)
[  192.008600] test 8 (192 bit key, 1024 byte blocks): 140791

Re: [PATCH] crypto: twofish - add x86_64/avx assembler implementation

2012-08-23 Thread Jussi Kivilinna

Quoting Jason Garrett-Glaser :


On Wed, Aug 22, 2012 at 12:20 PM, Jussi Kivilinna
 wrote:

Quoting Borislav Petkov :


On Wed, Aug 22, 2012 at 07:35:12AM +0300, Jussi Kivilinna wrote:

Looks that encryption lost ~0.4% while decryption gained ~1.8%.

For 256 byte test, it's still slightly slower than twofish-3way
(~3%). For 1k
and 8k tests, it's ~5% faster.

Here's very last test-patch, testing different ordering of fpu<->cpu reg
instructions at few places.


Hehe,.

I don't mind testing patches, no worries there. Here are the results
this time, doesn't look better than the last run, AFAICT.



Actually it does look better, at least for encryption. Decryption  
had different

ordering for test, which appears to be bad on bulldozer as it is on
sandy-bridge.

So, yet another patch then :)

Interleaving at some new places (reordered lookup_32bit()s in G-macro) and
doing one of the round rotations one round ahead. Also introduces some
more paralellism inside lookup_32bit.


Outsider looking in here, but avoiding the 256-way lookup tables
entirely might be faster.  Looking at the twofish code, one byte-wise
calculation looks like this:

a0 = x >> 4; b0 = x & 15;
a1 = a0 ^ b0; b1 = ror4[b0] ^ ashx[a0];
a2 = qt0[n][a1]; b2 = qt1[n][b1];
a3 = a2 ^ b2; b3 = ror4[b2] ^ ashx[a2];
a4 = qt2[n][a3]; b4 = qt3[n][b3];
return (b4 << 4) | a4;

This means that you can do something like this pseudocode (Intel
syntax).  pshufb on ymm registers is AVX2, but splitting it into xmm
operations would probably be fine (as would using this for just a pure
SSE implementation!).  On AVX2 you' have to double the tables for both
ways, naturally.

constants:
pb_0x0f = {0x0f,0x0f,0x0f ... }
ashx: lookup table
ror4: lookup table
qt0[n]: lookup table
qt1[n]: lookup table
qt2[n]: lookup table
qt3[n]: lookup table

vpandb0, in, pb_0x0f
vpsrlw   a0, in, 4
vpanda0, a0, pb_0x0f; effectively vpsrlb, but that doesn't exist

vpxora1, a0, b0
vpshufb  a0,   ashx, a0
vpshufb  b0,   ror4, b0
vpxorb1, a0, b0

vpshufb  a2, qt0[n], a1
vpshufb  b2, qt1[n], b1

vpxora3, a2, b2
vpshufb  a3,   ashx, a2
vpshufb  b3,   ror4, b2
vpxorb3, a2, b2

vpshufb  a4, qt2[n], a3
vpshufb  b4, qt3[n], b3

vpsllw   b4, b4, 4  ; effectively vpsrlb, but that doesn't exist
vporout, a4, b4

That's 15 instructions (plus maybe a move or two) to do 16 lookups for
SSE (~9 cycles by my guessing on a Nehalem).  AVX would run into the
problem of lots of extra vinsert/vextract (just going 16-byte might be
better, might be not, depending on execution units).  AVX2 would be
super fast (15 for 32).

If this works, this could be quite a bit faster with the table-based  
approach.


The above would implement twofish permutations q0 and q1? For  
byte-sliced implementation you would need 8 parallel blocks (16b  
registers, two parallel h-functions for round, 16/2).


In this setup, for double h-function, you need 12 q0/1 operations (for  
128bit key, for 192bit: 16, for 256bit: 20), plus 8 key material xors  
(for 192bit 12, 256bit 16) and MDS matrix multiplication (alot more  
than 15 instructions, I'd think). We do 16-rounds so that gives us,  
((12*15+8+15)*16)/(8*16) > 25.3 cycles/byte. Usually I get ~2.5  
instructions/cycle for pure SSE2, so that's 10 cycles/byte.


After that we have PHT phase. But now problem is that PHT base uses  
32-bit additions, so either we move between byte-sliced and  
dword-sliced modes here or move addition carry over bytes. After PHT  
there is 32-bit addition with key material and 32-bit rotations.


I don't think this is going to work. For AVX2, vpgatherdd is going to  
speed up 32-bit lookups anyway.


-Jussi



Jason






--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] crypto: twofish - add x86_64/avx assembler implementation

2012-08-23 Thread Jussi Kivilinna

Quoting Jason Garrett-Glaser ja...@x264.com:


On Wed, Aug 22, 2012 at 12:20 PM, Jussi Kivilinna
jussi.kivili...@mbnet.fi wrote:

Quoting Borislav Petkov b...@alien8.de:


On Wed, Aug 22, 2012 at 07:35:12AM +0300, Jussi Kivilinna wrote:

Looks that encryption lost ~0.4% while decryption gained ~1.8%.

For 256 byte test, it's still slightly slower than twofish-3way
(~3%). For 1k
and 8k tests, it's ~5% faster.

Here's very last test-patch, testing different ordering of fpu-cpu reg
instructions at few places.


Hehe,.

I don't mind testing patches, no worries there. Here are the results
this time, doesn't look better than the last run, AFAICT.



Actually it does look better, at least for encryption. Decryption  
had different

ordering for test, which appears to be bad on bulldozer as it is on
sandy-bridge.

So, yet another patch then :)

Interleaving at some new places (reordered lookup_32bit()s in G-macro) and
doing one of the round rotations one round ahead. Also introduces some
more paralellism inside lookup_32bit.


Outsider looking in here, but avoiding the 256-way lookup tables
entirely might be faster.  Looking at the twofish code, one byte-wise
calculation looks like this:

a0 = x  4; b0 = x  15;
a1 = a0 ^ b0; b1 = ror4[b0] ^ ashx[a0];
a2 = qt0[n][a1]; b2 = qt1[n][b1];
a3 = a2 ^ b2; b3 = ror4[b2] ^ ashx[a2];
a4 = qt2[n][a3]; b4 = qt3[n][b3];
return (b4  4) | a4;

This means that you can do something like this pseudocode (Intel
syntax).  pshufb on ymm registers is AVX2, but splitting it into xmm
operations would probably be fine (as would using this for just a pure
SSE implementation!).  On AVX2 you' have to double the tables for both
ways, naturally.

constants:
pb_0x0f = {0x0f,0x0f,0x0f ... }
ashx: lookup table
ror4: lookup table
qt0[n]: lookup table
qt1[n]: lookup table
qt2[n]: lookup table
qt3[n]: lookup table

vpandb0, in, pb_0x0f
vpsrlw   a0, in, 4
vpanda0, a0, pb_0x0f; effectively vpsrlb, but that doesn't exist

vpxora1, a0, b0
vpshufb  a0,   ashx, a0
vpshufb  b0,   ror4, b0
vpxorb1, a0, b0

vpshufb  a2, qt0[n], a1
vpshufb  b2, qt1[n], b1

vpxora3, a2, b2
vpshufb  a3,   ashx, a2
vpshufb  b3,   ror4, b2
vpxorb3, a2, b2

vpshufb  a4, qt2[n], a3
vpshufb  b4, qt3[n], b3

vpsllw   b4, b4, 4  ; effectively vpsrlb, but that doesn't exist
vporout, a4, b4

That's 15 instructions (plus maybe a move or two) to do 16 lookups for
SSE (~9 cycles by my guessing on a Nehalem).  AVX would run into the
problem of lots of extra vinsert/vextract (just going 16-byte might be
better, might be not, depending on execution units).  AVX2 would be
super fast (15 for 32).

If this works, this could be quite a bit faster with the table-based  
approach.


The above would implement twofish permutations q0 and q1? For  
byte-sliced implementation you would need 8 parallel blocks (16b  
registers, two parallel h-functions for round, 16/2).


In this setup, for double h-function, you need 12 q0/1 operations (for  
128bit key, for 192bit: 16, for 256bit: 20), plus 8 key material xors  
(for 192bit 12, 256bit 16) and MDS matrix multiplication (alot more  
than 15 instructions, I'd think). We do 16-rounds so that gives us,  
((12*15+8+15)*16)/(8*16)  25.3 cycles/byte. Usually I get ~2.5  
instructions/cycle for pure SSE2, so that's 10 cycles/byte.


After that we have PHT phase. But now problem is that PHT base uses  
32-bit additions, so either we move between byte-sliced and  
dword-sliced modes here or move addition carry over bytes. After PHT  
there is 32-bit addition with key material and 32-bit rotations.


I don't think this is going to work. For AVX2, vpgatherdd is going to  
speed up 32-bit lookups anyway.


-Jussi



Jason






--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] crypto: twofish - add x86_64/avx assembler implementation

2012-08-22 Thread Jussi Kivilinna
Quoting Borislav Petkov :

> On Wed, Aug 22, 2012 at 07:35:12AM +0300, Jussi Kivilinna wrote:
>> Looks that encryption lost ~0.4% while decryption gained ~1.8%.
>>
>> For 256 byte test, it's still slightly slower than twofish-3way
>> (~3%). For 1k
>> and 8k tests, it's ~5% faster.
>>
>> Here's very last test-patch, testing different ordering of fpu<->cpu reg
>> instructions at few places.
>
> Hehe,
>
> I don't mind testing patches, no worries there. Here are the results
> this time, doesn't look better than the last run, AFAICT.
>

Actually it does look better, at least for encryption. Decryption had different
ordering for test, which appears to be bad on bulldozer as it is on
sandy-bridge.

So, yet another patch then :)

Interleaving at some new places (reordered lookup_32bit()s in G-macro) and
doing one of the round rotations one round ahead. Also introduces some
more paralellism inside lookup_32bit.

---
 arch/x86/crypto/twofish-avx-x86_64-asm_64.S |  227 +--
 1 file changed, 142 insertions(+), 85 deletions(-)

diff --git a/arch/x86/crypto/twofish-avx-x86_64-asm_64.S 
b/arch/x86/crypto/twofish-avx-x86_64-asm_64.S
index 35f4557..1585abb 100644
--- a/arch/x86/crypto/twofish-avx-x86_64-asm_64.S
+++ b/arch/x86/crypto/twofish-avx-x86_64-asm_64.S
@@ -4,6 +4,8 @@
  * Copyright (C) 2012 Johannes Goetzfried
  * 
  *
+ * Copyright © 2012 Jussi Kivilinna 
+ *
  * This program is free software; you can redistribute it and/or modify
  * it under the terms of the GNU General Public License as published by
  * the Free Software Foundation; either version 2 of the License, or
@@ -47,16 +49,22 @@
 #define RC2 %xmm6
 #define RD2 %xmm7
 
-#define RX %xmm8
-#define RY %xmm9
+#define RX0 %xmm8
+#define RY0 %xmm9
+
+#define RX1 %xmm10
+#define RY1 %xmm11
 
-#define RK1 %xmm10
-#define RK2 %xmm11
+#define RK1 %xmm12
+#define RK2 %xmm13
 
-#define RID1  %rax
-#define RID1b %al
-#define RID2  %rbx
-#define RID2b %bl
+#define RT %xmm14
+#define RR %xmm15
+
+#define RID1  %rbp
+#define RID1d %ebp
+#define RID2  %rsi
+#define RID2d %esi
 
 #define RGI1   %rdx
 #define RGI1bl %dl
@@ -65,6 +73,13 @@
 #define RGI2bl %cl
 #define RGI2bh %ch
 
+#define RGI3   %rax
+#define RGI3bl %al
+#define RGI3bh %ah
+#define RGI4   %rbx
+#define RGI4bl %bl
+#define RGI4bh %bh
+
 #define RGS1  %r8
 #define RGS1d %r8d
 #define RGS2  %r9
@@ -73,89 +88,123 @@
 #define RGS3d %r10d
 
 
-#define lookup_32bit(t0, t1, t2, t3, src, dst) \
-   movbsrc ## bl,RID1b; \
-   movbsrc ## bh,RID2b; \
-   movlt0(CTX, RID1, 4), dst ## d;  \
-   xorlt1(CTX, RID2, 4), dst ## d;  \
+#define lookup_32bit(t0, t1, t2, t3, src, dst, interleave_op, il_reg) \
+   movzbl  src ## bl,RID1d; \
+   movzbl  src ## bh,RID2d; \
shrq $16,   src; \
-   movbsrc ## bl,RID1b; \
-   movbsrc ## bh,RID2b; \
+   movlt0(CTX, RID1, 4), dst ## d;  \
+   movlt1(CTX, RID2, 4), RID2d; \
+   movzbl  src ## bl,RID1d; \
+   xorlRID2d,dst ## d;  \
+   movzbl  src ## bh,RID2d; \
+   interleave_op(il_reg);   \
xorlt2(CTX, RID1, 4), dst ## d;  \
xorlt3(CTX, RID2, 4), dst ## d;
 
-#define G(a, x, t0, t1, t2, t3) \
-   vmovq   a,RGI1;   \
-   vpsrldq $8, a,x;  \
-   vmovq   x,RGI2;   \
+#define dummy(d) /* do nothing */
+
+#define shr_next(reg) \
+   shrq $16,   reg;
+
+#define G(gi1, gi2, x, t0, t1, t2, t3) \
+   lookup_32bit(t0, t1, t2, t3, ##gi1, RGS1, shr_next, ##gi1);  \
+   lookup_32bit(t0, t1, t2, t3, ##gi2, RGS3, shr_next, ##gi2);  \
+   \
+   lookup_32bit(t0, t1, t2, t3, ##gi1, RGS2, dummy, none);  \
+   shlq $32,   RGS2;\
+   orq RGS1, RGS2;  \
+   lookup_32bit(t0, t1, t2, t3, ##gi2, RGS1, dummy, none);  \
+   shlq $32,   RGS1;\
+   orq RGS1, RGS3;
+
+#define round_head_2(a, b, x1, y1, x2, y2) \
+   vmovq   b ## 1, RGI3;   \
+   vpextrq $1, b ## 1, RGI4;   \
\
-   lookup_32bit(t0, t1, t2, t3, RGI1, RGS1); \
-   shrq $16,   RGI1; \
-   lookup_32bit(t0, t1, t2, t3, RGI1, RGS2); \
-   shlq $32,   RGS2; \
-   orq RGS1, RGS2;   \
+   G(RGI1, RGI2, x1, s0, s1, s2, s3);  \
+   vmovq   a ## 2, RGI1;   \
+   vpextrq $1, a ## 2, RGI2;   \
+   vmo

Re: [PATCH] crypto: twofish - add x86_64/avx assembler implementation

2012-08-22 Thread Jussi Kivilinna
Quoting Borislav Petkov b...@alien8.de:

 On Wed, Aug 22, 2012 at 07:35:12AM +0300, Jussi Kivilinna wrote:
 Looks that encryption lost ~0.4% while decryption gained ~1.8%.

 For 256 byte test, it's still slightly slower than twofish-3way
 (~3%). For 1k
 and 8k tests, it's ~5% faster.

 Here's very last test-patch, testing different ordering of fpu-cpu reg
 instructions at few places.

 Hehe,

 I don't mind testing patches, no worries there. Here are the results
 this time, doesn't look better than the last run, AFAICT.


Actually it does look better, at least for encryption. Decryption had different
ordering for test, which appears to be bad on bulldozer as it is on
sandy-bridge.

So, yet another patch then :)

Interleaving at some new places (reordered lookup_32bit()s in G-macro) and
doing one of the round rotations one round ahead. Also introduces some
more paralellism inside lookup_32bit.

---
 arch/x86/crypto/twofish-avx-x86_64-asm_64.S |  227 +--
 1 file changed, 142 insertions(+), 85 deletions(-)

diff --git a/arch/x86/crypto/twofish-avx-x86_64-asm_64.S 
b/arch/x86/crypto/twofish-avx-x86_64-asm_64.S
index 35f4557..1585abb 100644
--- a/arch/x86/crypto/twofish-avx-x86_64-asm_64.S
+++ b/arch/x86/crypto/twofish-avx-x86_64-asm_64.S
@@ -4,6 +4,8 @@
  * Copyright (C) 2012 Johannes Goetzfried
  * johannes.goetzfr...@informatik.stud.uni-erlangen.de
  *
+ * Copyright © 2012 Jussi Kivilinna jussi.kivili...@mbnet.fi
+ *
  * This program is free software; you can redistribute it and/or modify
  * it under the terms of the GNU General Public License as published by
  * the Free Software Foundation; either version 2 of the License, or
@@ -47,16 +49,22 @@
 #define RC2 %xmm6
 #define RD2 %xmm7
 
-#define RX %xmm8
-#define RY %xmm9
+#define RX0 %xmm8
+#define RY0 %xmm9
+
+#define RX1 %xmm10
+#define RY1 %xmm11
 
-#define RK1 %xmm10
-#define RK2 %xmm11
+#define RK1 %xmm12
+#define RK2 %xmm13
 
-#define RID1  %rax
-#define RID1b %al
-#define RID2  %rbx
-#define RID2b %bl
+#define RT %xmm14
+#define RR %xmm15
+
+#define RID1  %rbp
+#define RID1d %ebp
+#define RID2  %rsi
+#define RID2d %esi
 
 #define RGI1   %rdx
 #define RGI1bl %dl
@@ -65,6 +73,13 @@
 #define RGI2bl %cl
 #define RGI2bh %ch
 
+#define RGI3   %rax
+#define RGI3bl %al
+#define RGI3bh %ah
+#define RGI4   %rbx
+#define RGI4bl %bl
+#define RGI4bh %bh
+
 #define RGS1  %r8
 #define RGS1d %r8d
 #define RGS2  %r9
@@ -73,89 +88,123 @@
 #define RGS3d %r10d
 
 
-#define lookup_32bit(t0, t1, t2, t3, src, dst) \
-   movbsrc ## bl,RID1b; \
-   movbsrc ## bh,RID2b; \
-   movlt0(CTX, RID1, 4), dst ## d;  \
-   xorlt1(CTX, RID2, 4), dst ## d;  \
+#define lookup_32bit(t0, t1, t2, t3, src, dst, interleave_op, il_reg) \
+   movzbl  src ## bl,RID1d; \
+   movzbl  src ## bh,RID2d; \
shrq $16,   src; \
-   movbsrc ## bl,RID1b; \
-   movbsrc ## bh,RID2b; \
+   movlt0(CTX, RID1, 4), dst ## d;  \
+   movlt1(CTX, RID2, 4), RID2d; \
+   movzbl  src ## bl,RID1d; \
+   xorlRID2d,dst ## d;  \
+   movzbl  src ## bh,RID2d; \
+   interleave_op(il_reg);   \
xorlt2(CTX, RID1, 4), dst ## d;  \
xorlt3(CTX, RID2, 4), dst ## d;
 
-#define G(a, x, t0, t1, t2, t3) \
-   vmovq   a,RGI1;   \
-   vpsrldq $8, a,x;  \
-   vmovq   x,RGI2;   \
+#define dummy(d) /* do nothing */
+
+#define shr_next(reg) \
+   shrq $16,   reg;
+
+#define G(gi1, gi2, x, t0, t1, t2, t3) \
+   lookup_32bit(t0, t1, t2, t3, ##gi1, RGS1, shr_next, ##gi1);  \
+   lookup_32bit(t0, t1, t2, t3, ##gi2, RGS3, shr_next, ##gi2);  \
+   \
+   lookup_32bit(t0, t1, t2, t3, ##gi1, RGS2, dummy, none);  \
+   shlq $32,   RGS2;\
+   orq RGS1, RGS2;  \
+   lookup_32bit(t0, t1, t2, t3, ##gi2, RGS1, dummy, none);  \
+   shlq $32,   RGS1;\
+   orq RGS1, RGS3;
+
+#define round_head_2(a, b, x1, y1, x2, y2) \
+   vmovq   b ## 1, RGI3;   \
+   vpextrq $1, b ## 1, RGI4;   \
\
-   lookup_32bit(t0, t1, t2, t3, RGI1, RGS1); \
-   shrq $16,   RGI1; \
-   lookup_32bit(t0, t1, t2, t3, RGI1, RGS2); \
-   shlq $32,   RGS2; \
-   orq RGS1, RGS2;   \
+   G(RGI1, RGI2, x1, s0, s1, s2, s3);  \
+   vmovq   a ## 2, RGI1;   \
+   vpextrq $1, a ## 2, RGI2;   \
+   vmovq   RGS2, x1

Re: [PATCH] crypto: twofish - add x86_64/avx assembler implementation

2012-08-21 Thread Jussi Kivilinna
Quoting Borislav Petkov :

> 
> Here you go:
> 
> [   52.282208]
> [   52.282208] testing speed of async ecb(twofish) encryption

Thanks!

Looks that encryption lost ~0.4% while decryption gained ~1.8%.

For 256 byte test, it's still slightly slower than twofish-3way (~3%). For 1k
and 8k tests, it's ~5% faster.

Here's very last test-patch, testing different ordering of fpu<->cpu reg
instructions at few places.

---
 arch/x86/crypto/twofish-avx-x86_64-asm_64.S |  232 ++-
 1 file changed, 154 insertions(+), 78 deletions(-)

diff --git a/arch/x86/crypto/twofish-avx-x86_64-asm_64.S 
b/arch/x86/crypto/twofish-avx-x86_64-asm_64.S
index 35f4557..693963a 100644
--- a/arch/x86/crypto/twofish-avx-x86_64-asm_64.S
+++ b/arch/x86/crypto/twofish-avx-x86_64-asm_64.S
@@ -4,6 +4,8 @@
  * Copyright (C) 2012 Johannes Goetzfried
  * 
  *
+ * Copyright © 2012 Jussi Kivilinna 
+ *
  * This program is free software; you can redistribute it and/or modify
  * it under the terms of the GNU General Public License as published by
  * the Free Software Foundation; either version 2 of the License, or
@@ -47,16 +49,21 @@
 #define RC2 %xmm6
 #define RD2 %xmm7
 
-#define RX %xmm8
-#define RY %xmm9
+#define RX0 %xmm8
+#define RY0 %xmm9
+
+#define RX1 %xmm10
+#define RY1 %xmm11
+
+#define RK1 %xmm12
+#define RK2 %xmm13
 
-#define RK1 %xmm10
-#define RK2 %xmm11
+#define RT %xmm14
 
-#define RID1  %rax
-#define RID1b %al
-#define RID2  %rbx
-#define RID2b %bl
+#define RID1  %rbp
+#define RID1d %ebp
+#define RID2  %rsi
+#define RID2d %esi
 
 #define RGI1   %rdx
 #define RGI1bl %dl
@@ -65,6 +72,13 @@
 #define RGI2bl %cl
 #define RGI2bh %ch
 
+#define RGI3   %rax
+#define RGI3bl %al
+#define RGI3bh %ah
+#define RGI4   %rbx
+#define RGI4bl %bl
+#define RGI4bh %bh
+
 #define RGS1  %r8
 #define RGS1d %r8d
 #define RGS2  %r9
@@ -73,40 +87,58 @@
 #define RGS3d %r10d
 
 
-#define lookup_32bit(t0, t1, t2, t3, src, dst) \
-   movbsrc ## bl,RID1b; \
-   movbsrc ## bh,RID2b; \
+#define lookup_32bit(t0, t1, t2, t3, src, dst, interleave_op, il_reg) \
+   movzbl  src ## bl,RID1d; \
+   movzbl  src ## bh,RID2d; \
+   shrq $16,   src; \
movlt0(CTX, RID1, 4), dst ## d;  \
xorlt1(CTX, RID2, 4), dst ## d;  \
-   shrq $16,   src; \
-   movbsrc ## bl,RID1b; \
-   movbsrc ## bh,RID2b; \
+   movzbl  src ## bl,RID1d; \
+   movzbl  src ## bh,RID2d; \
+   interleave_op(il_reg);   \
xorlt2(CTX, RID1, 4), dst ## d;  \
xorlt3(CTX, RID2, 4), dst ## d;
 
-#define G(a, x, t0, t1, t2, t3) \
-   vmovq   a,RGI1;   \
-   vpsrldq $8, a,x;  \
-   vmovq   x,RGI2;   \
-   \
-   lookup_32bit(t0, t1, t2, t3, RGI1, RGS1); \
-   shrq $16,   RGI1; \
-   lookup_32bit(t0, t1, t2, t3, RGI1, RGS2); \
-   shlq $32,   RGS2; \
-   orq RGS1, RGS2;   \
-   \
-   lookup_32bit(t0, t1, t2, t3, RGI2, RGS1); \
-   shrq $16,   RGI2; \
-   lookup_32bit(t0, t1, t2, t3, RGI2, RGS3); \
-   shlq $32,   RGS3; \
-   orq RGS1, RGS3;   \
-   \
-   vmovq   RGS2, x;  \
-   vpinsrq $1, RGS3, x, x;
+#define dummy(d) /* do nothing */
 
-#define encround(a, b, c, d, x, y) \
-   G(a, x, s0, s1, s2, s3);   \
-   G(b, y, s1, s2, s3, s0);   \
+#define shr_next(reg) \
+   shrq $16,   reg;
+
+#define G_enc(gi1, gi2, x, t0, t1, t2, t3) \
+   lookup_32bit(t0, t1, t2, t3, ##gi1, RGS1, shr_next, ##gi1);  \
+   lookup_32bit(t0, t1, t2, t3, ##gi1, RGS2, dummy, none);  \
+   shlq $32,   RGS2;\
+   orq RGS1, RGS2;  \
+   \
+   lookup_32bit(t0, t1, t2, t3, ##gi2, RGS3, shr_next, ##gi2);  \
+   lookup_32bit(t0, t1, t2, t3, ##gi2, RGS1, dummy, none);  \
+   shlq $32,   RGS1;\
+   orq RGS1, RGS3;
+
+#define encround_head_2(a, b, c, d, x1, y1, x2, y2) \
+   vmovq   b ## 1, RGI3;   \
+   vpextrq $1, b ## 1, RGI4;   \
+   G_enc(RGI1, RGI2, x1, s0, s1, s2, s3);  \
+   vmovq   a ## 2, RGI1;   \
+   vpextrq $1, a ## 2, RGI2;   \
+   vmovq   RGS2, x1;   \
+   vpinsrq $1, RGS3, x1, x1;   \
+   G_enc(RGI3, RGI4, y1, s1, s2, s3, s0);  \
+   vmovq  

Re: [PATCH] crypto: twofish - add x86_64/avx assembler implementation

2012-08-21 Thread Jussi Kivilinna
Quoting Borislav Petkov b...@alien8.de:

 
 Here you go:
 
 [   52.282208]
 [   52.282208] testing speed of async ecb(twofish) encryption

Thanks!

Looks that encryption lost ~0.4% while decryption gained ~1.8%.

For 256 byte test, it's still slightly slower than twofish-3way (~3%). For 1k
and 8k tests, it's ~5% faster.

Here's very last test-patch, testing different ordering of fpu-cpu reg
instructions at few places.

---
 arch/x86/crypto/twofish-avx-x86_64-asm_64.S |  232 ++-
 1 file changed, 154 insertions(+), 78 deletions(-)

diff --git a/arch/x86/crypto/twofish-avx-x86_64-asm_64.S 
b/arch/x86/crypto/twofish-avx-x86_64-asm_64.S
index 35f4557..693963a 100644
--- a/arch/x86/crypto/twofish-avx-x86_64-asm_64.S
+++ b/arch/x86/crypto/twofish-avx-x86_64-asm_64.S
@@ -4,6 +4,8 @@
  * Copyright (C) 2012 Johannes Goetzfried
  * johannes.goetzfr...@informatik.stud.uni-erlangen.de
  *
+ * Copyright © 2012 Jussi Kivilinna jussi.kivili...@mbnet.fi
+ *
  * This program is free software; you can redistribute it and/or modify
  * it under the terms of the GNU General Public License as published by
  * the Free Software Foundation; either version 2 of the License, or
@@ -47,16 +49,21 @@
 #define RC2 %xmm6
 #define RD2 %xmm7
 
-#define RX %xmm8
-#define RY %xmm9
+#define RX0 %xmm8
+#define RY0 %xmm9
+
+#define RX1 %xmm10
+#define RY1 %xmm11
+
+#define RK1 %xmm12
+#define RK2 %xmm13
 
-#define RK1 %xmm10
-#define RK2 %xmm11
+#define RT %xmm14
 
-#define RID1  %rax
-#define RID1b %al
-#define RID2  %rbx
-#define RID2b %bl
+#define RID1  %rbp
+#define RID1d %ebp
+#define RID2  %rsi
+#define RID2d %esi
 
 #define RGI1   %rdx
 #define RGI1bl %dl
@@ -65,6 +72,13 @@
 #define RGI2bl %cl
 #define RGI2bh %ch
 
+#define RGI3   %rax
+#define RGI3bl %al
+#define RGI3bh %ah
+#define RGI4   %rbx
+#define RGI4bl %bl
+#define RGI4bh %bh
+
 #define RGS1  %r8
 #define RGS1d %r8d
 #define RGS2  %r9
@@ -73,40 +87,58 @@
 #define RGS3d %r10d
 
 
-#define lookup_32bit(t0, t1, t2, t3, src, dst) \
-   movbsrc ## bl,RID1b; \
-   movbsrc ## bh,RID2b; \
+#define lookup_32bit(t0, t1, t2, t3, src, dst, interleave_op, il_reg) \
+   movzbl  src ## bl,RID1d; \
+   movzbl  src ## bh,RID2d; \
+   shrq $16,   src; \
movlt0(CTX, RID1, 4), dst ## d;  \
xorlt1(CTX, RID2, 4), dst ## d;  \
-   shrq $16,   src; \
-   movbsrc ## bl,RID1b; \
-   movbsrc ## bh,RID2b; \
+   movzbl  src ## bl,RID1d; \
+   movzbl  src ## bh,RID2d; \
+   interleave_op(il_reg);   \
xorlt2(CTX, RID1, 4), dst ## d;  \
xorlt3(CTX, RID2, 4), dst ## d;
 
-#define G(a, x, t0, t1, t2, t3) \
-   vmovq   a,RGI1;   \
-   vpsrldq $8, a,x;  \
-   vmovq   x,RGI2;   \
-   \
-   lookup_32bit(t0, t1, t2, t3, RGI1, RGS1); \
-   shrq $16,   RGI1; \
-   lookup_32bit(t0, t1, t2, t3, RGI1, RGS2); \
-   shlq $32,   RGS2; \
-   orq RGS1, RGS2;   \
-   \
-   lookup_32bit(t0, t1, t2, t3, RGI2, RGS1); \
-   shrq $16,   RGI2; \
-   lookup_32bit(t0, t1, t2, t3, RGI2, RGS3); \
-   shlq $32,   RGS3; \
-   orq RGS1, RGS3;   \
-   \
-   vmovq   RGS2, x;  \
-   vpinsrq $1, RGS3, x, x;
+#define dummy(d) /* do nothing */
 
-#define encround(a, b, c, d, x, y) \
-   G(a, x, s0, s1, s2, s3);   \
-   G(b, y, s1, s2, s3, s0);   \
+#define shr_next(reg) \
+   shrq $16,   reg;
+
+#define G_enc(gi1, gi2, x, t0, t1, t2, t3) \
+   lookup_32bit(t0, t1, t2, t3, ##gi1, RGS1, shr_next, ##gi1);  \
+   lookup_32bit(t0, t1, t2, t3, ##gi1, RGS2, dummy, none);  \
+   shlq $32,   RGS2;\
+   orq RGS1, RGS2;  \
+   \
+   lookup_32bit(t0, t1, t2, t3, ##gi2, RGS3, shr_next, ##gi2);  \
+   lookup_32bit(t0, t1, t2, t3, ##gi2, RGS1, dummy, none);  \
+   shlq $32,   RGS1;\
+   orq RGS1, RGS3;
+
+#define encround_head_2(a, b, c, d, x1, y1, x2, y2) \
+   vmovq   b ## 1, RGI3;   \
+   vpextrq $1, b ## 1, RGI4;   \
+   G_enc(RGI1, RGI2, x1, s0, s1, s2, s3);  \
+   vmovq   a ## 2, RGI1;   \
+   vpextrq $1, a ## 2, RGI2;   \
+   vmovq   RGS2, x1;   \
+   vpinsrq $1, RGS3, x1, x1;   \
+   G_enc(RGI3, RGI4

Re: on stack dynamic allocations

2012-08-17 Thread Jussi Kivilinna

Quoting David Daney :


On 08/16/2012 02:20 PM, Kasatkin, Dmitry wrote:

Hello,

Some places in the code uses variable-size allocation on stack..
For example from hmac_setkey():

struct {
struct shash_desc shash;
char ctx[crypto_shash_descsize(hash)];
} desc;


sparse complains

CHECK   crypto/hmac.c
crypto/hmac.c:57:47: error: bad constant expression

I like it instead of kmalloc..

But what is position of kernel community about it?


If you know that the range of crypto_shash_descsize(hash) is  
bounded, just use the upper bound.


If the range of crypto_shash_descsize(hash) is unbounded, then the  
stack will overflow and ... BOOM!




Quick look shows that largest crypto_shash_descsize() would be with  
hmac+s390/sha512, 16 + 332 = 348. Crypto-api also prevents registering  
shash with descsize larger than (PAGE_SIZE / 8).


-Jussi

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] crypto: twofish - add x86_64/avx assembler implementation

2012-08-17 Thread Jussi Kivilinna
Quoting Borislav Petkov :

>
> Yep, looks better than the previous run and also a bit better or on par
> with the initial run I did.
>

I made few further changes, mainly moving/interleaving 'vmovq/vpextrq' ahead
so they should be completed before those target registers are needed. This
only gave 0.5% increase on Sandy-bridge, but might help more on Bulldozer.

-Jussi

---
 arch/x86/crypto/twofish-avx-x86_64-asm_64.S |  205 +--
 1 file changed, 130 insertions(+), 75 deletions(-)

diff --git a/arch/x86/crypto/twofish-avx-x86_64-asm_64.S 
b/arch/x86/crypto/twofish-avx-x86_64-asm_64.S
index 35f4557..6638a87 100644
--- a/arch/x86/crypto/twofish-avx-x86_64-asm_64.S
+++ b/arch/x86/crypto/twofish-avx-x86_64-asm_64.S
@@ -4,6 +4,8 @@
  * Copyright (C) 2012 Johannes Goetzfried
  * 
  *
+ * Copyright © 2012 Jussi Kivilinna 
+ *
  * This program is free software; you can redistribute it and/or modify
  * it under the terms of the GNU General Public License as published by
  * the Free Software Foundation; either version 2 of the License, or
@@ -47,16 +49,21 @@
 #define RC2 %xmm6
 #define RD2 %xmm7
 
-#define RX %xmm8
-#define RY %xmm9
+#define RX0 %xmm8
+#define RY0 %xmm9
+
+#define RX1 %xmm10
+#define RY1 %xmm11
 
-#define RK1 %xmm10
-#define RK2 %xmm11
+#define RK1 %xmm12
+#define RK2 %xmm13
 
-#define RID1  %rax
-#define RID1b %al
-#define RID2  %rbx
-#define RID2b %bl
+#define RT %xmm14
+
+#define RID1  %rbp
+#define RID1d %ebp
+#define RID2  %rsi
+#define RID2d %esi
 
 #define RGI1   %rdx
 #define RGI1bl %dl
@@ -65,6 +72,13 @@
 #define RGI2bl %cl
 #define RGI2bh %ch
 
+#define RGI3   %rax
+#define RGI3bl %al
+#define RGI3bh %ah
+#define RGI4   %rbx
+#define RGI4bl %bl
+#define RGI4bh %bh
+
 #define RGS1  %r8
 #define RGS1d %r8d
 #define RGS2  %r9
@@ -73,40 +87,53 @@
 #define RGS3d %r10d
 
 
-#define lookup_32bit(t0, t1, t2, t3, src, dst) \
-   movbsrc ## bl,RID1b; \
-   movbsrc ## bh,RID2b; \
+#define lookup_32bit(t0, t1, t2, t3, src, dst, interleave_op, il_reg) \
+   movzbl  src ## bl,RID1d; \
+   movzbl  src ## bh,RID2d; \
+   shrq $16,   src; \
movlt0(CTX, RID1, 4), dst ## d;  \
xorlt1(CTX, RID2, 4), dst ## d;  \
-   shrq $16,   src; \
-   movbsrc ## bl,RID1b; \
-   movbsrc ## bh,RID2b; \
+   movzbl  src ## bl,RID1d; \
+   movzbl  src ## bh,RID2d; \
+   interleave_op(il_reg);   \
xorlt2(CTX, RID1, 4), dst ## d;  \
xorlt3(CTX, RID2, 4), dst ## d;
 
-#define G(a, x, t0, t1, t2, t3) \
-   vmovq   a,RGI1;   \
-   vpsrldq $8, a,x;  \
-   vmovq   x,RGI2;   \
-   \
-   lookup_32bit(t0, t1, t2, t3, RGI1, RGS1); \
-   shrq $16,   RGI1; \
-   lookup_32bit(t0, t1, t2, t3, RGI1, RGS2); \
-   shlq $32,   RGS2; \
-   orq RGS1, RGS2;   \
+#define dummy(d) /* do nothing */
+
+#define shr_next(reg) \
+   shrq $16,   reg;
+
+#define G(gi1, gi2, x, t0, t1, t2, t3) \
+   lookup_32bit(t0, t1, t2, t3, ##gi1, RGS1, shr_next, ##gi1);  \
+   lookup_32bit(t0, t1, t2, t3, ##gi1, RGS2, dummy, none);  \
+   shlq $32,   RGS2;\
+   orq RGS1, RGS2;  \
\
-   lookup_32bit(t0, t1, t2, t3, RGI2, RGS1); \
-   shrq $16,   RGI2; \
-   lookup_32bit(t0, t1, t2, t3, RGI2, RGS3); \
-   shlq $32,   RGS3; \
-   orq RGS1, RGS3;   \
+   lookup_32bit(t0, t1, t2, t3, ##gi2, RGS3, shr_next, ##gi2);  \
+   lookup_32bit(t0, t1, t2, t3, ##gi2, RGS1, dummy, none);  \
+   shlq $32,   RGS1;\
+   orq RGS1, RGS3;  \
\
-   vmovq   RGS2, x;  \
+   vmovq   RGS2, x; \
vpinsrq $1, RGS3, x, x;
 
-#define encround(a, b, c, d, x, y) \
-   G(a, x, s0, s1, s2, s3);   \
-   G(b, y, s1, s2, s3, s0);   \
+#define encround_head_2(a, b, c, d, x1, y1, x2, y2) \
+   vmovq   b ## 1, RGI3;   \
+   vpextrq $1, b ## 1, RGI4;   \
+   G(RGI1, RGI2, x1, s0, s1, s2, s3);  \
+   vmovq   a ## 2, RGI1;   \
+   vpextrq $1, a ## 2, RGI2;   \
+   G(RGI3, RGI4, y1, s1, s2, s3, s0);  \
+   vmovq   b ## 2, RGI3;   \
+   vpextrq $1, b ## 2, RGI4;   \
+   

Re: [PATCH] crypto: twofish - add x86_64/avx assembler implementation

2012-08-17 Thread Jussi Kivilinna
Quoting Borislav Petkov b...@alien8.de:


 Yep, looks better than the previous run and also a bit better or on par
 with the initial run I did.


I made few further changes, mainly moving/interleaving 'vmovq/vpextrq' ahead
so they should be completed before those target registers are needed. This
only gave 0.5% increase on Sandy-bridge, but might help more on Bulldozer.

-Jussi

---
 arch/x86/crypto/twofish-avx-x86_64-asm_64.S |  205 +--
 1 file changed, 130 insertions(+), 75 deletions(-)

diff --git a/arch/x86/crypto/twofish-avx-x86_64-asm_64.S 
b/arch/x86/crypto/twofish-avx-x86_64-asm_64.S
index 35f4557..6638a87 100644
--- a/arch/x86/crypto/twofish-avx-x86_64-asm_64.S
+++ b/arch/x86/crypto/twofish-avx-x86_64-asm_64.S
@@ -4,6 +4,8 @@
  * Copyright (C) 2012 Johannes Goetzfried
  * johannes.goetzfr...@informatik.stud.uni-erlangen.de
  *
+ * Copyright © 2012 Jussi Kivilinna jussi.kivili...@mbnet.fi
+ *
  * This program is free software; you can redistribute it and/or modify
  * it under the terms of the GNU General Public License as published by
  * the Free Software Foundation; either version 2 of the License, or
@@ -47,16 +49,21 @@
 #define RC2 %xmm6
 #define RD2 %xmm7
 
-#define RX %xmm8
-#define RY %xmm9
+#define RX0 %xmm8
+#define RY0 %xmm9
+
+#define RX1 %xmm10
+#define RY1 %xmm11
 
-#define RK1 %xmm10
-#define RK2 %xmm11
+#define RK1 %xmm12
+#define RK2 %xmm13
 
-#define RID1  %rax
-#define RID1b %al
-#define RID2  %rbx
-#define RID2b %bl
+#define RT %xmm14
+
+#define RID1  %rbp
+#define RID1d %ebp
+#define RID2  %rsi
+#define RID2d %esi
 
 #define RGI1   %rdx
 #define RGI1bl %dl
@@ -65,6 +72,13 @@
 #define RGI2bl %cl
 #define RGI2bh %ch
 
+#define RGI3   %rax
+#define RGI3bl %al
+#define RGI3bh %ah
+#define RGI4   %rbx
+#define RGI4bl %bl
+#define RGI4bh %bh
+
 #define RGS1  %r8
 #define RGS1d %r8d
 #define RGS2  %r9
@@ -73,40 +87,53 @@
 #define RGS3d %r10d
 
 
-#define lookup_32bit(t0, t1, t2, t3, src, dst) \
-   movbsrc ## bl,RID1b; \
-   movbsrc ## bh,RID2b; \
+#define lookup_32bit(t0, t1, t2, t3, src, dst, interleave_op, il_reg) \
+   movzbl  src ## bl,RID1d; \
+   movzbl  src ## bh,RID2d; \
+   shrq $16,   src; \
movlt0(CTX, RID1, 4), dst ## d;  \
xorlt1(CTX, RID2, 4), dst ## d;  \
-   shrq $16,   src; \
-   movbsrc ## bl,RID1b; \
-   movbsrc ## bh,RID2b; \
+   movzbl  src ## bl,RID1d; \
+   movzbl  src ## bh,RID2d; \
+   interleave_op(il_reg);   \
xorlt2(CTX, RID1, 4), dst ## d;  \
xorlt3(CTX, RID2, 4), dst ## d;
 
-#define G(a, x, t0, t1, t2, t3) \
-   vmovq   a,RGI1;   \
-   vpsrldq $8, a,x;  \
-   vmovq   x,RGI2;   \
-   \
-   lookup_32bit(t0, t1, t2, t3, RGI1, RGS1); \
-   shrq $16,   RGI1; \
-   lookup_32bit(t0, t1, t2, t3, RGI1, RGS2); \
-   shlq $32,   RGS2; \
-   orq RGS1, RGS2;   \
+#define dummy(d) /* do nothing */
+
+#define shr_next(reg) \
+   shrq $16,   reg;
+
+#define G(gi1, gi2, x, t0, t1, t2, t3) \
+   lookup_32bit(t0, t1, t2, t3, ##gi1, RGS1, shr_next, ##gi1);  \
+   lookup_32bit(t0, t1, t2, t3, ##gi1, RGS2, dummy, none);  \
+   shlq $32,   RGS2;\
+   orq RGS1, RGS2;  \
\
-   lookup_32bit(t0, t1, t2, t3, RGI2, RGS1); \
-   shrq $16,   RGI2; \
-   lookup_32bit(t0, t1, t2, t3, RGI2, RGS3); \
-   shlq $32,   RGS3; \
-   orq RGS1, RGS3;   \
+   lookup_32bit(t0, t1, t2, t3, ##gi2, RGS3, shr_next, ##gi2);  \
+   lookup_32bit(t0, t1, t2, t3, ##gi2, RGS1, dummy, none);  \
+   shlq $32,   RGS1;\
+   orq RGS1, RGS3;  \
\
-   vmovq   RGS2, x;  \
+   vmovq   RGS2, x; \
vpinsrq $1, RGS3, x, x;
 
-#define encround(a, b, c, d, x, y) \
-   G(a, x, s0, s1, s2, s3);   \
-   G(b, y, s1, s2, s3, s0);   \
+#define encround_head_2(a, b, c, d, x1, y1, x2, y2) \
+   vmovq   b ## 1, RGI3;   \
+   vpextrq $1, b ## 1, RGI4;   \
+   G(RGI1, RGI2, x1, s0, s1, s2, s3);  \
+   vmovq   a ## 2, RGI1;   \
+   vpextrq $1, a ## 2, RGI2;   \
+   G(RGI3, RGI4, y1, s1, s2, s3, s0);  \
+   vmovq   b

Re: on stack dynamic allocations

2012-08-17 Thread Jussi Kivilinna

Quoting David Daney ddaney.c...@gmail.com:


On 08/16/2012 02:20 PM, Kasatkin, Dmitry wrote:

Hello,

Some places in the code uses variable-size allocation on stack..
For example from hmac_setkey():

struct {
struct shash_desc shash;
char ctx[crypto_shash_descsize(hash)];
} desc;


sparse complains

CHECK   crypto/hmac.c
crypto/hmac.c:57:47: error: bad constant expression

I like it instead of kmalloc..

But what is position of kernel community about it?


If you know that the range of crypto_shash_descsize(hash) is  
bounded, just use the upper bound.


If the range of crypto_shash_descsize(hash) is unbounded, then the  
stack will overflow and ... BOOM!




Quick look shows that largest crypto_shash_descsize() would be with  
hmac+s390/sha512, 16 + 332 = 348. Crypto-api also prevents registering  
shash with descsize larger than (PAGE_SIZE / 8).


-Jussi

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] crypto: twofish - add x86_64/avx assembler implementation

2012-08-16 Thread Jussi Kivilinna

Quoting Borislav Petkov :


On Wed, Aug 15, 2012 at 08:34:25PM +0300, Jussi Kivilinna wrote:

About ~5% slower, probably because I was tuning for sandy-bridge and
introduced more FPU<=>CPU register moves.

Here's new version of patch, with FPU<=>CPU moves from original
implementation.

(Note: also changes encryption function to inline all code in to main
function, decryption still places common code to separate function to
reduce object size. This is to measure the difference.)


Yep, looks better than the previous run and also a bit better or on par
with the initial run I did.


Thanks again. Speed gained with patch is ~8%, and is able of getting  
twofish-avx pass twofish-3way.




The thing is, I'm not sure whether optimizing the thing for each uarch
is a workable solution software-wise or maybe having a single version
which performs sufficiently ok on all uarches is easier/better to
maintain without causing code bloat. Hmmm...


Agreed, testing on multiple CPUs to get single well working version is  
what I have done in the past. But purchasing all the latest CPUs on  
the market isn't option for me, and for testing AVX I'm stuck with  
sandy-bridge :)


-Jussi


4th:

ran like 1st.

[ 1014.074150]
[ 1014.074150] testing speed of async ecb(twofish) encryption
[ 1014.083829] test 0 (128 bit key, 16 byte blocks): 4870055  
operations in 1 seconds (77920880 bytes)
[ 1015.092757] test 1 (128 bit key, 64 byte blocks): 2043828  
operations in 1 seconds (130804992 bytes)
[ 1016.099441] test 2 (128 bit key, 256 byte blocks): 606400  
operations in 1 seconds (155238400 bytes)
[ 1017.105939] test 3 (128 bit key, 1024 byte blocks): 168939  
operations in 1 seconds (172993536 bytes)
[ 1018.112517] test 4 (128 bit key, 8192 byte blocks): 21777  
operations in 1 seconds (178397184 bytes)
[ 1019.119035] test 5 (192 bit key, 16 byte blocks): 4882254  
operations in 1 seconds (78116064 bytes)
[ 1020.125716] test 6 (192 bit key, 64 byte blocks): 2043230  
operations in 1 seconds (130766720 bytes)
[ 1021.132391] test 7 (192 bit key, 256 byte blocks): 607477  
operations in 1 seconds (155514112 bytes)
[ 1022.138889] test 8 (192 bit key, 1024 byte blocks): 168743  
operations in 1 seconds (172792832 bytes)
[ 1023.145476] test 9 (192 bit key, 8192 byte blocks): 21442  
operations in 1 seconds (175652864 bytes)
[ 1024.152012] test 10 (256 bit key, 16 byte blocks): 4891863  
operations in 1 seconds (78269808 bytes)
[ 1025.158684] test 11 (256 bit key, 64 byte blocks): 2049390  
operations in 1 seconds (131160960 bytes)
[ 1026.165366] test 12 (256 bit key, 256 byte blocks): 606847  
operations in 1 seconds (155352832 bytes)
[ 1027.171841] test 13 (256 bit key, 1024 byte blocks): 169228  
operations in 1 seconds (173289472 bytes)
[ 1028.178436] test 14 (256 bit key, 8192 byte blocks): 21773  
operations in 1 seconds (178364416 bytes)

[ 1029.184981]
[ 1029.184981] testing speed of async ecb(twofish) decryption
[ 1029.194508] test 0 (128 bit key, 16 byte blocks): 4931065  
operations in 1 seconds (78897040 bytes)
[ 1030.199640] test 1 (128 bit key, 64 byte blocks): 2056931  
operations in 1 seconds (131643584 bytes)
[ 1031.206303] test 2 (128 bit key, 256 byte blocks): 589409  
operations in 1 seconds (150888704 bytes)
[ 1032.212832] test 3 (128 bit key, 1024 byte blocks): 163681  
operations in 1 seconds (167609344 bytes)
[ 1033.219443] test 4 (128 bit key, 8192 byte blocks): 21062  
operations in 1 seconds (172539904 bytes)
[ 1034.225979] test 5 (192 bit key, 16 byte blocks): 4931537  
operations in 1 seconds (78904592 bytes)
[ 1035.232608] test 6 (192 bit key, 64 byte blocks): 2053989  
operations in 1 seconds (131455296 bytes)
[ 1036.239289] test 7 (192 bit key, 256 byte blocks): 589591  
operations in 1 seconds (150935296 bytes)
[ 1037.241784] test 8 (192 bit key, 1024 byte blocks): 163565  
operations in 1 seconds (167490560 bytes)
[ 1038.244387] test 9 (192 bit key, 8192 byte blocks): 20899  
operations in 1 seconds (171204608 bytes)
[ 1039.250923] test 10 (256 bit key, 16 byte blocks): 4937343  
operations in 1 seconds (78997488 bytes)
[ 1040.257589] test 11 (256 bit key, 64 byte blocks): 2050678  
operations in 1 seconds (131243392 bytes)
[ 1041.264262] test 12 (256 bit key, 256 byte blocks): 586869  
operations in 1 seconds (150238464 bytes)
[ 1042.270753] test 13 (256 bit key, 1024 byte blocks): 163548  
operations in 1 seconds (167473152 bytes)
[ 1043.277365] test 14 (256 bit key, 8192 byte blocks): 21053  
operations in 1 seconds (172466176 bytes)

[ 1044.283892]
[ 1044.283892] testing speed of async cbc(twofish) encryption
[ 1044.293349] test 0 (128 bit key, 16 byte blocks): 5186240  
operations in 1 seconds (82979840 bytes)
[ 1045.298534] test 1 (128 bit key, 64 byte blocks): 1921034  
operations in 1 seconds (122946176 bytes)
[ 1046.305207] test 2 (128 bit key, 256 byte blocks): 542787  
operations in 1 seconds (138953472 bytes)
[ 1047.311699] test 3 (128 bit key, 1024 byte blocks): 141399  
ope

Re: [PATCH] crypto: twofish - add x86_64/avx assembler implementation

2012-08-16 Thread Jussi Kivilinna

Quoting Borislav Petkov b...@alien8.de:


On Wed, Aug 15, 2012 at 08:34:25PM +0300, Jussi Kivilinna wrote:

About ~5% slower, probably because I was tuning for sandy-bridge and
introduced more FPU=CPU register moves.

Here's new version of patch, with FPU=CPU moves from original
implementation.

(Note: also changes encryption function to inline all code in to main
function, decryption still places common code to separate function to
reduce object size. This is to measure the difference.)


Yep, looks better than the previous run and also a bit better or on par
with the initial run I did.


Thanks again. Speed gained with patch is ~8%, and is able of getting  
twofish-avx pass twofish-3way.




The thing is, I'm not sure whether optimizing the thing for each uarch
is a workable solution software-wise or maybe having a single version
which performs sufficiently ok on all uarches is easier/better to
maintain without causing code bloat. Hmmm...


Agreed, testing on multiple CPUs to get single well working version is  
what I have done in the past. But purchasing all the latest CPUs on  
the market isn't option for me, and for testing AVX I'm stuck with  
sandy-bridge :)


-Jussi


4th:

ran like 1st.

[ 1014.074150]
[ 1014.074150] testing speed of async ecb(twofish) encryption
[ 1014.083829] test 0 (128 bit key, 16 byte blocks): 4870055  
operations in 1 seconds (77920880 bytes)
[ 1015.092757] test 1 (128 bit key, 64 byte blocks): 2043828  
operations in 1 seconds (130804992 bytes)
[ 1016.099441] test 2 (128 bit key, 256 byte blocks): 606400  
operations in 1 seconds (155238400 bytes)
[ 1017.105939] test 3 (128 bit key, 1024 byte blocks): 168939  
operations in 1 seconds (172993536 bytes)
[ 1018.112517] test 4 (128 bit key, 8192 byte blocks): 21777  
operations in 1 seconds (178397184 bytes)
[ 1019.119035] test 5 (192 bit key, 16 byte blocks): 4882254  
operations in 1 seconds (78116064 bytes)
[ 1020.125716] test 6 (192 bit key, 64 byte blocks): 2043230  
operations in 1 seconds (130766720 bytes)
[ 1021.132391] test 7 (192 bit key, 256 byte blocks): 607477  
operations in 1 seconds (155514112 bytes)
[ 1022.138889] test 8 (192 bit key, 1024 byte blocks): 168743  
operations in 1 seconds (172792832 bytes)
[ 1023.145476] test 9 (192 bit key, 8192 byte blocks): 21442  
operations in 1 seconds (175652864 bytes)
[ 1024.152012] test 10 (256 bit key, 16 byte blocks): 4891863  
operations in 1 seconds (78269808 bytes)
[ 1025.158684] test 11 (256 bit key, 64 byte blocks): 2049390  
operations in 1 seconds (131160960 bytes)
[ 1026.165366] test 12 (256 bit key, 256 byte blocks): 606847  
operations in 1 seconds (155352832 bytes)
[ 1027.171841] test 13 (256 bit key, 1024 byte blocks): 169228  
operations in 1 seconds (173289472 bytes)
[ 1028.178436] test 14 (256 bit key, 8192 byte blocks): 21773  
operations in 1 seconds (178364416 bytes)

[ 1029.184981]
[ 1029.184981] testing speed of async ecb(twofish) decryption
[ 1029.194508] test 0 (128 bit key, 16 byte blocks): 4931065  
operations in 1 seconds (78897040 bytes)
[ 1030.199640] test 1 (128 bit key, 64 byte blocks): 2056931  
operations in 1 seconds (131643584 bytes)
[ 1031.206303] test 2 (128 bit key, 256 byte blocks): 589409  
operations in 1 seconds (150888704 bytes)
[ 1032.212832] test 3 (128 bit key, 1024 byte blocks): 163681  
operations in 1 seconds (167609344 bytes)
[ 1033.219443] test 4 (128 bit key, 8192 byte blocks): 21062  
operations in 1 seconds (172539904 bytes)
[ 1034.225979] test 5 (192 bit key, 16 byte blocks): 4931537  
operations in 1 seconds (78904592 bytes)
[ 1035.232608] test 6 (192 bit key, 64 byte blocks): 2053989  
operations in 1 seconds (131455296 bytes)
[ 1036.239289] test 7 (192 bit key, 256 byte blocks): 589591  
operations in 1 seconds (150935296 bytes)
[ 1037.241784] test 8 (192 bit key, 1024 byte blocks): 163565  
operations in 1 seconds (167490560 bytes)
[ 1038.244387] test 9 (192 bit key, 8192 byte blocks): 20899  
operations in 1 seconds (171204608 bytes)
[ 1039.250923] test 10 (256 bit key, 16 byte blocks): 4937343  
operations in 1 seconds (78997488 bytes)
[ 1040.257589] test 11 (256 bit key, 64 byte blocks): 2050678  
operations in 1 seconds (131243392 bytes)
[ 1041.264262] test 12 (256 bit key, 256 byte blocks): 586869  
operations in 1 seconds (150238464 bytes)
[ 1042.270753] test 13 (256 bit key, 1024 byte blocks): 163548  
operations in 1 seconds (167473152 bytes)
[ 1043.277365] test 14 (256 bit key, 8192 byte blocks): 21053  
operations in 1 seconds (172466176 bytes)

[ 1044.283892]
[ 1044.283892] testing speed of async cbc(twofish) encryption
[ 1044.293349] test 0 (128 bit key, 16 byte blocks): 5186240  
operations in 1 seconds (82979840 bytes)
[ 1045.298534] test 1 (128 bit key, 64 byte blocks): 1921034  
operations in 1 seconds (122946176 bytes)
[ 1046.305207] test 2 (128 bit key, 256 byte blocks): 542787  
operations in 1 seconds (138953472 bytes)
[ 1047.311699] test 3 (128 bit key, 1024 byte blocks): 141399

Re: [PATCH] crypto: twofish - add x86_64/avx assembler implementation

2012-08-15 Thread Jussi Kivilinna
Quoting Borislav Petkov :

> On Wed, Aug 15, 2012 at 05:22:03PM +0300, Jussi Kivilinna wrote:
>
>> Patch replaces 'movb' instructions with 'movzbl' to break false
>> register dependencies and interleaves instructions better for
>> out-of-order scheduling.
>>
>> Also move common round code to separate function to reduce object
>> size.
>
> Ok, redid the first test
>

Thanks.

> $ modprobe twofish-avx-x86_64
> $ modprobe tcrypt mode=504 sec=1
>
> and from quickly juxtaposing the two results, I'd say the patch makes
> things slightly worse but you'd need to run your scripts on it to get
> the accurate results:
>

About ~5% slower, probably because I was tuning for sandy-bridge and introduced
more FPU<=>CPU register moves.

Here's new version of patch, with FPU<=>CPU moves from original implementation.

(Note: also changes encryption function to inline all code in to main function,
decryption still places common code to separate function to reduce object size.
This is to measure the difference.)

-Jussi

---
 arch/x86/crypto/twofish-avx-x86_64-asm_64.S |  124 +--
 1 file changed, 77 insertions(+), 47 deletions(-)

diff --git a/arch/x86/crypto/twofish-avx-x86_64-asm_64.S 
b/arch/x86/crypto/twofish-avx-x86_64-asm_64.S
index 35f4557..d331ab8 100644
--- a/arch/x86/crypto/twofish-avx-x86_64-asm_64.S
+++ b/arch/x86/crypto/twofish-avx-x86_64-asm_64.S
@@ -47,15 +47,22 @@
 #define RC2 %xmm6
 #define RD2 %xmm7
 
-#define RX %xmm8
-#define RY %xmm9
+#define RX0 %xmm8
+#define RY0 %xmm9
 
-#define RK1 %xmm10
-#define RK2 %xmm11
+#define RX1 %xmm10
+#define RY1 %xmm11
+
+#define RK1 %xmm12
+#define RK2 %xmm13
+
+#define RT %xmm14
 
 #define RID1  %rax
+#define RID1d %eax
 #define RID1b %al
 #define RID2  %rbx
+#define RID2d %ebx
 #define RID2b %bl
 
 #define RGI1   %rdx
@@ -73,40 +80,48 @@
 #define RGS3d %r10d
 
 
-#define lookup_32bit(t0, t1, t2, t3, src, dst) \
-   movbsrc ## bl,RID1b; \
-   movbsrc ## bh,RID2b; \
+#define lookup_32bit(t0, t1, t2, t3, src, dst, interleave_op, il_reg) \
+   movzbl  src ## bl,RID1d; \
+   movzbl  src ## bh,RID2d; \
+   shrq $16,   src; \
movlt0(CTX, RID1, 4), dst ## d;  \
xorlt1(CTX, RID2, 4), dst ## d;  \
-   shrq $16,   src; \
-   movbsrc ## bl,RID1b; \
-   movbsrc ## bh,RID2b; \
+   movzbl  src ## bl,RID1d; \
+   movzbl  src ## bh,RID2d; \
+   interleave_op(il_reg);   \
xorlt2(CTX, RID1, 4), dst ## d;  \
xorlt3(CTX, RID2, 4), dst ## d;
 
+#define dummy(d) /* do nothing */
+
+#define shr_next(reg) \
+   shrq $16,   reg;
+
 #define G(a, x, t0, t1, t2, t3) \
-   vmovq   a,RGI1;   \
-   vpsrldq $8, a,x;  \
-   vmovq   x,RGI2;   \
+   vmovq   a, RGI1;  \
+   vpextrq $1, a, RGI2;  \
\
-   lookup_32bit(t0, t1, t2, t3, RGI1, RGS1); \
-   shrq $16,   RGI1; \
-   lookup_32bit(t0, t1, t2, t3, RGI1, RGS2); \
+   lookup_32bit(t0, t1, t2, t3, RGI1, RGS1, shr_next, RGI1); \
+   lookup_32bit(t0, t1, t2, t3, RGI1, RGS2, dummy, none); \
shlq $32,   RGS2; \
orq RGS1, RGS2;   \
\
-   lookup_32bit(t0, t1, t2, t3, RGI2, RGS1); \
-   shrq $16,   RGI2; \
-   lookup_32bit(t0, t1, t2, t3, RGI2, RGS3); \
-   shlq $32,   RGS3; \
+   lookup_32bit(t0, t1, t2, t3, RGI2, RGS3, shr_next, RGI2); \
+   lookup_32bit(t0, t1, t2, t3, RGI2, RGS1, dummy, none); \
+   shlq $32,   RGS1; \
orq RGS1, RGS3;   \
\
vmovq   RGS2, x;  \
vpinsrq $1, RGS3, x, x;
 
-#define encround(a, b, c, d, x, y) \
-   G(a, x, s0, s1, s2, s3);   \
-   G(b, y, s1, s2, s3, s0);   \
+#define encround_g1g2(a, b, c, d, x, y) \
+   G(a, x, s0, s1, s2, s3); \
+   G(b, y, s1, s2, s3, s0);
+
+#define encround_end(a, b, c, d, x, y) \
+   vpslld $1,  d, RT; \
+   vpsrld $(32 - 1),   d, d;  \
+   vpord, RT,  d; \
vpaddd  x, y,   x; \
vpaddd  y, x,   y; \
vpaddd  x, RK1, x; \
@@ -115,14 +130,16 @@
vpsrld $1,  c, x;  \
vpslld $(32 - 1),   c, c;  \
vporc, x,   c; \
-   vpslld $1,  d, x;  \
- 

Re: [PATCH] crypto: twofish - add x86_64/avx assembler implementation

2012-08-15 Thread Jussi Kivilinna
> On Wed, Aug 15, 2012 at 04:48:54PM +0300, Jussi Kivilinna wrote:
> > I posted patch that optimize twofish-avx few weeks ago:
> > http://marc.info/?l=linux-crypto-vger=134364845024825=2
> >
> > I'd be interested to know, if this is patch helps on Bulldozer.
> 
> Sure, can you inline it here too please. The "Download message RAW" link
> on marc.info gives me a diff but patch says:
> 
> patching file arch/x86/crypto/twofish-avx-x86_64-asm_64.S
> patch unexpectedly ends in middle of line
> 
> Thanks.

Here...


Patch replaces 'movb' instructions with 'movzbl' to break false register
dependencies and interleaves instructions better for out-of-order scheduling.

Also move common round code to separate function to reduce object size.

Tested on Core i5-2450M.

---
 arch/x86/crypto/twofish-avx-x86_64-asm_64.S |  144 +--
 1 file changed, 92 insertions(+), 52 deletions(-)

diff --git a/arch/x86/crypto/twofish-avx-x86_64-asm_64.S 
b/arch/x86/crypto/twofish-avx-x86_64-asm_64.S
index 35f4557..42b27b7 100644
--- a/arch/x86/crypto/twofish-avx-x86_64-asm_64.S
+++ b/arch/x86/crypto/twofish-avx-x86_64-asm_64.S
@@ -47,15 +47,22 @@
 #define RC2 %xmm6
 #define RD2 %xmm7
 
-#define RX %xmm8
-#define RY %xmm9
+#define RX0 %xmm8
+#define RY0 %xmm9
 
-#define RK1 %xmm10
-#define RK2 %xmm11
+#define RX1 %xmm10
+#define RY1 %xmm11
+
+#define RK1 %xmm12
+#define RK2 %xmm13
+
+#define RT %xmm14
 
 #define RID1  %rax
+#define RID1d %eax
 #define RID1b %al
 #define RID2  %rbx
+#define RID2d %ebx
 #define RID2b %bl
 
 #define RGI1   %rdx
@@ -73,40 +80,45 @@
 #define RGS3d %r10d
 
 
-#define lookup_32bit(t0, t1, t2, t3, src, dst) \
-   movbsrc ## bl,RID1b; \
-   movbsrc ## bh,RID2b; \
+#define lookup_32bit(t0, t1, t2, t3, src, dst, interleave_op, il_reg) \
+   movzbl  src ## bl,RID1d; \
+   movzbl  src ## bh,RID2d; \
+   shrq $16,   src; \
movlt0(CTX, RID1, 4), dst ## d;  \
xorlt1(CTX, RID2, 4), dst ## d;  \
-   shrq $16,   src; \
-   movbsrc ## bl,RID1b; \
-   movbsrc ## bh,RID2b; \
+   movzbl  src ## bl,RID1d; \
+   movzbl  src ## bh,RID2d; \
+   interleave_op(il_reg);   \
xorlt2(CTX, RID1, 4), dst ## d;  \
xorlt3(CTX, RID2, 4), dst ## d;
 
+#define dummy(d) /* do nothing */
+
+#define shr_next(reg) \
+   shrq $16,   reg;
+
 #define G(a, x, t0, t1, t2, t3) \
vmovq   a,RGI1;   \
-   vpsrldq $8, a,x;  \
-   vmovq   x,RGI2;   \
+   vpextrq $1, a,RGI2;   \
\
-   lookup_32bit(t0, t1, t2, t3, RGI1, RGS1); \
-   shrq $16,   RGI1; \
-   lookup_32bit(t0, t1, t2, t3, RGI1, RGS2); \
-   shlq $32,   RGS2; \
-   orq RGS1, RGS2;   \
+   lookup_32bit(t0, t1, t2, t3, RGI1, RGS1, shr_next, RGI1); \
+   vmovd   RGS1d, x;\
+   lookup_32bit(t0, t1, t2, t3, RGI1, RGS2, dummy, none); \
+   vpinsrd $1, RGS2d, x, x; \
\
-   lookup_32bit(t0, t1, t2, t3, RGI2, RGS1); \
-   shrq $16,   RGI2; \
-   lookup_32bit(t0, t1, t2, t3, RGI2, RGS3); \
-   shlq $32,   RGS3; \
-   orq RGS1, RGS3;   \
-   \
-   vmovq   RGS2, x;  \
-   vpinsrq $1, RGS3, x, x;
+   lookup_32bit(t0, t1, t2, t3, RGI2, RGS1, shr_next, RGI2); \
+   vpinsrd $2, RGS1d, x, x; \
+   lookup_32bit(t0, t1, t2, t3, RGI2, RGS3, dummy, none); \
+   vpinsrd $3, RGS3d, x, x;
+
+#define encround_g1g2(a, b, c, d, x, y) \
+   G(a, x, s0, s1, s2, s3); \
+   G(b, y, s1, s2, s3, s0);
 
-#define encround(a, b, c, d, x, y) \
-   G(a, x, s0, s1, s2, s3);   \
-   G(b, y, s1, s2, s3, s0);   \
+#define encround_end(a, b, c, d, x, y) \
+   vpslld $1,  d, RT; \
+   vpsrld $(32 - 1),   d, d;  \
+   vpord, RT,  d; \
vpaddd  x, y,   x; \
vpaddd  y, x,   y; \
vpaddd  x, RK1, x; \
@@ -115,14 +127,16 @@
vpsrld $1,  c, x;  \
vpslld $(32 - 1),   c, c;  \
vporc, x,   c; \
-   vpslld $1,  d, x;  \
-   vpsrld $(32 - 1),   d, d;  \
-   vpord, x,   d; \
vpxor   d, y,   d;
 
-#define decround(a, b, c, d, x, y) \
-   G(a, x

Re: [PATCH] crypto: twofish - add x86_64/avx assembler implementation

2012-08-15 Thread Jussi Kivilinna

Quoting Borislav Petkov :


Ok, here we go. Raw data below.


Thanks alot!

Twofish-avx appears somewhat slower than 3way, ~9% slower with 256byte  
blocks to ~3% slower with 8kb blocks.










Let me know if you need more tests.


I posted patch that optimize twofish-avx few weeks ago:  
http://marc.info/?l=linux-crypto-vger=134364845024825=2


I'd be interested to know, if this is patch helps on Bulldozer.

-Jussi



HTH.

--
Regards/Gruss,
Boris.






--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] crypto: twofish - add x86_64/avx assembler implementation

2012-08-15 Thread Jussi Kivilinna

Quoting Borislav Petkov :


On Wed, Aug 15, 2012 at 11:42:16AM +0300, Jussi Kivilinna wrote:

I started thinking about the performance on AMD Bulldozer.
vmovq/vmovd/vpextr*/vpinsr* between FPU and general purpose registers
on AMD CPU is alot slower (latencies from 8 to 12 cycles) than on
Intel sandy-bridge (where instructions have latency of 1 to 2). See:
http://www.agner.org/optimize/instruction_tables.pdf

It would be really good, if implementation could be tested on AMD CPU
to determinate, if it causes performance regression. However I don't
have access to machine with such CPU.


But I do. :)

And if you tell me exactly how to run the tests and on what kernel, I'll
try to do so.



Twofish-avx (CONFIG_TWOFISH_AVX_X86_64) is available in 3.6-rc1. For  
testing you need CRYPTO_TEST build as module. You should turn off  
turbo-core, freq-scaling, etc.


Testing twofish-avx ('async twofish' speed test):
 modprobe twofish-avx-x86_64
 modprobe tcrypt mode=504 sec=1

Testing twofish-x86_64-3way ('sync twofish' speed test):
 modprobe twofish-x86_64-3way
 modprobe tcrypt mode=202 sec=1

Loading tcrypt will block until tests are complete, after which  
modprobe will return with error. This is expected. Results are in  
kernel log.


-Jussi


HTH.

--
Regards/Gruss,
Boris.






--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] crypto: twofish - add x86_64/avx assembler implementation

2012-08-15 Thread Jussi Kivilinna
Quoting Johannes Goetzfried  
:



This patch adds a x86_64/avx assembler implementation of the Twofish block
cipher. The implementation processes eight blocks in parallel (two 4 block
chunk AVX operations). The table-lookups are done in general-purpose  
registers.

For small blocksizes the 3way-parallel functions from the twofish-x86_64-3way
module are called. A good performance increase is provided for blocksizes
greater or equal to 128B.

Patch has been tested with tcrypt and automated filesystem tests.

Tcrypt benchmark results:

Intel Core i5-2500 CPU (fam:6, model:42, step:7)


I started thinking about the performance on AMD Bulldozer.  
vmovq/vmovd/vpextr*/vpinsr* between FPU and general purpose registers  
on AMD CPU is alot slower (latencies from 8 to 12 cycles) than on  
Intel sandy-bridge (where instructions have latency of 1 to 2). See:  
http://www.agner.org/optimize/instruction_tables.pdf


It would be really good, if implementation could be tested on AMD CPU  
to determinate, if it causes performance regression. However I don't  
have access to machine with such CPU.


-Jussi

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] crypto: twofish - add x86_64/avx assembler implementation

2012-08-15 Thread Jussi Kivilinna
Quoting Johannes Goetzfried  
johannes.goetzfr...@informatik.stud.uni-erlangen.de:



This patch adds a x86_64/avx assembler implementation of the Twofish block
cipher. The implementation processes eight blocks in parallel (two 4 block
chunk AVX operations). The table-lookups are done in general-purpose  
registers.

For small blocksizes the 3way-parallel functions from the twofish-x86_64-3way
module are called. A good performance increase is provided for blocksizes
greater or equal to 128B.

Patch has been tested with tcrypt and automated filesystem tests.

Tcrypt benchmark results:

Intel Core i5-2500 CPU (fam:6, model:42, step:7)


I started thinking about the performance on AMD Bulldozer.  
vmovq/vmovd/vpextr*/vpinsr* between FPU and general purpose registers  
on AMD CPU is alot slower (latencies from 8 to 12 cycles) than on  
Intel sandy-bridge (where instructions have latency of 1 to 2). See:  
http://www.agner.org/optimize/instruction_tables.pdf


It would be really good, if implementation could be tested on AMD CPU  
to determinate, if it causes performance regression. However I don't  
have access to machine with such CPU.


-Jussi

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] crypto: twofish - add x86_64/avx assembler implementation

2012-08-15 Thread Jussi Kivilinna

Quoting Borislav Petkov b...@alien8.de:


On Wed, Aug 15, 2012 at 11:42:16AM +0300, Jussi Kivilinna wrote:

I started thinking about the performance on AMD Bulldozer.
vmovq/vmovd/vpextr*/vpinsr* between FPU and general purpose registers
on AMD CPU is alot slower (latencies from 8 to 12 cycles) than on
Intel sandy-bridge (where instructions have latency of 1 to 2). See:
http://www.agner.org/optimize/instruction_tables.pdf

It would be really good, if implementation could be tested on AMD CPU
to determinate, if it causes performance regression. However I don't
have access to machine with such CPU.


But I do. :)

And if you tell me exactly how to run the tests and on what kernel, I'll
try to do so.



Twofish-avx (CONFIG_TWOFISH_AVX_X86_64) is available in 3.6-rc1. For  
testing you need CRYPTO_TEST build as module. You should turn off  
turbo-core, freq-scaling, etc.


Testing twofish-avx ('async twofish' speed test):
 modprobe twofish-avx-x86_64
 modprobe tcrypt mode=504 sec=1

Testing twofish-x86_64-3way ('sync twofish' speed test):
 modprobe twofish-x86_64-3way
 modprobe tcrypt mode=202 sec=1

Loading tcrypt will block until tests are complete, after which  
modprobe will return with error. This is expected. Results are in  
kernel log.


-Jussi


HTH.

--
Regards/Gruss,
Boris.






--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] crypto: twofish - add x86_64/avx assembler implementation

2012-08-15 Thread Jussi Kivilinna

Quoting Borislav Petkov b...@alien8.de:


Ok, here we go. Raw data below.


Thanks alot!

Twofish-avx appears somewhat slower than 3way, ~9% slower with 256byte  
blocks to ~3% slower with 8kb blocks.






snip



Let me know if you need more tests.


I posted patch that optimize twofish-avx few weeks ago:  
http://marc.info/?l=linux-crypto-vgerm=134364845024825w=2


I'd be interested to know, if this is patch helps on Bulldozer.

-Jussi



HTH.

--
Regards/Gruss,
Boris.






--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] crypto: twofish - add x86_64/avx assembler implementation

2012-08-15 Thread Jussi Kivilinna
 On Wed, Aug 15, 2012 at 04:48:54PM +0300, Jussi Kivilinna wrote:
  I posted patch that optimize twofish-avx few weeks ago:
  http://marc.info/?l=linux-crypto-vgerm=134364845024825w=2
 
  I'd be interested to know, if this is patch helps on Bulldozer.
 
 Sure, can you inline it here too please. The Download message RAW link
 on marc.info gives me a diff but patch says:
 
 patching file arch/x86/crypto/twofish-avx-x86_64-asm_64.S
 patch unexpectedly ends in middle of line
 
 Thanks.

Here...


Patch replaces 'movb' instructions with 'movzbl' to break false register
dependencies and interleaves instructions better for out-of-order scheduling.

Also move common round code to separate function to reduce object size.

Tested on Core i5-2450M.

---
 arch/x86/crypto/twofish-avx-x86_64-asm_64.S |  144 +--
 1 file changed, 92 insertions(+), 52 deletions(-)

diff --git a/arch/x86/crypto/twofish-avx-x86_64-asm_64.S 
b/arch/x86/crypto/twofish-avx-x86_64-asm_64.S
index 35f4557..42b27b7 100644
--- a/arch/x86/crypto/twofish-avx-x86_64-asm_64.S
+++ b/arch/x86/crypto/twofish-avx-x86_64-asm_64.S
@@ -47,15 +47,22 @@
 #define RC2 %xmm6
 #define RD2 %xmm7
 
-#define RX %xmm8
-#define RY %xmm9
+#define RX0 %xmm8
+#define RY0 %xmm9
 
-#define RK1 %xmm10
-#define RK2 %xmm11
+#define RX1 %xmm10
+#define RY1 %xmm11
+
+#define RK1 %xmm12
+#define RK2 %xmm13
+
+#define RT %xmm14
 
 #define RID1  %rax
+#define RID1d %eax
 #define RID1b %al
 #define RID2  %rbx
+#define RID2d %ebx
 #define RID2b %bl
 
 #define RGI1   %rdx
@@ -73,40 +80,45 @@
 #define RGS3d %r10d
 
 
-#define lookup_32bit(t0, t1, t2, t3, src, dst) \
-   movbsrc ## bl,RID1b; \
-   movbsrc ## bh,RID2b; \
+#define lookup_32bit(t0, t1, t2, t3, src, dst, interleave_op, il_reg) \
+   movzbl  src ## bl,RID1d; \
+   movzbl  src ## bh,RID2d; \
+   shrq $16,   src; \
movlt0(CTX, RID1, 4), dst ## d;  \
xorlt1(CTX, RID2, 4), dst ## d;  \
-   shrq $16,   src; \
-   movbsrc ## bl,RID1b; \
-   movbsrc ## bh,RID2b; \
+   movzbl  src ## bl,RID1d; \
+   movzbl  src ## bh,RID2d; \
+   interleave_op(il_reg);   \
xorlt2(CTX, RID1, 4), dst ## d;  \
xorlt3(CTX, RID2, 4), dst ## d;
 
+#define dummy(d) /* do nothing */
+
+#define shr_next(reg) \
+   shrq $16,   reg;
+
 #define G(a, x, t0, t1, t2, t3) \
vmovq   a,RGI1;   \
-   vpsrldq $8, a,x;  \
-   vmovq   x,RGI2;   \
+   vpextrq $1, a,RGI2;   \
\
-   lookup_32bit(t0, t1, t2, t3, RGI1, RGS1); \
-   shrq $16,   RGI1; \
-   lookup_32bit(t0, t1, t2, t3, RGI1, RGS2); \
-   shlq $32,   RGS2; \
-   orq RGS1, RGS2;   \
+   lookup_32bit(t0, t1, t2, t3, RGI1, RGS1, shr_next, RGI1); \
+   vmovd   RGS1d, x;\
+   lookup_32bit(t0, t1, t2, t3, RGI1, RGS2, dummy, none); \
+   vpinsrd $1, RGS2d, x, x; \
\
-   lookup_32bit(t0, t1, t2, t3, RGI2, RGS1); \
-   shrq $16,   RGI2; \
-   lookup_32bit(t0, t1, t2, t3, RGI2, RGS3); \
-   shlq $32,   RGS3; \
-   orq RGS1, RGS3;   \
-   \
-   vmovq   RGS2, x;  \
-   vpinsrq $1, RGS3, x, x;
+   lookup_32bit(t0, t1, t2, t3, RGI2, RGS1, shr_next, RGI2); \
+   vpinsrd $2, RGS1d, x, x; \
+   lookup_32bit(t0, t1, t2, t3, RGI2, RGS3, dummy, none); \
+   vpinsrd $3, RGS3d, x, x;
+
+#define encround_g1g2(a, b, c, d, x, y) \
+   G(a, x, s0, s1, s2, s3); \
+   G(b, y, s1, s2, s3, s0);
 
-#define encround(a, b, c, d, x, y) \
-   G(a, x, s0, s1, s2, s3);   \
-   G(b, y, s1, s2, s3, s0);   \
+#define encround_end(a, b, c, d, x, y) \
+   vpslld $1,  d, RT; \
+   vpsrld $(32 - 1),   d, d;  \
+   vpord, RT,  d; \
vpaddd  x, y,   x; \
vpaddd  y, x,   y; \
vpaddd  x, RK1, x; \
@@ -115,14 +127,16 @@
vpsrld $1,  c, x;  \
vpslld $(32 - 1),   c, c;  \
vporc, x,   c; \
-   vpslld $1,  d, x;  \
-   vpsrld $(32 - 1),   d, d;  \
-   vpord, x,   d; \
vpxor   d, y,   d;
 
-#define decround(a, b, c, d, x, y) \
-   G(a, x, s0, s1, s2, s3);   \
-   G(b, y, s1, s2, s3, s0);   \
+#define

Re: [PATCH] crypto: twofish - add x86_64/avx assembler implementation

2012-08-15 Thread Jussi Kivilinna
Quoting Borislav Petkov b...@alien8.de:

 On Wed, Aug 15, 2012 at 05:22:03PM +0300, Jussi Kivilinna wrote:

 Patch replaces 'movb' instructions with 'movzbl' to break false
 register dependencies and interleaves instructions better for
 out-of-order scheduling.

 Also move common round code to separate function to reduce object
 size.

 Ok, redid the first test


Thanks.

 $ modprobe twofish-avx-x86_64
 $ modprobe tcrypt mode=504 sec=1

 and from quickly juxtaposing the two results, I'd say the patch makes
 things slightly worse but you'd need to run your scripts on it to get
 the accurate results:


About ~5% slower, probably because I was tuning for sandy-bridge and introduced
more FPU=CPU register moves.

Here's new version of patch, with FPU=CPU moves from original implementation.

(Note: also changes encryption function to inline all code in to main function,
decryption still places common code to separate function to reduce object size.
This is to measure the difference.)

-Jussi

---
 arch/x86/crypto/twofish-avx-x86_64-asm_64.S |  124 +--
 1 file changed, 77 insertions(+), 47 deletions(-)

diff --git a/arch/x86/crypto/twofish-avx-x86_64-asm_64.S 
b/arch/x86/crypto/twofish-avx-x86_64-asm_64.S
index 35f4557..d331ab8 100644
--- a/arch/x86/crypto/twofish-avx-x86_64-asm_64.S
+++ b/arch/x86/crypto/twofish-avx-x86_64-asm_64.S
@@ -47,15 +47,22 @@
 #define RC2 %xmm6
 #define RD2 %xmm7
 
-#define RX %xmm8
-#define RY %xmm9
+#define RX0 %xmm8
+#define RY0 %xmm9
 
-#define RK1 %xmm10
-#define RK2 %xmm11
+#define RX1 %xmm10
+#define RY1 %xmm11
+
+#define RK1 %xmm12
+#define RK2 %xmm13
+
+#define RT %xmm14
 
 #define RID1  %rax
+#define RID1d %eax
 #define RID1b %al
 #define RID2  %rbx
+#define RID2d %ebx
 #define RID2b %bl
 
 #define RGI1   %rdx
@@ -73,40 +80,48 @@
 #define RGS3d %r10d
 
 
-#define lookup_32bit(t0, t1, t2, t3, src, dst) \
-   movbsrc ## bl,RID1b; \
-   movbsrc ## bh,RID2b; \
+#define lookup_32bit(t0, t1, t2, t3, src, dst, interleave_op, il_reg) \
+   movzbl  src ## bl,RID1d; \
+   movzbl  src ## bh,RID2d; \
+   shrq $16,   src; \
movlt0(CTX, RID1, 4), dst ## d;  \
xorlt1(CTX, RID2, 4), dst ## d;  \
-   shrq $16,   src; \
-   movbsrc ## bl,RID1b; \
-   movbsrc ## bh,RID2b; \
+   movzbl  src ## bl,RID1d; \
+   movzbl  src ## bh,RID2d; \
+   interleave_op(il_reg);   \
xorlt2(CTX, RID1, 4), dst ## d;  \
xorlt3(CTX, RID2, 4), dst ## d;
 
+#define dummy(d) /* do nothing */
+
+#define shr_next(reg) \
+   shrq $16,   reg;
+
 #define G(a, x, t0, t1, t2, t3) \
-   vmovq   a,RGI1;   \
-   vpsrldq $8, a,x;  \
-   vmovq   x,RGI2;   \
+   vmovq   a, RGI1;  \
+   vpextrq $1, a, RGI2;  \
\
-   lookup_32bit(t0, t1, t2, t3, RGI1, RGS1); \
-   shrq $16,   RGI1; \
-   lookup_32bit(t0, t1, t2, t3, RGI1, RGS2); \
+   lookup_32bit(t0, t1, t2, t3, RGI1, RGS1, shr_next, RGI1); \
+   lookup_32bit(t0, t1, t2, t3, RGI1, RGS2, dummy, none); \
shlq $32,   RGS2; \
orq RGS1, RGS2;   \
\
-   lookup_32bit(t0, t1, t2, t3, RGI2, RGS1); \
-   shrq $16,   RGI2; \
-   lookup_32bit(t0, t1, t2, t3, RGI2, RGS3); \
-   shlq $32,   RGS3; \
+   lookup_32bit(t0, t1, t2, t3, RGI2, RGS3, shr_next, RGI2); \
+   lookup_32bit(t0, t1, t2, t3, RGI2, RGS1, dummy, none); \
+   shlq $32,   RGS1; \
orq RGS1, RGS3;   \
\
vmovq   RGS2, x;  \
vpinsrq $1, RGS3, x, x;
 
-#define encround(a, b, c, d, x, y) \
-   G(a, x, s0, s1, s2, s3);   \
-   G(b, y, s1, s2, s3, s0);   \
+#define encround_g1g2(a, b, c, d, x, y) \
+   G(a, x, s0, s1, s2, s3); \
+   G(b, y, s1, s2, s3, s0);
+
+#define encround_end(a, b, c, d, x, y) \
+   vpslld $1,  d, RT; \
+   vpsrld $(32 - 1),   d, d;  \
+   vpord, RT,  d; \
vpaddd  x, y,   x; \
vpaddd  y, x,   y; \
vpaddd  x, RK1, x; \
@@ -115,14 +130,16 @@
vpsrld $1,  c, x;  \
vpslld $(32 - 1),   c, c;  \
vporc, x,   c; \
-   vpslld $1,  d, x;  \
-   vpsrld $(32 - 1),   d, d;  \
-   vpord, x,   d; \
vpxor

Re: [PATCH] rndis_wlan: Fix potential memory leak in update_pmkid()

2012-08-08 Thread Jussi Kivilinna

Quoting Alexey Khoroshilov :

Do not leak memory by updating pointer with potentially NULL realloc  
return value.


Found by Linux Driver Verification project (linuxtesting.org).

Signed-off-by: Alexey Khoroshilov 


Thanks!

Acked-by: Jussi Kivilinna 


---
 drivers/net/wireless/rndis_wlan.c |6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/net/wireless/rndis_wlan.c  
b/drivers/net/wireless/rndis_wlan.c

index 241162e..7a4ae9e 100644
--- a/drivers/net/wireless/rndis_wlan.c
+++ b/drivers/net/wireless/rndis_wlan.c
@@ -1803,6 +1803,7 @@ static struct ndis_80211_pmkid  
*update_pmkid(struct usbnet *usbdev,

struct cfg80211_pmksa *pmksa,
int max_pmkids)
 {
+   struct ndis_80211_pmkid *new_pmkids;
int i, err, newlen;
unsigned int count;

@@ -1833,11 +1834,12 @@ static struct ndis_80211_pmkid  
*update_pmkid(struct usbnet *usbdev,

/* add new pmkid */
newlen = sizeof(*pmkids) + (count + 1) * sizeof(pmkids->bssid_info[0]);

-   pmkids = krealloc(pmkids, newlen, GFP_KERNEL);
-   if (!pmkids) {
+   new_pmkids = krealloc(pmkids, newlen, GFP_KERNEL);
+   if (!new_pmkids) {
err = -ENOMEM;
goto error;
}
+   pmkids = new_pmkids;

pmkids->length = cpu_to_le32(newlen);
pmkids->bssid_info_count = cpu_to_le32(count + 1);
--
1.7.9.5







--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] rndis_wlan: Fix potential memory leak in update_pmkid()

2012-08-08 Thread Jussi Kivilinna

Quoting Alexey Khoroshilov khoroshi...@ispras.ru:

Do not leak memory by updating pointer with potentially NULL realloc  
return value.


Found by Linux Driver Verification project (linuxtesting.org).

Signed-off-by: Alexey Khoroshilov khoroshi...@ispras.ru


Thanks!

Acked-by: Jussi Kivilinna jussi.kivili...@mbnet.fi


---
 drivers/net/wireless/rndis_wlan.c |6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/net/wireless/rndis_wlan.c  
b/drivers/net/wireless/rndis_wlan.c

index 241162e..7a4ae9e 100644
--- a/drivers/net/wireless/rndis_wlan.c
+++ b/drivers/net/wireless/rndis_wlan.c
@@ -1803,6 +1803,7 @@ static struct ndis_80211_pmkid  
*update_pmkid(struct usbnet *usbdev,

struct cfg80211_pmksa *pmksa,
int max_pmkids)
 {
+   struct ndis_80211_pmkid *new_pmkids;
int i, err, newlen;
unsigned int count;

@@ -1833,11 +1834,12 @@ static struct ndis_80211_pmkid  
*update_pmkid(struct usbnet *usbdev,

/* add new pmkid */
newlen = sizeof(*pmkids) + (count + 1) * sizeof(pmkids-bssid_info[0]);

-   pmkids = krealloc(pmkids, newlen, GFP_KERNEL);
-   if (!pmkids) {
+   new_pmkids = krealloc(pmkids, newlen, GFP_KERNEL);
+   if (!new_pmkids) {
err = -ENOMEM;
goto error;
}
+   pmkids = new_pmkids;

pmkids-length = cpu_to_le32(newlen);
pmkids-bssid_info_count = cpu_to_le32(count + 1);
--
1.7.9.5







--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: linux-next: Tree for July 2 (crypto/hifn_795x)

2012-07-09 Thread Jussi Kivilinna

Quoting Randy Dunlap :


On 07/02/2012 12:23 AM, Stephen Rothwell wrote:


Hi all,

Changes since 20120629:




on i386:


ERROR: "__divdi3" [drivers/crypto/hifn_795x.ko] undefined!



This is caused by commit feb7b7ab928afa97a79a9c424e4e0691f49d63be.  
hifn_795x has "DIV_ROUND_UP(NSEC_PER_SEC, dev->pk_clk_freq)", which  
should be changed to DIV_ROUND_UP_ULL now that NSEC_PER_SEC is 64bit  
on 32bit archs. Patch to fix hifn_795x is attached (only compile  
tested).


-Jussi

crypto: hifn_795x - fix 64bit division and undefined __divdi3 on 32bit archs

From: Jussi Kivilinna 

Commit feb7b7ab928afa97a79a9c424e4e0691f49d63be changed NSEC_PER_SEC to 64-bit
constant, which causes "DIV_ROUND_UP(NSEC_PER_SEC, dev->pk_clk_freq)" to
generate __divdi3 call on 32-bit archs. Fix this by changing DIV_ROUND_UP to
DIV_ROUND_UP_ULL.

Signed-off-by: Jussi Kivilinna 
---
 drivers/crypto/hifn_795x.c |4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/crypto/hifn_795x.c b/drivers/crypto/hifn_795x.c
index c9c4bef..df14358 100644
--- a/drivers/crypto/hifn_795x.c
+++ b/drivers/crypto/hifn_795x.c
@@ -821,8 +821,8 @@ static int hifn_register_rng(struct hifn_device *dev)
 	/*
 	 * We must wait at least 256 Pk_clk cycles between two reads of the rng.
 	 */
-	dev->rng_wait_time	= DIV_ROUND_UP(NSEC_PER_SEC, dev->pk_clk_freq) *
-  256;
+	dev->rng_wait_time	= DIV_ROUND_UP_ULL(NSEC_PER_SEC,
+		   dev->pk_clk_freq) * 256;
 
 	dev->rng.name		= dev->name;
 	dev->rng.data_present	= hifn_rng_data_present,


Re: linux-next: Tree for July 2 (crypto/hifn_795x)

2012-07-09 Thread Jussi Kivilinna

Quoting Randy Dunlap rdun...@xenotime.net:


On 07/02/2012 12:23 AM, Stephen Rothwell wrote:


Hi all,

Changes since 20120629:




on i386:


ERROR: __divdi3 [drivers/crypto/hifn_795x.ko] undefined!



This is caused by commit feb7b7ab928afa97a79a9c424e4e0691f49d63be.  
hifn_795x has DIV_ROUND_UP(NSEC_PER_SEC, dev-pk_clk_freq), which  
should be changed to DIV_ROUND_UP_ULL now that NSEC_PER_SEC is 64bit  
on 32bit archs. Patch to fix hifn_795x is attached (only compile  
tested).


-Jussi

crypto: hifn_795x - fix 64bit division and undefined __divdi3 on 32bit archs

From: Jussi Kivilinna jussi.kivili...@mbnet.fi

Commit feb7b7ab928afa97a79a9c424e4e0691f49d63be changed NSEC_PER_SEC to 64-bit
constant, which causes DIV_ROUND_UP(NSEC_PER_SEC, dev-pk_clk_freq) to
generate __divdi3 call on 32-bit archs. Fix this by changing DIV_ROUND_UP to
DIV_ROUND_UP_ULL.

Signed-off-by: Jussi Kivilinna jussi.kivili...@mbnet.fi
---
 drivers/crypto/hifn_795x.c |4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/crypto/hifn_795x.c b/drivers/crypto/hifn_795x.c
index c9c4bef..df14358 100644
--- a/drivers/crypto/hifn_795x.c
+++ b/drivers/crypto/hifn_795x.c
@@ -821,8 +821,8 @@ static int hifn_register_rng(struct hifn_device *dev)
 	/*
 	 * We must wait at least 256 Pk_clk cycles between two reads of the rng.
 	 */
-	dev-rng_wait_time	= DIV_ROUND_UP(NSEC_PER_SEC, dev-pk_clk_freq) *
-  256;
+	dev-rng_wait_time	= DIV_ROUND_UP_ULL(NSEC_PER_SEC,
+		   dev-pk_clk_freq) * 256;
 
 	dev-rng.name		= dev-name;
 	dev-rng.data_present	= hifn_rng_data_present,