Re: [PATCH v2 2/2] crypto, x86: SSSE3 based SHA1 implementation for x86-64
Hi Max, 2011/8/8 Locktyukhin, Maxim maxim.locktyuk...@intel.com: I'd like to note that at Intel we very much appreciate Mathias effort to port/integrate this implementation into Linux kernel! $0.02 re tcrypt perf numbers below: I believe something must be terribly broken with the tcrypt measurements 20 (and more) cycles per byte shown below are not reasonable numbers for SHA-1 - ~6 c/b (as can be seen in some of the results for Core2) is the expected results ... so, while relative improvement seen is sort of consistent, the absolute performance numbers are very much off (and yes Sandy Bridge on AVX code is expected to be faster than Core2/SSSE3 - ~5.2 c/b vs. ~5.8 c/b on the level of the sha1_update() call to me more precise) this does not affect the proposed patch in any way, it looks like tcrypt's timing problem to me - I'd even venture a guess that it may be due to the use of RDTSC (that gets affected significantly by Turbo/EIST, TSC is isotropic in time but not with the core clock domain, i.e. RDTSC cannot be used to measure core cycles without at least disabling EIST and Turbo, or doing runtime adjustment of actual bus/core clock ratio vs. the standard ratio always used by TSC - I could elaborate more if someone is interested) I found the Sandy Bridge numbers odd too but suspected, it might be because of the laptop platform. The SSSE3 numbers on this platform were slightly lower than the AVX numbers and that for still way off the ones for the Core2 system. But your explanation fits well, too. It might be EIST or Turbo mode that tampered with the numbers. Another, maybe more likely point might be the overhead Andy mentioned. thanks again, -Max Mathias -- To unsubscribe from this list: send the line unsubscribe linux-crypto in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 2/2] crypto, x86: SSSE3 based SHA1 implementation for x86-64
On Thu, Aug 11, 2011 at 4:50 PM, Andy Lutomirski l...@mit.edu wrote: I have vague plans to clean up extended state handling and make kernel_fpu_begin work efficiently from any context. (i.e. the first kernel_fpu_begin after a context switch could take up to ~60 ns on Sandy Bridge, but further calls to kernel_fpu_begin would be a single branch.) The current code that handles context switches when user code is using extended state is terrible and will almost certainly become faster in the near future. Sounds good! This would not only improve the performance of sha1_ssse3 but of aesni as well. Hopefully I'll have patches for 3.2 or 3.3. IOW, please don't introduce another thing like the fpu crypto module quite yet unless there's a good reason. I'm looking forward to deleting the fpu module entirely. I've no intention to. So please go ahead and do so. Mathias -- To unsubscribe from this list: send the line unsubscribe linux-crypto in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 2/2] crypto, x86: SSSE3 based SHA1 implementation for x86-64
On 08/04/2011 02:44 AM, Herbert Xu wrote: On Sun, Jul 24, 2011 at 07:53:14PM +0200, Mathias Krause wrote: With this algorithm I was able to increase the throughput of a single IPsec link from 344 Mbit/s to 464 Mbit/s on a Core 2 Quad CPU using the SSSE3 variant -- a speedup of +34.8%. Were you testing this on the transmit side or the receive side? As the IPsec receive code path usually runs in a softirq context, does this code have any effect there at all? This is pretty similar to the situation with the Intel AES code. Over there they solved it by using the asynchronous interface and deferring the processing to a work queue. I have vague plans to clean up extended state handling and make kernel_fpu_begin work efficiently from any context. (i.e. the first kernel_fpu_begin after a context switch could take up to ~60 ns on Sandy Bridge, but further calls to kernel_fpu_begin would be a single branch.) The current code that handles context switches when user code is using extended state is terrible and will almost certainly become faster in the near future. Hopefully I'll have patches for 3.2 or 3.3. IOW, please don't introduce another thing like the fpu crypto module quite yet unless there's a good reason. I'm looking forward to deleting the fpu module entirely. --Andy -- To unsubscribe from this list: send the line unsubscribe linux-crypto in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 2/2] crypto, x86: SSSE3 based SHA1 implementation for x86-64
On Thu, Aug 11, 2011 at 10:50:49AM -0400, Andy Lutomirski wrote: This is pretty similar to the situation with the Intel AES code. Over there they solved it by using the asynchronous interface and deferring the processing to a work queue. I have vague plans to clean up extended state handling and make kernel_fpu_begin work efficiently from any context. (i.e. the first kernel_fpu_begin after a context switch could take up to ~60 ns on Sandy Bridge, but further calls to kernel_fpu_begin would be a single branch.) This is all well and good but you still need to deal with the case of !irq_fpu_usable. Cheers, -- Email: Herbert Xu herb...@gondor.apana.org.au Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -- To unsubscribe from this list: send the line unsubscribe linux-crypto in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 2/2] crypto, x86: SSSE3 based SHA1 implementation for x86-64
On Thu, Aug 11, 2011 at 11:08 AM, Herbert Xu herb...@gondor.hengli.com.au wrote: On Thu, Aug 11, 2011 at 10:50:49AM -0400, Andy Lutomirski wrote: This is pretty similar to the situation with the Intel AES code. Over there they solved it by using the asynchronous interface and deferring the processing to a work queue. I have vague plans to clean up extended state handling and make kernel_fpu_begin work efficiently from any context. (i.e. the first kernel_fpu_begin after a context switch could take up to ~60 ns on Sandy Bridge, but further calls to kernel_fpu_begin would be a single branch.) This is all well and good but you still need to deal with the case of !irq_fpu_usable. I think I can even get rid of that. Of course, until that happens, code still needs to handle !irq_fpu_usable. (Also, calling these things kernel_fpu_begin() is dangerous. It's not actually safe to use floating-point instructions after calling kernel_fpu_begin. Integer SIMD instructions are okay, though. The issue is that kernel_fpu_begin doesn't initialize MXCSR, and there are MXCSR values that will cause any floating-point instruction to trap regardless of its arguments.) --Andy -- To unsubscribe from this list: send the line unsubscribe linux-crypto in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 2/2] crypto, x86: SSSE3 based SHA1 implementation for x86-64
On Mon, Aug 8, 2011 at 1:48 PM, Locktyukhin, Maxim maxim.locktyuk...@intel.com wrote: 20 (and more) cycles per byte shown below are not reasonable numbers for SHA-1 - ~6 c/b (as can be seen in some of the results for Core2) is the expected results ... Ten years ago, on Pentium II, one benchmark showed 13 cycles/byte for SHA-1. http://www.freeswan.org/freeswan_trees/freeswan-2.06/doc/performance.html#perf.estimate -- To unsubscribe from this list: send the line unsubscribe linux-crypto in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH v2 2/2] crypto, x86: SSSE3 based SHA1 implementation for x86-64
I'd like to note that at Intel we very much appreciate Mathias effort to port/integrate this implementation into Linux kernel! $0.02 re tcrypt perf numbers below: I believe something must be terribly broken with the tcrypt measurements 20 (and more) cycles per byte shown below are not reasonable numbers for SHA-1 - ~6 c/b (as can be seen in some of the results for Core2) is the expected results ... so, while relative improvement seen is sort of consistent, the absolute performance numbers are very much off (and yes Sandy Bridge on AVX code is expected to be faster than Core2/SSSE3 - ~5.2 c/b vs. ~5.8 c/b on the level of the sha1_update() call to me more precise) this does not affect the proposed patch in any way, it looks like tcrypt's timing problem to me - I'd even venture a guess that it may be due to the use of RDTSC (that gets affected significantly by Turbo/EIST, TSC is isotropic in time but not with the core clock domain, i.e. RDTSC cannot be used to measure core cycles without at least disabling EIST and Turbo, or doing runtime adjustment of actual bus/core clock ratio vs. the standard ratio always used by TSC - I could elaborate more if someone is interested) thanks again, -Max -Original Message- From: Mathias Krause [mailto:mini...@googlemail.com] Sent: Thursday, August 04, 2011 10:05 AM To: Herbert Xu Cc: David S. Miller; linux-crypto@vger.kernel.org; Locktyukhin, Maxim; linux-ker...@vger.kernel.org Subject: Re: [PATCH v2 2/2] crypto, x86: SSSE3 based SHA1 implementation for x86-64 On Thu, Aug 4, 2011 at 8:44 AM, Herbert Xu herb...@gondor.apana.org.au wrote: On Sun, Jul 24, 2011 at 07:53:14PM +0200, Mathias Krause wrote: With this algorithm I was able to increase the throughput of a single IPsec link from 344 Mbit/s to 464 Mbit/s on a Core 2 Quad CPU using the SSSE3 variant -- a speedup of +34.8%. Were you testing this on the transmit side or the receive side? I was running an iperf test on two directly connected systems. Both sides showed me those numbers (iperf server and client). As the IPsec receive code path usually runs in a softirq context, does this code have any effect there at all? It does. Just have a look at how fpu_available() is implemented: ,-[ arch/x86/include/asm/i387.h ] | static inline bool irq_fpu_usable(void) | { | struct pt_regs *regs; | | return !in_interrupt() || !(regs = get_irq_regs()) || \ | user_mode(regs) || (read_cr0() X86_CR0_TS); | } ` So, it'll fail in softirq context when the softirq interrupted a kernel thread or TS in CR0 is set. When it interrupted a userland thread that hasn't the TS flag set in CR0, i.e. the CPU won't generate an exception when we use the FPU, it'll work in softirq context, too. With a busy userland making extensive use of the FPU it'll almost always have to fall back to the generic implementation, right. However, using this module on an IPsec gateway with no real userland at all, you get a nice performance gain. This is pretty similar to the situation with the Intel AES code. Over there they solved it by using the asynchronous interface and deferring the processing to a work queue. This also avoids the situation where you have an FPU/SSE using process that also tries to transmit over IPsec thrashing the FPU state. Interesting. I'll look into this. Now I'm still happy to take this because hashing is very different from ciphers in that some users tend to hash small amounts of data all the time. Those users will typically use the shash interface that you provide here. So I'm interested to know how much of an improvement this is for those users ( 64 bytes). Anything below 64 byte will i(and has to) be padded to a full block, i.e. 64 bytes. If you run the tcrypt speed tests that should provide some useful info. I've summarized the mean values of five consecutive tcrypt runs from two different systems. The first system is an Intel Core i7 2620M based notebook running at 2.70 GHz. It's a Sandy Bridge processor so could make use of the AVX variant. The second system was an Intel Core 2 Quad Xeon system running at 2.40 GHz -- no AVX, but SSSE3. Since the output of tcrypt is a little awkward to read, I've condensed it slightly to make it (hopefully) more readable. Please interpret the table as follow: The triple in the first column is (byte blocks | bytes per update | updates), c/B is cycles per byte. Here are the numbers for the first system: sha1-genericsha1-ssse3 (AVX) ( 16 | 16 | 1): 9.65 MiB/s, 266.2 c/B 12.93 MiB/s, 200.0 c/B ( 64 | 16 | 4):19.05 MiB/s, 140.2 c/B 25.27 MiB/s, 105.6 c/B ( 64 | 64 | 1):21.35 MiB/s, 119.2 c/B 29.29 MiB/s, 87.0 c/B ( 256 | 16 | 16):28.81 MiB/s, 88.8 c/B 37.70 MiB/s, 68.4 c/B ( 256 | 64 | 4):34.58 MiB/s, 74.0 c/B 47.16 MiB/s, 54.8 c/B ( 256 | 256 | 1):37.44 MiB/s, 68.0 c/B 69.01 MiB/s, 36.8 c/B
Re: [PATCH v2 2/2] crypto, x86: SSSE3 based SHA1 implementation for x86-64
On Sun, Jul 24, 2011 at 07:53:14PM +0200, Mathias Krause wrote: With this algorithm I was able to increase the throughput of a single IPsec link from 344 Mbit/s to 464 Mbit/s on a Core 2 Quad CPU using the SSSE3 variant -- a speedup of +34.8%. Were you testing this on the transmit side or the receive side? As the IPsec receive code path usually runs in a softirq context, does this code have any effect there at all? This is pretty similar to the situation with the Intel AES code. Over there they solved it by using the asynchronous interface and deferring the processing to a work queue. This also avoids the situation where you have an FPU/SSE using process that also tries to transmit over IPsec thrashing the FPU state. Now I'm still happy to take this because hashing is very different from ciphers in that some users tend to hash small amounts of data all the time. Those users will typically use the shash interface that you provide here. So I'm interested to know how much of an improvement this is for those users ( 64 bytes). If you run the tcrypt speed tests that should provide some useful info. Thanks, -- Email: Herbert Xu herb...@gondor.apana.org.au Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -- To unsubscribe from this list: send the line unsubscribe linux-crypto in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 2/2] crypto, x86: SSSE3 based SHA1 implementation for x86-64
On Thu, Aug 4, 2011 at 8:44 AM, Herbert Xu herb...@gondor.apana.org.au wrote: On Sun, Jul 24, 2011 at 07:53:14PM +0200, Mathias Krause wrote: With this algorithm I was able to increase the throughput of a single IPsec link from 344 Mbit/s to 464 Mbit/s on a Core 2 Quad CPU using the SSSE3 variant -- a speedup of +34.8%. Were you testing this on the transmit side or the receive side? I was running an iperf test on two directly connected systems. Both sides showed me those numbers (iperf server and client). As the IPsec receive code path usually runs in a softirq context, does this code have any effect there at all? It does. Just have a look at how fpu_available() is implemented: ,-[ arch/x86/include/asm/i387.h ] | static inline bool irq_fpu_usable(void) | { | struct pt_regs *regs; | | return !in_interrupt() || !(regs = get_irq_regs()) || \ | user_mode(regs) || (read_cr0() X86_CR0_TS); | } ` So, it'll fail in softirq context when the softirq interrupted a kernel thread or TS in CR0 is set. When it interrupted a userland thread that hasn't the TS flag set in CR0, i.e. the CPU won't generate an exception when we use the FPU, it'll work in softirq context, too. With a busy userland making extensive use of the FPU it'll almost always have to fall back to the generic implementation, right. However, using this module on an IPsec gateway with no real userland at all, you get a nice performance gain. This is pretty similar to the situation with the Intel AES code. Over there they solved it by using the asynchronous interface and deferring the processing to a work queue. This also avoids the situation where you have an FPU/SSE using process that also tries to transmit over IPsec thrashing the FPU state. Interesting. I'll look into this. Now I'm still happy to take this because hashing is very different from ciphers in that some users tend to hash small amounts of data all the time. Those users will typically use the shash interface that you provide here. So I'm interested to know how much of an improvement this is for those users ( 64 bytes). Anything below 64 byte will i(and has to) be padded to a full block, i.e. 64 bytes. If you run the tcrypt speed tests that should provide some useful info. I've summarized the mean values of five consecutive tcrypt runs from two different systems. The first system is an Intel Core i7 2620M based notebook running at 2.70 GHz. It's a Sandy Bridge processor so could make use of the AVX variant. The second system was an Intel Core 2 Quad Xeon system running at 2.40 GHz -- no AVX, but SSSE3. Since the output of tcrypt is a little awkward to read, I've condensed it slightly to make it (hopefully) more readable. Please interpret the table as follow: The triple in the first column is (byte blocks | bytes per update | updates), c/B is cycles per byte. Here are the numbers for the first system: sha1-genericsha1-ssse3 (AVX) ( 16 | 16 | 1): 9.65 MiB/s, 266.2 c/B 12.93 MiB/s, 200.0 c/B ( 64 | 16 | 4):19.05 MiB/s, 140.2 c/B 25.27 MiB/s, 105.6 c/B ( 64 | 64 | 1):21.35 MiB/s, 119.2 c/B 29.29 MiB/s, 87.0 c/B ( 256 | 16 | 16):28.81 MiB/s, 88.8 c/B 37.70 MiB/s, 68.4 c/B ( 256 | 64 | 4):34.58 MiB/s, 74.0 c/B 47.16 MiB/s, 54.8 c/B ( 256 | 256 | 1):37.44 MiB/s, 68.0 c/B 69.01 MiB/s, 36.8 c/B (1024 | 16 | 64):33.55 MiB/s, 76.2 c/B 43.77 MiB/s, 59.0 c/B (1024 | 256 | 4):45.12 MiB/s, 58.0 c/B 88.90 MiB/s, 28.8 c/B (1024 | 1024 | 1):46.69 MiB/s, 54.0 c/B104.39 MiB/s, 25.6 c/B (2048 | 16 | 128):34.66 MiB/s, 74.0 c/B 44.93 MiB/s, 57.2 c/B (2048 | 256 | 8):46.81 MiB/s, 54.0 c/B 93.83 MiB/s, 27.0 c/B (2048 | 1024 | 2):48.28 MiB/s, 52.4 c/B110.98 MiB/s, 23.0 c/B (2048 | 2048 | 1):48.69 MiB/s, 52.0 c/B114.26 MiB/s, 22.0 c/B (4096 | 16 | 256):35.15 MiB/s, 72.6 c/B 45.53 MiB/s, 56.0 c/B (4096 | 256 | 16):47.69 MiB/s, 53.0 c/B 96.46 MiB/s, 26.0 c/B (4096 | 1024 | 4):49.24 MiB/s, 51.0 c/B114.36 MiB/s, 22.0 c/B (4096 | 4096 | 1):49.77 MiB/s, 51.0 c/B119.80 MiB/s, 21.0 c/B (8192 | 16 | 512):35.46 MiB/s, 72.2 c/B 45.84 MiB/s, 55.8 c/B (8192 | 256 | 32):48.15 MiB/s, 53.0 c/B 97.83 MiB/s, 26.0 c/B (8192 | 1024 | 8):49.73 MiB/s, 51.0 c/B116.35 MiB/s, 22.0 c/B (8192 | 4096 | 2):50.10 MiB/s, 50.8 c/B121.66 MiB/s, 21.0 c/B (8192 | 8192 | 1):50.25 MiB/s, 50.8 c/B121.87 MiB/s, 21.0 c/B For the second system I got the following numbers: sha1-genericsha1-ssse3 (SSSE3) ( 16 | 16 | 1):27.23 MiB/s, 106.6 c/B 32.86 MiB/s, 73.8 c/B ( 64 | 16 | 4):51.67 MiB/s, 54.0 c/B 61.90 MiB/s, 37.8 c/B ( 64 | 64 | 1):62.44 MiB/s, 44.2 c/B 74.16 MiB/s, 31.6 c/B
Re: [PATCH v2 2/2] crypto, x86: SSSE3 based SHA1 implementation for x86-64
On Thu, Aug 4, 2011 at 7:05 PM, Mathias Krause mini...@googlemail.com wrote: It does. Just have a look at how fpu_available() is implemented: read: irq_fpu_usable() -- To unsubscribe from this list: send the line unsubscribe linux-crypto in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 2/2] crypto, x86: SSSE3 based SHA1 implementation for x86-64
This is an assembler implementation of the SHA1 algorithm using the Supplemental SSE3 (SSSE3) instructions or, when available, the Advanced Vector Extensions (AVX). Testing with the tcrypt module shows the raw hash performance is up to 2.3 times faster than the C implementation, using 8k data blocks on a Core 2 Duo T5500. For the smalest data set (16 byte) it is still 25% faster. Since this implementation uses SSE/YMM registers it cannot safely be used in every situation, e.g. while an IRQ interrupts a kernel thread. The implementation falls back to the generic SHA1 variant, if using the SSE/YMM registers is not possible. With this algorithm I was able to increase the throughput of a single IPsec link from 344 Mbit/s to 464 Mbit/s on a Core 2 Quad CPU using the SSSE3 variant -- a speedup of +34.8%. Saving and restoring SSE/YMM state might make the actual throughput fluctuate when there are FPU intensive userland applications running. For example, meassuring the performance using iperf2 directly on the machine under test gives wobbling numbers because iperf2 uses the FPU for each packet to check if the reporting interval has expired (in the above test I got min/max/avg: 402/484/464 MBit/s). Using this algorithm on a IPsec gateway gives much more reasonable and stable numbers, albeit not as high as in the directly connected case. Here is the result from an RFC 2544 test run with a EXFO Packet Blazer FTB-8510: frame sizesha1-generic sha1-ssse3delta 64 byte 37.5 MBit/s37.5 MBit/s 0.0% 128 byte 56.3 MBit/s62.5 MBit/s +11.0% 256 byte 87.5 MBit/s 100.0 MBit/s +14.3% 512 byte131.3 MBit/s 150.0 MBit/s +14.2% 1024 byte162.5 MBit/s 193.8 MBit/s +19.3% 1280 byte175.0 MBit/s 212.5 MBit/s +21.4% 1420 byte175.0 MBit/s 218.7 MBit/s +25.0% 1518 byte150.0 MBit/s 181.2 MBit/s +20.8% The throughput for the largest frame size is lower than for the previous size because the IP packets need to be fragmented in this case to make there way through the IPsec tunnel. Signed-off-by: Mathias Krause mini...@googlemail.com Cc: Maxim Locktyukhin maxim.locktyuk...@intel.com --- arch/x86/crypto/Makefile |8 + arch/x86/crypto/sha1_ssse3_asm.S | 558 + arch/x86/crypto/sha1_ssse3_glue.c | 240 arch/x86/include/asm/cpufeature.h |3 + crypto/Kconfig| 10 + 5 files changed, 819 insertions(+), 0 deletions(-) create mode 100644 arch/x86/crypto/sha1_ssse3_asm.S create mode 100644 arch/x86/crypto/sha1_ssse3_glue.c diff --git a/arch/x86/crypto/Makefile b/arch/x86/crypto/Makefile index c04f1b7..57c7f7b 100644 --- a/arch/x86/crypto/Makefile +++ b/arch/x86/crypto/Makefile @@ -13,6 +13,7 @@ obj-$(CONFIG_CRYPTO_AES_NI_INTEL) += aesni-intel.o obj-$(CONFIG_CRYPTO_GHASH_CLMUL_NI_INTEL) += ghash-clmulni-intel.o obj-$(CONFIG_CRYPTO_CRC32C_INTEL) += crc32c-intel.o +obj-$(CONFIG_CRYPTO_SHA1_SSSE3) += sha1-ssse3.o aes-i586-y := aes-i586-asm_32.o aes_glue.o twofish-i586-y := twofish-i586-asm_32.o twofish_glue.o @@ -25,3 +26,10 @@ salsa20-x86_64-y := salsa20-x86_64-asm_64.o salsa20_glue.o aesni-intel-y := aesni-intel_asm.o aesni-intel_glue.o fpu.o ghash-clmulni-intel-y := ghash-clmulni-intel_asm.o ghash-clmulni-intel_glue.o + +# enable AVX support only when $(AS) can actually assemble the instructions +ifeq ($(call as-instr,vpxor %xmm0$(comma)%xmm1$(comma)%xmm2,yes,no),yes) +AFLAGS_sha1_ssse3_asm.o += -DSHA1_ENABLE_AVX_SUPPORT +CFLAGS_sha1_ssse3_glue.o += -DSHA1_ENABLE_AVX_SUPPORT +endif +sha1-ssse3-y := sha1_ssse3_asm.o sha1_ssse3_glue.o diff --git a/arch/x86/crypto/sha1_ssse3_asm.S b/arch/x86/crypto/sha1_ssse3_asm.S new file mode 100644 index 000..b2c2f57 --- /dev/null +++ b/arch/x86/crypto/sha1_ssse3_asm.S @@ -0,0 +1,558 @@ +/* + * This is a SIMD SHA-1 implementation. It requires the Intel(R) Supplemental + * SSE3 instruction set extensions introduced in Intel Core Microarchitecture + * processors. CPUs supporting Intel(R) AVX extensions will get an additional + * boost. + * + * This work was inspired by the vectorized implementation of Dean Gaudet. + * Additional information on it can be found at: + *http://www.arctic.org/~dean/crypto/sha1.html + * + * It was improved upon with more efficient vectorization of the message + * scheduling. This implementation has also been optimized for all current and + * several future generations of Intel CPUs. + * + * See this article for more information about the implementation details: + * http://software.intel.com/en-us/articles/improving-the-performance-of-the-secure-hash-algorithm-1/ + * + * Copyright (C) 2010, Intel Corp. + * Authors: Maxim Locktyukhin maxim.locktyuk...@intel.com + *Ronen Zohar ronen.zo...@intel.com + * + * Converted to ATT syntax and adapted for inclusion in the Linux kernel: + * Author: Mathias Krause mini...@googlemail.com + * + * This program is free