Re: [PATCH v2 2/2] crypto, x86: SSSE3 based SHA1 implementation for x86-64

2011-08-14 Thread Mathias Krause
Hi Max,

2011/8/8 Locktyukhin, Maxim maxim.locktyuk...@intel.com:
 I'd like to note that at Intel we very much appreciate Mathias effort to 
 port/integrate this implementation into Linux kernel!


 $0.02 re tcrypt perf numbers below: I believe something must be terribly 
 broken with the tcrypt measurements

 20 (and more) cycles per byte shown below are not reasonable numbers for 
 SHA-1 - ~6 c/b (as can be seen in some of the results for Core2) is the 
 expected results ... so, while relative improvement seen is sort of 
 consistent, the absolute performance numbers are very much off (and yes Sandy 
 Bridge on AVX code is expected to be faster than Core2/SSSE3 - ~5.2 c/b vs. 
 ~5.8 c/b on the level of the sha1_update() call to me more precise)

 this does not affect the proposed patch in any way, it looks like tcrypt's 
 timing problem to me - I'd even venture a guess that it may be due to the use 
 of RDTSC (that gets affected significantly by Turbo/EIST, TSC is isotropic in 
 time but not with the core clock domain, i.e. RDTSC cannot be used to measure 
 core cycles without at least disabling EIST and Turbo, or doing runtime 
 adjustment of actual bus/core clock ratio vs. the standard ratio always used 
 by TSC - I could elaborate more if someone is interested)

I found the Sandy Bridge numbers odd too but suspected, it might be
because of the laptop platform. The SSSE3 numbers on this platform
were slightly lower than the AVX numbers and that for still way off
the ones for the Core2 system. But your explanation fits well, too. It
might be EIST or Turbo mode that tampered with the numbers. Another,
maybe more likely point might be the overhead Andy mentioned.

 thanks again,
 -Max


Mathias
--
To unsubscribe from this list: send the line unsubscribe linux-crypto in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 2/2] crypto, x86: SSSE3 based SHA1 implementation for x86-64

2011-08-14 Thread Mathias Krause
On Thu, Aug 11, 2011 at 4:50 PM, Andy Lutomirski l...@mit.edu wrote:
 I have vague plans to clean up extended state handling and make
 kernel_fpu_begin work efficiently from any context.  (i.e. the first
 kernel_fpu_begin after a context switch could take up to ~60 ns on Sandy
 Bridge, but further calls to kernel_fpu_begin would be a single branch.)

 The current code that handles context switches when user code is using
 extended state is terrible and will almost certainly become faster in the
 near future.

Sounds good! This would not only improve the performance of sha1_ssse3
but of aesni as well.

 Hopefully I'll have patches for 3.2 or 3.3.

 IOW, please don't introduce another thing like the fpu crypto module quite
 yet unless there's a good reason.  I'm looking forward to deleting the fpu
 module entirely.

I've no intention to. So please go ahead and do so.


Mathias
--
To unsubscribe from this list: send the line unsubscribe linux-crypto in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 2/2] crypto, x86: SSSE3 based SHA1 implementation for x86-64

2011-08-11 Thread Andy Lutomirski

On 08/04/2011 02:44 AM, Herbert Xu wrote:

On Sun, Jul 24, 2011 at 07:53:14PM +0200, Mathias Krause wrote:


With this algorithm I was able to increase the throughput of a single
IPsec link from 344 Mbit/s to 464 Mbit/s on a Core 2 Quad CPU using
the SSSE3 variant -- a speedup of +34.8%.


Were you testing this on the transmit side or the receive side?

As the IPsec receive code path usually runs in a softirq context,
does this code have any effect there at all?

This is pretty similar to the situation with the Intel AES code.
Over there they solved it by using the asynchronous interface and
deferring the processing to a work queue.


I have vague plans to clean up extended state handling and make 
kernel_fpu_begin work efficiently from any context.  (i.e. the first 
kernel_fpu_begin after a context switch could take up to ~60 ns on Sandy 
Bridge, but further calls to kernel_fpu_begin would be a single branch.)


The current code that handles context switches when user code is using 
extended state is terrible and will almost certainly become faster in 
the near future.


Hopefully I'll have patches for 3.2 or 3.3.

IOW, please don't introduce another thing like the fpu crypto module 
quite yet unless there's a good reason.  I'm looking forward to deleting 
the fpu module entirely.


--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-crypto in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 2/2] crypto, x86: SSSE3 based SHA1 implementation for x86-64

2011-08-11 Thread Herbert Xu
On Thu, Aug 11, 2011 at 10:50:49AM -0400, Andy Lutomirski wrote:

 This is pretty similar to the situation with the Intel AES code.
 Over there they solved it by using the asynchronous interface and
 deferring the processing to a work queue.

 I have vague plans to clean up extended state handling and make  
 kernel_fpu_begin work efficiently from any context.  (i.e. the first  
 kernel_fpu_begin after a context switch could take up to ~60 ns on Sandy  
 Bridge, but further calls to kernel_fpu_begin would be a single branch.)

This is all well and good but you still need to deal with the
case of !irq_fpu_usable.

Cheers,
-- 
Email: Herbert Xu herb...@gondor.apana.org.au
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line unsubscribe linux-crypto in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 2/2] crypto, x86: SSSE3 based SHA1 implementation for x86-64

2011-08-11 Thread Andrew Lutomirski
On Thu, Aug 11, 2011 at 11:08 AM, Herbert Xu
herb...@gondor.hengli.com.au wrote:
 On Thu, Aug 11, 2011 at 10:50:49AM -0400, Andy Lutomirski wrote:

 This is pretty similar to the situation with the Intel AES code.
 Over there they solved it by using the asynchronous interface and
 deferring the processing to a work queue.

 I have vague plans to clean up extended state handling and make
 kernel_fpu_begin work efficiently from any context.  (i.e. the first
 kernel_fpu_begin after a context switch could take up to ~60 ns on Sandy
 Bridge, but further calls to kernel_fpu_begin would be a single branch.)

 This is all well and good but you still need to deal with the
 case of !irq_fpu_usable.

I think I can even get rid of that.  Of course, until that happens,
code still needs to handle !irq_fpu_usable.

(Also, calling these things kernel_fpu_begin() is dangerous.  It's not
actually safe to use floating-point instructions after calling
kernel_fpu_begin.  Integer SIMD instructions are okay, though.  The
issue is that kernel_fpu_begin doesn't initialize MXCSR, and there are
MXCSR values that will cause any floating-point instruction to trap
regardless of its arguments.)

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-crypto in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 2/2] crypto, x86: SSSE3 based SHA1 implementation for x86-64

2011-08-08 Thread Sandy Harris
On Mon, Aug 8, 2011 at 1:48 PM, Locktyukhin, Maxim
maxim.locktyuk...@intel.com wrote:

 20 (and more) cycles per byte shown below are not reasonable numbers for SHA-1
 - ~6 c/b (as can be seen in some of the results for Core2) is the expected 
 results ...

Ten years ago, on Pentium II, one benchmark showed 13 cycles/byte for SHA-1.
http://www.freeswan.org/freeswan_trees/freeswan-2.06/doc/performance.html#perf.estimate
--
To unsubscribe from this list: send the line unsubscribe linux-crypto in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH v2 2/2] crypto, x86: SSSE3 based SHA1 implementation for x86-64

2011-08-07 Thread Locktyukhin, Maxim
I'd like to note that at Intel we very much appreciate Mathias effort to 
port/integrate this implementation into Linux kernel!


$0.02 re tcrypt perf numbers below: I believe something must be terribly broken 
with the tcrypt measurements

20 (and more) cycles per byte shown below are not reasonable numbers for SHA-1 
- ~6 c/b (as can be seen in some of the results for Core2) is the expected 
results ... so, while relative improvement seen is sort of consistent, the 
absolute performance numbers are very much off (and yes Sandy Bridge on AVX 
code is expected to be faster than Core2/SSSE3 - ~5.2 c/b vs. ~5.8 c/b on the 
level of the sha1_update() call to me more precise)

this does not affect the proposed patch in any way, it looks like tcrypt's 
timing problem to me - I'd even venture a guess that it may be due to the use 
of RDTSC (that gets affected significantly by Turbo/EIST, TSC is isotropic in 
time but not with the core clock domain, i.e. RDTSC cannot be used to measure 
core cycles without at least disabling EIST and Turbo, or doing runtime 
adjustment of actual bus/core clock ratio vs. the standard ratio always used by 
TSC - I could elaborate more if someone is interested)

thanks again,
-Max


-Original Message-
From: Mathias Krause [mailto:mini...@googlemail.com]
Sent: Thursday, August 04, 2011 10:05 AM
To: Herbert Xu
Cc: David S. Miller; linux-crypto@vger.kernel.org; Locktyukhin, Maxim; 
linux-ker...@vger.kernel.org
Subject: Re: [PATCH v2 2/2] crypto, x86: SSSE3 based SHA1 implementation for 
x86-64

On Thu, Aug 4, 2011 at 8:44 AM, Herbert Xu herb...@gondor.apana.org.au wrote:
 On Sun, Jul 24, 2011 at 07:53:14PM +0200, Mathias Krause wrote:

 With this algorithm I was able to increase the throughput of a single
 IPsec link from 344 Mbit/s to 464 Mbit/s on a Core 2 Quad CPU using
 the SSSE3 variant -- a speedup of +34.8%.

 Were you testing this on the transmit side or the receive side?

I was running an iperf test on two directly connected systems. Both sides
showed me those numbers (iperf server and client).

 As the IPsec receive code path usually runs in a softirq context,
 does this code have any effect there at all?

It does. Just have a look at how fpu_available() is implemented:

,-[ arch/x86/include/asm/i387.h ]
| static inline bool irq_fpu_usable(void)
| {
| struct pt_regs *regs;
|
| return !in_interrupt() || !(regs = get_irq_regs()) || \
| user_mode(regs) || (read_cr0()  X86_CR0_TS);
| }
`

So, it'll fail in softirq context when the softirq interrupted a kernel thread
or TS in CR0 is set. When it interrupted a userland thread that hasn't the TS
flag set in CR0, i.e. the CPU won't generate an exception when we use the FPU,
it'll work in softirq context, too.

With a busy userland making extensive use of the FPU it'll almost always have
to fall back to the generic implementation, right. However, using this module
on an IPsec gateway with no real userland at all, you get a nice performance
gain.

 This is pretty similar to the situation with the Intel AES code.
 Over there they solved it by using the asynchronous interface and
 deferring the processing to a work queue.

 This also avoids the situation where you have an FPU/SSE using
 process that also tries to transmit over IPsec thrashing the
 FPU state.

Interesting. I'll look into this.

 Now I'm still happy to take this because hashing is very different
 from ciphers in that some users tend to hash small amounts of data
 all the time.  Those users will typically use the shash interface
 that you provide here.

 So I'm interested to know how much of an improvement this is for
 those users ( 64 bytes).

Anything below 64 byte will i(and has to) be padded to a full block, i.e. 64
bytes.

 If you run the tcrypt speed tests that should provide some useful info.

I've summarized the mean values of five consecutive tcrypt runs from two
different systems. The first system is an Intel Core i7 2620M based notebook
running at 2.70 GHz. It's a Sandy Bridge processor so could make use of the
AVX variant. The second system was an Intel Core 2 Quad Xeon system running at
2.40 GHz -- no AVX, but SSSE3.

Since the output of tcrypt is a little awkward to read, I've condensed it
slightly to make it (hopefully) more readable. Please interpret the table as
follow: The triple in the first column is (byte blocks | bytes per update |
updates), c/B is cycles per byte.

Here are the numbers for the first system:

   sha1-genericsha1-ssse3 (AVX)
 (  16 |   16 |   1): 9.65 MiB/s, 266.2 c/B 12.93 MiB/s, 200.0 c/B
 (  64 |   16 |   4):19.05 MiB/s, 140.2 c/B 25.27 MiB/s, 105.6 c/B
 (  64 |   64 |   1):21.35 MiB/s, 119.2 c/B 29.29 MiB/s,  87.0 c/B
 ( 256 |   16 |  16):28.81 MiB/s,  88.8 c/B 37.70 MiB/s,  68.4 c/B
 ( 256 |   64 |   4):34.58 MiB/s,  74.0 c/B 47.16 MiB/s,  54.8 c/B
 ( 256 |  256 |   1):37.44 MiB/s,  68.0 c/B 69.01 MiB/s,  36.8 c/B

Re: [PATCH v2 2/2] crypto, x86: SSSE3 based SHA1 implementation for x86-64

2011-08-04 Thread Herbert Xu
On Sun, Jul 24, 2011 at 07:53:14PM +0200, Mathias Krause wrote:

 With this algorithm I was able to increase the throughput of a single
 IPsec link from 344 Mbit/s to 464 Mbit/s on a Core 2 Quad CPU using
 the SSSE3 variant -- a speedup of +34.8%.

Were you testing this on the transmit side or the receive side?

As the IPsec receive code path usually runs in a softirq context,
does this code have any effect there at all?

This is pretty similar to the situation with the Intel AES code.
Over there they solved it by using the asynchronous interface and
deferring the processing to a work queue.

This also avoids the situation where you have an FPU/SSE using
process that also tries to transmit over IPsec thrashing the
FPU state.

Now I'm still happy to take this because hashing is very different
from ciphers in that some users tend to hash small amounts of data
all the time.  Those users will typically use the shash interface
that you provide here.

So I'm interested to know how much of an improvement this is for
those users ( 64 bytes).  If you run the tcrypt speed tests that
should provide some useful info.

Thanks,
-- 
Email: Herbert Xu herb...@gondor.apana.org.au
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line unsubscribe linux-crypto in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 2/2] crypto, x86: SSSE3 based SHA1 implementation for x86-64

2011-08-04 Thread Mathias Krause
On Thu, Aug 4, 2011 at 8:44 AM, Herbert Xu herb...@gondor.apana.org.au wrote:
 On Sun, Jul 24, 2011 at 07:53:14PM +0200, Mathias Krause wrote:

 With this algorithm I was able to increase the throughput of a single
 IPsec link from 344 Mbit/s to 464 Mbit/s on a Core 2 Quad CPU using
 the SSSE3 variant -- a speedup of +34.8%.

 Were you testing this on the transmit side or the receive side?

I was running an iperf test on two directly connected systems. Both sides
showed me those numbers (iperf server and client).

 As the IPsec receive code path usually runs in a softirq context,
 does this code have any effect there at all?

It does. Just have a look at how fpu_available() is implemented:

,-[ arch/x86/include/asm/i387.h ]
| static inline bool irq_fpu_usable(void)
| {
| struct pt_regs *regs;
|
| return !in_interrupt() || !(regs = get_irq_regs()) || \
| user_mode(regs) || (read_cr0()  X86_CR0_TS);
| }
`

So, it'll fail in softirq context when the softirq interrupted a kernel thread
or TS in CR0 is set. When it interrupted a userland thread that hasn't the TS
flag set in CR0, i.e. the CPU won't generate an exception when we use the FPU,
it'll work in softirq context, too.

With a busy userland making extensive use of the FPU it'll almost always have
to fall back to the generic implementation, right. However, using this module
on an IPsec gateway with no real userland at all, you get a nice performance
gain.

 This is pretty similar to the situation with the Intel AES code.
 Over there they solved it by using the asynchronous interface and
 deferring the processing to a work queue.

 This also avoids the situation where you have an FPU/SSE using
 process that also tries to transmit over IPsec thrashing the
 FPU state.

Interesting. I'll look into this.

 Now I'm still happy to take this because hashing is very different
 from ciphers in that some users tend to hash small amounts of data
 all the time.  Those users will typically use the shash interface
 that you provide here.

 So I'm interested to know how much of an improvement this is for
 those users ( 64 bytes).

Anything below 64 byte will i(and has to) be padded to a full block, i.e. 64
bytes.

 If you run the tcrypt speed tests that should provide some useful info.

I've summarized the mean values of five consecutive tcrypt runs from two
different systems. The first system is an Intel Core i7 2620M based notebook
running at 2.70 GHz. It's a Sandy Bridge processor so could make use of the
AVX variant. The second system was an Intel Core 2 Quad Xeon system running at
2.40 GHz -- no AVX, but SSSE3.

Since the output of tcrypt is a little awkward to read, I've condensed it
slightly to make it (hopefully) more readable. Please interpret the table as
follow: The triple in the first column is (byte blocks | bytes per update |
updates), c/B is cycles per byte.

Here are the numbers for the first system:

   sha1-genericsha1-ssse3 (AVX)
 (  16 |   16 |   1): 9.65 MiB/s, 266.2 c/B 12.93 MiB/s, 200.0 c/B
 (  64 |   16 |   4):19.05 MiB/s, 140.2 c/B 25.27 MiB/s, 105.6 c/B
 (  64 |   64 |   1):21.35 MiB/s, 119.2 c/B 29.29 MiB/s,  87.0 c/B
 ( 256 |   16 |  16):28.81 MiB/s,  88.8 c/B 37.70 MiB/s,  68.4 c/B
 ( 256 |   64 |   4):34.58 MiB/s,  74.0 c/B 47.16 MiB/s,  54.8 c/B
 ( 256 |  256 |   1):37.44 MiB/s,  68.0 c/B 69.01 MiB/s,  36.8 c/B
 (1024 |   16 |  64):33.55 MiB/s,  76.2 c/B 43.77 MiB/s,  59.0 c/B
 (1024 |  256 |   4):45.12 MiB/s,  58.0 c/B 88.90 MiB/s,  28.8 c/B
 (1024 | 1024 |   1):46.69 MiB/s,  54.0 c/B104.39 MiB/s,  25.6 c/B
 (2048 |   16 | 128):34.66 MiB/s,  74.0 c/B 44.93 MiB/s,  57.2 c/B
 (2048 |  256 |   8):46.81 MiB/s,  54.0 c/B 93.83 MiB/s,  27.0 c/B
 (2048 | 1024 |   2):48.28 MiB/s,  52.4 c/B110.98 MiB/s,  23.0 c/B
 (2048 | 2048 |   1):48.69 MiB/s,  52.0 c/B114.26 MiB/s,  22.0 c/B
 (4096 |   16 | 256):35.15 MiB/s,  72.6 c/B 45.53 MiB/s,  56.0 c/B
 (4096 |  256 |  16):47.69 MiB/s,  53.0 c/B 96.46 MiB/s,  26.0 c/B
 (4096 | 1024 |   4):49.24 MiB/s,  51.0 c/B114.36 MiB/s,  22.0 c/B
 (4096 | 4096 |   1):49.77 MiB/s,  51.0 c/B119.80 MiB/s,  21.0 c/B
 (8192 |   16 | 512):35.46 MiB/s,  72.2 c/B 45.84 MiB/s,  55.8 c/B
 (8192 |  256 |  32):48.15 MiB/s,  53.0 c/B 97.83 MiB/s,  26.0 c/B
 (8192 | 1024 |   8):49.73 MiB/s,  51.0 c/B116.35 MiB/s,  22.0 c/B
 (8192 | 4096 |   2):50.10 MiB/s,  50.8 c/B121.66 MiB/s,  21.0 c/B
 (8192 | 8192 |   1):50.25 MiB/s,  50.8 c/B121.87 MiB/s,  21.0 c/B

For the second system I got the following numbers:

   sha1-genericsha1-ssse3 (SSSE3)
 (  16 |   16 |   1):27.23 MiB/s, 106.6 c/B 32.86 MiB/s,  73.8 c/B
 (  64 |   16 |   4):51.67 MiB/s,  54.0 c/B 61.90 MiB/s,  37.8 c/B
 (  64 |   64 |   1):62.44 MiB/s,  44.2 c/B 74.16 MiB/s,  31.6 c/B

Re: [PATCH v2 2/2] crypto, x86: SSSE3 based SHA1 implementation for x86-64

2011-08-04 Thread Mathias Krause
On Thu, Aug 4, 2011 at 7:05 PM, Mathias Krause mini...@googlemail.com wrote:
 It does. Just have a look at how fpu_available() is implemented:

read: irq_fpu_usable()
--
To unsubscribe from this list: send the line unsubscribe linux-crypto in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 2/2] crypto, x86: SSSE3 based SHA1 implementation for x86-64

2011-07-24 Thread Mathias Krause
This is an assembler implementation of the SHA1 algorithm using the
Supplemental SSE3 (SSSE3) instructions or, when available, the
Advanced Vector Extensions (AVX).

Testing with the tcrypt module shows the raw hash performance is up to
2.3 times faster than the C implementation, using 8k data blocks on a
Core 2 Duo T5500. For the smalest data set (16 byte) it is still 25%
faster.

Since this implementation uses SSE/YMM registers it cannot safely be
used in every situation, e.g. while an IRQ interrupts a kernel thread.
The implementation falls back to the generic SHA1 variant, if using
the SSE/YMM registers is not possible.

With this algorithm I was able to increase the throughput of a single
IPsec link from 344 Mbit/s to 464 Mbit/s on a Core 2 Quad CPU using
the SSSE3 variant -- a speedup of +34.8%.

Saving and restoring SSE/YMM state might make the actual throughput
fluctuate when there are FPU intensive userland applications running.
For example, meassuring the performance using iperf2 directly on the
machine under test gives wobbling numbers because iperf2 uses the FPU
for each packet to check if the reporting interval has expired (in the
above test I got min/max/avg: 402/484/464 MBit/s).

Using this algorithm on a IPsec gateway gives much more reasonable and
stable numbers, albeit not as high as in the directly connected case.
Here is the result from an RFC 2544 test run with a EXFO Packet Blazer
FTB-8510:

 frame sizesha1-generic sha1-ssse3delta
64 byte 37.5 MBit/s37.5 MBit/s 0.0%
   128 byte 56.3 MBit/s62.5 MBit/s   +11.0%
   256 byte 87.5 MBit/s   100.0 MBit/s   +14.3%
   512 byte131.3 MBit/s   150.0 MBit/s   +14.2%
  1024 byte162.5 MBit/s   193.8 MBit/s   +19.3%
  1280 byte175.0 MBit/s   212.5 MBit/s   +21.4%
  1420 byte175.0 MBit/s   218.7 MBit/s   +25.0%
  1518 byte150.0 MBit/s   181.2 MBit/s   +20.8%

The throughput for the largest frame size is lower than for the
previous size because the IP packets need to be fragmented in this
case to make there way through the IPsec tunnel.

Signed-off-by: Mathias Krause mini...@googlemail.com
Cc: Maxim Locktyukhin maxim.locktyuk...@intel.com
---
 arch/x86/crypto/Makefile  |8 +
 arch/x86/crypto/sha1_ssse3_asm.S  |  558 +
 arch/x86/crypto/sha1_ssse3_glue.c |  240 
 arch/x86/include/asm/cpufeature.h |3 +
 crypto/Kconfig|   10 +
 5 files changed, 819 insertions(+), 0 deletions(-)
 create mode 100644 arch/x86/crypto/sha1_ssse3_asm.S
 create mode 100644 arch/x86/crypto/sha1_ssse3_glue.c

diff --git a/arch/x86/crypto/Makefile b/arch/x86/crypto/Makefile
index c04f1b7..57c7f7b 100644
--- a/arch/x86/crypto/Makefile
+++ b/arch/x86/crypto/Makefile
@@ -13,6 +13,7 @@ obj-$(CONFIG_CRYPTO_AES_NI_INTEL) += aesni-intel.o
 obj-$(CONFIG_CRYPTO_GHASH_CLMUL_NI_INTEL) += ghash-clmulni-intel.o
 
 obj-$(CONFIG_CRYPTO_CRC32C_INTEL) += crc32c-intel.o
+obj-$(CONFIG_CRYPTO_SHA1_SSSE3) += sha1-ssse3.o
 
 aes-i586-y := aes-i586-asm_32.o aes_glue.o
 twofish-i586-y := twofish-i586-asm_32.o twofish_glue.o
@@ -25,3 +26,10 @@ salsa20-x86_64-y := salsa20-x86_64-asm_64.o salsa20_glue.o
 aesni-intel-y := aesni-intel_asm.o aesni-intel_glue.o fpu.o
 
 ghash-clmulni-intel-y := ghash-clmulni-intel_asm.o ghash-clmulni-intel_glue.o
+
+# enable AVX support only when $(AS) can actually assemble the instructions
+ifeq ($(call as-instr,vpxor %xmm0$(comma)%xmm1$(comma)%xmm2,yes,no),yes)
+AFLAGS_sha1_ssse3_asm.o += -DSHA1_ENABLE_AVX_SUPPORT
+CFLAGS_sha1_ssse3_glue.o += -DSHA1_ENABLE_AVX_SUPPORT
+endif
+sha1-ssse3-y := sha1_ssse3_asm.o sha1_ssse3_glue.o
diff --git a/arch/x86/crypto/sha1_ssse3_asm.S b/arch/x86/crypto/sha1_ssse3_asm.S
new file mode 100644
index 000..b2c2f57
--- /dev/null
+++ b/arch/x86/crypto/sha1_ssse3_asm.S
@@ -0,0 +1,558 @@
+/*
+ * This is a SIMD SHA-1 implementation. It requires the Intel(R) Supplemental
+ * SSE3 instruction set extensions introduced in Intel Core Microarchitecture
+ * processors. CPUs supporting Intel(R) AVX extensions will get an additional
+ * boost.
+ *
+ * This work was inspired by the vectorized implementation of Dean Gaudet.
+ * Additional information on it can be found at:
+ *http://www.arctic.org/~dean/crypto/sha1.html
+ *
+ * It was improved upon with more efficient vectorization of the message
+ * scheduling. This implementation has also been optimized for all current and
+ * several future generations of Intel CPUs.
+ *
+ * See this article for more information about the implementation details:
+ *   
http://software.intel.com/en-us/articles/improving-the-performance-of-the-secure-hash-algorithm-1/
+ *
+ * Copyright (C) 2010, Intel Corp.
+ *   Authors: Maxim Locktyukhin maxim.locktyuk...@intel.com
+ *Ronen Zohar ronen.zo...@intel.com
+ *
+ * Converted to ATT syntax and adapted for inclusion in the Linux kernel:
+ *   Author: Mathias Krause mini...@googlemail.com
+ *
+ * This program is free