RE: [PATCH v7 0/8] Raid: enable talitos xor offload for improving performance

2012-08-15 Thread Liu Qiang-B32616
-Original Message- From: dan.j.willi...@gmail.com [mailto:dan.j.willi...@gmail.com] On Behalf Of Dan Williams Sent: Wednesday, August 15, 2012 4:02 AM To: Liu Qiang-B32616 Cc: dan.j.willi...@intel.com; vinod.k...@intel.com; a...@arndb.de; herb...@gondor.apana.org.au;

Re: [PATCH] crypto: twofish - add x86_64/avx assembler implementation

2012-08-15 Thread Jussi Kivilinna
Quoting Johannes Goetzfried johannes.goetzfr...@informatik.stud.uni-erlangen.de: This patch adds a x86_64/avx assembler implementation of the Twofish block cipher. The implementation processes eight blocks in parallel (two 4 block chunk AVX operations). The table-lookups are done in

Re: [PATCH] crypto: twofish - add x86_64/avx assembler implementation

2012-08-15 Thread Borislav Petkov
On Wed, Aug 15, 2012 at 11:42:16AM +0300, Jussi Kivilinna wrote: I started thinking about the performance on AMD Bulldozer. vmovq/vmovd/vpextr*/vpinsr* between FPU and general purpose registers on AMD CPU is alot slower (latencies from 8 to 12 cycles) than on Intel sandy-bridge (where

Re: [PATCH] crypto: twofish - add x86_64/avx assembler implementation

2012-08-15 Thread Jussi Kivilinna
Quoting Borislav Petkov b...@alien8.de: On Wed, Aug 15, 2012 at 11:42:16AM +0300, Jussi Kivilinna wrote: I started thinking about the performance on AMD Bulldozer. vmovq/vmovd/vpextr*/vpinsr* between FPU and general purpose registers on AMD CPU is alot slower (latencies from 8 to 12 cycles)

Re: [PATCH] crypto: twofish - add x86_64/avx assembler implementation

2012-08-15 Thread Borislav Petkov
Ok, here we go. Raw data below. On Wed, Aug 15, 2012 at 02:00:16PM +0300, Jussi Kivilinna wrote: And if you tell me exactly how to run the tests and on what kernel, I'll try to do so. Ok, the box is a single-socket Bulldozer: AMD FX(tm)-8100 Eight-Core Processor stepping 02; kernel is

Re: [PATCH] crypto: twofish - add x86_64/avx assembler implementation

2012-08-15 Thread Jussi Kivilinna
Quoting Borislav Petkov b...@alien8.de: Ok, here we go. Raw data below. Thanks alot! Twofish-avx appears somewhat slower than 3way, ~9% slower with 256byte blocks to ~3% slower with 8kb blocks. snip Let me know if you need more tests. I posted patch that optimize twofish-avx few

Re: [PATCH] crypto: twofish - add x86_64/avx assembler implementation

2012-08-15 Thread Borislav Petkov
On Wed, Aug 15, 2012 at 04:48:54PM +0300, Jussi Kivilinna wrote: I posted patch that optimize twofish-avx few weeks ago: http://marc.info/?l=linux-crypto-vgerm=134364845024825w=2 I'd be interested to know, if this is patch helps on Bulldozer. Sure, can you inline it here too please. The

Re: [PATCH] crypto: twofish - add x86_64/avx assembler implementation

2012-08-15 Thread Jussi Kivilinna
On Wed, Aug 15, 2012 at 04:48:54PM +0300, Jussi Kivilinna wrote: I posted patch that optimize twofish-avx few weeks ago: http://marc.info/?l=linux-crypto-vgerm=134364845024825w=2 I'd be interested to know, if this is patch helps on Bulldozer. Sure, can you inline it here too please.

Re: [PATCH] crypto: twofish - add x86_64/avx assembler implementation

2012-08-15 Thread Borislav Petkov
On Wed, Aug 15, 2012 at 05:22:03PM +0300, Jussi Kivilinna wrote: Patch replaces 'movb' instructions with 'movzbl' to break false register dependencies and interleaves instructions better for out-of-order scheduling. Also move common round code to separate function to reduce object size.

Re: [PATCH] crypto: twofish - add x86_64/avx assembler implementation

2012-08-15 Thread Jussi Kivilinna
Quoting Borislav Petkov b...@alien8.de: On Wed, Aug 15, 2012 at 05:22:03PM +0300, Jussi Kivilinna wrote: Patch replaces 'movb' instructions with 'movzbl' to break false register dependencies and interleaves instructions better for out-of-order scheduling. Also move common round code to