Re: svn commit: r313006 - in head: sys/conf sys/libkern sys/libkern/x86 sys/sys tests/sys/kern

2017-03-22 Thread Ben RUBSON
> On 01 Mar 2017, at 21:57, Conrad Meyer  wrote:
> 
> Hi Bruce,
> 
> On my laptop (Intel(R) Core(TM) i5-3320M CPU — Ivy Bridge) I still see
> a little worse performance with this patch.

Hi Bruce & Conrad,

I gave both patches a try.
It's a real use case, iSCSI throughput.
Both target and initiator running FreeBSD 11, both patched.

I read 100 times 10G from the same raw SSD device.
Results in MB/s.
(top10 is the average of the 10 higher values)

# Without CRC32 :
average : 300.3
std dev :   3.9
top10   : 308.6
max : 311.4

# With software CRC32 :
average : 197.6
std dev :   6.5
top10   : 204.2
max : 204.9

# With Conrad CRC32 :
average : 282.4
std dev :   9.1
top10   : 292.9
max : 293.3

# With Bruce CRC32 :
average : 278.4
std dev :   7.8
top10   : 290.8
max : 292.5

The difference between the 2 patches is let's say almost non significant.
CPU on both sides: Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz

Ben

___
svn-src-all@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/svn-src-all
To unsubscribe, send any mail to "svn-src-all-unsubscr...@freebsd.org"

Re: svn commit: r313006 - in head: sys/conf sys/libkern sys/libkern/x86 sys/sys tests/sys/kern

2017-03-02 Thread Bruce Evans

On Wed, 1 Mar 2017, Conrad Meyer wrote:


On Wed, Mar 1, 2017 at 9:27 PM, Bruce Evans  wrote:

On Wed, 1 Mar 2017, Conrad Meyer wrote:


On my laptop (Intel(R) Core(TM) i5-3320M CPU ??? Ivy Bridge) I still see
a little worse performance with this patch.  Please excuse the ugly
graphs, I don't have a better graphing tool set up at this time:

https://people.freebsd.org/~cem/crc32/sse42_bde.png
https://people.freebsd.org/~cem/crc32/sse42_bde_log.png


Try doubling the loop sizes.  There shouldn't be any significant difference
above size 3*LONG unless LONG is too small.  Apparently it is too small for
older CPUs.

I now have a Sandybridge i5-2xxx laptop to test on, but don't have it set
up for much yet.


Doubling the loop sizes seems to make it slightly worse, actually:

https://people.freebsd.org/~cem/crc32/sse42_bde2.png
https://people.freebsd.org/~cem/crc32/sse42_bde_log2.png

I haven't made any attempt to inspect the generated assembly.  This is
Clang 3.9.1 with -O2.


I tested on Sandybridge (i5=2540M) and get exactly the opposite results
with clang-3.9-0.  It is much slower with intrinsics.  Much slower than
gcc-4-2.1.  Perhaps a bug in one of the test programs (mine is enclosed).
Minimum types with low variance (+-10 msec_ for "./z2 size 10" (100G total)
in seconds on idle system:

 buf_size: 512  3*512  4096  3*4096
--   -  -    --
./z2-bde-clang   10.57   8.36  6.856.58
./z2-bde-gcc 10.99   8.96  7.086.58
./z2-cur-clang   17.23  11.19  6.976.75

Oops, that was with MISALIGN = 1.  Also, I forgot to force aligment of buf,
but checked it was at 0x...40 in all case.  Now with proper alignment:

 buf size: 512  3*512  4096  3*4096
--   -  -    --
./z2-bde-clang8.96   6.56  6.626.42
./z2-bde-gcc  8.81   6.51  6.636.30
./z2-cur-clang   14.70   6.22  6.666.13

The number of iterations is adjusted so that buf_size * num_iter = 100G.

This shows that clang-3.9.0 with intrinsics is doing lots of
rearrangement which is very bad for the misaligned case and otherwise
helps for the multiple-of-3 cases (when the SHORT loop is null), and
otherwise is a small pessimization relative to no intrinsicts, but beats
gcc, while gcc does almost none.  (I mostly tested with gcc -O3 and it
seemed equally good then.)  The function doesn't use __predict_ugly(),
and clang apparently uses this to optimized the alignment code at
great cost to the main loops when the alignment code executes (perhaps
it removes the alignment code?)  clang also does poorly with buf_size
512 in the aligned case.

Indeed, gcc is much better with -O3 (other flags -static [-msse4 for
intrins]).  clang does excessive optimizations by default, and -O3
makes no difference for it:

 buf size: 512  3*512  4096  3*4096
--   -  -    --
./z2-bde-clangO3  8.96   6.56  6.626.42
./z2-bde-gccO38.95   6.06  6.115.80
./z2-cur-clangO3 14.70   6.22  6.666.13

So we seem to be mainly testing uninteresting compiler pessimizations.
Eventually compilers will understand the code better and not rearrange
it very much (except for the alignment part).

I did a quick test with LONG = SHORT = 128 and gcc -O2.  This was just
slower, even for the ideal loop size of 4096*3 (up from 6.30 to 6.67
seconds).  This change just removes the LONG loop after renaming the
SHORT loop to LONG.  gcc apparently thinks it understands this simpler
version, and pessimizes it.  While testing, I did notice a pessimization
that is not the compiler's fault: when the crc32 instructions are optimized
at the expense of the crc update at the end of the loop, the loop gets out
of sync with the update and the wrong thing can stall.  The code has
subtleties to try to prevent this, by compilers don't really understand
this.  Compiler membars to control the ordering precisely were just
pessimizations.

X #include 
X #include 
X #include 
X #include 
X 
X #define MISALIGN	1

X #define SIZE  (1024 * 1024)
X 
X uint8_t buf[MISALIGN + SIZE];
X 
X uint32_t sse42_crc32c(uint32_t, const unsigned char *, unsigned);
X 
X int

X main(int argc, char **argv)
X {
X   size_t size;
X   uint32_t crc;
X   int i, j, limit, repeat;
X 
X 	size = argc == 1 ? SIZE : atoi(argv[1]);

X   limit = 100L / size;
X   repeat = argc < 3 ? 10 : atoi(argv[2]);
X   for (i = 0; i < sizeof(buf); i++)
X   buf[i] = rand();
X   crc = 0;
X   for (j = 0; j < repeat; j++)
X   for (i = 0; i < limit; i++)
X   crc = sse42_crc32c(crc, [MISALIGN], size);
X   printf("%#x\n", sse42_crc32c(0, [MISALIGN], size));
X   return (crc == 0 ? 0 : 1);
X }

Loops like this are not very representative of normal use, but I don't
know a better way.

Bruce___
svn-src-all@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/svn-src-all
To 

Re: svn commit: r313006 - in head: sys/conf sys/libkern sys/libkern/x86 sys/sys tests/sys/kern

2017-03-01 Thread Conrad Meyer
On Wed, Mar 1, 2017 at 9:27 PM, Bruce Evans  wrote:
> On Wed, 1 Mar 2017, Conrad Meyer wrote:
>
>> On my laptop (Intel(R) Core(TM) i5-3320M CPU — Ivy Bridge) I still see
>> a little worse performance with this patch.  Please excuse the ugly
>> graphs, I don't have a better graphing tool set up at this time:
>>
>> https://people.freebsd.org/~cem/crc32/sse42_bde.png
>> https://people.freebsd.org/~cem/crc32/sse42_bde_log.png
>
>
> Try doubling the loop sizes.  There shouldn't be any significant difference
> above size 3*LONG unless LONG is too small.  Apparently it is too small for
> older CPUs.
>
> I now have a Sandybridge i5-2xxx laptop to test on, but don't have it set
> up for much yet.
>
> Bruce


Hi Bruce,

Doubling the loop sizes seems to make it slightly worse, actually:

https://people.freebsd.org/~cem/crc32/sse42_bde2.png
https://people.freebsd.org/~cem/crc32/sse42_bde_log2.png

I haven't made any attempt to inspect the generated assembly.  This is
Clang 3.9.1 with -O2.

Best,
Conrad
___
svn-src-all@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/svn-src-all
To unsubscribe, send any mail to "svn-src-all-unsubscr...@freebsd.org"

Re: svn commit: r313006 - in head: sys/conf sys/libkern sys/libkern/x86 sys/sys tests/sys/kern

2017-03-01 Thread Ben RUBSON
> On 02 Mar 2017, at 06:27, Bruce Evans  wrote:
> 
> On Wed, 1 Mar 2017, Conrad Meyer wrote:
> 
>> On my laptop (Intel(R) Core(TM) i5-3320M CPU — Ivy Bridge) I still see
>> a little worse performance with this patch.  Please excuse the ugly
>> graphs, I don't have a better graphing tool set up at this time:
>> 
>> https://people.freebsd.org/~cem/crc32/sse42_bde.png
>> https://people.freebsd.org/~cem/crc32/sse42_bde_log.png
> 
> Try doubling the loop sizes.  There shouldn't be any significant difference
> above size 3*LONG unless LONG is too small.  Apparently it is too small for
> older CPUs.

Many thx Bruce & Conrad for your work on this !
My two cents, big chunks speed sounds (really) important as it will impact
data digest enabled iSCSI throughput.
If needed, I could help with some benchmarks on my servers :
Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz
Thank you !

___
svn-src-all@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/svn-src-all
To unsubscribe, send any mail to "svn-src-all-unsubscr...@freebsd.org"

Re: svn commit: r313006 - in head: sys/conf sys/libkern sys/libkern/x86 sys/sys tests/sys/kern

2017-03-01 Thread Bruce Evans

On Wed, 1 Mar 2017, Conrad Meyer wrote:


On my laptop (Intel(R) Core(TM) i5-3320M CPU ??? Ivy Bridge) I still see
a little worse performance with this patch.  Please excuse the ugly
graphs, I don't have a better graphing tool set up at this time:

https://people.freebsd.org/~cem/crc32/sse42_bde.png
https://people.freebsd.org/~cem/crc32/sse42_bde_log.png


Try doubling the loop sizes.  There shouldn't be any significant difference
above size 3*LONG unless LONG is too small.  Apparently it is too small for
older CPUs.

I now have a Sandybridge i5-2xxx laptop to test on, but don't have it set
up for much yet.

Bruce___
svn-src-all@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/svn-src-all
To unsubscribe, send any mail to "svn-src-all-unsubscr...@freebsd.org"

Re: svn commit: r313006 - in head: sys/conf sys/libkern sys/libkern/x86 sys/sys tests/sys/kern

2017-03-01 Thread Conrad Meyer
Hi Bruce,

On my laptop (Intel(R) Core(TM) i5-3320M CPU — Ivy Bridge) I still see
a little worse performance with this patch.  Please excuse the ugly
graphs, I don't have a better graphing tool set up at this time:

https://people.freebsd.org/~cem/crc32/sse42_bde.png
https://people.freebsd.org/~cem/crc32/sse42_bde_log.png

Best,
Conrad


On Mon, Feb 27, 2017 at 6:27 PM, Bruce Evans  wrote:
> On Mon, 27 Feb 2017, Conrad Meyer wrote:
>
>> On Thu, Feb 2, 2017 at 12:29 PM, Bruce Evans  wrote:
>>>
>>> I've almost finished fixing and optimizing this.  I didn't manage to fix
>>> all the compiler pessimizations, but the result is within 5% of optimal
>>> for buffers larger than a few K.
>>
>>
>> Did you ever get to a final patch that you are satisfied with?  It
>> would be good to get this improvement into the tree.
>
>
> I'm happy with this version (attached and partly enclosed).  You need to
> test it in the kernel and commit it (I on;y did simple correctness tests
> in userland).
>
> X Index: conf/files.amd64
> X ===
> X --- conf/files.amd64  (revision 314363)
> X +++ conf/files.amd64  (working copy)
> X @@ -545,6 +545,9 @@
> X  isa/vga_isa.coptionalvga
> X  kern/kern_clocksource.c  standard
> X  kern/link_elf_obj.c  standard
> X +libkern/x86/crc32_sse42.cstandard
> X +libkern/memmove.cstandard
> X +libkern/memset.c standard
>
> Also fix some nearby disorder.
>
> X ...
> X Index: libkern/x86/crc32_sse42.c
> X ===
> X --- libkern/x86/crc32_sse42.c (revision 314363)
> X +++ libkern/x86/crc32_sse42.c (working copy)
> X @@ -31,15 +31,41 @@
> X   */
> X  #ifdef USERSPACE_TESTING
> X  #include 
> X +#include 
> X  #else
> X  #include 
> X +#include 
> X  #include 
> X -#include 
> X -#include 
> X  #endif
>
> Also fix minor #include errors.
>
> X X -#include 
> X +static __inline uint32_t
> X +_mm_crc32_u8(uint32_t x, uint8_t y)
> X +{
> X + /*
> X +  * clang (at least 3.9.[0-1]) pessimizes "rm" (y) and "m" (y)
> X +  * significantly and "r" (y) a lot by copying y to a different
> X +  * local variable (on the stack or in a register), so only use
> X +  * the latter.  This costs a register and an instruction but
> X +  * not a uop.
> X +  */
> X + __asm("crc32b %1,%0" : "+r" (x) : "r" (y));
> X + return (x);
> X +}
>
> Using intrinsics avoids the silly copying via the stack, and allows more
> unrolling.  Old gcc does more unrolling with just asms.  Unrolling is
> almost useless (some details below).
>
> X @@ -47,12 +73,14 @@
> X   * Block sizes for three-way parallel crc computation.  LONG and SHORT
> must
> X   * both be powers of two.
> X   */
> X -#define LONG 8192
> X -#define SHORT256
> X +#define LONG 128
> X +#define SHORT64
>
> These are aggressively low.
>
> Note that small buffers aren't handled very well.  SHORT = 64 means that
> a buffer of size 3 * 64 = 192 is handled entirely by the "SHORT" loop.
> 192 is not very small, but any smaller than that and overheads for
> adjustment at the end of the loop are too large for the "SHORT" loop
> to be worth doing.  Almost any value of LONG larger than 128 works OK
> now, but if LONG is large then it gives too much work for the "SHORT"
> loop, since normal buffer sizes are not a multiple of 3.  E.g., with
> the old LONG and SHORT, a buffer of size 128 was decomposed as 5 * 24K
> (done almost optimally by the "LONG" loop) + 10 * 768 (done a bit less
> optimally by the "SHORT" loop) + 10 * 768 + 64 * 8 (done pessimally).
>
> I didn't get around to ifdefing this for i386.  On i386, the loops take
> twice as many crc32 instructions for a given byte count, so the timing
> is satisfed by a byte count half as large, so SHORT and LONG can be
> reduced by a factor of 2 to give faster handling for small buffers without
> affecting the speed for large buffers significantly.
>
> X X  /* Tables for hardware crc that shift a crc by LONG and SHORT zeros. */
> X  static uint32_t crc32c_long[4][256];
> X +static uint32_t crc32c_2long[4][256];
> X  static uint32_t crc32c_short[4][256];
> X +static uint32_t crc32c_2short[4][256];
>
> I didn't get around to updating the comment.  2long shifts by 2*LONG zeros,
> etc.
>
> Shifts by 3N are done by adding shifts by 1N and 2N in parallel.  I couldn't
> get the direct 3N shift to run any faster.
>
> X @@ -190,7 +220,11 @@
> X   const size_t align = 4;
> X  #endif
> X   const unsigned char *next, *end;
> X - uint64_t crc0, crc1, crc2;  /* need to be 64 bits for crc32q */
> X +#ifdef __amd64__
> X + uint64_t crc0, crc1, crc2;
> X +#else
> X + uint32_t crc0, crc1, crc2;
> X +#endif
> X X next = buf;
> X   crc0 = crc;
>
> 64 bits of course isn't needed for i386.  It isn't needed for amd64 either.
> I think the 

Re: svn commit: r313006 - in head: sys/conf sys/libkern sys/libkern/x86 sys/sys tests/sys/kern

2017-02-27 Thread Bruce Evans

On Mon, 27 Feb 2017, Conrad Meyer wrote:


On Thu, Feb 2, 2017 at 12:29 PM, Bruce Evans  wrote:

I've almost finished fixing and optimizing this.  I didn't manage to fix
all the compiler pessimizations, but the result is within 5% of optimal
for buffers larger than a few K.


Did you ever get to a final patch that you are satisfied with?  It
would be good to get this improvement into the tree.


I'm happy with this version (attached and partly enclosed).  You need to
test it in the kernel and commit it (I on;y did simple correctness tests
in userland).

X Index: conf/files.amd64
X ===
X --- conf/files.amd64  (revision 314363)
X +++ conf/files.amd64  (working copy)
X @@ -545,6 +545,9 @@
X  isa/vga_isa.coptionalvga
X  kern/kern_clocksource.c  standard
X  kern/link_elf_obj.c  standard
X +libkern/x86/crc32_sse42.cstandard
X +libkern/memmove.cstandard
X +libkern/memset.c standard

Also fix some nearby disorder.

X ...
X Index: libkern/x86/crc32_sse42.c
X ===
X --- libkern/x86/crc32_sse42.c (revision 314363)
X +++ libkern/x86/crc32_sse42.c (working copy)
X @@ -31,15 +31,41 @@
X   */
X  #ifdef USERSPACE_TESTING
X  #include 
X +#include 
X  #else
X  #include 
X +#include 
X  #include 
X -#include 
X -#include 
X  #endif

Also fix minor #include errors.

X 
X -#include 

X +static __inline uint32_t
X +_mm_crc32_u8(uint32_t x, uint8_t y)
X +{
X + /*
X +  * clang (at least 3.9.[0-1]) pessimizes "rm" (y) and "m" (y)
X +  * significantly and "r" (y) a lot by copying y to a different
X +  * local variable (on the stack or in a register), so only use
X +  * the latter.  This costs a register and an instruction but
X +  * not a uop.
X +  */
X + __asm("crc32b %1,%0" : "+r" (x) : "r" (y));
X + return (x);
X +}

Using intrinsics avoids the silly copying via the stack, and allows more
unrolling.  Old gcc does more unrolling with just asms.  Unrolling is
almost useless (some details below).

X @@ -47,12 +73,14 @@
X   * Block sizes for three-way parallel crc computation.  LONG and SHORT must
X   * both be powers of two.
X   */
X -#define LONG 8192
X -#define SHORT256
X +#define LONG 128
X +#define SHORT64

These are aggressively low.

Note that small buffers aren't handled very well.  SHORT = 64 means that
a buffer of size 3 * 64 = 192 is handled entirely by the "SHORT" loop.
192 is not very small, but any smaller than that and overheads for
adjustment at the end of the loop are too large for the "SHORT" loop
to be worth doing.  Almost any value of LONG larger than 128 works OK
now, but if LONG is large then it gives too much work for the "SHORT"
loop, since normal buffer sizes are not a multiple of 3.  E.g., with
the old LONG and SHORT, a buffer of size 128 was decomposed as 5 * 24K
(done almost optimally by the "LONG" loop) + 10 * 768 (done a bit less
optimally by the "SHORT" loop) + 10 * 768 + 64 * 8 (done pessimally).

I didn't get around to ifdefing this for i386.  On i386, the loops take
twice as many crc32 instructions for a given byte count, so the timing
is satisfed by a byte count half as large, so SHORT and LONG can be
reduced by a factor of 2 to give faster handling for small buffers without
affecting the speed for large buffers significantly.

X 
X  /* Tables for hardware crc that shift a crc by LONG and SHORT zeros. */

X  static uint32_t crc32c_long[4][256];
X +static uint32_t crc32c_2long[4][256];
X  static uint32_t crc32c_short[4][256];
X +static uint32_t crc32c_2short[4][256];

I didn't get around to updating the comment.  2long shifts by 2*LONG zeros,
etc.

Shifts by 3N are done by adding shifts by 1N and 2N in parallel.  I couldn't
get the direct 3N shift to run any faster.

X @@ -190,7 +220,11 @@
X   const size_t align = 4;
X  #endif
X   const unsigned char *next, *end;
X - uint64_t crc0, crc1, crc2;  /* need to be 64 bits for crc32q */
X +#ifdef __amd64__
X + uint64_t crc0, crc1, crc2;
X +#else
X + uint32_t crc0, crc1, crc2;
X +#endif
X 
X  	next = buf;

X   crc0 = crc;

64 bits of course isn't needed for i386.  It isn't needed for amd64 either.
I think the crc32 instruction zeros the top 32 bits so they can be ignored.
However, when I modified the asm to return 32 bits to tell the compiler about
this (which the intrinsic wouldn't be able to do) and used 32 bits here,
this was just slightly slower.

For some intermediate crc calculations, only 32 bits are used, and the
compiler can see this.  clang on amd64 optimizes this better than gcc,
starting with all the intermediate crc variables declared as 64 bits.
gcc generates worst code when some of the intermediates are declared as
32 bits.  So keep using 64 bits on amd64 here.

X @@ -202,6 +236,7 @@
X   len--;
X   }
X 
X +#if LONG > 

Re: svn commit: r313006 - in head: sys/conf sys/libkern sys/libkern/x86 sys/sys tests/sys/kern

2017-02-27 Thread Conrad Meyer
On Thu, Feb 2, 2017 at 12:29 PM, Bruce Evans  wrote:
> I've almost finished fixing and optimizing this.  I didn't manage to fix
> all the compiler pessimizations, but the result is within 5% of optimal
> for buffers larger than a few K.

Hi Bruce,

Did you ever get to a final patch that you are satisfied with?  It
would be good to get this improvement into the tree.

Thanks,
Conrad
___
svn-src-all@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/svn-src-all
To unsubscribe, send any mail to "svn-src-all-unsubscr...@freebsd.org"


Re: svn commit: r313006 - in head: sys/conf sys/libkern sys/libkern/x86 sys/sys tests/sys/kern

2017-02-23 Thread Ben RUBSON
Hi guys, Conrad, Bruce,

May I ask you some news regarding this please ?

More than 3 weeks now running Conrad commit on 2 CRC32C digest enabled iSCSI 
initiators / targets without issue :)
Thank you very much again for this !

Shall we then think about "fixing" the last one or two remaining things before 
being able to MFC it please ?

Would really be glad to see this nice kernel feature in 11.1.
Perhaps it could also be backported to 11.0pX as a nice/huge & non-risky 
performance improvement.

Thank you very much for your support & cooperation !

Best regards,

Ben

___
svn-src-all@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/svn-src-all
To unsubscribe, send any mail to "svn-src-all-unsubscr...@freebsd.org"


Re: svn commit: r313006 - in head: sys/conf sys/libkern sys/libkern/x86 sys/sys tests/sys/kern

2017-02-02 Thread Bruce Evans

On Thu, 2 Feb 2017, Konstantin Belousov wrote:


On Tue, Jan 31, 2017 at 03:26:32AM +, Conrad E. Meyer wrote:

+   compile-with"${CC} -c ${CFLAGS:N-nostdinc} ${WERROR} ${PROF} -msse4 
${.IMPSRC}" \


BTW, new gcc has -mcrc32 option, but clang 3.9.1 apparently does not.


I've almost finished fixing and optimizing this.  I didn't manage to fix
all the compiler pessimizations, but the result is within 5% of optimal
for buffers larger than a few K.

See the previous patch for -msse4 removal.  It might be good to add
-fno-unroll-functions to CFLAGS to prevent some clang pessimizations.
Or -Os might work even better.

X Index: libkern/x86/crc32_sse42.c
X ===
X --- libkern/x86/crc32_sse42.c (revision 313090)
X +++ libkern/x86/crc32_sse42.c (working copy)
X @@ -31,15 +31,41 @@
X   */
X  #ifdef USERSPACE_TESTING
X  #include 
X +#include 
X  #else
X  #include 
X +#include 
X  #include 
X -#include 
X -#include 
X  #endif
X 
X -#include 

X +static __inline uint32_t
X +_mm_crc32_u8(uint32_t x, uint8_t y)
X +{
X + /*
X +  * clang (at least 3.9.[0-1]) pessimizes "rm" (y) and "m" (y)
X +  * significantly and "r" (y) a lot by copying y to a different
X +  * local variable (on the stack or in a register), so only use
X +  * the latter.  This costs a register and an instruction but
X +  * not a uop.
X +  */
X + __asm("crc32b %1,%0" : "+r" (x) : "r" (y));
X + return (x);
X +}

I haven't found a better fix.  Even gcc likes to pessimize this by
preferring the "r" constraint in "rm", so "r" should be disparaged.
It does this to try to schedule loop control instructions in beteen
the memory reference and the operation using it, but it gets this
scheduling mostly wrong.  -Os stops this for gcc, but clang keeps
producing the large and slow explicit load (then a store to a different
place in memory to use there if the constraint allows "m") even with -Os.

X 
X +static __inline uint32_t

X +_mm_crc32_u32(uint32_t x, uint32_t y)
X +{
X + __asm("crc32l %1,%0" : "+r" (x) : "r" (y));
X + return (x);
X +}
X +
X +static __inline uint64_t
X +_mm_crc32_u64(uint64_t x, uint64_t y)
X +{
X + __asm("crc32q %1,%0" : "+r" (x) : "r" (y));
X + return (x);
X +}
X +
X  /* CRC-32C (iSCSI) polynomial in reversed bit order. */
X  #define POLY 0x82f63b78
X 
X @@ -47,12 +73,14 @@

X   * Block sizes for three-way parallel crc computation.  LONG and SHORT must
X   * both be powers of two.
X   */
X -#define LONG 8192
X -#define SHORT256
X +#define LONG 128
X +#define SHORT64

These are now the smallest that work right on Haswell.  Practical values
should be larger in case the timing is to different for the manual
scheduling to work.

Small buffers are still slow, so LONG = SHORT = 1024 (with no LONG loop)
would be a good practical value, except for the annoying problem that
the 3-way optimization doesn't work so well for normal power-of-2 buffer
sizes.  LONG = SHORT = 1024 is perfect for buffer sizes of a multiple
of 3K.  It does 4K as 3K + 1K, with the 3K part optimized and the 1K
part unoptimized and taking as long as the 3K part.  LONG = 1024 with
SHORT = 64 does the residual 1K at almost full speed except for another
residual of 192.

X 
X  /* Tables for hardware crc that shift a crc by LONG and SHORT zeros. */

X  static uint32_t crc32c_long[4][256];
X +static uint32_t crc32c_2long[4][256];
X  static uint32_t crc32c_short[4][256];
X +static uint32_t crc32c_2short[4][256];
X 
X  /*

X   * Multiply a matrix times a vector over the Galois field of two elements,
X @@ -171,7 +199,9 @@
X  crc32c_init_hw(void)
X  {
X   crc32c_zeros(crc32c_long, LONG);
X + crc32c_zeros(crc32c_2long, 2 * LONG);
X   crc32c_zeros(crc32c_short, SHORT);
X + crc32c_zeros(crc32c_2short, 2 * SHORT);
X  }
X  #ifdef _KERNEL
X  SYSINIT(crc32c_sse42, SI_SUB_LOCK, SI_ORDER_ANY, crc32c_init_hw, NULL);
X @@ -190,7 +220,11 @@
X   const size_t align = 4;
X  #endif
X   const unsigned char *next, *end;
X - uint64_t crc0, crc1, crc2;  /* need to be 64 bits for crc32q */
X +#ifdef __amd64__
X + uint64_t crc0, crc1, crc2;
X +#else
X + uint32_t crc0, crc1, crc2;
X +#endif
X 
X  	next = buf;

X   crc0 = crc;

The type system doesn't work well here.  The crc32* instruction returns
a 32-bit value in a register of unclear size.  I think it is zero extended
to full width, but my asms aren't written quite right to say this and neither
are the instrinsics.  The 64-bit intrinsic returns a 64-bit value which just
gets in the way.  The above has to declare the type as uint64_t for amd64,
else pessimizations result.  That gives smaller pessimizations, mainly with
gcc, for expressions like (crc0 >> 24) where the 64-bit declaration doesn't
make it clear that the top bits are zero, so extra zeroing operations may
be necessary.

I have barely tested the 32-bit case, but the 64-bit type is clearly wasted
then (but easier to 

Re: svn commit: r313006 - in head: sys/conf sys/libkern sys/libkern/x86 sys/sys tests/sys/kern

2017-02-02 Thread Konstantin Belousov
On Tue, Jan 31, 2017 at 03:26:32AM +, Conrad E. Meyer wrote:
> + compile-with"${CC} -c ${CFLAGS:N-nostdinc} ${WERROR} ${PROF} -msse4 
> ${.IMPSRC}" \

BTW, new gcc has -mcrc32 option, but clang 3.9.1 apparently does not.
___
svn-src-all@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/svn-src-all
To unsubscribe, send any mail to "svn-src-all-unsubscr...@freebsd.org"


Re: svn commit: r313006 - in head: sys/conf sys/libkern sys/libkern/x86 sys/sys tests/sys/kern

2017-02-01 Thread Bruce Evans

On Tue, 31 Jan 2017, Conrad Meyer wrote:


On Tue, Jan 31, 2017 at 7:16 PM, Bruce Evans  wrote:

Another reply to this...

On Tue, 31 Jan 2017, Conrad Meyer wrote:


On Tue, Jan 31, 2017 at 7:36 AM, Bruce Evans  wrote:


On Tue, 31 Jan 2017, Bruce Evans wrote:
I
think there should by no alignment on entry -- just assume the buffer is
aligned in the usual case, and only run 4% slower when it is misaligned.


Please write such a patch and demonstrate the improvement.


It is easy to demonstrate.  I just put #if 0 around the early alignment
code.  The result seem too good to be true, so maybe I missed some
later dependency on alignment of the addresses:
- for 128-byte buffers and misalignment of 3, 10g takes 1.48 seconds with
  alignment and 1.02 seconds without alignment.
This actually makes sense, 128 bytes can be done with 16 8-byte unaligned
crc32q's.  The alignment code makes it do 15 * 8-but and (5 + 3) * 1-byte.
7 more 3-cycle instructions and overhead too is far more than the cost
of letting the CPU do read-combining.
- for 4096-byte buffers, the difference is insignificant (0.47 seconds for
  10g.


I believe it, especially for newer amd64.  I seem to recall that older
x86 machines had a higher misalignment penalty, but it was largely
reduced in (?)Nehalem.  Why don't you go ahead and commit that change?


Needs more work.

Here are fairly clean patches for the unportabilities.  I can't really
test these in the kernel.  Merge the asms with yours if yours are better
in parts.

X Index: conf/files.amd64
X ===
X --- conf/files.amd64  (revision 313007)
X +++ conf/files.amd64  (working copy)
X @@ -536,6 +536,9 @@
X  isa/vga_isa.coptionalvga
X  kern/kern_clocksource.c  standard
X  kern/link_elf_obj.c  standard
X +libkern/x86/crc32_sse42.cstandard
X +libkern/memmove.cstandard
X +libkern/memset.c standard
X  #
X  # IA32 binary support
X  #
X @@ -593,14 +596,6 @@
X  compat/ndis/subr_usbd.c  optionalndisapi pci
X  compat/ndis/winx64_wrap.Soptionalndisapi pci
X  #
X -crc32_sse42.ostandard
\
X - dependency  "$S/libkern/x86/crc32_sse42.c"\
X - compile-with"${CC} -c ${CFLAGS:N-nostdinc} ${WERROR} ${PROF} -msse4 
${.IMPSRC}" \
X - no-implicit-rule\
X - clean   "crc32_sse42.o"
X -libkern/memmove.cstandard
X -libkern/memset.c standard
X -#
X  # x86 real mode BIOS emulator, required by dpms/pci/vesa
X  #
X  compat/x86bios/x86bios.c optional x86bios | dpms | pci | vesa

This also fixes unsorting by crc32_sse42.o and nearby unsorting by all other
libkern lines (just 2) in files.amd64.  This file is still grossly unsorted
near the end.

X Index: conf/files.i386
X ===
X --- conf/files.i386   (revision 313007)
X +++ conf/files.i386   (working copy)
X @@ -554,11 +554,6 @@
X  kern/imgact_aout.c   optional compat_aout
X  kern/imgact_gzip.c   optional gzip
X  kern/subr_sfbuf.cstandard
X -crc32_sse42.ostandard
\
X - dependency  "$S/libkern/x86/crc32_sse42.c"\
X - compile-with"${CC} -c ${CFLAGS:N-nostdinc} ${WERROR} ${PROF} -msse4 
${.IMPSRC}" \
X - no-implicit-rule\
X - clean   "crc32_sse42.o"
X  libkern/divdi3.c standard
X  libkern/ffsll.c  standard
X  libkern/flsll.c  standard
X @@ -569,6 +564,7 @@
X  libkern/ucmpdi2.cstandard
X  libkern/udivdi3.cstandard
X  libkern/umoddi3.cstandard
X +libkern/x86/crc32_sse42.cstandard
X  i386/xbox/xbox.c optional xbox
X  i386/xbox/xboxfb.c   optional xboxfb
X  dev/fb/boot_font.c   optional xboxfb

This also fixes unsorting by crc32_sse42.o.  files.i386 is not as unsorted
as files.i386.

The rules for sorting generated files are unclear.  They are not in source
directories so are nor naturally grouped.  Old ones are mostly placed near
the beginning.  Some are sorted on a pathname of the source file(s) in
scattered places in the make rules.  crc32_sse42.o is partly like that.
It was sorted with libkern sources but not by the x86 part.

X Index: libkern/x86/crc32_sse42.c
X ===
X --- libkern/x86/crc32_sse42.c (revision 313007)
X +++ libkern/x86/crc32_sse42.c (working copy)
X @@ -31,15 +31,41 @@
X   */
X  #ifdef USERSPACE_TESTING
X  #include 
X +#include 

Fix dependency on namespace pollution in  for at least size_t.

X  #else
X  #include 
X +#include 
X  #include 

Re: svn commit: r313006 - in head: sys/conf sys/libkern sys/libkern/x86 sys/sys tests/sys/kern

2017-01-31 Thread Conrad Meyer
On Tue, Jan 31, 2017 at 7:16 PM, Bruce Evans  wrote:
> Another reply to this...
>
> On Tue, 31 Jan 2017, Conrad Meyer wrote:
>
>> On Tue, Jan 31, 2017 at 7:36 AM, Bruce Evans  wrote:
>>>
>>> On Tue, 31 Jan 2017, Bruce Evans wrote:
>>> I
>>> think there should by no alignment on entry -- just assume the buffer is
>>> aligned in the usual case, and only run 4% slower when it is misaligned.
>>
>>
>> Please write such a patch and demonstrate the improvement.
>
>
> It is easy to demonstrate.  I just put #if 0 around the early alignment
> code.  The result seem too good to be true, so maybe I missed some
> later dependency on alignment of the addresses:
> - for 128-byte buffers and misalignment of 3, 10g takes 1.48 seconds with
>   alignment and 1.02 seconds without alignment.
> This actually makes sense, 128 bytes can be done with 16 8-byte unaligned
> crc32q's.  The alignment code makes it do 15 * 8-but and (5 + 3) * 1-byte.
> 7 more 3-cycle instructions and overhead too is far more than the cost
> of letting the CPU do read-combining.
> - for 4096-byte buffers, the difference is insignificant (0.47 seconds for
>   10g.

I believe it, especially for newer amd64.  I seem to recall that older
x86 machines had a higher misalignment penalty, but it was largely
reduced in (?)Nehalem.  Why don't you go ahead and commit that change?

>> perhaps we could just remove the 3x8k table — I'm not sure
>> it adds any benefit over the 3x256 table.
>
>
> Good idea, but the big table is useful.  Ifdefing out the LONG case reduces
> the speed for large buffers from ~0.35 seconds to ~0.43 seconds in the
> setup below.  Ifdefing out the SHORT case only reduces to ~0.39 seconds.

Interesting.

> I hoped that an even shorter SHORT case would work.  I think it now handles
> 768 bytes (3 * SHORT) in the inner loop.

Right.

> That is 32 sets of 3 crc32q's,
> and I would have thought that update at the end would take about as long
> as 1 iteration (3%), but it apparently takes 33%.

The update at the end may be faster with PCLMULQDQ.  There are
versions of this algorithm that use that in place of the lookup-table
combine (for example, Linux has a permissively licensed implementation
here: 
http://lxr.free-electrons.com/source/arch/x86/crypto/crc32c-pcl-intel-asm_64.S
).

Unfortunately, PCLMULQDQ uses FPU state, which is inappropriate most
of the time in kernel mode.  It could be used opportunistically if the
thread is already in FPU-save mode or if the input is "big enough" to
make it worth it.

>>> Your benchmarks mainly give results for the <= 768 bytes where most of
>>> the manual optimizations don't apply.
>>
>>
>> 0x000400: asm:68 intrins:62 multitable:684  (ns per buf)
>> 0x000800: asm:132 intrins:133  (ns per buf)
>> 0x002000: asm:449 intrins:446  (ns per buf)
>> 0x008000: asm:1501 intrins:1497  (ns per buf)
>> 0x02: asm:5618 intrins:5609  (ns per buf)
>>
>> (All routines are in a separate compilation unit with no full-program
>> optimization, as they are in the kernel.)
>
>
> These seem slow.  I modified my program to test the actual kernel code,
> and get for 10gB on freefall's Xeon (main times in seconds):

Freefall has a Haswell Xeon @ 3.3GHz.  My laptop is a Sandy Bridge
Core i5 @ 2.6 GHz.  That may help explain the difference.

Best,
Conrad
___
svn-src-all@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/svn-src-all
To unsubscribe, send any mail to "svn-src-all-unsubscr...@freebsd.org"

Re: svn commit: r313006 - in head: sys/conf sys/libkern sys/libkern/x86 sys/sys tests/sys/kern

2017-01-31 Thread Bruce Evans

Another reply to this...

On Tue, 31 Jan 2017, Conrad Meyer wrote:


On Tue, Jan 31, 2017 at 7:36 AM, Bruce Evans  wrote:

On Tue, 31 Jan 2017, Bruce Evans wrote:
Unrolling (or not) may be helpful or harmful for entry and exit code.


Helpful, per my earlier benchmarks.


I
think there should by no alignment on entry -- just assume the buffer is
aligned in the usual case, and only run 4% slower when it is misaligned.


Please write such a patch and demonstrate the improvement.


It is easy to demonstrate.  I just put #if 0 around the early alignment
code.  The result seem too good to be true, so maybe I missed some
later dependency on alignment of the addresses:
- for 128-byte buffers and misalignment of 3, 10g takes 1.48 seconds with
  alignment and 1.02 seconds without alignment.
This actually makes sense, 128 bytes can be done with 16 8-byte unaligned
crc32q's.  The alignment code makes it do 15 * 8-but and (5 + 3) * 1-byte.
7 more 3-cycle instructions and overhead too is far more than the cost
of letting the CPU do read-combining.
- for 4096-byte buffers, the difference is insignificant (0.47 seconds for
  10g.


I
don't understand the algorithm for joining crcs -- why doesn't it work
to reduce to 12 or 24 bytes in the main loop?


It would, but I haven't implemented or tested that.  You're welcome to
do so and demonstrate an improvement.  It does add more lookup table
bloat, but perhaps we could just remove the 3x8k table ??? I'm not sure
it adds any benefit over the 3x256 table.


Good idea, but the big table is useful.  Ifdefing out the LONG case reduces
the speed for large buffers from ~0.35 seconds to ~0.43 seconds in the
setup below.  Ifdefing out the SHORT case only reduces to ~0.39 seconds.
I hoped that an even shorter SHORT case would work.  I think it now handles
768 bytes (3 * SHORT) in the inner loop.  That is 32 sets of 3 crc32q's,
and I would have thought that update at the end would take about as long
as 1 iteration (3%), but it apparently takes 33%.


...
Your benchmarks mainly give results for the <= 768 bytes where most of
the manual optimizations don't apply.


0x000400: asm:68 intrins:62 multitable:684  (ns per buf)
0x000800: asm:132 intrins:133  (ns per buf)
0x002000: asm:449 intrins:446  (ns per buf)
0x008000: asm:1501 intrins:1497  (ns per buf)
0x02: asm:5618 intrins:5609  (ns per buf)

(All routines are in a separate compilation unit with no full-program
optimization, as they are in the kernel.)


These seem slow.  I modified my program to test the actual kernel code,
and get for 10gB on freefall's Xeon (main times in seconds):

0x08: asm(rm):3.41 asm(r):3.07 intrins:6.01 gcc:3.74  (3S = 2.4ns/buf)
0x10: asm(rm):2.05 asm(r):1.70 intrins:2.92 gcc:2.62  (2S = 3/2ns/buf)
0x20: asm(rm):1.63 asm(r):1.58 intrins:1.62 gcc:1.61  (1.6S = 5.12ns/buf)
0x40: asm(rm):1.07 asm(r):1.11 intrins:1.06 gcc:1.14  (1.1S = 7.04ns/buf)
0x80: asm(rm):1.02 asm(r):1.04 intrins:1.03 gcc:1.04  (1.02S = 13.06ns/buf)
0x000100: asm(rm):1.02 asm(r):1.02 intrins:1.02 gcc:1.08  (1.02S = 52.22ns/buf)
0x000200: asm(rm):1.02 asm(r):1.02 intrins:1.02 gcc:1.02  (1.02S = 104.45ns/buf)
0x000400: asm(rm):0.58 asm(r):0.57 intrins:0.57 gcc:0.57  (.57S = 116.43ns/buf)
0x001000: asm(rm):0.62 asm(r):0.57 intrins:0.57 gcc:0.57  (.57S = 233.44ns/buf)
0x002000: asm(rm):0.48 asm(r):0.46 intrins:0.46 gcc:0.46  (.46S = 376.83ns/buf)
0x004000: asm(rm):0.49 asm(r):0.46 intrins:0.46 gcc:0.46  (.46S = 753.66ns/buf)
0x008000: asm(rm):0.49 asm(r):0.38 intrins:0.38 gcc:0.38  (.38S = 1245.18ns/buf)
0x01: asm(rm):0.47 asm(r):0.38 intrins:0.36 gcc:0.38  (.36S = 2359.30ns/buf)
0x02: asm(rm):0.43 asm(r):1.05 intrins:0.35 gcc:0.36  (.35S = 4587.52ns/buf)

asm(r) is a fix for clang's slownes with inline asms.  Just change the
constraint from "rm" to "r".  This takes an extra register, but no more
uops.

This is for the aligned case with no hacks.

intrins does something bad for small buffers.  Probably just the branch over
the dead unrolling.  Twice 2.4ns/buf for 8-byte buffers is still very fast.
This is 16 cycles.  3 cycles to do 1 crc32q and the rest mainly for 1 function
call and too many branches.

Bruce___
svn-src-all@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/svn-src-all
To unsubscribe, send any mail to "svn-src-all-unsubscr...@freebsd.org"

Re: svn commit: r313006 - in head: sys/conf sys/libkern sys/libkern/x86 sys/sys tests/sys/kern

2017-01-31 Thread Bruce Evans

On Tue, 31 Jan 2017, Conrad Meyer wrote:


On Tue, Jan 31, 2017 at 7:36 AM, Bruce Evans  wrote:

On Tue, 31 Jan 2017, Bruce Evans wrote:
Unrolling (or not) may be helpful or harmful for entry and exit code.


Helpful, per my earlier benchmarks.


I
think there should by no alignment on entry -- just assume the buffer is
aligned in the usual case, and only run 4% slower when it is misaligned.


Please write such a patch and demonstrate the improvement.


I now understand the algorithm.

The division into 3 has to keep sequentiality (since CRC input is a bit
string), so it doesn't work to separate the memory acces of the 3 crc32's
by 8 bytes as in my simple test -- they have to be separated by large
amounts.  Then recombining requires multiplying the polynomials associated
with the CRCs of the higher 2 blocks by X^N and reducing again, where N is
large an related to the block size.  This is done using a large table for
each N needed, and to keep things reasonably simple, only 2 N's are used.


The exit code handles up to SHORT * 3 = 768 bytes, not up to 4 or 8
bytes or up to 3 times that like simpler algorithms.  768 is quite
large, and the exit code is quite slow.  It reduces 8 or 4 bytes at a
time without any dependency reduction, and then 1 byte at a time.


Yes, this is the important loop to unroll for small inputs.  Somehow


Not like clang does it.  Unrolling is useless without the 3-way blocking.


with the unrolling, it is only ~19% slower than the by-3 algorithm on
my system ??? not 66%.  Clang 3.9.1 unrolls both of these trailing
loops; here is the first:


Maybe 3.9.1 pessimizes the 3-way loop by unrolling it.  This would be
fairly easy to do.  Just replicate the 3 crc32q's a few times, say 8,
and do them in a bad order (3 blocks of 8 dependent ones instead of
8 blocks of 3 independent ones).  With enough replication, the code
would be too large for the hardware to reorder.  Inline asm has
another advantage here -- volatile on it prevents reordering and
might prevent unrolling.

Maybe 3.9.1 unpessimizes the inline asms.

But I suspect not getting the 3 times speedup is for another reason.




  0x00401b88 <+584>:   cmp$0x38,%rbx
  0x00401b8c <+588>:   jae0x401b93 
  0x00401b8e <+590>:   mov%rsi,%rdx
  0x00401b91 <+593>:   jmp0x401be1 
  0x00401b93 <+595>:   lea-0x1(%rdi),%rbx
  0x00401b97 <+599>:   sub%rdx,%rbx
  0x00401b9a <+602>:   mov%rsi,%rdx
  0x00401b9d <+605>:   nopl   (%rax)
  0x00401ba0 <+608>:   crc32q (%rdx),%rax
  0x00401ba6 <+614>:   crc32q 0x8(%rdx),%rax
  0x00401bad <+621>:   crc32q 0x10(%rdx),%rax
  0x00401bb4 <+628>:   crc32q 0x18(%rdx),%rax
  0x00401bbb <+635>:   crc32q 0x20(%rdx),%rax
  0x00401bc2 <+642>:   crc32q 0x28(%rdx),%rax
  0x00401bc9 <+649>:   crc32q 0x30(%rdx),%rax
  0x00401bd0 <+656>:   crc32q 0x38(%rdx),%rax
  0x00401bd7 <+663>:   add$0x40,%rdx
  0x00401bdb <+667>:   add$0x8,%rbx
  0x00401bdf <+671>:   jne0x401ba0 


No, this unrolling is useless.  The crc32q's are dependent on each other,
so they take 3 cycles each.  There are spare resources to run about 12
instructions during that time.  Loop control only takes 3.


I
don't understand the algorithm for joining crcs -- why doesn't it work
to reduce to 12 or 24 bytes in the main loop?


It would, but I haven't implemented or tested that.  You're welcome to
do so and demonstrate an improvement.  It does add more lookup table
bloat, but perhaps we could just remove the 3x8k table ??? I'm not sure
it adds any benefit over the 3x256 table.


Your benchmarks mainly give results for the <= 768 bytes where most of
the manual optimizations don't apply.


Actually, they test only the large buffer case.  They used buffer size
of 1M and 1k and didn't do the entry and exit code that usually dominates
for small buffers.

I re-tested with the correct blocking.  This was about 10% slower
(0.34 -> 0.37 seconds for 10GB), except for clang without intrinsics it
was 20% slower (0.43 -> 0.51) seconds.


0x000400: asm:68 intrins:62 multitable:684  (ns per buf)


I don't see any signs of this in my test:
- a single crc32q in a (C) loop doesn't benefit from unrolling or lose
  to the extra clang instructions without intrinsics.  clang-3.9.0
  unrolls this 8-way in the simpler environment of my test program,
  but this makes no difference.
- similarly for a single crc32b in a loop, except when I forgot to
  change the type of the crc accumulator from uint64_t to uint32_t,
  gcc was 1 cycle slower in the loop (3 instead of 4).  gcc generates
  an extra instruction to zero-extend the crc, and this is more
  expensive than usual since it gives gives another dependency.
  clang optimizes this away.


0x000800: asm:132 intrins:133  (ns per buf)
0x002000: asm:449 intrins:446  (ns 

Re: svn commit: r313006 - in head: sys/conf sys/libkern sys/libkern/x86 sys/sys tests/sys/kern

2017-01-31 Thread Conrad Meyer
On Tue, Jan 31, 2017 at 7:36 AM, Bruce Evans  wrote:
> On Tue, 31 Jan 2017, Bruce Evans wrote:
> Unrolling (or not) may be helpful or harmful for entry and exit code.

Helpful, per my earlier benchmarks.

> I
> think there should by no alignment on entry -- just assume the buffer is
> aligned in the usual case, and only run 4% slower when it is misaligned.

Please write such a patch and demonstrate the improvement.

> The exit code handles up to SHORT * 3 = 768 bytes, not up to 4 or 8
> bytes or up to 3 times that like simpler algorithms.  768 is quite
> large, and the exit code is quite slow.  It reduces 8 or 4 bytes at a
> time without any dependency reduction, and then 1 byte at a time.

Yes, this is the important loop to unroll for small inputs.  Somehow
with the unrolling, it is only ~19% slower than the by-3 algorithm on
my system — not 66%.  Clang 3.9.1 unrolls both of these trailing
loops; here is the first:

   0x00401b88 <+584>:   cmp$0x38,%rbx
   0x00401b8c <+588>:   jae0x401b93 
   0x00401b8e <+590>:   mov%rsi,%rdx
   0x00401b91 <+593>:   jmp0x401be1 
   0x00401b93 <+595>:   lea-0x1(%rdi),%rbx
   0x00401b97 <+599>:   sub%rdx,%rbx
   0x00401b9a <+602>:   mov%rsi,%rdx
   0x00401b9d <+605>:   nopl   (%rax)
   0x00401ba0 <+608>:   crc32q (%rdx),%rax
   0x00401ba6 <+614>:   crc32q 0x8(%rdx),%rax
   0x00401bad <+621>:   crc32q 0x10(%rdx),%rax
   0x00401bb4 <+628>:   crc32q 0x18(%rdx),%rax
   0x00401bbb <+635>:   crc32q 0x20(%rdx),%rax
   0x00401bc2 <+642>:   crc32q 0x28(%rdx),%rax
   0x00401bc9 <+649>:   crc32q 0x30(%rdx),%rax
   0x00401bd0 <+656>:   crc32q 0x38(%rdx),%rax
   0x00401bd7 <+663>:   add$0x40,%rdx
   0x00401bdb <+667>:   add$0x8,%rbx
   0x00401bdf <+671>:   jne0x401ba0 


> I
> don't understand the algorithm for joining crcs -- why doesn't it work
> to reduce to 12 or 24 bytes in the main loop?

It would, but I haven't implemented or tested that.  You're welcome to
do so and demonstrate an improvement.  It does add more lookup table
bloat, but perhaps we could just remove the 3x8k table — I'm not sure
it adds any benefit over the 3x256 table.

> Your benchmarks mainly give results for the <= 768 bytes where most of
> the manual optimizations don't apply.

0x000400: asm:68 intrins:62 multitable:684  (ns per buf)
0x000800: asm:132 intrins:133  (ns per buf)
0x002000: asm:449 intrins:446  (ns per buf)
0x008000: asm:1501 intrins:1497  (ns per buf)
0x02: asm:5618 intrins:5609  (ns per buf)

(All routines are in a separate compilation unit with no full-program
optimization, as they are in the kernel.)

> Compiler optimizations are more
> likely to help there.  So I looked more closely at the last 2 loop.
> clang indeed only unrolls the last one,

Not in 3.9.1.

> only for the unreachable case
> with more than 8 bytes on amd64.

How is it unreachable?

Best,
Conrad
___
svn-src-all@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/svn-src-all
To unsubscribe, send any mail to "svn-src-all-unsubscr...@freebsd.org"

Re: svn commit: r313006 - in head: sys/conf sys/libkern sys/libkern/x86 sys/sys tests/sys/kern

2017-01-31 Thread Bruce Evans

On Tue, 31 Jan 2017, Bruce Evans wrote:


On Mon, 30 Jan 2017, Conrad Meyer wrote:


On Mon, Jan 30, 2017 at 9:26 PM, Bruce Evans  wrote:

On Tue, 31 Jan 2017, Conrad E. Meyer wrote:


Log:
 calculate_crc32c: Add SSE4.2 implementation on x86


This breaks building with gcc-4.2.1,


gcc-4.2.1 is an ancient compiler.  Good riddance.


I prefer it.


It also works better on ordinary asms for crc32.


Added: head/sys/libkern/x86/crc32_sse42.c

==
--- /dev/null   00:00:00 1970   (empty, because file is newly added)
+++ head/sys/libkern/x86/crc32_sse42.c  Tue Jan 31 03:26:32 2017
(r313006)
+
+#include 


...

Inline asm is much less unportable than intrinsics.  kib used the correct
method of .byte's in asms to avoid depending on assembler support for 
newer

instructions.  .byte is still used for clflush on amd64 and i386.  It
used to be used for invpcid on amd64.  I can't find where it is or was
used for xsave stuff.


Konstantin predicted this complaint in code review (phabricator).
Unfortunately, Clang does not automatically unroll asms, even with the
correct mnemonic.  Unrolling is essential to performance below the
by-3 block size (768 bytes in this implementation).  Hand unrolling in
C seems to generate less efficient assembly than the compiler's
unrolling.


Unorolling is almost completely useless if the instruction takes 3 cycles
like you said it does.  The loop calculations run much faster than that,
so they can run in parallel.  However, clang generates horrible code for
inline asms instead of intrinsics.


The unrolling was worse than useless (see previous reply).  It is the
horrible code in the non-intrinsics case that seems to be the main source
of differences (see below).


The left column below is block size.  The measurements are nanoseconds
per buf, per CLOCK_VIRTUAL, averaged over 10^5 loops.  These numbers
do not vary more than +/- 1ns run to run on my idle Sandy Bridge
laptop.  "asm" is using __asm__(), "intrins" using the _mm_crc32
intrinsics that Clang can unroll, and multitable is the older
lookup-table implementation (still used on other architectures).

0x10: asm:0 intrins:0 multitable:0  (ns per buf)
0x20: asm:7 intrins:9 multitable:78  (ns per buf)
0x40: asm:10 intrins:7 multitable:50  (ns per buf)
0x80: asm:15 intrins:9 multitable:91  (ns per buf)
0x000100: asm:25 intrins:17 multitable:178  (ns per buf)
0x000200: asm:55 intrins:38 multitable:347  (ns per buf)
0x000400: asm:61 intrins:62 multitable:684  (ns per buf)

Both implementations are superior to the multitable approach, so it is
unreasonable not to make one of them standard on x86 platforms.

The unrolled intrinsics are consistently better than not unrolled on
objects 0x40-0x200 bytes large.  At 0x400 bytes we pass the first
unroll-by-3 threshold and it stops mattering as much.

At 0x40 bytes, it is the difference between 6.4 GB/s and 9.1 GB/s.  At
0x200 bytes, it is the difference between 9.3 GB/s and 13.5 GB/s.  I
think this justifies some minor ugliness.


If it matters, which I doubt, then the compiler shouldn't be trusted
for unrolling.  It can only do better than handwritten unrolling if
it has tables for the correct amount unrolling for every instruction
for thousands of CPUs (since such tables are too hard to write manually).


It is the 3-way unrolling that makes the main difference.  clang is
clueless about unrolling in this function, but this 3-way unrolling is
in the C code.  It is to reduce dependencies.  The timing seems to be
as I suspected -- the crc32 instruction has a latency of 3 and a throughput
of 3 per 3 cycles, so with 3 dependent instructions it runs 3 times slower
than with 3 independent instructions.  Naive unrolling gives more dependent
instructions, so is equally slow to no unrolling.

Test program for this:

X #include 
X #include 
X 
X #ifndef __clang__


Change this to __clang__X to kill intrinsics for clang.

X static __inline uint32_t
X _mm_crc32_u8(uint32_t x, uint8_t y)
X {
X   __asm("crc32b %1,%0" : "+r" (x) : "rm" (y));
X   return (x);
X }
X 
X static __inline uint32_t __unused

X _mm_crc32_u32(uint32_t x, int32_t y)
X {
X   __asm("crc32l %1,%0" : "+r" (x) : "rm" (y));
X   return (x);
X }
X 
X #ifdef __amd64__

X static __inline uint64_t
X _mm_crc32_u64(uint64_t x, int64_t y)
X {
X   __asm("crc32q %1,%0" : "+r" (x) : "rm" (y));
X   return (x);
X }
X #endif
X #else
X #include 
X #endif
X 
X #define MISALIGN	0


MISALIGN 1 costs about 4% (0.36 seconds instead of 0.34 seconds on
freefall's Xeon).

X #define SIZE  100

The large size of 1M is to bust at least the L1 cache.  This makes little
difference, since the speed is CPU-bound (0.34 seconds @ 3.3 GHz = ~34/33
cycles per crc32 instruction).  Actually, I'm not sure about the speeds.
freefalls's machdep.tsc_freq is 3.3GHz, but the speed could be anything
due to turbo mode and frequency control 

Re: svn commit: r313006 - in head: sys/conf sys/libkern sys/libkern/x86 sys/sys tests/sys/kern

2017-01-31 Thread Bruce Evans

On Mon, 30 Jan 2017, Conrad Meyer wrote:


On Mon, Jan 30, 2017 at 9:26 PM, Bruce Evans  wrote:

On Tue, 31 Jan 2017, Conrad E. Meyer wrote:


Log:
 calculate_crc32c: Add SSE4.2 implementation on x86


This breaks building with gcc-4.2.1,


gcc-4.2.1 is an ancient compiler.  Good riddance.


I prefer it.


Added: head/sys/libkern/x86/crc32_sse42.c

==
--- /dev/null   00:00:00 1970   (empty, because file is newly added)
+++ head/sys/libkern/x86/crc32_sse42.c  Tue Jan 31 03:26:32 2017
(r313006)
+
+#include 


...

Inline asm is much less unportable than intrinsics.  kib used the correct
method of .byte's in asms to avoid depending on assembler support for newer
instructions.  .byte is still used for clflush on amd64 and i386.  It
used to be used for invpcid on amd64.  I can't find where it is or was
used for xsave stuff.


Konstantin predicted this complaint in code review (phabricator).
Unfortunately, Clang does not automatically unroll asms, even with the
correct mnemonic.  Unrolling is essential to performance below the
by-3 block size (768 bytes in this implementation).  Hand unrolling in
C seems to generate less efficient assembly than the compiler's
unrolling.


Unorolling is almost completely useless if the instruction takes 3 cycles
like you said it does.  The loop calculations run much faster than that,
so they can run in parallel.  However, clang generates horrible code for
inline asms instead of intrinsics.

Simple asms can be compiled even by gcc (apparently old binutils supports
SSE4):

X Index: crc32_sse42.c
X ===
X --- crc32_sse42.c (revision 313007)
X +++ crc32_sse42.c (working copy)
X @@ -38,8 +38,29 @@
X  #include 
X  #endif
X 
X -#include 

X +static __inline uint32_t
X +_mm_crc32_u8(uint32_t x, uint8_t y)
X +{
X + __asm("crc32b %1,%0" : "+r" (x) : "rm" (y));
X + return (x);
X +}
X 
X +static __inline uint32_t __unused

X +_mm_crc32_u32(uint32_t x, int32_t y)
X +{
X + __asm("crc32l %1,%0" : "+r" (x) : "rm" (y));
X + return (x);
X +}
X +
X +#ifdef __amd64__
X +static __inline uint64_t
X +_mm_crc32_u64(uint64_t x, int64_t y)
X +{
X + __asm("crc32q %1,%0" : "+r" (x) : "rm" (y));
X + return (x);
X +}
X +#endif
X +
X  /* CRC-32C (iSCSI) polynomial in reversed bit order. */
X  #define POLY 0x82f63b78

I only tested this at compile time.  clang generates horrible code.
Starting with y in an array in memory, it always copies y to the
stack and does a memory access there.  There is nothing special about
these inline asms, so I think clang does the same pessimizations for
most asms in cpufunc.h.  jkim got annoyed by this for something like
rdtsc() and changed too much to reduce the problem for just one case.
The differences in the generated code look like:

intrinsincs (old <) vs inline asm (new >):
28,29c29,33
This Inner Loop Header: Depth=1


Re: svn commit: r313006 - in head: sys/conf sys/libkern sys/libkern/x86 sys/sys tests/sys/kern

2017-01-30 Thread Conrad Meyer
Hi Bruce,

On Mon, Jan 30, 2017 at 9:26 PM, Bruce Evans  wrote:
> On Tue, 31 Jan 2017, Conrad E. Meyer wrote:
>
>> Log:
>>  calculate_crc32c: Add SSE4.2 implementation on x86
>
>
> This breaks building with gcc-4.2.1,

gcc-4.2.1 is an ancient compiler.  Good riddance.

>> Added: head/sys/libkern/x86/crc32_sse42.c
>>
>> ==
>> --- /dev/null   00:00:00 1970   (empty, because file is newly added)
>> +++ head/sys/libkern/x86/crc32_sse42.c  Tue Jan 31 03:26:32 2017
>> (r313006)
>> +
>> +#include 
>
> ...
>
> Inline asm is much less unportable than intrinsics.  kib used the correct
> method of .byte's in asms to avoid depending on assembler support for newer
> instructions.  .byte is still used for clflush on amd64 and i386.  It
> used to be used for invpcid on amd64.  I can't find where it is or was
> used for xsave stuff.

Konstantin predicted this complaint in code review (phabricator).
Unfortunately, Clang does not automatically unroll asms, even with the
correct mnemonic.  Unrolling is essential to performance below the
by-3 block size (768 bytes in this implementation).  Hand unrolling in
C seems to generate less efficient assembly than the compiler's
unrolling.

The left column below is block size.  The measurements are nanoseconds
per buf, per CLOCK_VIRTUAL, averaged over 10^5 loops.  These numbers
do not vary more than +/- 1ns run to run on my idle Sandy Bridge
laptop.  "asm" is using __asm__(), "intrins" using the _mm_crc32
intrinsics that Clang can unroll, and multitable is the older
lookup-table implementation (still used on other architectures).

0x10: asm:0 intrins:0 multitable:0  (ns per buf)
0x20: asm:7 intrins:9 multitable:78  (ns per buf)
0x40: asm:10 intrins:7 multitable:50  (ns per buf)
0x80: asm:15 intrins:9 multitable:91  (ns per buf)
0x000100: asm:25 intrins:17 multitable:178  (ns per buf)
0x000200: asm:55 intrins:38 multitable:347  (ns per buf)
0x000400: asm:61 intrins:62 multitable:684  (ns per buf)

Both implementations are superior to the multitable approach, so it is
unreasonable not to make one of them standard on x86 platforms.

The unrolled intrinsics are consistently better than not unrolled on
objects 0x40-0x200 bytes large.  At 0x400 bytes we pass the first
unroll-by-3 threshold and it stops mattering as much.

At 0x40 bytes, it is the difference between 6.4 GB/s and 9.1 GB/s.  At
0x200 bytes, it is the difference between 9.3 GB/s and 13.5 GB/s.  I
think this justifies some minor ugliness.

Best,
Conrad
___
svn-src-all@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/svn-src-all
To unsubscribe, send any mail to "svn-src-all-unsubscr...@freebsd.org"


Re: svn commit: r313006 - in head: sys/conf sys/libkern sys/libkern/x86 sys/sys tests/sys/kern

2017-01-30 Thread Bruce Evans

On Tue, 31 Jan 2017, Conrad E. Meyer wrote:


Log:
 calculate_crc32c: Add SSE4.2 implementation on x86


This breaks building with gcc-4.2.1, and depends on using non-kernel clang
headers for clang.


Modified: head/sys/conf/files.amd64
==
--- head/sys/conf/files.amd64   Tue Jan 31 01:55:29 2017(r313005)
+++ head/sys/conf/files.amd64   Tue Jan 31 03:26:32 2017(r313006)
@@ -593,6 +593,11 @@ compat/ndis/subr_pe.c  optionalndisapi
compat/ndis/subr_usbd.c optionalndisapi pci
compat/ndis/winx64_wrap.S   optionalndisapi pci
#
+crc32_sse42.o  standard\


I don't want it, but it is standard.


+   dependency  "$S/libkern/x86/crc32_sse42.c"\
+   compile-with"${CC} -c ${CFLAGS:N-nostdinc} ${WERROR} ${PROF} -msse4 
${.IMPSRC}" \


-msse4 is not supported by gcc-4.2.1,

Removing nostdinc pollutes the build with host headers, and the one needed
might not be installed, and it doesn't exist for gcc-4.2.1.

Similarly for i386.


Modified: head/sys/libkern/crc32.c
==
--- head/sys/libkern/crc32.cTue Jan 31 01:55:29 2017(r313005)
+++ head/sys/libkern/crc32.cTue Jan 31 03:26:32 2017(r313006)
@@ -46,8 +46,14 @@
__FBSDID("$FreeBSD$");

#include 
+#include 


Style bug.  libkern.h is part if systm.h.


#include 


Ordering bug.  systm.h is a prerequisite for all kernel headers except
param.h, since it defines macros which might be used by other headers.


Added: head/sys/libkern/x86/crc32_sse42.c
==
--- /dev/null   00:00:00 1970   (empty, because file is newly added)
+++ head/sys/libkern/x86/crc32_sse42.c  Tue Jan 31 03:26:32 2017
(r313006)
@@ -0,0 +1,288 @@
...
+#ifdef USERSPACE_TESTING
+#include 
+#else
+#include 
+#include 
+#include 
+#include 
+#endif


Style and ordering bugs, as above.


+
+#include 


This header is outside of the kernel source tree.  It is not even in
/usr/include, but clang finds it in:

crc32_sse42.o:
  /usr/bin/../lib/clang/3.9.0/include/nmmintrin.h \
  /usr/bin/../lib/clang/3.9.0/include/smmintrin.h \
  /usr/bin/../lib/clang/3.9.0/include/tmmintrin.h \
  /usr/bin/../lib/clang/3.9.0/include/pmmintrin.h \
  /usr/bin/../lib/clang/3.9.0/include/emmintrin.h \
  /usr/bin/../lib/clang/3.9.0/include/xmmintrin.h \
  /usr/bin/../lib/clang/3.9.0/include/mmintrin.h \
  /usr/bin/../lib/clang/3.9.0/include/f16cintrin.h \
  /usr/bin/../lib/clang/3.9.0/include/popcntintrin.h

nmmintrin.h doesn't exist for gcc-4.2.1.  gcc-4.2.1 has some of the other
intrin.h files, but they aren't installed in FreeBSD-9, and of course they
don't support newer SSE.

Inline asm is much less unportable than intrinsics.  kib used the correct
method of .byte's in asms to avoid depending on assembler support for newer
instructions.  .byte is still used for clflush on amd64 and i386.  It
used to be used for invpcid on amd64.  I can't find where it is or was
used for xsave stuff.

Bruce
___
svn-src-all@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/svn-src-all
To unsubscribe, send any mail to "svn-src-all-unsubscr...@freebsd.org"


Re: svn commit: r313006 - in head: sys/conf sys/libkern sys/libkern/x86 sys/sys tests/sys/kern

2017-01-30 Thread Conrad Meyer
On Mon, Jan 30, 2017 at 7:26 PM, Conrad E. Meyer  wrote:
> (The CRC instruction takes 1 cycle but has 2-3 cycles of latency.)

My mistake, it's not 2 anywhere.  It's just 3 cycles on all
workstation/server CPUs since Nehalem.  Different on Atom chips and
AMD.

Best,
Conrad
___
svn-src-all@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/svn-src-all
To unsubscribe, send any mail to "svn-src-all-unsubscr...@freebsd.org"


svn commit: r313006 - in head: sys/conf sys/libkern sys/libkern/x86 sys/sys tests/sys/kern

2017-01-30 Thread Conrad E. Meyer
Author: cem
Date: Tue Jan 31 03:26:32 2017
New Revision: 313006
URL: https://svnweb.freebsd.org/changeset/base/313006

Log:
  calculate_crc32c: Add SSE4.2 implementation on x86
  
  Derived from an implementation by Mark Adler.
  
  The fast loop performs three simultaneous CRCs over subsets of the data
  before composing them.  This takes advantage of certain properties of
  the CRC32 implementation in Intel hardware.  (The CRC instruction takes 1
  cycle but has 2-3 cycles of latency.)
  
  The CRC32 instruction does not manipulate FPU state.
  
  i386 does not have the crc32q instruction, so avoid it there.  Otherwise
  the implementation is identical to amd64.
  
  Add basic userland tests to verify correctness on a variety of inputs.
  
  PR:   216467
  Reported by:  Ben RUBSON 
  Reviewed by:  kib@, markj@ (earlier version)
  Sponsored by: Dell EMC Isilon
  Differential Revision:https://reviews.freebsd.org/D9342

Added:
  head/sys/libkern/x86/
  head/sys/libkern/x86/crc32_sse42.c   (contents, props changed)
  head/tests/sys/kern/libkern_crc32.c   (contents, props changed)
Modified:
  head/sys/conf/files.amd64
  head/sys/conf/files.i386
  head/sys/libkern/crc32.c
  head/sys/sys/libkern.h
  head/tests/sys/kern/Makefile

Modified: head/sys/conf/files.amd64
==
--- head/sys/conf/files.amd64   Tue Jan 31 01:55:29 2017(r313005)
+++ head/sys/conf/files.amd64   Tue Jan 31 03:26:32 2017(r313006)
@@ -593,6 +593,11 @@ compat/ndis/subr_pe.c  optionalndisapi 
 compat/ndis/subr_usbd.coptionalndisapi pci
 compat/ndis/winx64_wrap.S  optionalndisapi pci
 #
+crc32_sse42.o  standard\
+   dependency  "$S/libkern/x86/crc32_sse42.c"  \
+   compile-with"${CC} -c ${CFLAGS:N-nostdinc} ${WERROR} ${PROF} -msse4 
${.IMPSRC}" \
+   no-implicit-rule\
+   clean   "crc32_sse42.o"
 libkern/memmove.c  standard
 libkern/memset.c   standard
 #

Modified: head/sys/conf/files.i386
==
--- head/sys/conf/files.i386Tue Jan 31 01:55:29 2017(r313005)
+++ head/sys/conf/files.i386Tue Jan 31 03:26:32 2017(r313006)
@@ -554,6 +554,11 @@ kern/kern_clocksource.cstandard
 kern/imgact_aout.c optional compat_aout
 kern/imgact_gzip.c optional gzip
 kern/subr_sfbuf.c  standard
+crc32_sse42.o  standard\
+   dependency  "$S/libkern/x86/crc32_sse42.c"  \
+   compile-with"${CC} -c ${CFLAGS:N-nostdinc} ${WERROR} ${PROF} -msse4 
${.IMPSRC}" \
+   no-implicit-rule\
+   clean   "crc32_sse42.o"
 libkern/divdi3.c   standard
 libkern/ffsll.cstandard
 libkern/flsll.cstandard

Modified: head/sys/libkern/crc32.c
==
--- head/sys/libkern/crc32.cTue Jan 31 01:55:29 2017(r313005)
+++ head/sys/libkern/crc32.cTue Jan 31 03:26:32 2017(r313006)
@@ -46,8 +46,14 @@
 __FBSDID("$FreeBSD$");
 
 #include 
+#include 
 #include 
 
+#if defined(__amd64__) || defined(__i386__)
+#include 
+#include 
+#endif
+
 const uint32_t crc32_tab[] = {
0x, 0x77073096, 0xee0e612c, 0x990951ba, 0x076dc419, 0x706af48f,
0xe963a535, 0x9e6495a3, 0x0edb8832, 0x79dcb8a4, 0xe0d5e91e, 0x97d2d988,
@@ -749,6 +755,11 @@ calculate_crc32c(uint32_t crc32c,
 const unsigned char *buffer,
 unsigned int length)
 {
+#if defined(__amd64__) || defined(__i386__)
+   if ((cpu_feature2 & CPUID2_SSE42) != 0) {
+   return (sse42_crc32c(crc32c, buffer, length));
+   } else
+#endif
if (length < 4) {
return (singletable_crc32c(crc32c, buffer, length));
} else {

Added: head/sys/libkern/x86/crc32_sse42.c
==
--- /dev/null   00:00:00 1970   (empty, because file is newly added)
+++ head/sys/libkern/x86/crc32_sse42.c  Tue Jan 31 03:26:32 2017
(r313006)
@@ -0,0 +1,288 @@
+/*
+ * Derived from crc32c.c version 1.1 by Mark Adler.
+ *
+ * Copyright (C) 2013 Mark Adler
+ *
+ * This software is provided 'as-is', without any express or implied warranty.
+ * In no event will the author be held liable for any damages arising from the
+ * use of this software.
+ *
+ * Permission is granted to anyone to use this software for any purpose,
+ * including commercial applications, and to alter it and redistribute it
+ * freely, subject to the following restrictions:
+ *
+ * 1. The origin of this software must not be