Re: svn commit: r313006 - in head: sys/conf sys/libkern sys/libkern/x86 sys/sys tests/sys/kern
> On 01 Mar 2017, at 21:57, Conrad Meyerwrote: > > Hi Bruce, > > On my laptop (Intel(R) Core(TM) i5-3320M CPU — Ivy Bridge) I still see > a little worse performance with this patch. Hi Bruce & Conrad, I gave both patches a try. It's a real use case, iSCSI throughput. Both target and initiator running FreeBSD 11, both patched. I read 100 times 10G from the same raw SSD device. Results in MB/s. (top10 is the average of the 10 higher values) # Without CRC32 : average : 300.3 std dev : 3.9 top10 : 308.6 max : 311.4 # With software CRC32 : average : 197.6 std dev : 6.5 top10 : 204.2 max : 204.9 # With Conrad CRC32 : average : 282.4 std dev : 9.1 top10 : 292.9 max : 293.3 # With Bruce CRC32 : average : 278.4 std dev : 7.8 top10 : 290.8 max : 292.5 The difference between the 2 patches is let's say almost non significant. CPU on both sides: Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz Ben ___ svn-src-all@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/svn-src-all To unsubscribe, send any mail to "svn-src-all-unsubscr...@freebsd.org"
Re: svn commit: r313006 - in head: sys/conf sys/libkern sys/libkern/x86 sys/sys tests/sys/kern
On Wed, 1 Mar 2017, Conrad Meyer wrote: On Wed, Mar 1, 2017 at 9:27 PM, Bruce Evanswrote: On Wed, 1 Mar 2017, Conrad Meyer wrote: On my laptop (Intel(R) Core(TM) i5-3320M CPU ??? Ivy Bridge) I still see a little worse performance with this patch. Please excuse the ugly graphs, I don't have a better graphing tool set up at this time: https://people.freebsd.org/~cem/crc32/sse42_bde.png https://people.freebsd.org/~cem/crc32/sse42_bde_log.png Try doubling the loop sizes. There shouldn't be any significant difference above size 3*LONG unless LONG is too small. Apparently it is too small for older CPUs. I now have a Sandybridge i5-2xxx laptop to test on, but don't have it set up for much yet. Doubling the loop sizes seems to make it slightly worse, actually: https://people.freebsd.org/~cem/crc32/sse42_bde2.png https://people.freebsd.org/~cem/crc32/sse42_bde_log2.png I haven't made any attempt to inspect the generated assembly. This is Clang 3.9.1 with -O2. I tested on Sandybridge (i5=2540M) and get exactly the opposite results with clang-3.9-0. It is much slower with intrinsics. Much slower than gcc-4-2.1. Perhaps a bug in one of the test programs (mine is enclosed). Minimum types with low variance (+-10 msec_ for "./z2 size 10" (100G total) in seconds on idle system: buf_size: 512 3*512 4096 3*4096 -- - - -- ./z2-bde-clang 10.57 8.36 6.856.58 ./z2-bde-gcc 10.99 8.96 7.086.58 ./z2-cur-clang 17.23 11.19 6.976.75 Oops, that was with MISALIGN = 1. Also, I forgot to force aligment of buf, but checked it was at 0x...40 in all case. Now with proper alignment: buf size: 512 3*512 4096 3*4096 -- - - -- ./z2-bde-clang8.96 6.56 6.626.42 ./z2-bde-gcc 8.81 6.51 6.636.30 ./z2-cur-clang 14.70 6.22 6.666.13 The number of iterations is adjusted so that buf_size * num_iter = 100G. This shows that clang-3.9.0 with intrinsics is doing lots of rearrangement which is very bad for the misaligned case and otherwise helps for the multiple-of-3 cases (when the SHORT loop is null), and otherwise is a small pessimization relative to no intrinsicts, but beats gcc, while gcc does almost none. (I mostly tested with gcc -O3 and it seemed equally good then.) The function doesn't use __predict_ugly(), and clang apparently uses this to optimized the alignment code at great cost to the main loops when the alignment code executes (perhaps it removes the alignment code?) clang also does poorly with buf_size 512 in the aligned case. Indeed, gcc is much better with -O3 (other flags -static [-msse4 for intrins]). clang does excessive optimizations by default, and -O3 makes no difference for it: buf size: 512 3*512 4096 3*4096 -- - - -- ./z2-bde-clangO3 8.96 6.56 6.626.42 ./z2-bde-gccO38.95 6.06 6.115.80 ./z2-cur-clangO3 14.70 6.22 6.666.13 So we seem to be mainly testing uninteresting compiler pessimizations. Eventually compilers will understand the code better and not rearrange it very much (except for the alignment part). I did a quick test with LONG = SHORT = 128 and gcc -O2. This was just slower, even for the ideal loop size of 4096*3 (up from 6.30 to 6.67 seconds). This change just removes the LONG loop after renaming the SHORT loop to LONG. gcc apparently thinks it understands this simpler version, and pessimizes it. While testing, I did notice a pessimization that is not the compiler's fault: when the crc32 instructions are optimized at the expense of the crc update at the end of the loop, the loop gets out of sync with the update and the wrong thing can stall. The code has subtleties to try to prevent this, by compilers don't really understand this. Compiler membars to control the ordering precisely were just pessimizations. X #include X #include X #include X #include X X #define MISALIGN 1 X #define SIZE (1024 * 1024) X X uint8_t buf[MISALIGN + SIZE]; X X uint32_t sse42_crc32c(uint32_t, const unsigned char *, unsigned); X X int X main(int argc, char **argv) X { X size_t size; X uint32_t crc; X int i, j, limit, repeat; X X size = argc == 1 ? SIZE : atoi(argv[1]); X limit = 100L / size; X repeat = argc < 3 ? 10 : atoi(argv[2]); X for (i = 0; i < sizeof(buf); i++) X buf[i] = rand(); X crc = 0; X for (j = 0; j < repeat; j++) X for (i = 0; i < limit; i++) X crc = sse42_crc32c(crc, [MISALIGN], size); X printf("%#x\n", sse42_crc32c(0, [MISALIGN], size)); X return (crc == 0 ? 0 : 1); X } Loops like this are not very representative of normal use, but I don't know a better way. Bruce___ svn-src-all@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/svn-src-all To
Re: svn commit: r313006 - in head: sys/conf sys/libkern sys/libkern/x86 sys/sys tests/sys/kern
On Wed, Mar 1, 2017 at 9:27 PM, Bruce Evanswrote: > On Wed, 1 Mar 2017, Conrad Meyer wrote: > >> On my laptop (Intel(R) Core(TM) i5-3320M CPU — Ivy Bridge) I still see >> a little worse performance with this patch. Please excuse the ugly >> graphs, I don't have a better graphing tool set up at this time: >> >> https://people.freebsd.org/~cem/crc32/sse42_bde.png >> https://people.freebsd.org/~cem/crc32/sse42_bde_log.png > > > Try doubling the loop sizes. There shouldn't be any significant difference > above size 3*LONG unless LONG is too small. Apparently it is too small for > older CPUs. > > I now have a Sandybridge i5-2xxx laptop to test on, but don't have it set > up for much yet. > > Bruce Hi Bruce, Doubling the loop sizes seems to make it slightly worse, actually: https://people.freebsd.org/~cem/crc32/sse42_bde2.png https://people.freebsd.org/~cem/crc32/sse42_bde_log2.png I haven't made any attempt to inspect the generated assembly. This is Clang 3.9.1 with -O2. Best, Conrad ___ svn-src-all@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/svn-src-all To unsubscribe, send any mail to "svn-src-all-unsubscr...@freebsd.org"
Re: svn commit: r313006 - in head: sys/conf sys/libkern sys/libkern/x86 sys/sys tests/sys/kern
> On 02 Mar 2017, at 06:27, Bruce Evanswrote: > > On Wed, 1 Mar 2017, Conrad Meyer wrote: > >> On my laptop (Intel(R) Core(TM) i5-3320M CPU — Ivy Bridge) I still see >> a little worse performance with this patch. Please excuse the ugly >> graphs, I don't have a better graphing tool set up at this time: >> >> https://people.freebsd.org/~cem/crc32/sse42_bde.png >> https://people.freebsd.org/~cem/crc32/sse42_bde_log.png > > Try doubling the loop sizes. There shouldn't be any significant difference > above size 3*LONG unless LONG is too small. Apparently it is too small for > older CPUs. Many thx Bruce & Conrad for your work on this ! My two cents, big chunks speed sounds (really) important as it will impact data digest enabled iSCSI throughput. If needed, I could help with some benchmarks on my servers : Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz Thank you ! ___ svn-src-all@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/svn-src-all To unsubscribe, send any mail to "svn-src-all-unsubscr...@freebsd.org"
Re: svn commit: r313006 - in head: sys/conf sys/libkern sys/libkern/x86 sys/sys tests/sys/kern
On Wed, 1 Mar 2017, Conrad Meyer wrote: On my laptop (Intel(R) Core(TM) i5-3320M CPU ??? Ivy Bridge) I still see a little worse performance with this patch. Please excuse the ugly graphs, I don't have a better graphing tool set up at this time: https://people.freebsd.org/~cem/crc32/sse42_bde.png https://people.freebsd.org/~cem/crc32/sse42_bde_log.png Try doubling the loop sizes. There shouldn't be any significant difference above size 3*LONG unless LONG is too small. Apparently it is too small for older CPUs. I now have a Sandybridge i5-2xxx laptop to test on, but don't have it set up for much yet. Bruce___ svn-src-all@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/svn-src-all To unsubscribe, send any mail to "svn-src-all-unsubscr...@freebsd.org"
Re: svn commit: r313006 - in head: sys/conf sys/libkern sys/libkern/x86 sys/sys tests/sys/kern
Hi Bruce, On my laptop (Intel(R) Core(TM) i5-3320M CPU — Ivy Bridge) I still see a little worse performance with this patch. Please excuse the ugly graphs, I don't have a better graphing tool set up at this time: https://people.freebsd.org/~cem/crc32/sse42_bde.png https://people.freebsd.org/~cem/crc32/sse42_bde_log.png Best, Conrad On Mon, Feb 27, 2017 at 6:27 PM, Bruce Evanswrote: > On Mon, 27 Feb 2017, Conrad Meyer wrote: > >> On Thu, Feb 2, 2017 at 12:29 PM, Bruce Evans wrote: >>> >>> I've almost finished fixing and optimizing this. I didn't manage to fix >>> all the compiler pessimizations, but the result is within 5% of optimal >>> for buffers larger than a few K. >> >> >> Did you ever get to a final patch that you are satisfied with? It >> would be good to get this improvement into the tree. > > > I'm happy with this version (attached and partly enclosed). You need to > test it in the kernel and commit it (I on;y did simple correctness tests > in userland). > > X Index: conf/files.amd64 > X === > X --- conf/files.amd64 (revision 314363) > X +++ conf/files.amd64 (working copy) > X @@ -545,6 +545,9 @@ > X isa/vga_isa.coptionalvga > X kern/kern_clocksource.c standard > X kern/link_elf_obj.c standard > X +libkern/x86/crc32_sse42.cstandard > X +libkern/memmove.cstandard > X +libkern/memset.c standard > > Also fix some nearby disorder. > > X ... > X Index: libkern/x86/crc32_sse42.c > X === > X --- libkern/x86/crc32_sse42.c (revision 314363) > X +++ libkern/x86/crc32_sse42.c (working copy) > X @@ -31,15 +31,41 @@ > X */ > X #ifdef USERSPACE_TESTING > X #include > X +#include > X #else > X #include > X +#include > X #include > X -#include > X -#include > X #endif > > Also fix minor #include errors. > > X X -#include > X +static __inline uint32_t > X +_mm_crc32_u8(uint32_t x, uint8_t y) > X +{ > X + /* > X + * clang (at least 3.9.[0-1]) pessimizes "rm" (y) and "m" (y) > X + * significantly and "r" (y) a lot by copying y to a different > X + * local variable (on the stack or in a register), so only use > X + * the latter. This costs a register and an instruction but > X + * not a uop. > X + */ > X + __asm("crc32b %1,%0" : "+r" (x) : "r" (y)); > X + return (x); > X +} > > Using intrinsics avoids the silly copying via the stack, and allows more > unrolling. Old gcc does more unrolling with just asms. Unrolling is > almost useless (some details below). > > X @@ -47,12 +73,14 @@ > X * Block sizes for three-way parallel crc computation. LONG and SHORT > must > X * both be powers of two. > X */ > X -#define LONG 8192 > X -#define SHORT256 > X +#define LONG 128 > X +#define SHORT64 > > These are aggressively low. > > Note that small buffers aren't handled very well. SHORT = 64 means that > a buffer of size 3 * 64 = 192 is handled entirely by the "SHORT" loop. > 192 is not very small, but any smaller than that and overheads for > adjustment at the end of the loop are too large for the "SHORT" loop > to be worth doing. Almost any value of LONG larger than 128 works OK > now, but if LONG is large then it gives too much work for the "SHORT" > loop, since normal buffer sizes are not a multiple of 3. E.g., with > the old LONG and SHORT, a buffer of size 128 was decomposed as 5 * 24K > (done almost optimally by the "LONG" loop) + 10 * 768 (done a bit less > optimally by the "SHORT" loop) + 10 * 768 + 64 * 8 (done pessimally). > > I didn't get around to ifdefing this for i386. On i386, the loops take > twice as many crc32 instructions for a given byte count, so the timing > is satisfed by a byte count half as large, so SHORT and LONG can be > reduced by a factor of 2 to give faster handling for small buffers without > affecting the speed for large buffers significantly. > > X X /* Tables for hardware crc that shift a crc by LONG and SHORT zeros. */ > X static uint32_t crc32c_long[4][256]; > X +static uint32_t crc32c_2long[4][256]; > X static uint32_t crc32c_short[4][256]; > X +static uint32_t crc32c_2short[4][256]; > > I didn't get around to updating the comment. 2long shifts by 2*LONG zeros, > etc. > > Shifts by 3N are done by adding shifts by 1N and 2N in parallel. I couldn't > get the direct 3N shift to run any faster. > > X @@ -190,7 +220,11 @@ > X const size_t align = 4; > X #endif > X const unsigned char *next, *end; > X - uint64_t crc0, crc1, crc2; /* need to be 64 bits for crc32q */ > X +#ifdef __amd64__ > X + uint64_t crc0, crc1, crc2; > X +#else > X + uint32_t crc0, crc1, crc2; > X +#endif > X X next = buf; > X crc0 = crc; > > 64 bits of course isn't needed for i386. It isn't needed for amd64 either. > I think the
Re: svn commit: r313006 - in head: sys/conf sys/libkern sys/libkern/x86 sys/sys tests/sys/kern
On Mon, 27 Feb 2017, Conrad Meyer wrote: On Thu, Feb 2, 2017 at 12:29 PM, Bruce Evanswrote: I've almost finished fixing and optimizing this. I didn't manage to fix all the compiler pessimizations, but the result is within 5% of optimal for buffers larger than a few K. Did you ever get to a final patch that you are satisfied with? It would be good to get this improvement into the tree. I'm happy with this version (attached and partly enclosed). You need to test it in the kernel and commit it (I on;y did simple correctness tests in userland). X Index: conf/files.amd64 X === X --- conf/files.amd64 (revision 314363) X +++ conf/files.amd64 (working copy) X @@ -545,6 +545,9 @@ X isa/vga_isa.coptionalvga X kern/kern_clocksource.c standard X kern/link_elf_obj.c standard X +libkern/x86/crc32_sse42.cstandard X +libkern/memmove.cstandard X +libkern/memset.c standard Also fix some nearby disorder. X ... X Index: libkern/x86/crc32_sse42.c X === X --- libkern/x86/crc32_sse42.c (revision 314363) X +++ libkern/x86/crc32_sse42.c (working copy) X @@ -31,15 +31,41 @@ X */ X #ifdef USERSPACE_TESTING X #include X +#include X #else X #include X +#include X #include X -#include X -#include X #endif Also fix minor #include errors. X X -#include X +static __inline uint32_t X +_mm_crc32_u8(uint32_t x, uint8_t y) X +{ X + /* X + * clang (at least 3.9.[0-1]) pessimizes "rm" (y) and "m" (y) X + * significantly and "r" (y) a lot by copying y to a different X + * local variable (on the stack or in a register), so only use X + * the latter. This costs a register and an instruction but X + * not a uop. X + */ X + __asm("crc32b %1,%0" : "+r" (x) : "r" (y)); X + return (x); X +} Using intrinsics avoids the silly copying via the stack, and allows more unrolling. Old gcc does more unrolling with just asms. Unrolling is almost useless (some details below). X @@ -47,12 +73,14 @@ X * Block sizes for three-way parallel crc computation. LONG and SHORT must X * both be powers of two. X */ X -#define LONG 8192 X -#define SHORT256 X +#define LONG 128 X +#define SHORT64 These are aggressively low. Note that small buffers aren't handled very well. SHORT = 64 means that a buffer of size 3 * 64 = 192 is handled entirely by the "SHORT" loop. 192 is not very small, but any smaller than that and overheads for adjustment at the end of the loop are too large for the "SHORT" loop to be worth doing. Almost any value of LONG larger than 128 works OK now, but if LONG is large then it gives too much work for the "SHORT" loop, since normal buffer sizes are not a multiple of 3. E.g., with the old LONG and SHORT, a buffer of size 128 was decomposed as 5 * 24K (done almost optimally by the "LONG" loop) + 10 * 768 (done a bit less optimally by the "SHORT" loop) + 10 * 768 + 64 * 8 (done pessimally). I didn't get around to ifdefing this for i386. On i386, the loops take twice as many crc32 instructions for a given byte count, so the timing is satisfed by a byte count half as large, so SHORT and LONG can be reduced by a factor of 2 to give faster handling for small buffers without affecting the speed for large buffers significantly. X X /* Tables for hardware crc that shift a crc by LONG and SHORT zeros. */ X static uint32_t crc32c_long[4][256]; X +static uint32_t crc32c_2long[4][256]; X static uint32_t crc32c_short[4][256]; X +static uint32_t crc32c_2short[4][256]; I didn't get around to updating the comment. 2long shifts by 2*LONG zeros, etc. Shifts by 3N are done by adding shifts by 1N and 2N in parallel. I couldn't get the direct 3N shift to run any faster. X @@ -190,7 +220,11 @@ X const size_t align = 4; X #endif X const unsigned char *next, *end; X - uint64_t crc0, crc1, crc2; /* need to be 64 bits for crc32q */ X +#ifdef __amd64__ X + uint64_t crc0, crc1, crc2; X +#else X + uint32_t crc0, crc1, crc2; X +#endif X X next = buf; X crc0 = crc; 64 bits of course isn't needed for i386. It isn't needed for amd64 either. I think the crc32 instruction zeros the top 32 bits so they can be ignored. However, when I modified the asm to return 32 bits to tell the compiler about this (which the intrinsic wouldn't be able to do) and used 32 bits here, this was just slightly slower. For some intermediate crc calculations, only 32 bits are used, and the compiler can see this. clang on amd64 optimizes this better than gcc, starting with all the intermediate crc variables declared as 64 bits. gcc generates worst code when some of the intermediates are declared as 32 bits. So keep using 64 bits on amd64 here. X @@ -202,6 +236,7 @@ X len--; X } X X +#if LONG >
Re: svn commit: r313006 - in head: sys/conf sys/libkern sys/libkern/x86 sys/sys tests/sys/kern
On Thu, Feb 2, 2017 at 12:29 PM, Bruce Evanswrote: > I've almost finished fixing and optimizing this. I didn't manage to fix > all the compiler pessimizations, but the result is within 5% of optimal > for buffers larger than a few K. Hi Bruce, Did you ever get to a final patch that you are satisfied with? It would be good to get this improvement into the tree. Thanks, Conrad ___ svn-src-all@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/svn-src-all To unsubscribe, send any mail to "svn-src-all-unsubscr...@freebsd.org"
Re: svn commit: r313006 - in head: sys/conf sys/libkern sys/libkern/x86 sys/sys tests/sys/kern
Hi guys, Conrad, Bruce, May I ask you some news regarding this please ? More than 3 weeks now running Conrad commit on 2 CRC32C digest enabled iSCSI initiators / targets without issue :) Thank you very much again for this ! Shall we then think about "fixing" the last one or two remaining things before being able to MFC it please ? Would really be glad to see this nice kernel feature in 11.1. Perhaps it could also be backported to 11.0pX as a nice/huge & non-risky performance improvement. Thank you very much for your support & cooperation ! Best regards, Ben ___ svn-src-all@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/svn-src-all To unsubscribe, send any mail to "svn-src-all-unsubscr...@freebsd.org"
Re: svn commit: r313006 - in head: sys/conf sys/libkern sys/libkern/x86 sys/sys tests/sys/kern
On Thu, 2 Feb 2017, Konstantin Belousov wrote: On Tue, Jan 31, 2017 at 03:26:32AM +, Conrad E. Meyer wrote: + compile-with"${CC} -c ${CFLAGS:N-nostdinc} ${WERROR} ${PROF} -msse4 ${.IMPSRC}" \ BTW, new gcc has -mcrc32 option, but clang 3.9.1 apparently does not. I've almost finished fixing and optimizing this. I didn't manage to fix all the compiler pessimizations, but the result is within 5% of optimal for buffers larger than a few K. See the previous patch for -msse4 removal. It might be good to add -fno-unroll-functions to CFLAGS to prevent some clang pessimizations. Or -Os might work even better. X Index: libkern/x86/crc32_sse42.c X === X --- libkern/x86/crc32_sse42.c (revision 313090) X +++ libkern/x86/crc32_sse42.c (working copy) X @@ -31,15 +31,41 @@ X */ X #ifdef USERSPACE_TESTING X #include X +#include X #else X #include X +#include X #include X -#include X -#include X #endif X X -#include X +static __inline uint32_t X +_mm_crc32_u8(uint32_t x, uint8_t y) X +{ X + /* X + * clang (at least 3.9.[0-1]) pessimizes "rm" (y) and "m" (y) X + * significantly and "r" (y) a lot by copying y to a different X + * local variable (on the stack or in a register), so only use X + * the latter. This costs a register and an instruction but X + * not a uop. X + */ X + __asm("crc32b %1,%0" : "+r" (x) : "r" (y)); X + return (x); X +} I haven't found a better fix. Even gcc likes to pessimize this by preferring the "r" constraint in "rm", so "r" should be disparaged. It does this to try to schedule loop control instructions in beteen the memory reference and the operation using it, but it gets this scheduling mostly wrong. -Os stops this for gcc, but clang keeps producing the large and slow explicit load (then a store to a different place in memory to use there if the constraint allows "m") even with -Os. X X +static __inline uint32_t X +_mm_crc32_u32(uint32_t x, uint32_t y) X +{ X + __asm("crc32l %1,%0" : "+r" (x) : "r" (y)); X + return (x); X +} X + X +static __inline uint64_t X +_mm_crc32_u64(uint64_t x, uint64_t y) X +{ X + __asm("crc32q %1,%0" : "+r" (x) : "r" (y)); X + return (x); X +} X + X /* CRC-32C (iSCSI) polynomial in reversed bit order. */ X #define POLY 0x82f63b78 X X @@ -47,12 +73,14 @@ X * Block sizes for three-way parallel crc computation. LONG and SHORT must X * both be powers of two. X */ X -#define LONG 8192 X -#define SHORT256 X +#define LONG 128 X +#define SHORT64 These are now the smallest that work right on Haswell. Practical values should be larger in case the timing is to different for the manual scheduling to work. Small buffers are still slow, so LONG = SHORT = 1024 (with no LONG loop) would be a good practical value, except for the annoying problem that the 3-way optimization doesn't work so well for normal power-of-2 buffer sizes. LONG = SHORT = 1024 is perfect for buffer sizes of a multiple of 3K. It does 4K as 3K + 1K, with the 3K part optimized and the 1K part unoptimized and taking as long as the 3K part. LONG = 1024 with SHORT = 64 does the residual 1K at almost full speed except for another residual of 192. X X /* Tables for hardware crc that shift a crc by LONG and SHORT zeros. */ X static uint32_t crc32c_long[4][256]; X +static uint32_t crc32c_2long[4][256]; X static uint32_t crc32c_short[4][256]; X +static uint32_t crc32c_2short[4][256]; X X /* X * Multiply a matrix times a vector over the Galois field of two elements, X @@ -171,7 +199,9 @@ X crc32c_init_hw(void) X { X crc32c_zeros(crc32c_long, LONG); X + crc32c_zeros(crc32c_2long, 2 * LONG); X crc32c_zeros(crc32c_short, SHORT); X + crc32c_zeros(crc32c_2short, 2 * SHORT); X } X #ifdef _KERNEL X SYSINIT(crc32c_sse42, SI_SUB_LOCK, SI_ORDER_ANY, crc32c_init_hw, NULL); X @@ -190,7 +220,11 @@ X const size_t align = 4; X #endif X const unsigned char *next, *end; X - uint64_t crc0, crc1, crc2; /* need to be 64 bits for crc32q */ X +#ifdef __amd64__ X + uint64_t crc0, crc1, crc2; X +#else X + uint32_t crc0, crc1, crc2; X +#endif X X next = buf; X crc0 = crc; The type system doesn't work well here. The crc32* instruction returns a 32-bit value in a register of unclear size. I think it is zero extended to full width, but my asms aren't written quite right to say this and neither are the instrinsics. The 64-bit intrinsic returns a 64-bit value which just gets in the way. The above has to declare the type as uint64_t for amd64, else pessimizations result. That gives smaller pessimizations, mainly with gcc, for expressions like (crc0 >> 24) where the 64-bit declaration doesn't make it clear that the top bits are zero, so extra zeroing operations may be necessary. I have barely tested the 32-bit case, but the 64-bit type is clearly wasted then (but easier to
Re: svn commit: r313006 - in head: sys/conf sys/libkern sys/libkern/x86 sys/sys tests/sys/kern
On Tue, Jan 31, 2017 at 03:26:32AM +, Conrad E. Meyer wrote: > + compile-with"${CC} -c ${CFLAGS:N-nostdinc} ${WERROR} ${PROF} -msse4 > ${.IMPSRC}" \ BTW, new gcc has -mcrc32 option, but clang 3.9.1 apparently does not. ___ svn-src-all@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/svn-src-all To unsubscribe, send any mail to "svn-src-all-unsubscr...@freebsd.org"
Re: svn commit: r313006 - in head: sys/conf sys/libkern sys/libkern/x86 sys/sys tests/sys/kern
On Tue, 31 Jan 2017, Conrad Meyer wrote: On Tue, Jan 31, 2017 at 7:16 PM, Bruce Evanswrote: Another reply to this... On Tue, 31 Jan 2017, Conrad Meyer wrote: On Tue, Jan 31, 2017 at 7:36 AM, Bruce Evans wrote: On Tue, 31 Jan 2017, Bruce Evans wrote: I think there should by no alignment on entry -- just assume the buffer is aligned in the usual case, and only run 4% slower when it is misaligned. Please write such a patch and demonstrate the improvement. It is easy to demonstrate. I just put #if 0 around the early alignment code. The result seem too good to be true, so maybe I missed some later dependency on alignment of the addresses: - for 128-byte buffers and misalignment of 3, 10g takes 1.48 seconds with alignment and 1.02 seconds without alignment. This actually makes sense, 128 bytes can be done with 16 8-byte unaligned crc32q's. The alignment code makes it do 15 * 8-but and (5 + 3) * 1-byte. 7 more 3-cycle instructions and overhead too is far more than the cost of letting the CPU do read-combining. - for 4096-byte buffers, the difference is insignificant (0.47 seconds for 10g. I believe it, especially for newer amd64. I seem to recall that older x86 machines had a higher misalignment penalty, but it was largely reduced in (?)Nehalem. Why don't you go ahead and commit that change? Needs more work. Here are fairly clean patches for the unportabilities. I can't really test these in the kernel. Merge the asms with yours if yours are better in parts. X Index: conf/files.amd64 X === X --- conf/files.amd64 (revision 313007) X +++ conf/files.amd64 (working copy) X @@ -536,6 +536,9 @@ X isa/vga_isa.coptionalvga X kern/kern_clocksource.c standard X kern/link_elf_obj.c standard X +libkern/x86/crc32_sse42.cstandard X +libkern/memmove.cstandard X +libkern/memset.c standard X # X # IA32 binary support X # X @@ -593,14 +596,6 @@ X compat/ndis/subr_usbd.c optionalndisapi pci X compat/ndis/winx64_wrap.Soptionalndisapi pci X # X -crc32_sse42.ostandard \ X - dependency "$S/libkern/x86/crc32_sse42.c"\ X - compile-with"${CC} -c ${CFLAGS:N-nostdinc} ${WERROR} ${PROF} -msse4 ${.IMPSRC}" \ X - no-implicit-rule\ X - clean "crc32_sse42.o" X -libkern/memmove.cstandard X -libkern/memset.c standard X -# X # x86 real mode BIOS emulator, required by dpms/pci/vesa X # X compat/x86bios/x86bios.c optional x86bios | dpms | pci | vesa This also fixes unsorting by crc32_sse42.o and nearby unsorting by all other libkern lines (just 2) in files.amd64. This file is still grossly unsorted near the end. X Index: conf/files.i386 X === X --- conf/files.i386 (revision 313007) X +++ conf/files.i386 (working copy) X @@ -554,11 +554,6 @@ X kern/imgact_aout.c optional compat_aout X kern/imgact_gzip.c optional gzip X kern/subr_sfbuf.cstandard X -crc32_sse42.ostandard \ X - dependency "$S/libkern/x86/crc32_sse42.c"\ X - compile-with"${CC} -c ${CFLAGS:N-nostdinc} ${WERROR} ${PROF} -msse4 ${.IMPSRC}" \ X - no-implicit-rule\ X - clean "crc32_sse42.o" X libkern/divdi3.c standard X libkern/ffsll.c standard X libkern/flsll.c standard X @@ -569,6 +564,7 @@ X libkern/ucmpdi2.cstandard X libkern/udivdi3.cstandard X libkern/umoddi3.cstandard X +libkern/x86/crc32_sse42.cstandard X i386/xbox/xbox.c optional xbox X i386/xbox/xboxfb.c optional xboxfb X dev/fb/boot_font.c optional xboxfb This also fixes unsorting by crc32_sse42.o. files.i386 is not as unsorted as files.i386. The rules for sorting generated files are unclear. They are not in source directories so are nor naturally grouped. Old ones are mostly placed near the beginning. Some are sorted on a pathname of the source file(s) in scattered places in the make rules. crc32_sse42.o is partly like that. It was sorted with libkern sources but not by the x86 part. X Index: libkern/x86/crc32_sse42.c X === X --- libkern/x86/crc32_sse42.c (revision 313007) X +++ libkern/x86/crc32_sse42.c (working copy) X @@ -31,15 +31,41 @@ X */ X #ifdef USERSPACE_TESTING X #include X +#include Fix dependency on namespace pollution in for at least size_t. X #else X #include X +#include X #include
Re: svn commit: r313006 - in head: sys/conf sys/libkern sys/libkern/x86 sys/sys tests/sys/kern
On Tue, Jan 31, 2017 at 7:16 PM, Bruce Evanswrote: > Another reply to this... > > On Tue, 31 Jan 2017, Conrad Meyer wrote: > >> On Tue, Jan 31, 2017 at 7:36 AM, Bruce Evans wrote: >>> >>> On Tue, 31 Jan 2017, Bruce Evans wrote: >>> I >>> think there should by no alignment on entry -- just assume the buffer is >>> aligned in the usual case, and only run 4% slower when it is misaligned. >> >> >> Please write such a patch and demonstrate the improvement. > > > It is easy to demonstrate. I just put #if 0 around the early alignment > code. The result seem too good to be true, so maybe I missed some > later dependency on alignment of the addresses: > - for 128-byte buffers and misalignment of 3, 10g takes 1.48 seconds with > alignment and 1.02 seconds without alignment. > This actually makes sense, 128 bytes can be done with 16 8-byte unaligned > crc32q's. The alignment code makes it do 15 * 8-but and (5 + 3) * 1-byte. > 7 more 3-cycle instructions and overhead too is far more than the cost > of letting the CPU do read-combining. > - for 4096-byte buffers, the difference is insignificant (0.47 seconds for > 10g. I believe it, especially for newer amd64. I seem to recall that older x86 machines had a higher misalignment penalty, but it was largely reduced in (?)Nehalem. Why don't you go ahead and commit that change? >> perhaps we could just remove the 3x8k table — I'm not sure >> it adds any benefit over the 3x256 table. > > > Good idea, but the big table is useful. Ifdefing out the LONG case reduces > the speed for large buffers from ~0.35 seconds to ~0.43 seconds in the > setup below. Ifdefing out the SHORT case only reduces to ~0.39 seconds. Interesting. > I hoped that an even shorter SHORT case would work. I think it now handles > 768 bytes (3 * SHORT) in the inner loop. Right. > That is 32 sets of 3 crc32q's, > and I would have thought that update at the end would take about as long > as 1 iteration (3%), but it apparently takes 33%. The update at the end may be faster with PCLMULQDQ. There are versions of this algorithm that use that in place of the lookup-table combine (for example, Linux has a permissively licensed implementation here: http://lxr.free-electrons.com/source/arch/x86/crypto/crc32c-pcl-intel-asm_64.S ). Unfortunately, PCLMULQDQ uses FPU state, which is inappropriate most of the time in kernel mode. It could be used opportunistically if the thread is already in FPU-save mode or if the input is "big enough" to make it worth it. >>> Your benchmarks mainly give results for the <= 768 bytes where most of >>> the manual optimizations don't apply. >> >> >> 0x000400: asm:68 intrins:62 multitable:684 (ns per buf) >> 0x000800: asm:132 intrins:133 (ns per buf) >> 0x002000: asm:449 intrins:446 (ns per buf) >> 0x008000: asm:1501 intrins:1497 (ns per buf) >> 0x02: asm:5618 intrins:5609 (ns per buf) >> >> (All routines are in a separate compilation unit with no full-program >> optimization, as they are in the kernel.) > > > These seem slow. I modified my program to test the actual kernel code, > and get for 10gB on freefall's Xeon (main times in seconds): Freefall has a Haswell Xeon @ 3.3GHz. My laptop is a Sandy Bridge Core i5 @ 2.6 GHz. That may help explain the difference. Best, Conrad ___ svn-src-all@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/svn-src-all To unsubscribe, send any mail to "svn-src-all-unsubscr...@freebsd.org"
Re: svn commit: r313006 - in head: sys/conf sys/libkern sys/libkern/x86 sys/sys tests/sys/kern
Another reply to this... On Tue, 31 Jan 2017, Conrad Meyer wrote: On Tue, Jan 31, 2017 at 7:36 AM, Bruce Evanswrote: On Tue, 31 Jan 2017, Bruce Evans wrote: Unrolling (or not) may be helpful or harmful for entry and exit code. Helpful, per my earlier benchmarks. I think there should by no alignment on entry -- just assume the buffer is aligned in the usual case, and only run 4% slower when it is misaligned. Please write such a patch and demonstrate the improvement. It is easy to demonstrate. I just put #if 0 around the early alignment code. The result seem too good to be true, so maybe I missed some later dependency on alignment of the addresses: - for 128-byte buffers and misalignment of 3, 10g takes 1.48 seconds with alignment and 1.02 seconds without alignment. This actually makes sense, 128 bytes can be done with 16 8-byte unaligned crc32q's. The alignment code makes it do 15 * 8-but and (5 + 3) * 1-byte. 7 more 3-cycle instructions and overhead too is far more than the cost of letting the CPU do read-combining. - for 4096-byte buffers, the difference is insignificant (0.47 seconds for 10g. I don't understand the algorithm for joining crcs -- why doesn't it work to reduce to 12 or 24 bytes in the main loop? It would, but I haven't implemented or tested that. You're welcome to do so and demonstrate an improvement. It does add more lookup table bloat, but perhaps we could just remove the 3x8k table ??? I'm not sure it adds any benefit over the 3x256 table. Good idea, but the big table is useful. Ifdefing out the LONG case reduces the speed for large buffers from ~0.35 seconds to ~0.43 seconds in the setup below. Ifdefing out the SHORT case only reduces to ~0.39 seconds. I hoped that an even shorter SHORT case would work. I think it now handles 768 bytes (3 * SHORT) in the inner loop. That is 32 sets of 3 crc32q's, and I would have thought that update at the end would take about as long as 1 iteration (3%), but it apparently takes 33%. ... Your benchmarks mainly give results for the <= 768 bytes where most of the manual optimizations don't apply. 0x000400: asm:68 intrins:62 multitable:684 (ns per buf) 0x000800: asm:132 intrins:133 (ns per buf) 0x002000: asm:449 intrins:446 (ns per buf) 0x008000: asm:1501 intrins:1497 (ns per buf) 0x02: asm:5618 intrins:5609 (ns per buf) (All routines are in a separate compilation unit with no full-program optimization, as they are in the kernel.) These seem slow. I modified my program to test the actual kernel code, and get for 10gB on freefall's Xeon (main times in seconds): 0x08: asm(rm):3.41 asm(r):3.07 intrins:6.01 gcc:3.74 (3S = 2.4ns/buf) 0x10: asm(rm):2.05 asm(r):1.70 intrins:2.92 gcc:2.62 (2S = 3/2ns/buf) 0x20: asm(rm):1.63 asm(r):1.58 intrins:1.62 gcc:1.61 (1.6S = 5.12ns/buf) 0x40: asm(rm):1.07 asm(r):1.11 intrins:1.06 gcc:1.14 (1.1S = 7.04ns/buf) 0x80: asm(rm):1.02 asm(r):1.04 intrins:1.03 gcc:1.04 (1.02S = 13.06ns/buf) 0x000100: asm(rm):1.02 asm(r):1.02 intrins:1.02 gcc:1.08 (1.02S = 52.22ns/buf) 0x000200: asm(rm):1.02 asm(r):1.02 intrins:1.02 gcc:1.02 (1.02S = 104.45ns/buf) 0x000400: asm(rm):0.58 asm(r):0.57 intrins:0.57 gcc:0.57 (.57S = 116.43ns/buf) 0x001000: asm(rm):0.62 asm(r):0.57 intrins:0.57 gcc:0.57 (.57S = 233.44ns/buf) 0x002000: asm(rm):0.48 asm(r):0.46 intrins:0.46 gcc:0.46 (.46S = 376.83ns/buf) 0x004000: asm(rm):0.49 asm(r):0.46 intrins:0.46 gcc:0.46 (.46S = 753.66ns/buf) 0x008000: asm(rm):0.49 asm(r):0.38 intrins:0.38 gcc:0.38 (.38S = 1245.18ns/buf) 0x01: asm(rm):0.47 asm(r):0.38 intrins:0.36 gcc:0.38 (.36S = 2359.30ns/buf) 0x02: asm(rm):0.43 asm(r):1.05 intrins:0.35 gcc:0.36 (.35S = 4587.52ns/buf) asm(r) is a fix for clang's slownes with inline asms. Just change the constraint from "rm" to "r". This takes an extra register, but no more uops. This is for the aligned case with no hacks. intrins does something bad for small buffers. Probably just the branch over the dead unrolling. Twice 2.4ns/buf for 8-byte buffers is still very fast. This is 16 cycles. 3 cycles to do 1 crc32q and the rest mainly for 1 function call and too many branches. Bruce___ svn-src-all@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/svn-src-all To unsubscribe, send any mail to "svn-src-all-unsubscr...@freebsd.org"
Re: svn commit: r313006 - in head: sys/conf sys/libkern sys/libkern/x86 sys/sys tests/sys/kern
On Tue, 31 Jan 2017, Conrad Meyer wrote: On Tue, Jan 31, 2017 at 7:36 AM, Bruce Evanswrote: On Tue, 31 Jan 2017, Bruce Evans wrote: Unrolling (or not) may be helpful or harmful for entry and exit code. Helpful, per my earlier benchmarks. I think there should by no alignment on entry -- just assume the buffer is aligned in the usual case, and only run 4% slower when it is misaligned. Please write such a patch and demonstrate the improvement. I now understand the algorithm. The division into 3 has to keep sequentiality (since CRC input is a bit string), so it doesn't work to separate the memory acces of the 3 crc32's by 8 bytes as in my simple test -- they have to be separated by large amounts. Then recombining requires multiplying the polynomials associated with the CRCs of the higher 2 blocks by X^N and reducing again, where N is large an related to the block size. This is done using a large table for each N needed, and to keep things reasonably simple, only 2 N's are used. The exit code handles up to SHORT * 3 = 768 bytes, not up to 4 or 8 bytes or up to 3 times that like simpler algorithms. 768 is quite large, and the exit code is quite slow. It reduces 8 or 4 bytes at a time without any dependency reduction, and then 1 byte at a time. Yes, this is the important loop to unroll for small inputs. Somehow Not like clang does it. Unrolling is useless without the 3-way blocking. with the unrolling, it is only ~19% slower than the by-3 algorithm on my system ??? not 66%. Clang 3.9.1 unrolls both of these trailing loops; here is the first: Maybe 3.9.1 pessimizes the 3-way loop by unrolling it. This would be fairly easy to do. Just replicate the 3 crc32q's a few times, say 8, and do them in a bad order (3 blocks of 8 dependent ones instead of 8 blocks of 3 independent ones). With enough replication, the code would be too large for the hardware to reorder. Inline asm has another advantage here -- volatile on it prevents reordering and might prevent unrolling. Maybe 3.9.1 unpessimizes the inline asms. But I suspect not getting the 3 times speedup is for another reason. 0x00401b88 <+584>: cmp$0x38,%rbx 0x00401b8c <+588>: jae0x401b93 0x00401b8e <+590>: mov%rsi,%rdx 0x00401b91 <+593>: jmp0x401be1 0x00401b93 <+595>: lea-0x1(%rdi),%rbx 0x00401b97 <+599>: sub%rdx,%rbx 0x00401b9a <+602>: mov%rsi,%rdx 0x00401b9d <+605>: nopl (%rax) 0x00401ba0 <+608>: crc32q (%rdx),%rax 0x00401ba6 <+614>: crc32q 0x8(%rdx),%rax 0x00401bad <+621>: crc32q 0x10(%rdx),%rax 0x00401bb4 <+628>: crc32q 0x18(%rdx),%rax 0x00401bbb <+635>: crc32q 0x20(%rdx),%rax 0x00401bc2 <+642>: crc32q 0x28(%rdx),%rax 0x00401bc9 <+649>: crc32q 0x30(%rdx),%rax 0x00401bd0 <+656>: crc32q 0x38(%rdx),%rax 0x00401bd7 <+663>: add$0x40,%rdx 0x00401bdb <+667>: add$0x8,%rbx 0x00401bdf <+671>: jne0x401ba0 No, this unrolling is useless. The crc32q's are dependent on each other, so they take 3 cycles each. There are spare resources to run about 12 instructions during that time. Loop control only takes 3. I don't understand the algorithm for joining crcs -- why doesn't it work to reduce to 12 or 24 bytes in the main loop? It would, but I haven't implemented or tested that. You're welcome to do so and demonstrate an improvement. It does add more lookup table bloat, but perhaps we could just remove the 3x8k table ??? I'm not sure it adds any benefit over the 3x256 table. Your benchmarks mainly give results for the <= 768 bytes where most of the manual optimizations don't apply. Actually, they test only the large buffer case. They used buffer size of 1M and 1k and didn't do the entry and exit code that usually dominates for small buffers. I re-tested with the correct blocking. This was about 10% slower (0.34 -> 0.37 seconds for 10GB), except for clang without intrinsics it was 20% slower (0.43 -> 0.51) seconds. 0x000400: asm:68 intrins:62 multitable:684 (ns per buf) I don't see any signs of this in my test: - a single crc32q in a (C) loop doesn't benefit from unrolling or lose to the extra clang instructions without intrinsics. clang-3.9.0 unrolls this 8-way in the simpler environment of my test program, but this makes no difference. - similarly for a single crc32b in a loop, except when I forgot to change the type of the crc accumulator from uint64_t to uint32_t, gcc was 1 cycle slower in the loop (3 instead of 4). gcc generates an extra instruction to zero-extend the crc, and this is more expensive than usual since it gives gives another dependency. clang optimizes this away. 0x000800: asm:132 intrins:133 (ns per buf) 0x002000: asm:449 intrins:446 (ns
Re: svn commit: r313006 - in head: sys/conf sys/libkern sys/libkern/x86 sys/sys tests/sys/kern
On Tue, Jan 31, 2017 at 7:36 AM, Bruce Evanswrote: > On Tue, 31 Jan 2017, Bruce Evans wrote: > Unrolling (or not) may be helpful or harmful for entry and exit code. Helpful, per my earlier benchmarks. > I > think there should by no alignment on entry -- just assume the buffer is > aligned in the usual case, and only run 4% slower when it is misaligned. Please write such a patch and demonstrate the improvement. > The exit code handles up to SHORT * 3 = 768 bytes, not up to 4 or 8 > bytes or up to 3 times that like simpler algorithms. 768 is quite > large, and the exit code is quite slow. It reduces 8 or 4 bytes at a > time without any dependency reduction, and then 1 byte at a time. Yes, this is the important loop to unroll for small inputs. Somehow with the unrolling, it is only ~19% slower than the by-3 algorithm on my system — not 66%. Clang 3.9.1 unrolls both of these trailing loops; here is the first: 0x00401b88 <+584>: cmp$0x38,%rbx 0x00401b8c <+588>: jae0x401b93 0x00401b8e <+590>: mov%rsi,%rdx 0x00401b91 <+593>: jmp0x401be1 0x00401b93 <+595>: lea-0x1(%rdi),%rbx 0x00401b97 <+599>: sub%rdx,%rbx 0x00401b9a <+602>: mov%rsi,%rdx 0x00401b9d <+605>: nopl (%rax) 0x00401ba0 <+608>: crc32q (%rdx),%rax 0x00401ba6 <+614>: crc32q 0x8(%rdx),%rax 0x00401bad <+621>: crc32q 0x10(%rdx),%rax 0x00401bb4 <+628>: crc32q 0x18(%rdx),%rax 0x00401bbb <+635>: crc32q 0x20(%rdx),%rax 0x00401bc2 <+642>: crc32q 0x28(%rdx),%rax 0x00401bc9 <+649>: crc32q 0x30(%rdx),%rax 0x00401bd0 <+656>: crc32q 0x38(%rdx),%rax 0x00401bd7 <+663>: add$0x40,%rdx 0x00401bdb <+667>: add$0x8,%rbx 0x00401bdf <+671>: jne0x401ba0 > I > don't understand the algorithm for joining crcs -- why doesn't it work > to reduce to 12 or 24 bytes in the main loop? It would, but I haven't implemented or tested that. You're welcome to do so and demonstrate an improvement. It does add more lookup table bloat, but perhaps we could just remove the 3x8k table — I'm not sure it adds any benefit over the 3x256 table. > Your benchmarks mainly give results for the <= 768 bytes where most of > the manual optimizations don't apply. 0x000400: asm:68 intrins:62 multitable:684 (ns per buf) 0x000800: asm:132 intrins:133 (ns per buf) 0x002000: asm:449 intrins:446 (ns per buf) 0x008000: asm:1501 intrins:1497 (ns per buf) 0x02: asm:5618 intrins:5609 (ns per buf) (All routines are in a separate compilation unit with no full-program optimization, as they are in the kernel.) > Compiler optimizations are more > likely to help there. So I looked more closely at the last 2 loop. > clang indeed only unrolls the last one, Not in 3.9.1. > only for the unreachable case > with more than 8 bytes on amd64. How is it unreachable? Best, Conrad ___ svn-src-all@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/svn-src-all To unsubscribe, send any mail to "svn-src-all-unsubscr...@freebsd.org"
Re: svn commit: r313006 - in head: sys/conf sys/libkern sys/libkern/x86 sys/sys tests/sys/kern
On Tue, 31 Jan 2017, Bruce Evans wrote: On Mon, 30 Jan 2017, Conrad Meyer wrote: On Mon, Jan 30, 2017 at 9:26 PM, Bruce Evanswrote: On Tue, 31 Jan 2017, Conrad E. Meyer wrote: Log: calculate_crc32c: Add SSE4.2 implementation on x86 This breaks building with gcc-4.2.1, gcc-4.2.1 is an ancient compiler. Good riddance. I prefer it. It also works better on ordinary asms for crc32. Added: head/sys/libkern/x86/crc32_sse42.c == --- /dev/null 00:00:00 1970 (empty, because file is newly added) +++ head/sys/libkern/x86/crc32_sse42.c Tue Jan 31 03:26:32 2017 (r313006) + +#include ... Inline asm is much less unportable than intrinsics. kib used the correct method of .byte's in asms to avoid depending on assembler support for newer instructions. .byte is still used for clflush on amd64 and i386. It used to be used for invpcid on amd64. I can't find where it is or was used for xsave stuff. Konstantin predicted this complaint in code review (phabricator). Unfortunately, Clang does not automatically unroll asms, even with the correct mnemonic. Unrolling is essential to performance below the by-3 block size (768 bytes in this implementation). Hand unrolling in C seems to generate less efficient assembly than the compiler's unrolling. Unorolling is almost completely useless if the instruction takes 3 cycles like you said it does. The loop calculations run much faster than that, so they can run in parallel. However, clang generates horrible code for inline asms instead of intrinsics. The unrolling was worse than useless (see previous reply). It is the horrible code in the non-intrinsics case that seems to be the main source of differences (see below). The left column below is block size. The measurements are nanoseconds per buf, per CLOCK_VIRTUAL, averaged over 10^5 loops. These numbers do not vary more than +/- 1ns run to run on my idle Sandy Bridge laptop. "asm" is using __asm__(), "intrins" using the _mm_crc32 intrinsics that Clang can unroll, and multitable is the older lookup-table implementation (still used on other architectures). 0x10: asm:0 intrins:0 multitable:0 (ns per buf) 0x20: asm:7 intrins:9 multitable:78 (ns per buf) 0x40: asm:10 intrins:7 multitable:50 (ns per buf) 0x80: asm:15 intrins:9 multitable:91 (ns per buf) 0x000100: asm:25 intrins:17 multitable:178 (ns per buf) 0x000200: asm:55 intrins:38 multitable:347 (ns per buf) 0x000400: asm:61 intrins:62 multitable:684 (ns per buf) Both implementations are superior to the multitable approach, so it is unreasonable not to make one of them standard on x86 platforms. The unrolled intrinsics are consistently better than not unrolled on objects 0x40-0x200 bytes large. At 0x400 bytes we pass the first unroll-by-3 threshold and it stops mattering as much. At 0x40 bytes, it is the difference between 6.4 GB/s and 9.1 GB/s. At 0x200 bytes, it is the difference between 9.3 GB/s and 13.5 GB/s. I think this justifies some minor ugliness. If it matters, which I doubt, then the compiler shouldn't be trusted for unrolling. It can only do better than handwritten unrolling if it has tables for the correct amount unrolling for every instruction for thousands of CPUs (since such tables are too hard to write manually). It is the 3-way unrolling that makes the main difference. clang is clueless about unrolling in this function, but this 3-way unrolling is in the C code. It is to reduce dependencies. The timing seems to be as I suspected -- the crc32 instruction has a latency of 3 and a throughput of 3 per 3 cycles, so with 3 dependent instructions it runs 3 times slower than with 3 independent instructions. Naive unrolling gives more dependent instructions, so is equally slow to no unrolling. Test program for this: X #include X #include X X #ifndef __clang__ Change this to __clang__X to kill intrinsics for clang. X static __inline uint32_t X _mm_crc32_u8(uint32_t x, uint8_t y) X { X __asm("crc32b %1,%0" : "+r" (x) : "rm" (y)); X return (x); X } X X static __inline uint32_t __unused X _mm_crc32_u32(uint32_t x, int32_t y) X { X __asm("crc32l %1,%0" : "+r" (x) : "rm" (y)); X return (x); X } X X #ifdef __amd64__ X static __inline uint64_t X _mm_crc32_u64(uint64_t x, int64_t y) X { X __asm("crc32q %1,%0" : "+r" (x) : "rm" (y)); X return (x); X } X #endif X #else X #include X #endif X X #define MISALIGN 0 MISALIGN 1 costs about 4% (0.36 seconds instead of 0.34 seconds on freefall's Xeon). X #define SIZE 100 The large size of 1M is to bust at least the L1 cache. This makes little difference, since the speed is CPU-bound (0.34 seconds @ 3.3 GHz = ~34/33 cycles per crc32 instruction). Actually, I'm not sure about the speeds. freefalls's machdep.tsc_freq is 3.3GHz, but the speed could be anything due to turbo mode and frequency control
Re: svn commit: r313006 - in head: sys/conf sys/libkern sys/libkern/x86 sys/sys tests/sys/kern
On Mon, 30 Jan 2017, Conrad Meyer wrote: On Mon, Jan 30, 2017 at 9:26 PM, Bruce Evanswrote: On Tue, 31 Jan 2017, Conrad E. Meyer wrote: Log: calculate_crc32c: Add SSE4.2 implementation on x86 This breaks building with gcc-4.2.1, gcc-4.2.1 is an ancient compiler. Good riddance. I prefer it. Added: head/sys/libkern/x86/crc32_sse42.c == --- /dev/null 00:00:00 1970 (empty, because file is newly added) +++ head/sys/libkern/x86/crc32_sse42.c Tue Jan 31 03:26:32 2017 (r313006) + +#include ... Inline asm is much less unportable than intrinsics. kib used the correct method of .byte's in asms to avoid depending on assembler support for newer instructions. .byte is still used for clflush on amd64 and i386. It used to be used for invpcid on amd64. I can't find where it is or was used for xsave stuff. Konstantin predicted this complaint in code review (phabricator). Unfortunately, Clang does not automatically unroll asms, even with the correct mnemonic. Unrolling is essential to performance below the by-3 block size (768 bytes in this implementation). Hand unrolling in C seems to generate less efficient assembly than the compiler's unrolling. Unorolling is almost completely useless if the instruction takes 3 cycles like you said it does. The loop calculations run much faster than that, so they can run in parallel. However, clang generates horrible code for inline asms instead of intrinsics. Simple asms can be compiled even by gcc (apparently old binutils supports SSE4): X Index: crc32_sse42.c X === X --- crc32_sse42.c (revision 313007) X +++ crc32_sse42.c (working copy) X @@ -38,8 +38,29 @@ X #include X #endif X X -#include X +static __inline uint32_t X +_mm_crc32_u8(uint32_t x, uint8_t y) X +{ X + __asm("crc32b %1,%0" : "+r" (x) : "rm" (y)); X + return (x); X +} X X +static __inline uint32_t __unused X +_mm_crc32_u32(uint32_t x, int32_t y) X +{ X + __asm("crc32l %1,%0" : "+r" (x) : "rm" (y)); X + return (x); X +} X + X +#ifdef __amd64__ X +static __inline uint64_t X +_mm_crc32_u64(uint64_t x, int64_t y) X +{ X + __asm("crc32q %1,%0" : "+r" (x) : "rm" (y)); X + return (x); X +} X +#endif X + X /* CRC-32C (iSCSI) polynomial in reversed bit order. */ X #define POLY 0x82f63b78 I only tested this at compile time. clang generates horrible code. Starting with y in an array in memory, it always copies y to the stack and does a memory access there. There is nothing special about these inline asms, so I think clang does the same pessimizations for most asms in cpufunc.h. jkim got annoyed by this for something like rdtsc() and changed too much to reduce the problem for just one case. The differences in the generated code look like: intrinsincs (old <) vs inline asm (new >): 28,29c29,33 This Inner Loop Header: Depth=1
Re: svn commit: r313006 - in head: sys/conf sys/libkern sys/libkern/x86 sys/sys tests/sys/kern
Hi Bruce, On Mon, Jan 30, 2017 at 9:26 PM, Bruce Evanswrote: > On Tue, 31 Jan 2017, Conrad E. Meyer wrote: > >> Log: >> calculate_crc32c: Add SSE4.2 implementation on x86 > > > This breaks building with gcc-4.2.1, gcc-4.2.1 is an ancient compiler. Good riddance. >> Added: head/sys/libkern/x86/crc32_sse42.c >> >> == >> --- /dev/null 00:00:00 1970 (empty, because file is newly added) >> +++ head/sys/libkern/x86/crc32_sse42.c Tue Jan 31 03:26:32 2017 >> (r313006) >> + >> +#include > > ... > > Inline asm is much less unportable than intrinsics. kib used the correct > method of .byte's in asms to avoid depending on assembler support for newer > instructions. .byte is still used for clflush on amd64 and i386. It > used to be used for invpcid on amd64. I can't find where it is or was > used for xsave stuff. Konstantin predicted this complaint in code review (phabricator). Unfortunately, Clang does not automatically unroll asms, even with the correct mnemonic. Unrolling is essential to performance below the by-3 block size (768 bytes in this implementation). Hand unrolling in C seems to generate less efficient assembly than the compiler's unrolling. The left column below is block size. The measurements are nanoseconds per buf, per CLOCK_VIRTUAL, averaged over 10^5 loops. These numbers do not vary more than +/- 1ns run to run on my idle Sandy Bridge laptop. "asm" is using __asm__(), "intrins" using the _mm_crc32 intrinsics that Clang can unroll, and multitable is the older lookup-table implementation (still used on other architectures). 0x10: asm:0 intrins:0 multitable:0 (ns per buf) 0x20: asm:7 intrins:9 multitable:78 (ns per buf) 0x40: asm:10 intrins:7 multitable:50 (ns per buf) 0x80: asm:15 intrins:9 multitable:91 (ns per buf) 0x000100: asm:25 intrins:17 multitable:178 (ns per buf) 0x000200: asm:55 intrins:38 multitable:347 (ns per buf) 0x000400: asm:61 intrins:62 multitable:684 (ns per buf) Both implementations are superior to the multitable approach, so it is unreasonable not to make one of them standard on x86 platforms. The unrolled intrinsics are consistently better than not unrolled on objects 0x40-0x200 bytes large. At 0x400 bytes we pass the first unroll-by-3 threshold and it stops mattering as much. At 0x40 bytes, it is the difference between 6.4 GB/s and 9.1 GB/s. At 0x200 bytes, it is the difference between 9.3 GB/s and 13.5 GB/s. I think this justifies some minor ugliness. Best, Conrad ___ svn-src-all@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/svn-src-all To unsubscribe, send any mail to "svn-src-all-unsubscr...@freebsd.org"
Re: svn commit: r313006 - in head: sys/conf sys/libkern sys/libkern/x86 sys/sys tests/sys/kern
On Tue, 31 Jan 2017, Conrad E. Meyer wrote: Log: calculate_crc32c: Add SSE4.2 implementation on x86 This breaks building with gcc-4.2.1, and depends on using non-kernel clang headers for clang. Modified: head/sys/conf/files.amd64 == --- head/sys/conf/files.amd64 Tue Jan 31 01:55:29 2017(r313005) +++ head/sys/conf/files.amd64 Tue Jan 31 03:26:32 2017(r313006) @@ -593,6 +593,11 @@ compat/ndis/subr_pe.c optionalndisapi compat/ndis/subr_usbd.c optionalndisapi pci compat/ndis/winx64_wrap.S optionalndisapi pci # +crc32_sse42.o standard\ I don't want it, but it is standard. + dependency "$S/libkern/x86/crc32_sse42.c"\ + compile-with"${CC} -c ${CFLAGS:N-nostdinc} ${WERROR} ${PROF} -msse4 ${.IMPSRC}" \ -msse4 is not supported by gcc-4.2.1, Removing nostdinc pollutes the build with host headers, and the one needed might not be installed, and it doesn't exist for gcc-4.2.1. Similarly for i386. Modified: head/sys/libkern/crc32.c == --- head/sys/libkern/crc32.cTue Jan 31 01:55:29 2017(r313005) +++ head/sys/libkern/crc32.cTue Jan 31 03:26:32 2017(r313006) @@ -46,8 +46,14 @@ __FBSDID("$FreeBSD$"); #include +#include Style bug. libkern.h is part if systm.h. #include Ordering bug. systm.h is a prerequisite for all kernel headers except param.h, since it defines macros which might be used by other headers. Added: head/sys/libkern/x86/crc32_sse42.c == --- /dev/null 00:00:00 1970 (empty, because file is newly added) +++ head/sys/libkern/x86/crc32_sse42.c Tue Jan 31 03:26:32 2017 (r313006) @@ -0,0 +1,288 @@ ... +#ifdef USERSPACE_TESTING +#include +#else +#include +#include +#include +#include +#endif Style and ordering bugs, as above. + +#include This header is outside of the kernel source tree. It is not even in /usr/include, but clang finds it in: crc32_sse42.o: /usr/bin/../lib/clang/3.9.0/include/nmmintrin.h \ /usr/bin/../lib/clang/3.9.0/include/smmintrin.h \ /usr/bin/../lib/clang/3.9.0/include/tmmintrin.h \ /usr/bin/../lib/clang/3.9.0/include/pmmintrin.h \ /usr/bin/../lib/clang/3.9.0/include/emmintrin.h \ /usr/bin/../lib/clang/3.9.0/include/xmmintrin.h \ /usr/bin/../lib/clang/3.9.0/include/mmintrin.h \ /usr/bin/../lib/clang/3.9.0/include/f16cintrin.h \ /usr/bin/../lib/clang/3.9.0/include/popcntintrin.h nmmintrin.h doesn't exist for gcc-4.2.1. gcc-4.2.1 has some of the other intrin.h files, but they aren't installed in FreeBSD-9, and of course they don't support newer SSE. Inline asm is much less unportable than intrinsics. kib used the correct method of .byte's in asms to avoid depending on assembler support for newer instructions. .byte is still used for clflush on amd64 and i386. It used to be used for invpcid on amd64. I can't find where it is or was used for xsave stuff. Bruce ___ svn-src-all@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/svn-src-all To unsubscribe, send any mail to "svn-src-all-unsubscr...@freebsd.org"
Re: svn commit: r313006 - in head: sys/conf sys/libkern sys/libkern/x86 sys/sys tests/sys/kern
On Mon, Jan 30, 2017 at 7:26 PM, Conrad E. Meyerwrote: > (The CRC instruction takes 1 cycle but has 2-3 cycles of latency.) My mistake, it's not 2 anywhere. It's just 3 cycles on all workstation/server CPUs since Nehalem. Different on Atom chips and AMD. Best, Conrad ___ svn-src-all@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/svn-src-all To unsubscribe, send any mail to "svn-src-all-unsubscr...@freebsd.org"
svn commit: r313006 - in head: sys/conf sys/libkern sys/libkern/x86 sys/sys tests/sys/kern
Author: cem Date: Tue Jan 31 03:26:32 2017 New Revision: 313006 URL: https://svnweb.freebsd.org/changeset/base/313006 Log: calculate_crc32c: Add SSE4.2 implementation on x86 Derived from an implementation by Mark Adler. The fast loop performs three simultaneous CRCs over subsets of the data before composing them. This takes advantage of certain properties of the CRC32 implementation in Intel hardware. (The CRC instruction takes 1 cycle but has 2-3 cycles of latency.) The CRC32 instruction does not manipulate FPU state. i386 does not have the crc32q instruction, so avoid it there. Otherwise the implementation is identical to amd64. Add basic userland tests to verify correctness on a variety of inputs. PR: 216467 Reported by: Ben RUBSON Reviewed by: kib@, markj@ (earlier version) Sponsored by: Dell EMC Isilon Differential Revision:https://reviews.freebsd.org/D9342 Added: head/sys/libkern/x86/ head/sys/libkern/x86/crc32_sse42.c (contents, props changed) head/tests/sys/kern/libkern_crc32.c (contents, props changed) Modified: head/sys/conf/files.amd64 head/sys/conf/files.i386 head/sys/libkern/crc32.c head/sys/sys/libkern.h head/tests/sys/kern/Makefile Modified: head/sys/conf/files.amd64 == --- head/sys/conf/files.amd64 Tue Jan 31 01:55:29 2017(r313005) +++ head/sys/conf/files.amd64 Tue Jan 31 03:26:32 2017(r313006) @@ -593,6 +593,11 @@ compat/ndis/subr_pe.c optionalndisapi compat/ndis/subr_usbd.coptionalndisapi pci compat/ndis/winx64_wrap.S optionalndisapi pci # +crc32_sse42.o standard\ + dependency "$S/libkern/x86/crc32_sse42.c" \ + compile-with"${CC} -c ${CFLAGS:N-nostdinc} ${WERROR} ${PROF} -msse4 ${.IMPSRC}" \ + no-implicit-rule\ + clean "crc32_sse42.o" libkern/memmove.c standard libkern/memset.c standard # Modified: head/sys/conf/files.i386 == --- head/sys/conf/files.i386Tue Jan 31 01:55:29 2017(r313005) +++ head/sys/conf/files.i386Tue Jan 31 03:26:32 2017(r313006) @@ -554,6 +554,11 @@ kern/kern_clocksource.cstandard kern/imgact_aout.c optional compat_aout kern/imgact_gzip.c optional gzip kern/subr_sfbuf.c standard +crc32_sse42.o standard\ + dependency "$S/libkern/x86/crc32_sse42.c" \ + compile-with"${CC} -c ${CFLAGS:N-nostdinc} ${WERROR} ${PROF} -msse4 ${.IMPSRC}" \ + no-implicit-rule\ + clean "crc32_sse42.o" libkern/divdi3.c standard libkern/ffsll.cstandard libkern/flsll.cstandard Modified: head/sys/libkern/crc32.c == --- head/sys/libkern/crc32.cTue Jan 31 01:55:29 2017(r313005) +++ head/sys/libkern/crc32.cTue Jan 31 03:26:32 2017(r313006) @@ -46,8 +46,14 @@ __FBSDID("$FreeBSD$"); #include +#include #include +#if defined(__amd64__) || defined(__i386__) +#include +#include +#endif + const uint32_t crc32_tab[] = { 0x, 0x77073096, 0xee0e612c, 0x990951ba, 0x076dc419, 0x706af48f, 0xe963a535, 0x9e6495a3, 0x0edb8832, 0x79dcb8a4, 0xe0d5e91e, 0x97d2d988, @@ -749,6 +755,11 @@ calculate_crc32c(uint32_t crc32c, const unsigned char *buffer, unsigned int length) { +#if defined(__amd64__) || defined(__i386__) + if ((cpu_feature2 & CPUID2_SSE42) != 0) { + return (sse42_crc32c(crc32c, buffer, length)); + } else +#endif if (length < 4) { return (singletable_crc32c(crc32c, buffer, length)); } else { Added: head/sys/libkern/x86/crc32_sse42.c == --- /dev/null 00:00:00 1970 (empty, because file is newly added) +++ head/sys/libkern/x86/crc32_sse42.c Tue Jan 31 03:26:32 2017 (r313006) @@ -0,0 +1,288 @@ +/* + * Derived from crc32c.c version 1.1 by Mark Adler. + * + * Copyright (C) 2013 Mark Adler + * + * This software is provided 'as-is', without any express or implied warranty. + * In no event will the author be held liable for any damages arising from the + * use of this software. + * + * Permission is granted to anyone to use this software for any purpose, + * including commercial applications, and to alter it and redistribute it + * freely, subject to the following restrictions: + * + * 1. The origin of this software must not be