On Tue, 31 Jan 2017, Conrad Meyer wrote:

On Tue, Jan 31, 2017 at 7:36 AM, Bruce Evans <b...@optusnet.com.au> wrote:
On Tue, 31 Jan 2017, Bruce Evans wrote:
Unrolling (or not) may be helpful or harmful for entry and exit code.

Helpful, per my earlier benchmarks.

I
think there should by no alignment on entry -- just assume the buffer is
aligned in the usual case, and only run 4% slower when it is misaligned.

Please write such a patch and demonstrate the improvement.

I now understand the algorithm.

The division into 3 has to keep sequentiality (since CRC input is a bit
string), so it doesn't work to separate the memory acces of the 3 crc32's
by 8 bytes as in my simple test -- they have to be separated by large
amounts.  Then recombining requires multiplying the polynomials associated
with the CRCs of the higher 2 blocks by X^N and reducing again, where N is
large an related to the block size.  This is done using a large table for
each N needed, and to keep things reasonably simple, only 2 N's are used.

The exit code handles up to SHORT * 3 = 768 bytes, not up to 4 or 8
bytes or up to 3 times that like simpler algorithms.  768 is quite
large, and the exit code is quite slow.  It reduces 8 or 4 bytes at a
time without any dependency reduction, and then 1 byte at a time.

Yes, this is the important loop to unroll for small inputs.  Somehow

Not like clang does it.  Unrolling is useless without the 3-way blocking.

with the unrolling, it is only ~19% slower than the by-3 algorithm on
my system ??? not 66%.  Clang 3.9.1 unrolls both of these trailing
loops; here is the first:

Maybe 3.9.1 pessimizes the 3-way loop by unrolling it.  This would be
fairly easy to do.  Just replicate the 3 crc32q's a few times, say 8,
and do them in a bad order (3 blocks of 8 dependent ones instead of
8 blocks of 3 independent ones).  With enough replication, the code
would be too large for the hardware to reorder.  Inline asm has
another advantage here -- volatile on it prevents reordering and
might prevent unrolling.

Maybe 3.9.1 unpessimizes the inline asms.

But I suspect not getting the 3 times speedup is for another reason.



  0x0000000000401b88 <+584>:   cmp    $0x38,%rbx
  0x0000000000401b8c <+588>:   jae    0x401b93 <sse42_crc32c+595>
  0x0000000000401b8e <+590>:   mov    %rsi,%rdx
  0x0000000000401b91 <+593>:   jmp    0x401be1 <sse42_crc32c+673>
  0x0000000000401b93 <+595>:   lea    -0x1(%rdi),%rbx
  0x0000000000401b97 <+599>:   sub    %rdx,%rbx
  0x0000000000401b9a <+602>:   mov    %rsi,%rdx
  0x0000000000401b9d <+605>:   nopl   (%rax)
  0x0000000000401ba0 <+608>:   crc32q (%rdx),%rax
  0x0000000000401ba6 <+614>:   crc32q 0x8(%rdx),%rax
  0x0000000000401bad <+621>:   crc32q 0x10(%rdx),%rax
  0x0000000000401bb4 <+628>:   crc32q 0x18(%rdx),%rax
  0x0000000000401bbb <+635>:   crc32q 0x20(%rdx),%rax
  0x0000000000401bc2 <+642>:   crc32q 0x28(%rdx),%rax
  0x0000000000401bc9 <+649>:   crc32q 0x30(%rdx),%rax
  0x0000000000401bd0 <+656>:   crc32q 0x38(%rdx),%rax
  0x0000000000401bd7 <+663>:   add    $0x40,%rdx
  0x0000000000401bdb <+667>:   add    $0x8,%rbx
  0x0000000000401bdf <+671>:   jne    0x401ba0 <sse42_crc32c+608>

No, this unrolling is useless.  The crc32q's are dependent on each other,
so they take 3 cycles each.  There are spare resources to run about 12
instructions during that time.  Loop control only takes 3.

I
don't understand the algorithm for joining crcs -- why doesn't it work
to reduce to 12 or 24 bytes in the main loop?

It would, but I haven't implemented or tested that.  You're welcome to
do so and demonstrate an improvement.  It does add more lookup table
bloat, but perhaps we could just remove the 3x8k table ??? I'm not sure
it adds any benefit over the 3x256 table.

Your benchmarks mainly give results for the <= 768 bytes where most of
the manual optimizations don't apply.

Actually, they test only the large buffer case.  They used buffer size
of 1M and 1k and didn't do the entry and exit code that usually dominates
for small buffers.

I re-tested with the correct blocking.  This was about 10% slower
(0.34 -> 0.37 seconds for 10GB), except for clang without intrinsics it
was 20% slower (0.43 -> 0.51) seconds.

0x000400: asm:68 intrins:62 multitable:684  (ns per buf)

I don't see any signs of this in my test:
- a single crc32q in a (C) loop doesn't benefit from unrolling or lose
  to the extra clang instructions without intrinsics.  clang-3.9.0
  unrolls this 8-way in the simpler environment of my test program,
  but this makes no difference.
- similarly for a single crc32b in a loop, except when I forgot to
  change the type of the crc accumulator from uint64_t to uint32_t,
  gcc was 1 cycle slower in the loop (3 instead of 4).  gcc generates
  an extra instruction to zero-extend the crc, and this is more
  expensive than usual since it gives gives another dependency.
  clang optimizes this away.

0x000800: asm:132 intrins:133  (ns per buf)
0x002000: asm:449 intrins:446  (ns per buf)
0x008000: asm:1501 intrins:1497  (ns per buf)
0x020000: asm:5618 intrins:5609  (ns per buf)

Now it is mostly in the 3-way optimized case and the differences are
in the noise.

(All routines are in a separate compilation unit with no full-program
optimization, as they are in the kernel.)

Compiler optimizations are more
likely to help there.  So I looked more closely at the last 2 loop.
clang indeed only unrolls the last one,

Not in 3.9.1.

3.9.1 seems to only extend the useless unrolling.

only for the unreachable case
with more than 8 bytes on amd64.

How is it unreachable?

Because the loop doing 8-byte words at a time reduces the count below 8.

Bruce
_______________________________________________
svn-src-head@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to "svn-src-head-unsubscr...@freebsd.org"

Reply via email to