Is it the zlib implementation in the function crc32() you're comparing
to? Taking a peek in the zlib source, it looks like they do a fair
bit of manual loop unrolling and also process the CRC 4 bytes at a
time. Given those differences, the speed difference might not be so
surprising.
On Thu,
The fastest routine at
https://github.com/andrewcooke/CRC.jl/blob/master/test/speed.jl is 2.6x
slower than C code.
I've tried to isolate things so it's easy to hack and experiment with. If
anyone can beat my best code (which - credit to Julia - is also the
simplest; anything I try to make