Re: [xz-devel] java crc64 implementation
This had /way/ more impact than I expected on overall decompression performance. Here are the baseline numbers for 1.8 (jdk 11 64bit): Benchmark (file) Mode Cnt Score Error Units XZDecompressionBenchmark.decompress ihe_ovly_pr.dcm avgt3 0.731 ± 0.010 ms/op XZDecompressionBenchmark.decompress image1.dcm avgt3 445.394 ± 7.052 ms/op XZDecompressionBenchmark.decompresslarge.xml avgt3 283.013 ± 5.004 ms/op XZ2DecompressionBenchmark.decompress 1 byte rep avgt3 587.127 ± 7.254 ms/op Here are numbers with current trunk, which is basically just the crc64 change: Benchmark (file) Mode Cnt ScoreError Units XZDecompressionBenchmark.decompress ihe_ovly_pr.dcm avgt3 0.662 ± 0.012 ms/op XZDecompressionBenchmark.decompress image1.dcm avgt3 391.644 ± 13.871 ms/op XZDecompressionBenchmark.decompresslarge.xml avgt3 225.456 ± 16.265 ms/op XZ2DecompressionBenchmark.decompress 1 byte rep avgt3 387.347 ± 18.811 ms/op And the LZDecoder change gets it down to: Benchmark (file) Mode Cnt ScoreError Units XZDecompressionBenchmark.decompress ihe_ovly_pr.dcm avgt3 0.607 ± 0.187 ms/op XZDecompressionBenchmark.decompress image1.dcm avgt3 369.419 ± 32.400 ms/op XZDecompressionBenchmark.decompresslarge.xml avgt3 190.580 ± 7.856 ms/op XZ2DecompressionBenchmark.decompress 1 byte rep avgt3 220.554 ± 8.812 ms/op Brett
Re: [xz-devel] java crc64 implementation
On 2021-02-05 Brett Okken wrote: > On Fri, Feb 5, 2021 at 11:07 AM Lasse Collin > wrote: > > Also, does it really help to unroll the loop? With 8191-byte > > buffers I see no significant difference (in a quick > > not-very-accurate test) if the switch-statement is replaced with a > > while-loop. > > The differences are pretty minimal. My observation was switch a bit > faster than for loop, which was a bit faster than a while loop. But > the differences in averages were less than the confidence interval for > the given tests. OK, smaller code wins then. > > With these two changes the code becomes functionally identical to > > the version I posted with the name "Modified slicing-by-4". Is that > > an OK version to commit? > > Yes. OK. > > Is the following fine to you as the file header? Your email address > > can be omitted if you prefer that. I will mention in the commit > > message that you adapted the code from XZ Utils and benchmarked it. > > > > /* > > * CRC64 > > * > > * Authors: Brett Okken > > * Lasse Collin > > * > > * This file has been put into the public domain. > > * You can do whatever you want with this file. > > */ > > That is fine. You can include my e-mail. OK. :-) I have committed it. Thank you! The LZDecoder changes I may still look at before the next release. Then I will go back to the XZ Utils code. -- Lasse Collin | IRC: Larhzu @ IRCnet & Freenode
Re: [xz-devel] java crc64 implementation
On Fri, Feb 5, 2021 at 11:07 AM Lasse Collin wrote: > > On 2021-02-02 Brett Okken wrote: > > Thus far I have only tested on jdk 11 64bit windows, but the fairly > > clear winner is: > > > > public void update(byte[] buf, int off, int len) { > > final int end = off + len; > > int i=off; > > if (len > 3) { > > switch (i & 3) { > > case 3: > > crc = TABLE[0][(buf[i++] ^ (int) crc) & 0xFF] ^ > > (crc >>> 8); > > case 2: > > crc = TABLE[0][(buf[i++] ^ (int) crc) & 0xFF] ^ > > (crc >>> 8); > > case 1: > > crc = TABLE[0][(buf[i++] ^ (int) crc) & 0xFF] ^ > > (crc >>> 8); > > } > > To ensure (i & 3) == 0 when entering the main loop, the case-labels > should be 1-2-3, not 3-2-1. This may have messed up your tests. :-( Egads. I clearly just copied/pasted and did not think about it enough. The results were still correct, but definitely only got aligned if the offset was divisible by 2. > With a very quick test I didn't see much difference if I changed the > case-label order. I re-ran with this fixed and the changes are not significant. > > On 2021-02-02 Brett Okken wrote: > > I tested jdk 15 64bit and jdk 11 32bit, client and server and the > > above implementation is consistently quite good. > > The alternate in running does not do the leading alignment. This > > version is really close in 64 bit testing and slightly faster for 32 > > bit. The differences are pretty small, and both are noticeably better > > than my original proposal (and all 3 are significantly faster than > > current). I think I would lead towards the simplicity of not doing the > > leading alignment, but I do not have a strong opinion. > > Let's go with the simpler option. > > > switch (len & 3) { > > case 3: > > crc = TABLE[0][(buf[i++] ^ (int) crc) & 0xFF] ^ > > (crc >>> 8); Sounds good. > I suppose this should use the same (faster) array indexing style as the > main loop: > > crc = TABLE[0][(buf[off++] & 0xFF) ^ ((int)crc & 0xFF)] > ^ (crc >>> 8); Yes. > Also, does it really help to unroll the loop? With 8191-byte buffers I > see no significant difference (in a quick not-very-accurate test) if > the switch-statement is replaced with a while-loop. The differences are pretty minimal. My observation was switch a bit faster than for loop, which was a bit faster than a while loop. But the differences in averages were less than the confidence interval for the given tests. > With these two changes the code becomes functionally identical to the > version I posted with the name "Modified slicing-by-4". Is that an OK > version to commit? Yes. > Is the following fine to you as the file header? Your email address can > be omitted if you prefer that. I will mention in the commit message > that you adapted the code from XZ Utils and benchmarked it. > > /* > * CRC64 > * > * Authors: Brett Okken > * Lasse Collin > * > * This file has been put into the public domain. > * You can do whatever you want with this file. > */ That is fine. You can include my e-mail. Brett
Re: [xz-devel] java crc64 implementation
On 2021-02-02 Brett Okken wrote: > Thus far I have only tested on jdk 11 64bit windows, but the fairly > clear winner is: > > public void update(byte[] buf, int off, int len) { > final int end = off + len; > int i=off; > if (len > 3) { > switch (i & 3) { > case 3: > crc = TABLE[0][(buf[i++] ^ (int) crc) & 0xFF] ^ > (crc >>> 8); > case 2: > crc = TABLE[0][(buf[i++] ^ (int) crc) & 0xFF] ^ > (crc >>> 8); > case 1: > crc = TABLE[0][(buf[i++] ^ (int) crc) & 0xFF] ^ > (crc >>> 8); > } To ensure (i & 3) == 0 when entering the main loop, the case-labels should be 1-2-3, not 3-2-1. This may have messed up your tests. :-( With a very quick test I didn't see much difference if I changed the case-label order. On 2021-02-02 Brett Okken wrote: > I tested jdk 15 64bit and jdk 11 32bit, client and server and the > above implementation is consistently quite good. > The alternate in running does not do the leading alignment. This > version is really close in 64 bit testing and slightly faster for 32 > bit. The differences are pretty small, and both are noticeably better > than my original proposal (and all 3 are significantly faster than > current). I think I would lead towards the simplicity of not doing the > leading alignment, but I do not have a strong opinion. Let's go with the simpler option. > switch (len & 3) { > case 3: > crc = TABLE[0][(buf[i++] ^ (int) crc) & 0xFF] ^ > (crc >>> 8); I suppose this should use the same (faster) array indexing style as the main loop: crc = TABLE[0][(buf[off++] & 0xFF) ^ ((int)crc & 0xFF)] ^ (crc >>> 8); Also, does it really help to unroll the loop? With 8191-byte buffers I see no significant difference (in a quick not-very-accurate test) if the switch-statement is replaced with a while-loop. With these two changes the code becomes functionally identical to the version I posted with the name "Modified slicing-by-4". Is that an OK version to commit? Is the following fine to you as the file header? Your email address can be omitted if you prefer that. I will mention in the commit message that you adapted the code from XZ Utils and benchmarked it. /* * CRC64 * * Authors: Brett Okken * Lasse Collin * * This file has been put into the public domain. * You can do whatever you want with this file. */ Thanks! -- Lasse Collin | IRC: Larhzu @ IRCnet & Freenode
Re: [xz-devel] java crc64 implementation
I tested jdk 15 64bit and jdk 11 32bit, client and server and the above implementation is consistently quite good. The alternate in running does not do the leading alignment. This version is really close in 64 bit testing and slightly faster for 32 bit. The differences are pretty small, and both are noticeably better than my original proposal (and all 3 are significantly faster than current). I think I would lead towards the simplicity of not doing the leading alignment, but I do not have a strong opinion. public void update(byte[] buf, int off, int len) { final int end = off + len; int i=off; for (int j = end - 3; i < j; i += 4) { final int tmp = (int)crc; crc = TABLE[3][(tmp & 0xFF) ^ (buf[i] & 0xFF)] ^ TABLE[2][((tmp >>> 8) & 0xFF) ^ (buf[i + 1] & 0XFF)] ^ (crc >>> 32) ^ TABLE[1][((tmp >>> 16) & 0xFF) ^ (buf[i + 2] & 0XFF)] ^ TABLE[0][((tmp >>> 24) & 0xFF) ^ (buf[i + 3] & 0XFF)]; } switch (len & 3) { case 3: crc = TABLE[0][(buf[i++] ^ (int) crc) & 0xFF] ^ (crc >>> 8); case 2: crc = TABLE[0][(buf[i++] ^ (int) crc) & 0xFF] ^ (crc >>> 8); case 1: crc = TABLE[0][(buf[i++] ^ (int) crc) & 0xFF] ^ (crc >>> 8); } }
Re: [xz-devel] java crc64 implementation
I accidentally hit reply instead of reply all. > > Shouldn't that be (i & 3) != 0? > > An offset of 0 should not enter this loop, but 0 & 3 does not equal 1. > > The idea really is that offset of 1 doesn't enter the loop, thus the > main slicing-by-4 loop is misaligned. I don't know why it makes a > difference and I'm no longer even sure why I decided to try it. You can > try different (i & 3) != { 0, 1, 2, 3 } combinations. I misunderstood your intent. I thought you were intending to get the for loop onto 4 byte alignment. I updated the benchmark to test with offsets [0,1,2] and also reducing the length by an additional [0,1,2]. This should provide a good mix of content which could require alignment at beginning and extra bytes at the end. Thus far I have only tested on jdk 11 64bit windows, but the fairly clear winner is: public void update(byte[] buf, int off, int len) { final int end = off + len; int i=off; if (len > 3) { switch (i & 3) { case 3: crc = TABLE[0][(buf[i++] ^ (int) crc) & 0xFF] ^ (crc >>> 8); case 2: crc = TABLE[0][(buf[i++] ^ (int) crc) & 0xFF] ^ (crc >>> 8); case 1: crc = TABLE[0][(buf[i++] ^ (int) crc) & 0xFF] ^ (crc >>> 8); } for (int j = end - 3; i < j; i += 4) { final int tmp = (int)crc; crc = TABLE[3][(tmp & 0xFF) ^ (buf[i] & 0xFF)] ^ TABLE[2][((tmp >>> 8) & 0xFF) ^ (buf[i + 1] & 0XFF)] ^ (crc >>> 32) ^ TABLE[1][((tmp >>> 16) & 0xFF) ^ (buf[i + 2] & 0XFF)] ^ TABLE[0][((tmp >>> 24) & 0xFF) ^ (buf[i + 3] & 0XFF)]; } } switch ((end-i) & 3) { case 3: crc = TABLE[0][(buf[i++] ^ (int) crc) & 0xFF] ^ (crc >>> 8); case 2: crc = TABLE[0][(buf[i++] ^ (int) crc) & 0xFF] ^ (crc >>> 8); case 1: crc = TABLE[0][(buf[i++] ^ (int) crc) & 0xFF] ^ (crc >>> 8); } } Brett
Re: [xz-devel] java crc64 implementation
I assume you accidentally didn't post to the list so I'm quoting your email in full. On 2021-02-02 Brett Okken wrote: > > while ((i & 3) != 1 && i < end) > > Shouldn't that be (i & 3) != 0? > An offset of 0 should not enter this loop, but 0 & 3 does not equal 1. The idea really is that offset of 1 doesn't enter the loop, thus the main slicing-by-4 loop is misaligned. I don't know why it makes a difference and I'm no longer even sure why I decided to try it. You can try different (i & 3) != { 0, 1, 2, 3 } combinations. > > If I change the buffer size from 8192 to 8191 in XZDecDemo.java, > > then "Modified slicing-by-4" somehow becomes as fast as the > > "Misaligned slicing-by-4". On the surface it sounds weird because > > the buffer still has the same alignment, it's just one byte smaller > > at the end. > > My guess is that this has to do with how many while loops need to be > executed/optimized. > Making it one byte smaller guarantees one of the additional while > loops actually has to execute. Depending on the initial offset, > potentially both need to execute. Maybe you are right, but the confusing thing is that those while-loops are supposedly slower than the for-loop. :-) > > It would be nice if you could compare these too and suggest what > > should be committed. Maybe you can figure out an even better > > version. Different CPU or 32-bit Java or other things may give > > quite different results. > > Truncating the crc to an int 1 time in the loop seems like a clear > winner. I will play with this in my benchmark. > My benchmark is calculating the crc64 of 8k of random bytes. I will > change it to include misaligned read as well. Thanks. -- Lasse Collin | IRC: Larhzu @ IRCnet & Freenode
Re: [xz-devel] java crc64 implementation
Hello! I need to make a new release in the near future so that a minor problem can be fixed in .7z support in Apache Commons Compress. I thought I could include simpler and safer changes from your long list of patches and the CRC64 improvement might be such. On 2021-01-21 Brett Okken wrote: > Here is a slice by 4 implementation. It goes byte by byte to easily be > compatible with older jdks. Performance wise, it is pretty comparable > to the java port of Adler's stackoverflow implementation: > > Benchmark Mode Cnt Score Error Units > Hash64Benchmark.adler avgt5 6850.172 ± 251.528 ns/op > Hash64Benchmark.crc64 avgt5 16347.986 ± 53.702 ns/op > Hash64Benchmark.slice4avgt5 6842.010 ± 393.149 ns/op Thank you! I played around a bit. Seems that the code is *really* sensitive to tiny changes. It's possible that it depends on the computer and such things too; I only tried on one machine. I timed decompression of gigabyte of null bytes using XZDecDemo and OpenJDK 15 on x86-64. This isn't very accurate but it's enough to sort them: Original6.8 s Modified original 6.2 s Your slicing-by-4 5.8 s Modified slicing-by-4 5.6 s Misaligned slicing-by-4 5.2 s xz -t 3.6 s Modified original: --- a/src/org/tukaani/xz/check/CRC64.java +++ b/src/org/tukaani/xz/check/CRC64.java @@ -38,7 +38,8 @@ public class CRC64 extends Check { int end = off + len; while (off < end) -crc = crcTable[(buf[off++] ^ (int)crc) & 0xFF] ^ (crc >>> 8); +crc = crcTable[(buf[off++] & 0xFF) ^ ((int)crc & 0xFF)] + ^ (crc >>> 8); } public byte[] finish() { Modified slicing-by-4: public void update(byte[] buf, int off, int len) { final int end = off + len; int i = off; for (int end4 = end - 3; i < end4; i += 4) { final int tmp = (int)crc; crc = TABLE[3][(tmp & 0xFF) ^ (buf[i] & 0xFF)] ^ TABLE[2][((tmp >>> 8) & 0xFF) ^ (buf[i + 1] & 0XFF)] ^ (crc >>> 32) ^ TABLE[1][((tmp >>> 16) & 0xFF) ^ (buf[i + 2] & 0XFF)] ^ TABLE[0][((tmp >>> 24) & 0xFF) ^ (buf[i + 3] & 0XFF)]; } while (i < end) crc = TABLE[0][(buf[i++] & 0xFF) ^ ((int)crc & 0xFF)] ^ (crc >>> 8); } Misaligned slicing-by-4 adds an extra while-loop to the beginning: public void update(byte[] buf, int off, int len) { final int end = off + len; int i = off; while ((i & 3) != 1 && i < end) crc = TABLE[0][(buf[i++] & 0xFF) ^ ((int)crc & 0xFF)] ^ (crc >>> 8); for (int end4 = end - 3; i < end4; i += 4) { final int tmp = (int)crc; crc = TABLE[3][(tmp & 0xFF) ^ (buf[i] & 0xFF)] ^ TABLE[2][((tmp >>> 8) & 0xFF) ^ (buf[i + 1] & 0XFF)] ^ (crc >>> 32) ^ TABLE[1][((tmp >>> 16) & 0xFF) ^ (buf[i + 2] & 0XFF)] ^ TABLE[0][((tmp >>> 24) & 0xFF) ^ (buf[i + 3] & 0XFF)]; } while (i < end) crc = TABLE[0][(buf[i++] & 0xFF) ^ ((int)crc & 0xFF)] ^ (crc >>> 8); } If I change the buffer size from 8192 to 8191 in XZDecDemo.java, then "Modified slicing-by-4" somehow becomes as fast as the "Misaligned slicing-by-4". On the surface it sounds weird because the buffer still has the same alignment, it's just one byte smaller at the end. The same thing happens too if the buffer size is kept at 8192 but first byte isn't used (making the beginning of the buffer misaligned). Moving the "(crc32 >> 32)" to a different position in the xor sequence can affect things too... it's almost spooky. ;-) It would be nice if you could compare these too and suggest what should be committed. Maybe you can figure out an even better version. Different CPU or 32-bit Java or other things may give quite different results. -- Lasse Collin | IRC: Larhzu @ IRCnet & Freenode
Re: [xz-devel] java crc64 implementation
Here is a slice by 4 implementation. It goes byte by byte to easily be compatible with older jdks. Performance wise, it is pretty comparable to the java port of Adler's stackoverflow implementation: Benchmark Mode Cnt Score Error Units Hash64Benchmark.adler avgt5 6850.172 ± 251.528 ns/op Hash64Benchmark.crc64 avgt5 16347.986 ± 53.702 ns/op Hash64Benchmark.slice4avgt5 6842.010 ± 393.149 ns/op package org.tukaani.xz.check; public class CRC64 extends Check { private static final long[][] TABLE = new long[4][256]; static { final long poly64 = 0xC96C5795D7870F42L; for (int s = 0; s < 4; ++s) { for (int b = 0; b < 256; ++b) { long r = s == 0 ? b : TABLE[s-1][b]; for (int i=0; i< 8; ++i) { if ((r & 1) == 1) { r = (r >>> 1) ^ poly64; } else { r >>>= 1; } } TABLE[s][b] = r; } } } private long crc = -1; public CRC64() { size = 8; name = "CRC64"; } @Override public void update(byte[] buf, int off, int len) { final int end = off + len; int i=off; for (int j = end-3; i>> 8) & 0xFF) ^ (buf[i + 1] & 0XFF))] ^ (crc >>> 32) ^ TABLE[1][(int) (((crc >>> 16) & 0xFF) ^ (buf[i + 2] & 0XFF))] ^ TABLE[0][(int) (((crc >>> 24) & 0xFF) ^ (buf[i + 3] & 0XFF))]; } for (; i>> 8); } } @Override public byte[] finish() { long value = ~crc; crc = -1; byte[] buf = new byte[8]; for (int i = 0; i < buf.length; ++i) buf[i] = (byte)(value >> (i * 8)); return buf; } }
Re: [xz-devel] java crc64 implementation
On 2021-01-13 Brett Okken wrote: > Mark Adler has posted an optimized crc64 implementation on > stackoverflow[1]. This can be reasonably easily ported to java (that > post has a link to java impl on github[2] which warrants a little > clean up, but gives a decent idea). > > I did a quick benchmark calculating the crc64 over 8KB and the results > were impressive: > > Benchmark Mode Cnt ScoreError Units > Hash64Benchmark.adler avgt5 6908.677 ± 47.790 ns/op > Hash64Benchmark.crc64 avgt5 16343.091 ± 64.089 ns/op The CRC64 implementation in XZ for Java is indeed a basic version. I wanted to keep things simple in the beginning and didn't think about it much later since the Java version of XZ is slower than C version for other reasons anyway. In XZ Utils, slicing-by-4 method is used for CRC64 and slicing-by-8 for CRC32. A reason for not using by-8 for CRC64 is to reduce CPU L1 cache usage: by-4 with CRC64 needs 8 KiB lookup table, by-8 needs 16 KiB. Micro-benchmarking with big table can look good but when the CRC is just a small part of the application the results are more complicated (more cache misses to load the bigger table, more other data pushed out of cache). It is essential to note that the decisions about table sizes were made over a decade ago with 32-bit CPUs and it's very much possible that different decisions would be better nowadays. The version by Mark Adler [1] uses slicing-by-8 with CRC64. It also includes a method to combine the CRC values of two blocks which is great if one uses threads to compute a CRC. Threaded CRC doesn't sound useful with XZ since LZMA isn't that fast anyway. A side note: GNU gzip uses the basic method for CRC32 [3] while zlib uses slicing-by-8. Since Deflate is fast to decode, replacing the CRC32 in GNU gzip would make a clear difference in decompression speed. [3] http://git.savannah.gnu.org/cgit/gzip.git/tree/util.c#n126 > [1] - > https://stackoverflow.com/questions/20562546/how-to-get-crc64-distributed-calculation-use-its-linearity-property/20579405#20579405 > > [2] - > https://github.com/MrBuddyCasino/crc-64/blob/master/crc-64/src/main/java/net/boeckling/crc/CRC64.java I didn't find license information from the [2] repository. XZ for Java is public domain so the license likely wouldn't match anyway. Porting from XZ Utils shouldn't be too hard, depending on how much one wishes to optimize it. - src/liblzma/check/crc64_fast.c - src/liblzma/check/crc_macros.h - src/liblzma/check/crc64_tablegen.c (or should it just include pre-computed tables like liblzma and zlib do?) Unlike the C version in [1], the Java version in [2] reads the input byte[] array byte-by-byte. Using a fast method to read 8 *aligned* bytes at a time in native byte order should give more speed; after all, it's one of the benefits of this method that one can read multiple input bytes at a time. A public domain patch for a faster CRC64 to XZ for Java is welcome. -- Lasse Collin | IRC: Larhzu @ IRCnet & Freenode