Re: [xz-devel] java crc64 implementation

2021-02-05 Thread Brett Okken
This had /way/ more impact than I expected on overall decompression performance.
Here are the baseline numbers for 1.8 (jdk 11 64bit):

Benchmark (file)  Mode  Cnt
Score   Error  Units
XZDecompressionBenchmark.decompress  ihe_ovly_pr.dcm  avgt3
0.731 ± 0.010  ms/op
XZDecompressionBenchmark.decompress   image1.dcm  avgt3
445.394 ± 7.052  ms/op
XZDecompressionBenchmark.decompresslarge.xml  avgt3
283.013 ± 5.004  ms/op
XZ2DecompressionBenchmark.decompress  1 byte rep  avgt3
587.127 ± 7.254  ms/op


Here are numbers with current trunk, which is basically just the crc64 change:

Benchmark (file)  Mode  Cnt
ScoreError  Units
XZDecompressionBenchmark.decompress  ihe_ovly_pr.dcm  avgt3
0.662 ±  0.012  ms/op
XZDecompressionBenchmark.decompress   image1.dcm  avgt3
391.644 ± 13.871  ms/op
XZDecompressionBenchmark.decompresslarge.xml  avgt3
225.456 ± 16.265  ms/op
XZ2DecompressionBenchmark.decompress  1 byte rep  avgt3
387.347 ± 18.811  ms/op


And the LZDecoder change gets it down to:

Benchmark (file)  Mode  Cnt
ScoreError  Units
XZDecompressionBenchmark.decompress  ihe_ovly_pr.dcm  avgt3
0.607 ±  0.187  ms/op
XZDecompressionBenchmark.decompress   image1.dcm  avgt3
369.419 ± 32.400  ms/op
XZDecompressionBenchmark.decompresslarge.xml  avgt3
190.580 ±  7.856  ms/op
XZ2DecompressionBenchmark.decompress  1 byte rep  avgt3
220.554 ±  8.812  ms/op

Brett



Re: [xz-devel] java crc64 implementation

2021-02-05 Thread Lasse Collin
On 2021-02-05 Brett Okken wrote:
> On Fri, Feb 5, 2021 at 11:07 AM Lasse Collin
>  wrote:
> > Also, does it really help to unroll the loop? With 8191-byte
> > buffers I see no significant difference (in a quick
> > not-very-accurate test) if the switch-statement is replaced with a
> > while-loop.  
> 
> The differences are pretty minimal. My observation was switch a bit
> faster than for loop, which was a bit faster than a while loop. But
> the differences in averages were less than the confidence interval for
> the given tests.

OK, smaller code wins then.

> > With these two changes the code becomes functionally identical to
> > the version I posted with the name "Modified slicing-by-4". Is that
> > an OK version to commit?  
> 
> Yes.

OK.

> > Is the following fine to you as the file header? Your email address
> > can be omitted if you prefer that. I will mention in the commit
> > message that you adapted the code from XZ Utils and benchmarked it.
> >
> > /*
> >  * CRC64
> >  *
> >  * Authors: Brett Okken 
> >  *  Lasse Collin 
> >  *
> >  * This file has been put into the public domain.
> >  * You can do whatever you want with this file.
> >  */  
> 
> That is fine. You can include my e-mail.

OK. :-) I have committed it. Thank you!

The LZDecoder changes I may still look at before the next release. Then
I will go back to the XZ Utils code.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode



Re: [xz-devel] java crc64 implementation

2021-02-05 Thread Brett Okken
On Fri, Feb 5, 2021 at 11:07 AM Lasse Collin  wrote:
>
> On 2021-02-02 Brett Okken wrote:
> > Thus far I have only tested on jdk 11 64bit windows, but the fairly
> > clear winner is:
> >
> > public void update(byte[] buf, int off, int len) {
> > final int end = off + len;
> > int i=off;
> > if (len > 3) {
> > switch (i & 3) {
> > case 3:
> > crc = TABLE[0][(buf[i++] ^ (int) crc) & 0xFF] ^
> >   (crc >>> 8);
> > case 2:
> > crc = TABLE[0][(buf[i++] ^ (int) crc) & 0xFF] ^
> >   (crc >>> 8);
> > case 1:
> > crc = TABLE[0][(buf[i++] ^ (int) crc) & 0xFF] ^
> >   (crc >>> 8);
> > }
>
> To ensure (i & 3) == 0 when entering the main loop, the case-labels
> should be 1-2-3, not 3-2-1. This may have messed up your tests. :-(

Egads. I clearly just copied/pasted and did not think about it enough.
The results were still correct, but definitely only got aligned if the
offset was divisible by 2.


> With a very quick test I didn't see much difference if I changed the
> case-label order.

I re-ran with this fixed and the changes are not significant.

>
> On 2021-02-02 Brett Okken wrote:
> > I tested jdk 15 64bit and jdk 11 32bit, client and server and the
> > above implementation is consistently quite good.
> > The alternate in running does not do the leading alignment. This
> > version is really close in 64 bit testing and slightly faster for 32
> > bit. The differences are pretty small, and both are noticeably better
> > than my original proposal (and all 3 are significantly faster than
> > current). I think I would lead towards the simplicity of not doing the
> > leading alignment, but I do not have a strong opinion.
>
> Let's go with the simpler option.
>
> > switch (len & 3) {
> > case 3:
> > crc = TABLE[0][(buf[i++] ^ (int) crc) & 0xFF] ^
> >   (crc >>> 8);

Sounds good.

> I suppose this should use the same (faster) array indexing style as the
> main loop:
>
> crc = TABLE[0][(buf[off++] & 0xFF) ^ ((int)crc & 0xFF)]
>   ^ (crc >>> 8);

Yes.

> Also, does it really help to unroll the loop? With 8191-byte buffers I
> see no significant difference (in a quick not-very-accurate test) if
> the switch-statement is replaced with a while-loop.

The differences are pretty minimal. My observation was switch a bit
faster than for loop, which was a bit faster than a while loop. But
the differences in averages were less than the confidence interval for
the given tests.

> With these two changes the code becomes functionally identical to the
> version I posted with the name "Modified slicing-by-4". Is that an OK
> version to commit?

Yes.

> Is the following fine to you as the file header? Your email address can
> be omitted if you prefer that. I will mention in the commit message
> that you adapted the code from XZ Utils and benchmarked it.
>
> /*
>  * CRC64
>  *
>  * Authors: Brett Okken 
>  *  Lasse Collin 
>  *
>  * This file has been put into the public domain.
>  * You can do whatever you want with this file.
>  */

That is fine. You can include my e-mail.

Brett



Re: [xz-devel] java crc64 implementation

2021-02-05 Thread Lasse Collin
On 2021-02-02 Brett Okken wrote:
> Thus far I have only tested on jdk 11 64bit windows, but the fairly
> clear winner is:
> 
> public void update(byte[] buf, int off, int len) {
> final int end = off + len;
> int i=off;
> if (len > 3) {
> switch (i & 3) {
> case 3:
> crc = TABLE[0][(buf[i++] ^ (int) crc) & 0xFF] ^
>   (crc >>> 8);
> case 2:
> crc = TABLE[0][(buf[i++] ^ (int) crc) & 0xFF] ^
>   (crc >>> 8);
> case 1:
> crc = TABLE[0][(buf[i++] ^ (int) crc) & 0xFF] ^
>   (crc >>> 8);
> }

To ensure (i & 3) == 0 when entering the main loop, the case-labels
should be 1-2-3, not 3-2-1. This may have messed up your tests. :-(

With a very quick test I didn't see much difference if I changed the
case-label order.

On 2021-02-02 Brett Okken wrote:
> I tested jdk 15 64bit and jdk 11 32bit, client and server and the
> above implementation is consistently quite good.
> The alternate in running does not do the leading alignment. This
> version is really close in 64 bit testing and slightly faster for 32
> bit. The differences are pretty small, and both are noticeably better
> than my original proposal (and all 3 are significantly faster than
> current). I think I would lead towards the simplicity of not doing the
> leading alignment, but I do not have a strong opinion.

Let's go with the simpler option.

> switch (len & 3) {
> case 3:
> crc = TABLE[0][(buf[i++] ^ (int) crc) & 0xFF] ^
>   (crc >>> 8);

I suppose this should use the same (faster) array indexing style as the
main loop:

crc = TABLE[0][(buf[off++] & 0xFF) ^ ((int)crc & 0xFF)]
  ^ (crc >>> 8);

Also, does it really help to unroll the loop? With 8191-byte buffers I
see no significant difference (in a quick not-very-accurate test) if
the switch-statement is replaced with a while-loop.

With these two changes the code becomes functionally identical to the
version I posted with the name "Modified slicing-by-4". Is that an OK
version to commit?

Is the following fine to you as the file header? Your email address can
be omitted if you prefer that. I will mention in the commit message
that you adapted the code from XZ Utils and benchmarked it.

/*
 * CRC64
 *
 * Authors: Brett Okken 
 *  Lasse Collin 
 *
 * This file has been put into the public domain.
 * You can do whatever you want with this file.
 */

Thanks!

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode



Re: [xz-devel] java crc64 implementation

2021-02-02 Thread Brett Okken
I tested jdk 15 64bit and jdk 11 32bit, client and server and the
above implementation is consistently quite good.
The alternate in running does not do the leading alignment. This
version is really close in 64 bit testing and slightly faster for 32
bit. The differences are pretty small, and both are noticeably better
than my original proposal (and all 3 are significantly faster than
current). I think I would lead towards the simplicity of not doing the
leading alignment, but I do not have a strong opinion.

public void update(byte[] buf, int off, int len)
{
final int end = off + len;
int i=off;
for (int j = end - 3; i < j; i += 4) {
final int tmp = (int)crc;
crc = TABLE[3][(tmp & 0xFF) ^ (buf[i] & 0xFF)] ^
  TABLE[2][((tmp >>> 8) & 0xFF) ^ (buf[i + 1] & 0XFF)] ^
  (crc >>> 32) ^
  TABLE[1][((tmp >>> 16) & 0xFF) ^ (buf[i + 2] & 0XFF)] ^
  TABLE[0][((tmp >>> 24) & 0xFF) ^ (buf[i + 3] & 0XFF)];
}
switch (len & 3) {
case 3:
crc = TABLE[0][(buf[i++] ^ (int) crc) & 0xFF] ^ (crc >>> 8);
case 2:
crc = TABLE[0][(buf[i++] ^ (int) crc) & 0xFF] ^ (crc >>> 8);
case 1:
crc = TABLE[0][(buf[i++] ^ (int) crc) & 0xFF] ^ (crc >>> 8);
}
}



Re: [xz-devel] java crc64 implementation

2021-02-02 Thread Brett Okken
I accidentally hit reply instead of reply all.

> > Shouldn't that be (i & 3) != 0?
> > An offset of 0 should not enter this loop, but 0 & 3 does not equal 1.
>
> The idea really is that offset of 1 doesn't enter the loop, thus the
> main slicing-by-4 loop is misaligned. I don't know why it makes a
> difference and I'm no longer even sure why I decided to try it. You can
> try different (i & 3) != { 0, 1, 2, 3 } combinations.

I misunderstood your intent. I thought you were intending to get the
for loop onto 4 byte alignment.

I updated the benchmark to test with offsets [0,1,2] and also reducing
the length by an additional [0,1,2]. This should provide a good mix of
content which could require alignment at beginning and extra bytes at
the end.

Thus far I have only tested on jdk 11 64bit windows, but the fairly
clear winner is:

public void update(byte[] buf, int off, int len) {
final int end = off + len;
int i=off;
if (len > 3) {
switch (i & 3) {
case 3:
crc = TABLE[0][(buf[i++] ^ (int) crc) & 0xFF] ^ (crc >>> 8);
case 2:
crc = TABLE[0][(buf[i++] ^ (int) crc) & 0xFF] ^ (crc >>> 8);
case 1:
crc = TABLE[0][(buf[i++] ^ (int) crc) & 0xFF] ^ (crc >>> 8);
}
for (int j = end - 3; i < j; i += 4) {
final int tmp = (int)crc;
crc = TABLE[3][(tmp & 0xFF) ^ (buf[i] & 0xFF)] ^
  TABLE[2][((tmp >>> 8) & 0xFF) ^ (buf[i + 1] & 0XFF)] ^
  (crc >>> 32) ^
  TABLE[1][((tmp >>> 16) & 0xFF) ^ (buf[i + 2] & 0XFF)] ^
  TABLE[0][((tmp >>> 24) & 0xFF) ^ (buf[i + 3] & 0XFF)];
}
}
switch ((end-i) & 3) {
case 3:
crc = TABLE[0][(buf[i++] ^ (int) crc) & 0xFF] ^ (crc >>> 8);
case 2:
crc = TABLE[0][(buf[i++] ^ (int) crc) & 0xFF] ^ (crc >>> 8);
case 1:
crc = TABLE[0][(buf[i++] ^ (int) crc) & 0xFF] ^ (crc >>> 8);
}
}


Brett



Re: [xz-devel] java crc64 implementation

2021-02-02 Thread Lasse Collin
I assume you accidentally didn't post to the list so I'm quoting your
email in full.

On 2021-02-02 Brett Okken wrote:
> > while ((i & 3) != 1 && i < end)  
> 
> Shouldn't that be (i & 3) != 0?
> An offset of 0 should not enter this loop, but 0 & 3 does not equal 1.

The idea really is that offset of 1 doesn't enter the loop, thus the
main slicing-by-4 loop is misaligned. I don't know why it makes a
difference and I'm no longer even sure why I decided to try it. You can
try different (i & 3) != { 0, 1, 2, 3 } combinations.

> > If I change the buffer size from 8192 to 8191 in XZDecDemo.java,
> > then "Modified slicing-by-4" somehow becomes as fast as the
> > "Misaligned slicing-by-4". On the surface it sounds weird because
> > the buffer still has the same alignment, it's just one byte smaller
> > at the end.  
> 
> My guess is that this has to do with how many while loops need to be
> executed/optimized.
> Making it one byte smaller guarantees one of the additional while
> loops actually has to execute. Depending on the initial offset,
> potentially both need to execute.

Maybe you are right, but the confusing thing is that those while-loops
are supposedly slower than the for-loop. :-)

> > It would be nice if you could compare these too and suggest what
> > should be committed. Maybe you can figure out an even better
> > version. Different CPU or 32-bit Java or other things may give
> > quite different results.  
> 
> Truncating the crc to an int 1 time in the loop seems like a clear
> winner. I will play with this in my benchmark.
> My benchmark is calculating the crc64 of 8k of random bytes. I will
> change it to include misaligned read as well.

Thanks.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode



Re: [xz-devel] java crc64 implementation

2021-02-02 Thread Lasse Collin
Hello!

I need to make a new release in the near future so that a minor problem
can be fixed in .7z support in Apache Commons Compress. I thought I
could include simpler and safer changes from your long list of patches
and the CRC64 improvement might be such.

On 2021-01-21 Brett Okken wrote:
> Here is a slice by 4 implementation. It goes byte by byte to easily be
> compatible with older jdks. Performance wise, it is pretty comparable
> to the java port of Adler's stackoverflow implementation:
> 
> Benchmark Mode  Cnt  Score Error  Units
> Hash64Benchmark.adler avgt5   6850.172 ± 251.528  ns/op
> Hash64Benchmark.crc64 avgt5  16347.986 ±  53.702  ns/op
> Hash64Benchmark.slice4avgt5   6842.010 ± 393.149  ns/op

Thank you!

I played around a bit. Seems that the code is *really* sensitive to tiny
changes. It's possible that it depends on the computer and such things
too; I only tried on one machine.

I timed decompression of gigabyte of null bytes using XZDecDemo and
OpenJDK 15 on x86-64. This isn't very accurate but it's enough to sort
them:

Original6.8 s
Modified original   6.2 s
Your slicing-by-4   5.8 s
Modified slicing-by-4   5.6 s
Misaligned slicing-by-4 5.2 s
xz -t   3.6 s

Modified original:

--- a/src/org/tukaani/xz/check/CRC64.java
+++ b/src/org/tukaani/xz/check/CRC64.java
@@ -38,7 +38,8 @@ public class CRC64 extends Check {
 int end = off + len;
 
 while (off < end)
-crc = crcTable[(buf[off++] ^ (int)crc) & 0xFF] ^ (crc >>> 8);
+crc = crcTable[(buf[off++] & 0xFF) ^ ((int)crc & 0xFF)]
+  ^ (crc >>> 8);
 }
 
 public byte[] finish() {

Modified slicing-by-4:

public void update(byte[] buf, int off, int len) {
final int end = off + len;
int i = off;

for (int end4 = end - 3; i < end4; i += 4) {
final int tmp = (int)crc;
crc = TABLE[3][(tmp & 0xFF) ^ (buf[i] & 0xFF)] ^
  TABLE[2][((tmp >>> 8) & 0xFF) ^ (buf[i + 1] & 0XFF)] ^
  (crc >>> 32) ^
  TABLE[1][((tmp >>> 16) & 0xFF) ^ (buf[i + 2] & 0XFF)] ^
  TABLE[0][((tmp >>> 24) & 0xFF) ^ (buf[i + 3] & 0XFF)];
}

while (i < end)
crc = TABLE[0][(buf[i++] & 0xFF) ^ ((int)crc & 0xFF)] ^
  (crc >>> 8);
}

Misaligned slicing-by-4 adds an extra while-loop to the beginning:

public void update(byte[] buf, int off, int len) {
final int end = off + len;
int i = off;

while ((i & 3) != 1 && i < end)
crc = TABLE[0][(buf[i++] & 0xFF) ^ ((int)crc & 0xFF)] ^
  (crc >>> 8);

for (int end4 = end - 3; i < end4; i += 4) {
final int tmp = (int)crc;
crc = TABLE[3][(tmp & 0xFF) ^ (buf[i] & 0xFF)] ^
  TABLE[2][((tmp >>> 8) & 0xFF) ^ (buf[i + 1] & 0XFF)] ^
  (crc >>> 32) ^
  TABLE[1][((tmp >>> 16) & 0xFF) ^ (buf[i + 2] & 0XFF)] ^
  TABLE[0][((tmp >>> 24) & 0xFF) ^ (buf[i + 3] & 0XFF)];
}

while (i < end)
crc = TABLE[0][(buf[i++] & 0xFF) ^ ((int)crc & 0xFF)] ^
  (crc >>> 8);
}

If I change the buffer size from 8192 to 8191 in XZDecDemo.java, then
"Modified slicing-by-4" somehow becomes as fast as the "Misaligned
slicing-by-4". On the surface it sounds weird because the buffer still
has the same alignment, it's just one byte smaller at the end.

The same thing happens too if the buffer size is kept at 8192 but first
byte isn't used (making the beginning of the buffer misaligned).

Moving the "(crc32 >> 32)" to a different position in the xor sequence
can affect things too... it's almost spooky. ;-)

It would be nice if you could compare these too and suggest what should
be committed. Maybe you can figure out an even better version.
Different CPU or 32-bit Java or other things may give quite different
results.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode



Re: [xz-devel] java crc64 implementation

2021-01-21 Thread Brett Okken
Here is a slice by 4 implementation. It goes byte by byte to easily be
compatible with older jdks. Performance wise, it is pretty comparable
to the java port of Adler's stackoverflow implementation:

Benchmark Mode  Cnt  Score Error  Units
Hash64Benchmark.adler avgt5   6850.172 ± 251.528  ns/op
Hash64Benchmark.crc64 avgt5  16347.986 ±  53.702  ns/op
Hash64Benchmark.slice4avgt5   6842.010 ± 393.149  ns/op

package org.tukaani.xz.check;

public class CRC64 extends Check {
private static final long[][] TABLE = new long[4][256];

static {
final long poly64 = 0xC96C5795D7870F42L;
for (int s = 0; s < 4; ++s) {
for (int b = 0; b < 256; ++b) {
long r = s == 0 ? b : TABLE[s-1][b];
for (int i=0; i< 8; ++i) {
if ((r & 1) == 1) {
r = (r >>> 1) ^ poly64;
} else {
r >>>= 1;
}
}
TABLE[s][b] = r;
}
}
}

private long crc = -1;

public CRC64() {
size = 8;
name = "CRC64";
}

@Override
public void update(byte[] buf, int off, int len) {
final int end = off + len;
int i=off;
for (int j = end-3; i>> 8) & 0xFF) ^ (buf[i + 1] & 0XFF))] ^
  (crc >>> 32) ^
  TABLE[1][(int) (((crc >>> 16) & 0xFF) ^ (buf[i + 2]
& 0XFF))] ^
  TABLE[0][(int) (((crc >>> 24) & 0xFF) ^ (buf[i + 3] & 0XFF))];
}
for (; i>> 8);
}
}

@Override
public byte[] finish() {
long value = ~crc;
crc = -1;

byte[] buf = new byte[8];
for (int i = 0; i < buf.length; ++i)
buf[i] = (byte)(value >> (i * 8));

return buf;
}
}



Re: [xz-devel] java crc64 implementation

2021-01-19 Thread Lasse Collin
On 2021-01-13 Brett Okken wrote:
> Mark Adler has posted an optimized crc64 implementation on
> stackoverflow[1]. This can be reasonably easily ported to java (that
> post has a link to java impl on github[2] which warrants a little
> clean up, but gives a decent idea).
> 
> I did a quick benchmark calculating the crc64 over 8KB and the results
> were impressive:
> 
> Benchmark  Mode  Cnt  ScoreError  Units
> Hash64Benchmark.adler  avgt5   6908.677 ± 47.790  ns/op
> Hash64Benchmark.crc64  avgt5  16343.091 ± 64.089  ns/op

The CRC64 implementation in XZ for Java is indeed a basic version. I
wanted to keep things simple in the beginning and didn't think about it
much later since the Java version of XZ is slower than C version for
other reasons anyway.

In XZ Utils, slicing-by-4 method is used for CRC64 and slicing-by-8
for CRC32. A reason for not using by-8 for CRC64 is to reduce CPU L1
cache usage: by-4 with CRC64 needs 8 KiB lookup table, by-8 needs 16
KiB. Micro-benchmarking with big table can look good but when the CRC
is just a small part of the application the results are more
complicated (more cache misses to load the bigger table, more other data
pushed out of cache). It is essential to note that the decisions about
table sizes were made over a decade ago with 32-bit CPUs and it's very
much possible that different decisions would be better nowadays.

The version by Mark Adler [1] uses slicing-by-8 with CRC64. It also
includes a method to combine the CRC values of two blocks which is
great if one uses threads to compute a CRC. Threaded CRC doesn't sound
useful with XZ since LZMA isn't that fast anyway.

A side note: GNU gzip uses the basic method for CRC32 [3] while zlib
uses slicing-by-8. Since Deflate is fast to decode, replacing the CRC32
in GNU gzip would make a clear difference in decompression speed.

[3] http://git.savannah.gnu.org/cgit/gzip.git/tree/util.c#n126

> [1] -
> https://stackoverflow.com/questions/20562546/how-to-get-crc64-distributed-calculation-use-its-linearity-property/20579405#20579405
> 
> [2] -
> https://github.com/MrBuddyCasino/crc-64/blob/master/crc-64/src/main/java/net/boeckling/crc/CRC64.java

I didn't find license information from the [2] repository. XZ for Java
is public domain so the license likely wouldn't match anyway.

Porting from XZ Utils shouldn't be too hard, depending on how much one
wishes to optimize it.
  - src/liblzma/check/crc64_fast.c
  - src/liblzma/check/crc_macros.h
  - src/liblzma/check/crc64_tablegen.c (or should it just include
pre-computed tables like liblzma and zlib do?)

Unlike the C version in [1], the Java version in [2] reads the input
byte[] array byte-by-byte. Using a fast method to read 8 *aligned*
bytes at a time in native byte order should give more speed; after all,
it's one of the benefits of this method that one can read multiple
input bytes at a time.

A public domain patch for a faster CRC64 to XZ for Java is welcome.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode