Re: [xz-devel] Re: java LZDecoder small improvement

2021-03-08 Thread Lasse Collin
On 2021-03-01 Brett Okken wrote:
> > One thing that confuses me in your version is the special handling
> > of the first byte:
> >
> > buf[pos++] = buf[back++];
> > --left;
> >
> > If there are two bytes to copy, then one will be copied above and
> > the other with arraycopy later. If there are more bytes to copy and
> > distance is very small, incrementing "back" above can mean that an
> > extra arraycopy call might be needed in the loop because the first
> > copy will be one byte smaller.
> >
> > I understand that it might help when there is just one byte to
> > repeat because then the while-loop will be skipped. In all other
> > situations it sounds like that the special handling of the first
> > byte would in theory be harmful. Note that I don't doubt your test
> > results; I already saw with the CRC64 code that some changes in the
> > code can affect performance in weird ways.  
> 
> The image1.dcm is the most impacted by this optimization. Again, this
> file is basically a large greyscale bmp. This results in a significant
> number of single byte repeats. Optimizing for the single byte improves
> performance in that file by 3-5%, while having smaller effects on the
> other 2 files (ihe_ovly_pr.dcm slightly slower, large.xml slightly
> faster)

OK, that is an interesting test case.

> I agree your approach is more readable. From your version of it, I was
> expecting that simplicity in reading to translate into better
> performance.
> This latest version actually does appear to do that. The image1.dcm
> performance matches my version and the other 2 are a bit faster.
> Adding the single byte optimization still speeds up image1.dcm (~8ms,
> ~2%) and large.xml (~3ms, 2%), while slowing ihe_ovly_pr.dcm (~.008ms,
> ~1%).
[...]
> Version 3 is better for all 3 files.

With these results I now plan to include version 3 in the next release.
It sounds that the single-byte optimization has a fairly small effect.
Omitting it keeps the code a tiny bit simpler.

I have committed the change. I think xz-java.git should now be almost
ready for a release. I just need to add NEWS and bump the version
number.

Thanks for your help!

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode



Re: [xz-devel] Re: java LZDecoder small improvement

2021-03-01 Thread Brett Okken
> With a quick try I got a feeling that my worry about short repeats was
> wrong. It doesn't matter because decoding each LZMA symbol is much more
> expensive. What matters is avoiding multiple tiny arraycopy calls
> within a single run of the repeat method, and that problem was already
> solved.

My observation is basically the same. Individual tiny array copy calls
do not really matter, but tiny array copy calls in a tight loop is
quite bad.
My /guess/ is that the optimizer is not able to unroll these loops
effectively, possibly because of the arraycopy intrinsic.

>
> > > I came up with the following. I haven't decided yet if I like it.
> >
> > On the 3 files I have been testing with, this change is a mixed bag.
> > Compared to trunk 1 regresses by ~8%. While the other 2 do improve,
> > neither are better than my last patch.
>
> OK, thanks. So it isn't great. I wonder which details make the
> difference.

I think some of the problem is too many branches, making
prediction/speculative execution less useful.
Another issue is the absence of the single byte optimization, which I
will address more below.

> One thing that confuses me in your version is the special handling of
> the first byte:
>
> buf[pos++] = buf[back++];
> --left;
>
> If there are two bytes to copy, then one will be copied above and the
> other with arraycopy later. If there are more bytes to copy and
> distance is very small, incrementing "back" above can mean that an
> extra arraycopy call might be needed in the loop because the first copy
> will be one byte smaller.
>
> I understand that it might help when there is just one byte to repeat
> because then the while-loop will be skipped. In all other situations it
> sounds like that the special handling of the first byte would in theory
> be harmful. Note that I don't doubt your test results; I already saw
> with the CRC64 code that some changes in the code can affect
> performance in weird ways.

The image1.dcm is the most impacted by this optimization. Again, this
file is basically a large greyscale bmp. This results in a significant
number of single byte repeats. Optimizing for the single byte improves
performance in that file by 3-5%, while having smaller effects on the
other 2 files (ihe_ovly_pr.dcm slightly slower, large.xml slightly
faster)

> Your code needs
>
> if (back == bufSize)
> back = 0;
>
> in the beginning of the while-loop and later checking for tmp > 0. My
> version avoids these branches by handling those cases under "if (back <
> 0)" (which is equivalent to "if (dist >= pos)"). On the other hand,
> under "if (back < 0)" all copies, including tiny copies, are done with
> arraycopy.
>
> Another tiny difference is that your code uses left shift to double the
> copy size in the loop while I used Math.min(pos - back, left).
>
> > I was able to improve this a bit by pulling the handling of small
> > copies outside of the while loop. This eliminates the regressions
> > compared to trunk, but still does not feel like an improvement over my
> > last patch.
>
> Yeah, the switch isn't worth it. If I understand it correctly now,
> trying to avoid arraycopy for the tiny copies wasn't a useful idea in
> the first place. So the code can be simplified ("version 3"):
>
> int back = pos - dist - 1;
> if (back < 0) {
> // The distance wraps around to the end of the cyclic dictionary
> // buffer. We cannot get here if the dictionary isn't full.
> assert full == bufSize;
> back += bufSize;
>
> // Here we will never copy more than dist + 1 bytes and
> // so the copying won't repeat from its own output.
> // Thus, we can always use arraycopy safely.
> int copySize = Math.min(bufSize - back, left);
> assert copySize <= dist + 1;
>
> System.arraycopy(buf, back, buf, pos, copySize);
> pos += copySize;
> back = 0;
> left -= copySize;
>
> if (left == 0)
> return;
> }
>
> assert back < pos;
> assert left > 0;
>
> do {
> // Determine the number of bytes to copy on this loop iteration:
> // copySize is set so that the source and destination ranges
> // don't overlap. If "left" is large enough, the destination
> // range will start right after the last byte of the source
> // range. This way we don't need to advance "back" which
> // allows the next iteration of this loop to copy (up to)
> // twice the number of bytes.
> int copySize = Math.min(left, pos - back);
> System.arraycopy(buf, back, buf, pos, copySize);
> pos += copySize;
> left -= copySize;
> } while (left > 0);
>
> I know I may sound stubborn for not accepting your version as is but
> the special handling of the first byte and the 

Re: [xz-devel] Re: java LZDecoder small improvement

2021-02-27 Thread Lasse Collin
On 2021-02-13 Brett Okken wrote:
> On Thu, Feb 11, 2021 at 12:51 PM Lasse Collin
>  wrote:
> > I still worry about short copies. If the file is full of tiny
> > matches/repeats of 1-3 bytes or so, arraycopy can be slower. Such
> > files aren't typical at all but I don't want to add a corner case
> > where the performance drops too much.  
> 
> Do you have examples of such files, or code on how to generate one?

Use the patch below and compress with this:

java -jar build/jar/XZEncDemo.jar 2 < infile > outfile.xz"

Adjust LIMIT to get longer matches.

diff --git a/src/org/tukaani/xz/lzma/LZMAEncoderFast.java 
b/src/org/tukaani/xz/lzma/LZMAEncoderFast.java
index f8230ee..cd92ca6 100644
--- a/src/org/tukaani/xz/lzma/LZMAEncoderFast.java
+++ b/src/org/tukaani/xz/lzma/LZMAEncoderFast.java
@@ -44,6 +44,8 @@ final class LZMAEncoderFast extends LZMAEncoder {
 return smallDist < (bigDist >>> 7);
 }
 
+private static final int LIMIT = 2;
+
 int getNextSymbol() {
 // Get the matches for the next byte unless readAhead indicates
 // that we already got the new matches during the previous call
@@ -66,11 +68,13 @@ final class LZMAEncoderFast extends LZMAEncoder {
 int bestRepIndex = 0;
 for (int rep = 0; rep < REPS; ++rep) {
 int len = lz.getMatchLen(reps[rep], avail);
+if (len > LIMIT)
+len = LIMIT;
 if (len < MATCH_LEN_MIN)
 continue;
 
 // If it is long enough, return it.
-if (len >= niceLen) {
+if (len >= LIMIT) {
 back = rep;
 skip(len - 1);
 return len;
@@ -88,9 +92,11 @@ final class LZMAEncoderFast extends LZMAEncoder {
 
 if (matches.count > 0) {
 mainLen = matches.len[matches.count - 1];
+if (mainLen > LIMIT)
+mainLen = LIMIT;
 mainDist = matches.dist[matches.count - 1];
 
-if (mainLen >= niceLen) {
+if (mainLen >= LIMIT) {
 back = mainDist + REPS;
 skip(mainLen - 1);
 return mainLen;

With a quick try I got a feeling that my worry about short repeats was
wrong. It doesn't matter because decoding each LZMA symbol is much more
expensive. What matters is avoiding multiple tiny arraycopy calls
within a single run of the repeat method, and that problem was already
solved.

> > I came up with the following. I haven't decided yet if I like it.  
> 
> On the 3 files I have been testing with, this change is a mixed bag.
> Compared to trunk 1 regresses by ~8%. While the other 2 do improve,
> neither are better than my last patch.

OK, thanks. So it isn't great. I wonder which details make the
difference.

One thing that confuses me in your version is the special handling of
the first byte:

buf[pos++] = buf[back++];
--left;

If there are two bytes to copy, then one will be copied above and the
other with arraycopy later. If there are more bytes to copy and
distance is very small, incrementing "back" above can mean that an
extra arraycopy call might be needed in the loop because the first copy
will be one byte smaller.

I understand that it might help when there is just one byte to repeat
because then the while-loop will be skipped. In all other situations it
sounds like that the special handling of the first byte would in theory
be harmful. Note that I don't doubt your test results; I already saw
with the CRC64 code that some changes in the code can affect
performance in weird ways.

Your code needs

if (back == bufSize)
back = 0;

in the beginning of the while-loop and later checking for tmp > 0. My
version avoids these branches by handling those cases under "if (back <
0)" (which is equivalent to "if (dist >= pos)"). On the other hand,
under "if (back < 0)" all copies, including tiny copies, are done with
arraycopy.

Another tiny difference is that your code uses left shift to double the
copy size in the loop while I used Math.min(pos - back, left).

> I was able to improve this a bit by pulling the handling of small
> copies outside of the while loop. This eliminates the regressions
> compared to trunk, but still does not feel like an improvement over my
> last patch.

Yeah, the switch isn't worth it. If I understand it correctly now,
trying to avoid arraycopy for the tiny copies wasn't a useful idea in
the first place. So the code can be simplified ("version 3"):

int back = pos - dist - 1;
if (back < 0) {
// The distance wraps around to the end of the cyclic dictionary
// buffer. We cannot get here if the dictionary isn't full.
assert full == bufSize;
back += bufSize;

// Here we will never copy more than dist + 1 bytes and
// so the copying won't repeat from its own output.
// Thus, we can always use arraycopy safely.
int 

Re: [xz-devel] Re: java LZDecoder small improvement

2021-02-13 Thread Brett Okken
On Thu, Feb 11, 2021 at 12:51 PM Lasse Collin  wrote:
>
> On 2021-02-05 Brett Okken wrote:
> > I worked this out last night. We need to double how much we copy each
> > time by not advancing "back". This actually works even better than
> > Arrays.fill for the single byte case also.
>
> This clearly is a good idea in a Java implementation. :-)
>
> I still worry about short copies. If the file is full of tiny
> matches/repeats of 1-3 bytes or so, arraycopy can be slower. Such files
> aren't typical at all but I don't want to add a corner case where the
> performance drops too much.

Do you have examples of such files, or code on how to generate one?
The case of a 1 byte match/repeat is optimized for in my latest patch,
as providing that optimization did provide noticeable improvement in
the (real) files I have been testing with. While I did observe some 2
and 3 byte repeats, it appears to not be common enough to negatively
impact overall throughput. It would be quite helpful to have some
examples so that we can at least quantify the impact.

> I came up with the following. I haven't decided yet if I like it.

On the 3 files I have been testing with, this change is a mixed bag.
Compared to trunk 1 regresses by ~8%. While the other 2 do improve,
neither are better than my last patch.

jdk 11 64 bit
TRUNK
 (file)  Mode  CntScoreError  Units
ihe_ovly_pr.dcm  avgt30.662 ±  0.012  ms/op
 image1.dcm  avgt3  391.644 ± 13.871  ms/op
  large.xml  avgt3  225.456 ± 16.265  ms/op

(okken last)
ihe_ovly_pr.dcm  avgt30.607 ±  0.187  ms/op
 image1.dcm  avgt3  369.419 ± 32.400  ms/op
  large.xml  avgt3  190.580 ±  7.856  ms/op

(lasse new)
 (file)  Mode  CntScoreError  Units
ihe_ovly_pr.dcm  avgt30.628 ±  0.066  ms/op
 image1.dcm  avgt3  424.159 ± 14.823  ms/op
   large.xml  avgt3  192.763 ±  6.831  ms/op


I was able to improve this a bit by pulling the handling of small
copies outside of the while loop. This eliminates the regressions
compared to trunk, but still does not feel like an improvement over my
last patch.

(lasse + outer switch)
 (file)  Mode  CntScoreError  Units
ihe_ovly_pr.dcm  avgt30.633 ±  0.032  ms/op
 image1.dcm  avgt3  390.868 ± 40.598  ms/op
  large.xml  avgt3  190.030 ±  2.619  ms/op


int back = pos - dist - 1;
if (back < 0) {
// We won't get here if the dictionary isn't full.
assert full == bufSize;

// The distance wraps around to the end of the cyclic dictionary
// buffer. Here we will never copy more than dist + 1 bytes
// and so the copying won't repeat from its own output. Thus,
// we can always use arraycopy safely.
back += bufSize;
int copySize = Math.min(bufSize - back, left);
assert copySize <= dist + 1;

System.arraycopy(buf, back, buf, pos, copySize);
pos += copySize;
back = 0;
left -= copySize;

if (left == 0)
return;
}

assert back < pos;
assert left > 0;

int copySize = Math.min(left, pos - back);

switch(copySize) {
case 3:
buf[pos + 2] = buf[back + 2];
case 2:
buf[pos + 1] = buf[back + 1];
case 1:
buf[pos] = buf[back];
break;
default:
System.arraycopy(buf, back, buf, pos, copySize);
}
pos += copySize;
left -= copySize;

while (left > 0) {

copySize = Math.min(left, copySize << 1);
System.arraycopy(buf, back, buf, pos, copySize);
pos += copySize;
left -= copySize;
}



Re: [xz-devel] Re: java LZDecoder small improvement

2021-02-11 Thread Lasse Collin
On 2021-02-05 Brett Okken wrote:
> I worked this out last night. We need to double how much we copy each
> time by not advancing "back". This actually works even better than
> Arrays.fill for the single byte case also.

This clearly is a good idea in a Java implementation. :-)

I still worry about short copies. If the file is full of tiny
matches/repeats of 1-3 bytes or so, arraycopy can be slower. Such files
aren't typical at all but I don't want to add a corner case where the
performance drops too much.

I came up with the following. I haven't decided yet if I like it.

public void repeat(int dist, int len) throws IOException {
if (dist < 0 || dist >= full)
throw new CorruptedInputException();

int left = Math.min(limit - pos, len);
pendingLen = len - left;
pendingDist = dist;

int back = pos - dist - 1;
if (back < 0) {
// We won't get here if the dictionary isn't full.
assert full == bufSize;

// The distance wraps around to the end of the cyclic dictionary
// buffer. Here we will never copy more than dist + 1 bytes
// and so the copying won't repeat from its own output. Thus,
// we can always use arraycopy safely.
back += bufSize;
int copySize = Math.min(bufSize - back, left);
assert copySize <= dist + 1;

System.arraycopy(buf, back, buf, pos, copySize);
pos += copySize;
back = 0;
left -= copySize;

if (left == 0)
return;
}

assert back < pos;
assert left > 0;

do {
// Determine the number of bytes to copy on this loop iteration:
// copySize is set so that the source and destination ranges
// don't overlap. If "left" is large enough, the destination
// range will start right after the last byte of the source
// range. This way we don't need to advance "back" which
// allows the next iteration of this loop to copy (up to)
// twice the number of bytes.
int copySize = Math.min(left, pos - back);

// With tiny copy sizes arraycopy is slower than a byte-by-byte
// loop. With typical files the difference is tiny but with
// unusual files this can matter more.
if (copySize < 4) {
int i = 0;
do {
buf[pos + i] = buf[back + i];
} while (++i < copySize);
} else {
System.arraycopy(buf, back, buf, pos, copySize);
}

pos += copySize;
left -= copySize;
} while (left > 0);

if (full < pos)
full = pos;
}

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode



Re: [xz-devel] Re: java LZDecoder small improvement

2021-02-06 Thread Brett Okken
Here is a patch for changes. The benchmark results follow.


diff --git a/src/org/tukaani/xz/lz/LZDecoder.java
b/src/org/tukaani/xz/lz/LZDecoder.java
index 85b2ca1..565209a 100644
--- a/src/org/tukaani/xz/lz/LZDecoder.java
+++ b/src/org/tukaani/xz/lz/LZDecoder.java
@@ -12,6 +12,7 @@ package org.tukaani.xz.lz;

 import java.io.DataInputStream;
 import java.io.IOException;
+
 import org.tukaani.xz.ArrayCache;
 import org.tukaani.xz.CorruptedInputException;

@@ -95,12 +96,39 @@ public final class LZDecoder {
 if (dist >= pos)
 back += bufSize;

-do {
-buf[pos++] = buf[back++];
+buf[pos++] = buf[back++];
+--left;
+//then handle in bulk if there is work remaining
+while (left > 0) {
 if (back == bufSize)
 back = 0;
-} while (--left > 0);

+//it is possible for the "repeat" to include content which is going
+//to be generated here so we have to limit ourselves to how much
+//data is already in the buffer (i.e. pos - back). It is also
+//possible that "back" can actually be forward in the buffer from
+//position, in which case that comparison does not matter
+int toCopy = Math.min(left, bufSize - back);
+int tmp = pos - back;
+if (tmp < toCopy && tmp > 0) {
+//if the delta between pos and back is smaller than how much
+//there is to copy, then we can safely repeat it all the way
+//for what is left
+do {
+System.arraycopy(buf, back, buf, pos, tmp);
+pos += tmp;
+left -= tmp;
+tmp <<= 1;
+} while (left > tmp);
+} else {
+
+System.arraycopy(buf, back, buf, pos, toCopy);
+pos += toCopy;
+back += toCopy;
+left -= toCopy;
+}
+}
+
 if (full < pos)
 full = pos;
 }



Here are benchmarks covering jdk 8, 11, and 15. This compares current
TRUNK, which has crc64 changes compared to 1.8. For the "real"
content, the improvements range from fairly small for the image (~5%),
to ~15% for the large xml file. The artificial repeating byte improved
by ~40%.



jdk 8 64 bit
TRUNK
 (file)  Mode  CntScoreError  Units
ihe_ovly_pr.dcm  avgt30.589 ±  0.007  ms/op
 image1.dcm  avgt3  384.959 ±  7.477  ms/op
  large.xml  avgt3  237.317 ± 14.152  ms/op
 1 byte rep  avgt3  388.612 ± 12.811  ms/op

After
ihe_ovly_pr.dcm  avgt30.525 ±  0.022  ms/op
 image1.dcm  avgt3  371.327 ± 23.419  ms/op
  large.xml  avgt3  190.587 ±  9.545  ms/op
 1 byte rep  avgt3  216.481 ±  3.936  ms/op

jdk 11 64 bit
TRUNK
 (file)  Mode  CntScoreError  Units
ihe_ovly_pr.dcm  avgt30.662 ±  0.012  ms/op
 image1.dcm  avgt3  391.644 ± 13.871  ms/op
  large.xml  avgt3  225.456 ± 16.265  ms/op
 1 byte rep  avgt3  387.347 ± 18.811  ms/op

After
ihe_ovly_pr.dcm  avgt30.607 ±  0.187  ms/op
 image1.dcm  avgt3  369.419 ± 32.400  ms/op
  large.xml  avgt3  190.580 ±  7.856  ms/op
 1 byte rep  avgt3  220.554 ±  8.812  ms/op

jdk 15 64 bit
TRUNK
 (file)  Mode  CntScoreError  Units
ihe_ovly_pr.dcm  avgt30.661 ±  0.034  ms/op
 image1.dcm  avgt3  375.150 ± 22.059  ms/op
  large.xml  avgt3  219.960 ±  7.347  ms/op
 1 byte rep  avgt3  385.606 ± 13.185  ms/op

After
ihe_ovly_pr.dcm  avgt30.609 ±  0.139  ms/op
 image1.dcm  avgt3  367.665 ±  0.555  ms/op
  large.xml  avgt3  187.471 ± 56.143  ms/op
 1 byte rep  avgt3  217.445 ± 16.682  ms/op

jdk 8 32 bit (server)
TRUNK
 (file)  Mode  CntScoreError  Units
ihe_ovly_pr.dcm  avgt30.627 ±  0.046  ms/op
 image1.dcm  avgt3  457.876 ± 28.759  ms/op
  large.xml  avgt3  279.929 ± 16.974  ms/op
 1 byte rep  avgt3  490.610 ± 67.420  ms/op

After
ihe_ovly_pr.dcm  avgt30.561 ±  0.018  ms/op
 image1.dcm  avgt3  426.993 ± 12.331  ms/op
  large.xml  avgt3  228.273 ± 13.871  ms/op
 1 byte rep  avgt3  309.623 ± 94.535  ms/op

jdk 11 32 bit (server)
TRUNK
 (file)  Mode  CntScoreError  Units
ihe_ovly_pr.dcm  avgt30.855 ±  0.058  ms/op
 image1.dcm  avgt3  655.695 ± 14.873  ms/op
  large.xml  avgt3  411.228 ±  3.691  ms/op
 1 byte rep  avgt3  696.332 ± 35.806  ms/op

After
ihe_ovly_pr.dcm  avgt30.740 ±  0.011  ms/op
 image1.dcm  avgt3  616.774 ± 25.305  ms/op
  large.xml  avgt3  322.130 ±  4.882  ms/op
 1 byte rep  avgt3  422.353 ± 17.333  ms/op



Re: [xz-devel] Re: java LZDecoder small improvement

2021-02-05 Thread Brett Okken
> With a file with two-byte repeat ("ababababababab"...) it's 50 % slower
> than the baseline. Calling arraycopy in a loop, copying two bytes at a
> time, is not efficient. I didn't try look how big the copy needs to be
> to make the overhead of arraycopy smaller than the benefit but clearly
> it needs to be bigger than two bytes.

I worked this out last night. We need to double how much we copy each
time by not advancing "back". This actually works even better than
Arrays.fill for the single byte case also.

I still need to run through 1.8 and 15 to make sure this holds up, but
this implementation is better for all files on jdk 11:

int back = pos - dist - 1;
if (dist >= pos)
back += bufSize;

buf[pos++] = buf[back++];
--left;
//then handle in bulk if there is work remaining
while (left > 0) {
if (back == bufSize)
back = 0;

//it is possible for the "repeat" to include content which is going
//to be generated here so we have to limit ourselves to how much
//data is already in the buffer (i.e. pos - back). It is also
//possible that "back" can actually be forward in the buffer from
//position, in which case that comparison does not matter
int toCopy = Math.min(left, bufSize - back);
int tmp = pos - back;
if (tmp < toCopy && tmp > 0) {
//if the delta between pos and back is smaller than how much
//there is to copy, then we can safely repeat it all the way
//for what is left
do {
System.arraycopy(buf, back, buf, pos, tmp);
pos += tmp;
left -= tmp;
tmp <<= 1;
} while (left > tmp);
} else {
System.arraycopy(buf, back, buf, pos, toCopy);
pos += toCopy;
back += toCopy;
left -= toCopy;
}
}

if (full < pos)
full = pos;



Re: [xz-devel] Re: java LZDecoder small improvement

2021-02-05 Thread Lasse Collin
On 2021-02-03 Brett Okken wrote:
> On Wed, Feb 3, 2021 at 2:56 PM Lasse Collin
>  wrote:
> > It seems to regress horribly if dist is zero. A file with a very
> > long sequence of the same byte is good for testing.
> 
> Would this be a valid test of what you are describing?
[...]
> The source is effectively 160MB of the same byte value.

Yes, it's fine.

> I found a strange bit of behavior with this case in the compression.
> In LZMAEncoderNormal.calcLongRepPrices, I am seeing a case where
> 
> int len2Limit = Math.min(niceLen, avail - len - 1);
> 
> results in -1, (avail and len are both 8). This results in calling
> LZEncoder.getMatchLen with a lenLimit of -1. Is that expected?

I didn't check in detail now, but I think it's expected. There are two
such places. A speed optimization was forgotten in liblzma from these
two places because of this detail. I finally remembered to add the
optimization in 5.2.5.

On 2021-02-03 Brett Okken wrote:
> I still need to do more testing across jdk 8 and 15, but initial
> returns on this are pretty positive. The repeating byte file is
> meaningfully faster than baseline. One of my test files (image1.dcm)
> does not improve much from baseline, but the other 2 files do.

The repeating byte is indeed much faster than the baseline. With normal
files the speed seems to be about the same as the version I posted, so
a minor improvement over the baseline.

With a file with two-byte repeat ("ababababababab"...) it's 50 % slower
than the baseline. Calling arraycopy in a loop, copying two bytes at a
time, is not efficient. I didn't try look how big the copy needs to be
to make the overhead of arraycopy smaller than the benefit but clearly
it needs to be bigger than two bytes.

The use of Arrays.fill to optimize the case of one repeating byte looks
useful especially if it won't hurt performance in other situations.
Still, I'm not sure yet if the LZDecoder optimizations should go in 1.9.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode



Re: [xz-devel] Re: java LZDecoder small improvement

2021-02-03 Thread Brett Okken
I still need to do more testing across jdk 8 and 15, but initial
returns on this are pretty positive. The repeating byte file is
meaningfully faster than baseline. One of my test files (image1.dcm)
does not improve much from baseline, but the other 2 files do.


diff --git a/src/org/tukaani/xz/lz/LZDecoder.java
b/src/org/tukaani/xz/lz/LZDecoder.java
index 85b2ca1..b062a9d 100644
--- a/src/org/tukaani/xz/lz/LZDecoder.java
+++ b/src/org/tukaani/xz/lz/LZDecoder.java
@@ -12,6 +12,8 @@ package org.tukaani.xz.lz;

 import java.io.DataInputStream;
 import java.io.IOException;
+import java.util.Arrays;
+
 import org.tukaani.xz.ArrayCache;
 import org.tukaani.xz.CorruptedInputException;

@@ -92,14 +94,52 @@ public final class LZDecoder {
 pendingDist = dist;

 int back = pos - dist - 1;
-if (dist >= pos)
+if (dist >= pos) {
+// We won't get here if the dictionary isn't full.
+assert full == bufSize;
+
+// The distance wraps around to the end of the cyclic dictionary
+// buffer. Here we will never copy more than dist + 1 bytes
+// and so the copying won't repeat from its own output. Thus,
+// we can always use arraycopy safely.
 back += bufSize;
+int copySize = Math.min(bufSize - back, left);
+assert copySize <= dist + 1;
+System.arraycopy(buf, back, buf, pos, copySize);
+pos += copySize;
+back = 0;
+left -= copySize;
+
+if (left == 0)
+return;
+}
+
+assert left > 0;
+
+//the difference between pos and back is how much data is in the source
+//buffer to be repeated
+final int delta = pos - back;
+if (delta < left) {
+// We are copying more than dist + 1 bytes and thus will partly
+// copy from our own output.
+if (delta > 1) {
+// take the size of data to be repeated, and copy it in loop
+for (int i=0, j=left/delta; i 0);
+System.arraycopy(buf, back, buf, pos, left);
+pos += left;

 if (full < pos)
 full = pos;



Re: [xz-devel] Re: java LZDecoder small improvement

2021-02-03 Thread Brett Okken
On Wed, Feb 3, 2021 at 2:56 PM Lasse Collin  wrote:
>
> On 2021-02-01 Brett Okken wrote:
> > I have played with this quite a bit and have come up with a slightly
> > modified change which does not regress for the smallest of the sample
> > objects and shows a nice improvement for the 2 larger files.
>
> It seems to regress horribly if dist is zero. A file with a very long
> sequence of the same byte is good for testing.
>

Would this be a valid test of what you are describing?

final byte[] bytes = new byte[16 * 1024];
Arrays.fill(bytes, (byte) -75);

final byte[] compressed;
try(final ByteArrayOutputStream baos = new ByteArrayOutputStream();
final XZOutputStream xos = new XZOutputStream(baos,
new LZMA2Options()))
{
for (int i=0; i<10240; ++i) {
xos.write(bytes);
}
xos.finish();
compressed = baos.toByteArray();
}

The source is effectively 160MB of the same byte value.


I found a strange bit of behavior with this case in the compression.
In LZMAEncoderNormal.calcLongRepPrices, I am seeing a case where

int len2Limit = Math.min(niceLen, avail - len - 1);

results in -1, (avail and len are both 8). This results in calling
LZEncoder.getMatchLen with a lenLimit of -1. Is that expected?
When I was testing with java 8 and the ArrayUtil changes, this
resulted in an ArrayIndexOutBoundsException.



Re: [xz-devel] Re: java LZDecoder small improvement

2021-02-03 Thread Lasse Collin
On 2021-02-01 Brett Okken wrote:
> I have played with this quite a bit and have come up with a slightly
> modified change which does not regress for the smallest of the sample
> objects and shows a nice improvement for the 2 larger files.

It seems to regress horribly if dist is zero. A file with a very long
sequence of the same byte is good for testing.

The problem is that tmp is almost always 1 and then each arraycopy call
will copy exactly one byte. The overhead is very high compared to doing
the copying in a loop like in the original code.

Below is a different version which is a little faster with Java 15 but
worse than the current simple code on Java 8 (tested on the same
computer and OS). The improvement over the current code is like 3-5 %
with Java 15, so not a lot but not insignificant either (such
optimizations add up). However, if the change is neutral or clearly
negative on Java 8, maybe this patch isn't worth the complexity yet.
Java 8 is still supported by its upstream.

Maybe you get different results. Make sure the uncompressed size of the
test files is several times larger than the dictionary size.

With the current knowledge I think this patch will need to wait past XZ
for Java 1.9.

diff --git a/src/org/tukaani/xz/lz/LZDecoder.java 
b/src/org/tukaani/xz/lz/LZDecoder.java
index 85b2ca1..8b3564c 100644
--- a/src/org/tukaani/xz/lz/LZDecoder.java
+++ b/src/org/tukaani/xz/lz/LZDecoder.java
@@ -92,14 +92,39 @@ public final class LZDecoder {
 pendingDist = dist;
 
 int back = pos - dist - 1;
-if (dist >= pos)
+if (dist >= pos) {
+// We won't get here if the dictionary isn't full.
+assert full == bufSize;
+
+// The distance wraps around to the end of the cyclic dictionary
+// buffer. Here we will never copy more than dist + 1 bytes
+// and so the copying won't repeat from its own output. Thus,
+// we can always use arraycopy safely.
 back += bufSize;
+int copySize = Math.min(bufSize - back, left);
+assert copySize <= dist + 1;
+
+System.arraycopy(buf, back, buf, pos, copySize);
+pos += copySize;
+back = 0;
+left -= copySize;
 
-do {
-buf[pos++] = buf[back++];
-if (back == bufSize)
-back = 0;
-} while (--left > 0);
+if (left == 0)
+return;
+}
+
+assert left > 0;
+
+if (left > dist + 1) {
+// We are copying more than dist + 1 bytes and thus will partly
+// copy from our own output.
+do {
+buf[pos++] = buf[back++];
+} while (--left > 0);
+} else {
+System.arraycopy(buf, back, buf, pos, left);
+pos += left;
+}
 
 if (full < pos)
 full = pos;

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode