Re: [xz-devel] Re: java LZDecoder small improvement
On 2021-03-01 Brett Okken wrote: > > One thing that confuses me in your version is the special handling > > of the first byte: > > > > buf[pos++] = buf[back++]; > > --left; > > > > If there are two bytes to copy, then one will be copied above and > > the other with arraycopy later. If there are more bytes to copy and > > distance is very small, incrementing "back" above can mean that an > > extra arraycopy call might be needed in the loop because the first > > copy will be one byte smaller. > > > > I understand that it might help when there is just one byte to > > repeat because then the while-loop will be skipped. In all other > > situations it sounds like that the special handling of the first > > byte would in theory be harmful. Note that I don't doubt your test > > results; I already saw with the CRC64 code that some changes in the > > code can affect performance in weird ways. > > The image1.dcm is the most impacted by this optimization. Again, this > file is basically a large greyscale bmp. This results in a significant > number of single byte repeats. Optimizing for the single byte improves > performance in that file by 3-5%, while having smaller effects on the > other 2 files (ihe_ovly_pr.dcm slightly slower, large.xml slightly > faster) OK, that is an interesting test case. > I agree your approach is more readable. From your version of it, I was > expecting that simplicity in reading to translate into better > performance. > This latest version actually does appear to do that. The image1.dcm > performance matches my version and the other 2 are a bit faster. > Adding the single byte optimization still speeds up image1.dcm (~8ms, > ~2%) and large.xml (~3ms, 2%), while slowing ihe_ovly_pr.dcm (~.008ms, > ~1%). [...] > Version 3 is better for all 3 files. With these results I now plan to include version 3 in the next release. It sounds that the single-byte optimization has a fairly small effect. Omitting it keeps the code a tiny bit simpler. I have committed the change. I think xz-java.git should now be almost ready for a release. I just need to add NEWS and bump the version number. Thanks for your help! -- Lasse Collin | IRC: Larhzu @ IRCnet & Freenode
Re: [xz-devel] Re: java LZDecoder small improvement
> With a quick try I got a feeling that my worry about short repeats was > wrong. It doesn't matter because decoding each LZMA symbol is much more > expensive. What matters is avoiding multiple tiny arraycopy calls > within a single run of the repeat method, and that problem was already > solved. My observation is basically the same. Individual tiny array copy calls do not really matter, but tiny array copy calls in a tight loop is quite bad. My /guess/ is that the optimizer is not able to unroll these loops effectively, possibly because of the arraycopy intrinsic. > > > > I came up with the following. I haven't decided yet if I like it. > > > > On the 3 files I have been testing with, this change is a mixed bag. > > Compared to trunk 1 regresses by ~8%. While the other 2 do improve, > > neither are better than my last patch. > > OK, thanks. So it isn't great. I wonder which details make the > difference. I think some of the problem is too many branches, making prediction/speculative execution less useful. Another issue is the absence of the single byte optimization, which I will address more below. > One thing that confuses me in your version is the special handling of > the first byte: > > buf[pos++] = buf[back++]; > --left; > > If there are two bytes to copy, then one will be copied above and the > other with arraycopy later. If there are more bytes to copy and > distance is very small, incrementing "back" above can mean that an > extra arraycopy call might be needed in the loop because the first copy > will be one byte smaller. > > I understand that it might help when there is just one byte to repeat > because then the while-loop will be skipped. In all other situations it > sounds like that the special handling of the first byte would in theory > be harmful. Note that I don't doubt your test results; I already saw > with the CRC64 code that some changes in the code can affect > performance in weird ways. The image1.dcm is the most impacted by this optimization. Again, this file is basically a large greyscale bmp. This results in a significant number of single byte repeats. Optimizing for the single byte improves performance in that file by 3-5%, while having smaller effects on the other 2 files (ihe_ovly_pr.dcm slightly slower, large.xml slightly faster) > Your code needs > > if (back == bufSize) > back = 0; > > in the beginning of the while-loop and later checking for tmp > 0. My > version avoids these branches by handling those cases under "if (back < > 0)" (which is equivalent to "if (dist >= pos)"). On the other hand, > under "if (back < 0)" all copies, including tiny copies, are done with > arraycopy. > > Another tiny difference is that your code uses left shift to double the > copy size in the loop while I used Math.min(pos - back, left). > > > I was able to improve this a bit by pulling the handling of small > > copies outside of the while loop. This eliminates the regressions > > compared to trunk, but still does not feel like an improvement over my > > last patch. > > Yeah, the switch isn't worth it. If I understand it correctly now, > trying to avoid arraycopy for the tiny copies wasn't a useful idea in > the first place. So the code can be simplified ("version 3"): > > int back = pos - dist - 1; > if (back < 0) { > // The distance wraps around to the end of the cyclic dictionary > // buffer. We cannot get here if the dictionary isn't full. > assert full == bufSize; > back += bufSize; > > // Here we will never copy more than dist + 1 bytes and > // so the copying won't repeat from its own output. > // Thus, we can always use arraycopy safely. > int copySize = Math.min(bufSize - back, left); > assert copySize <= dist + 1; > > System.arraycopy(buf, back, buf, pos, copySize); > pos += copySize; > back = 0; > left -= copySize; > > if (left == 0) > return; > } > > assert back < pos; > assert left > 0; > > do { > // Determine the number of bytes to copy on this loop iteration: > // copySize is set so that the source and destination ranges > // don't overlap. If "left" is large enough, the destination > // range will start right after the last byte of the source > // range. This way we don't need to advance "back" which > // allows the next iteration of this loop to copy (up to) > // twice the number of bytes. > int copySize = Math.min(left, pos - back); > System.arraycopy(buf, back, buf, pos, copySize); > pos += copySize; > left -= copySize; > } while (left > 0); > > I know I may sound stubborn for not accepting your version as is but > the special handling of the first byte and the
Re: [xz-devel] Re: java LZDecoder small improvement
On 2021-02-13 Brett Okken wrote: > On Thu, Feb 11, 2021 at 12:51 PM Lasse Collin > wrote: > > I still worry about short copies. If the file is full of tiny > > matches/repeats of 1-3 bytes or so, arraycopy can be slower. Such > > files aren't typical at all but I don't want to add a corner case > > where the performance drops too much. > > Do you have examples of such files, or code on how to generate one? Use the patch below and compress with this: java -jar build/jar/XZEncDemo.jar 2 < infile > outfile.xz" Adjust LIMIT to get longer matches. diff --git a/src/org/tukaani/xz/lzma/LZMAEncoderFast.java b/src/org/tukaani/xz/lzma/LZMAEncoderFast.java index f8230ee..cd92ca6 100644 --- a/src/org/tukaani/xz/lzma/LZMAEncoderFast.java +++ b/src/org/tukaani/xz/lzma/LZMAEncoderFast.java @@ -44,6 +44,8 @@ final class LZMAEncoderFast extends LZMAEncoder { return smallDist < (bigDist >>> 7); } +private static final int LIMIT = 2; + int getNextSymbol() { // Get the matches for the next byte unless readAhead indicates // that we already got the new matches during the previous call @@ -66,11 +68,13 @@ final class LZMAEncoderFast extends LZMAEncoder { int bestRepIndex = 0; for (int rep = 0; rep < REPS; ++rep) { int len = lz.getMatchLen(reps[rep], avail); +if (len > LIMIT) +len = LIMIT; if (len < MATCH_LEN_MIN) continue; // If it is long enough, return it. -if (len >= niceLen) { +if (len >= LIMIT) { back = rep; skip(len - 1); return len; @@ -88,9 +92,11 @@ final class LZMAEncoderFast extends LZMAEncoder { if (matches.count > 0) { mainLen = matches.len[matches.count - 1]; +if (mainLen > LIMIT) +mainLen = LIMIT; mainDist = matches.dist[matches.count - 1]; -if (mainLen >= niceLen) { +if (mainLen >= LIMIT) { back = mainDist + REPS; skip(mainLen - 1); return mainLen; With a quick try I got a feeling that my worry about short repeats was wrong. It doesn't matter because decoding each LZMA symbol is much more expensive. What matters is avoiding multiple tiny arraycopy calls within a single run of the repeat method, and that problem was already solved. > > I came up with the following. I haven't decided yet if I like it. > > On the 3 files I have been testing with, this change is a mixed bag. > Compared to trunk 1 regresses by ~8%. While the other 2 do improve, > neither are better than my last patch. OK, thanks. So it isn't great. I wonder which details make the difference. One thing that confuses me in your version is the special handling of the first byte: buf[pos++] = buf[back++]; --left; If there are two bytes to copy, then one will be copied above and the other with arraycopy later. If there are more bytes to copy and distance is very small, incrementing "back" above can mean that an extra arraycopy call might be needed in the loop because the first copy will be one byte smaller. I understand that it might help when there is just one byte to repeat because then the while-loop will be skipped. In all other situations it sounds like that the special handling of the first byte would in theory be harmful. Note that I don't doubt your test results; I already saw with the CRC64 code that some changes in the code can affect performance in weird ways. Your code needs if (back == bufSize) back = 0; in the beginning of the while-loop and later checking for tmp > 0. My version avoids these branches by handling those cases under "if (back < 0)" (which is equivalent to "if (dist >= pos)"). On the other hand, under "if (back < 0)" all copies, including tiny copies, are done with arraycopy. Another tiny difference is that your code uses left shift to double the copy size in the loop while I used Math.min(pos - back, left). > I was able to improve this a bit by pulling the handling of small > copies outside of the while loop. This eliminates the regressions > compared to trunk, but still does not feel like an improvement over my > last patch. Yeah, the switch isn't worth it. If I understand it correctly now, trying to avoid arraycopy for the tiny copies wasn't a useful idea in the first place. So the code can be simplified ("version 3"): int back = pos - dist - 1; if (back < 0) { // The distance wraps around to the end of the cyclic dictionary // buffer. We cannot get here if the dictionary isn't full. assert full == bufSize; back += bufSize; // Here we will never copy more than dist + 1 bytes and // so the copying won't repeat from its own output. // Thus, we can always use arraycopy safely. int
Re: [xz-devel] Re: java LZDecoder small improvement
On Thu, Feb 11, 2021 at 12:51 PM Lasse Collin wrote: > > On 2021-02-05 Brett Okken wrote: > > I worked this out last night. We need to double how much we copy each > > time by not advancing "back". This actually works even better than > > Arrays.fill for the single byte case also. > > This clearly is a good idea in a Java implementation. :-) > > I still worry about short copies. If the file is full of tiny > matches/repeats of 1-3 bytes or so, arraycopy can be slower. Such files > aren't typical at all but I don't want to add a corner case where the > performance drops too much. Do you have examples of such files, or code on how to generate one? The case of a 1 byte match/repeat is optimized for in my latest patch, as providing that optimization did provide noticeable improvement in the (real) files I have been testing with. While I did observe some 2 and 3 byte repeats, it appears to not be common enough to negatively impact overall throughput. It would be quite helpful to have some examples so that we can at least quantify the impact. > I came up with the following. I haven't decided yet if I like it. On the 3 files I have been testing with, this change is a mixed bag. Compared to trunk 1 regresses by ~8%. While the other 2 do improve, neither are better than my last patch. jdk 11 64 bit TRUNK (file) Mode CntScoreError Units ihe_ovly_pr.dcm avgt30.662 ± 0.012 ms/op image1.dcm avgt3 391.644 ± 13.871 ms/op large.xml avgt3 225.456 ± 16.265 ms/op (okken last) ihe_ovly_pr.dcm avgt30.607 ± 0.187 ms/op image1.dcm avgt3 369.419 ± 32.400 ms/op large.xml avgt3 190.580 ± 7.856 ms/op (lasse new) (file) Mode CntScoreError Units ihe_ovly_pr.dcm avgt30.628 ± 0.066 ms/op image1.dcm avgt3 424.159 ± 14.823 ms/op large.xml avgt3 192.763 ± 6.831 ms/op I was able to improve this a bit by pulling the handling of small copies outside of the while loop. This eliminates the regressions compared to trunk, but still does not feel like an improvement over my last patch. (lasse + outer switch) (file) Mode CntScoreError Units ihe_ovly_pr.dcm avgt30.633 ± 0.032 ms/op image1.dcm avgt3 390.868 ± 40.598 ms/op large.xml avgt3 190.030 ± 2.619 ms/op int back = pos - dist - 1; if (back < 0) { // We won't get here if the dictionary isn't full. assert full == bufSize; // The distance wraps around to the end of the cyclic dictionary // buffer. Here we will never copy more than dist + 1 bytes // and so the copying won't repeat from its own output. Thus, // we can always use arraycopy safely. back += bufSize; int copySize = Math.min(bufSize - back, left); assert copySize <= dist + 1; System.arraycopy(buf, back, buf, pos, copySize); pos += copySize; back = 0; left -= copySize; if (left == 0) return; } assert back < pos; assert left > 0; int copySize = Math.min(left, pos - back); switch(copySize) { case 3: buf[pos + 2] = buf[back + 2]; case 2: buf[pos + 1] = buf[back + 1]; case 1: buf[pos] = buf[back]; break; default: System.arraycopy(buf, back, buf, pos, copySize); } pos += copySize; left -= copySize; while (left > 0) { copySize = Math.min(left, copySize << 1); System.arraycopy(buf, back, buf, pos, copySize); pos += copySize; left -= copySize; }
Re: [xz-devel] Re: java LZDecoder small improvement
On 2021-02-05 Brett Okken wrote: > I worked this out last night. We need to double how much we copy each > time by not advancing "back". This actually works even better than > Arrays.fill for the single byte case also. This clearly is a good idea in a Java implementation. :-) I still worry about short copies. If the file is full of tiny matches/repeats of 1-3 bytes or so, arraycopy can be slower. Such files aren't typical at all but I don't want to add a corner case where the performance drops too much. I came up with the following. I haven't decided yet if I like it. public void repeat(int dist, int len) throws IOException { if (dist < 0 || dist >= full) throw new CorruptedInputException(); int left = Math.min(limit - pos, len); pendingLen = len - left; pendingDist = dist; int back = pos - dist - 1; if (back < 0) { // We won't get here if the dictionary isn't full. assert full == bufSize; // The distance wraps around to the end of the cyclic dictionary // buffer. Here we will never copy more than dist + 1 bytes // and so the copying won't repeat from its own output. Thus, // we can always use arraycopy safely. back += bufSize; int copySize = Math.min(bufSize - back, left); assert copySize <= dist + 1; System.arraycopy(buf, back, buf, pos, copySize); pos += copySize; back = 0; left -= copySize; if (left == 0) return; } assert back < pos; assert left > 0; do { // Determine the number of bytes to copy on this loop iteration: // copySize is set so that the source and destination ranges // don't overlap. If "left" is large enough, the destination // range will start right after the last byte of the source // range. This way we don't need to advance "back" which // allows the next iteration of this loop to copy (up to) // twice the number of bytes. int copySize = Math.min(left, pos - back); // With tiny copy sizes arraycopy is slower than a byte-by-byte // loop. With typical files the difference is tiny but with // unusual files this can matter more. if (copySize < 4) { int i = 0; do { buf[pos + i] = buf[back + i]; } while (++i < copySize); } else { System.arraycopy(buf, back, buf, pos, copySize); } pos += copySize; left -= copySize; } while (left > 0); if (full < pos) full = pos; } -- Lasse Collin | IRC: Larhzu @ IRCnet & Freenode
Re: [xz-devel] Re: java LZDecoder small improvement
Here is a patch for changes. The benchmark results follow. diff --git a/src/org/tukaani/xz/lz/LZDecoder.java b/src/org/tukaani/xz/lz/LZDecoder.java index 85b2ca1..565209a 100644 --- a/src/org/tukaani/xz/lz/LZDecoder.java +++ b/src/org/tukaani/xz/lz/LZDecoder.java @@ -12,6 +12,7 @@ package org.tukaani.xz.lz; import java.io.DataInputStream; import java.io.IOException; + import org.tukaani.xz.ArrayCache; import org.tukaani.xz.CorruptedInputException; @@ -95,12 +96,39 @@ public final class LZDecoder { if (dist >= pos) back += bufSize; -do { -buf[pos++] = buf[back++]; +buf[pos++] = buf[back++]; +--left; +//then handle in bulk if there is work remaining +while (left > 0) { if (back == bufSize) back = 0; -} while (--left > 0); +//it is possible for the "repeat" to include content which is going +//to be generated here so we have to limit ourselves to how much +//data is already in the buffer (i.e. pos - back). It is also +//possible that "back" can actually be forward in the buffer from +//position, in which case that comparison does not matter +int toCopy = Math.min(left, bufSize - back); +int tmp = pos - back; +if (tmp < toCopy && tmp > 0) { +//if the delta between pos and back is smaller than how much +//there is to copy, then we can safely repeat it all the way +//for what is left +do { +System.arraycopy(buf, back, buf, pos, tmp); +pos += tmp; +left -= tmp; +tmp <<= 1; +} while (left > tmp); +} else { + +System.arraycopy(buf, back, buf, pos, toCopy); +pos += toCopy; +back += toCopy; +left -= toCopy; +} +} + if (full < pos) full = pos; } Here are benchmarks covering jdk 8, 11, and 15. This compares current TRUNK, which has crc64 changes compared to 1.8. For the "real" content, the improvements range from fairly small for the image (~5%), to ~15% for the large xml file. The artificial repeating byte improved by ~40%. jdk 8 64 bit TRUNK (file) Mode CntScoreError Units ihe_ovly_pr.dcm avgt30.589 ± 0.007 ms/op image1.dcm avgt3 384.959 ± 7.477 ms/op large.xml avgt3 237.317 ± 14.152 ms/op 1 byte rep avgt3 388.612 ± 12.811 ms/op After ihe_ovly_pr.dcm avgt30.525 ± 0.022 ms/op image1.dcm avgt3 371.327 ± 23.419 ms/op large.xml avgt3 190.587 ± 9.545 ms/op 1 byte rep avgt3 216.481 ± 3.936 ms/op jdk 11 64 bit TRUNK (file) Mode CntScoreError Units ihe_ovly_pr.dcm avgt30.662 ± 0.012 ms/op image1.dcm avgt3 391.644 ± 13.871 ms/op large.xml avgt3 225.456 ± 16.265 ms/op 1 byte rep avgt3 387.347 ± 18.811 ms/op After ihe_ovly_pr.dcm avgt30.607 ± 0.187 ms/op image1.dcm avgt3 369.419 ± 32.400 ms/op large.xml avgt3 190.580 ± 7.856 ms/op 1 byte rep avgt3 220.554 ± 8.812 ms/op jdk 15 64 bit TRUNK (file) Mode CntScoreError Units ihe_ovly_pr.dcm avgt30.661 ± 0.034 ms/op image1.dcm avgt3 375.150 ± 22.059 ms/op large.xml avgt3 219.960 ± 7.347 ms/op 1 byte rep avgt3 385.606 ± 13.185 ms/op After ihe_ovly_pr.dcm avgt30.609 ± 0.139 ms/op image1.dcm avgt3 367.665 ± 0.555 ms/op large.xml avgt3 187.471 ± 56.143 ms/op 1 byte rep avgt3 217.445 ± 16.682 ms/op jdk 8 32 bit (server) TRUNK (file) Mode CntScoreError Units ihe_ovly_pr.dcm avgt30.627 ± 0.046 ms/op image1.dcm avgt3 457.876 ± 28.759 ms/op large.xml avgt3 279.929 ± 16.974 ms/op 1 byte rep avgt3 490.610 ± 67.420 ms/op After ihe_ovly_pr.dcm avgt30.561 ± 0.018 ms/op image1.dcm avgt3 426.993 ± 12.331 ms/op large.xml avgt3 228.273 ± 13.871 ms/op 1 byte rep avgt3 309.623 ± 94.535 ms/op jdk 11 32 bit (server) TRUNK (file) Mode CntScoreError Units ihe_ovly_pr.dcm avgt30.855 ± 0.058 ms/op image1.dcm avgt3 655.695 ± 14.873 ms/op large.xml avgt3 411.228 ± 3.691 ms/op 1 byte rep avgt3 696.332 ± 35.806 ms/op After ihe_ovly_pr.dcm avgt30.740 ± 0.011 ms/op image1.dcm avgt3 616.774 ± 25.305 ms/op large.xml avgt3 322.130 ± 4.882 ms/op 1 byte rep avgt3 422.353 ± 17.333 ms/op
Re: [xz-devel] Re: java LZDecoder small improvement
> With a file with two-byte repeat ("ababababababab"...) it's 50 % slower > than the baseline. Calling arraycopy in a loop, copying two bytes at a > time, is not efficient. I didn't try look how big the copy needs to be > to make the overhead of arraycopy smaller than the benefit but clearly > it needs to be bigger than two bytes. I worked this out last night. We need to double how much we copy each time by not advancing "back". This actually works even better than Arrays.fill for the single byte case also. I still need to run through 1.8 and 15 to make sure this holds up, but this implementation is better for all files on jdk 11: int back = pos - dist - 1; if (dist >= pos) back += bufSize; buf[pos++] = buf[back++]; --left; //then handle in bulk if there is work remaining while (left > 0) { if (back == bufSize) back = 0; //it is possible for the "repeat" to include content which is going //to be generated here so we have to limit ourselves to how much //data is already in the buffer (i.e. pos - back). It is also //possible that "back" can actually be forward in the buffer from //position, in which case that comparison does not matter int toCopy = Math.min(left, bufSize - back); int tmp = pos - back; if (tmp < toCopy && tmp > 0) { //if the delta between pos and back is smaller than how much //there is to copy, then we can safely repeat it all the way //for what is left do { System.arraycopy(buf, back, buf, pos, tmp); pos += tmp; left -= tmp; tmp <<= 1; } while (left > tmp); } else { System.arraycopy(buf, back, buf, pos, toCopy); pos += toCopy; back += toCopy; left -= toCopy; } } if (full < pos) full = pos;
Re: [xz-devel] Re: java LZDecoder small improvement
On 2021-02-03 Brett Okken wrote: > On Wed, Feb 3, 2021 at 2:56 PM Lasse Collin > wrote: > > It seems to regress horribly if dist is zero. A file with a very > > long sequence of the same byte is good for testing. > > Would this be a valid test of what you are describing? [...] > The source is effectively 160MB of the same byte value. Yes, it's fine. > I found a strange bit of behavior with this case in the compression. > In LZMAEncoderNormal.calcLongRepPrices, I am seeing a case where > > int len2Limit = Math.min(niceLen, avail - len - 1); > > results in -1, (avail and len are both 8). This results in calling > LZEncoder.getMatchLen with a lenLimit of -1. Is that expected? I didn't check in detail now, but I think it's expected. There are two such places. A speed optimization was forgotten in liblzma from these two places because of this detail. I finally remembered to add the optimization in 5.2.5. On 2021-02-03 Brett Okken wrote: > I still need to do more testing across jdk 8 and 15, but initial > returns on this are pretty positive. The repeating byte file is > meaningfully faster than baseline. One of my test files (image1.dcm) > does not improve much from baseline, but the other 2 files do. The repeating byte is indeed much faster than the baseline. With normal files the speed seems to be about the same as the version I posted, so a minor improvement over the baseline. With a file with two-byte repeat ("ababababababab"...) it's 50 % slower than the baseline. Calling arraycopy in a loop, copying two bytes at a time, is not efficient. I didn't try look how big the copy needs to be to make the overhead of arraycopy smaller than the benefit but clearly it needs to be bigger than two bytes. The use of Arrays.fill to optimize the case of one repeating byte looks useful especially if it won't hurt performance in other situations. Still, I'm not sure yet if the LZDecoder optimizations should go in 1.9. -- Lasse Collin | IRC: Larhzu @ IRCnet & Freenode
Re: [xz-devel] Re: java LZDecoder small improvement
I still need to do more testing across jdk 8 and 15, but initial returns on this are pretty positive. The repeating byte file is meaningfully faster than baseline. One of my test files (image1.dcm) does not improve much from baseline, but the other 2 files do. diff --git a/src/org/tukaani/xz/lz/LZDecoder.java b/src/org/tukaani/xz/lz/LZDecoder.java index 85b2ca1..b062a9d 100644 --- a/src/org/tukaani/xz/lz/LZDecoder.java +++ b/src/org/tukaani/xz/lz/LZDecoder.java @@ -12,6 +12,8 @@ package org.tukaani.xz.lz; import java.io.DataInputStream; import java.io.IOException; +import java.util.Arrays; + import org.tukaani.xz.ArrayCache; import org.tukaani.xz.CorruptedInputException; @@ -92,14 +94,52 @@ public final class LZDecoder { pendingDist = dist; int back = pos - dist - 1; -if (dist >= pos) +if (dist >= pos) { +// We won't get here if the dictionary isn't full. +assert full == bufSize; + +// The distance wraps around to the end of the cyclic dictionary +// buffer. Here we will never copy more than dist + 1 bytes +// and so the copying won't repeat from its own output. Thus, +// we can always use arraycopy safely. back += bufSize; +int copySize = Math.min(bufSize - back, left); +assert copySize <= dist + 1; +System.arraycopy(buf, back, buf, pos, copySize); +pos += copySize; +back = 0; +left -= copySize; + +if (left == 0) +return; +} + +assert left > 0; + +//the difference between pos and back is how much data is in the source +//buffer to be repeated +final int delta = pos - back; +if (delta < left) { +// We are copying more than dist + 1 bytes and thus will partly +// copy from our own output. +if (delta > 1) { +// take the size of data to be repeated, and copy it in loop +for (int i=0, j=left/delta; i 0); +System.arraycopy(buf, back, buf, pos, left); +pos += left; if (full < pos) full = pos;
Re: [xz-devel] Re: java LZDecoder small improvement
On Wed, Feb 3, 2021 at 2:56 PM Lasse Collin wrote: > > On 2021-02-01 Brett Okken wrote: > > I have played with this quite a bit and have come up with a slightly > > modified change which does not regress for the smallest of the sample > > objects and shows a nice improvement for the 2 larger files. > > It seems to regress horribly if dist is zero. A file with a very long > sequence of the same byte is good for testing. > Would this be a valid test of what you are describing? final byte[] bytes = new byte[16 * 1024]; Arrays.fill(bytes, (byte) -75); final byte[] compressed; try(final ByteArrayOutputStream baos = new ByteArrayOutputStream(); final XZOutputStream xos = new XZOutputStream(baos, new LZMA2Options())) { for (int i=0; i<10240; ++i) { xos.write(bytes); } xos.finish(); compressed = baos.toByteArray(); } The source is effectively 160MB of the same byte value. I found a strange bit of behavior with this case in the compression. In LZMAEncoderNormal.calcLongRepPrices, I am seeing a case where int len2Limit = Math.min(niceLen, avail - len - 1); results in -1, (avail and len are both 8). This results in calling LZEncoder.getMatchLen with a lenLimit of -1. Is that expected? When I was testing with java 8 and the ArrayUtil changes, this resulted in an ArrayIndexOutBoundsException.
Re: [xz-devel] Re: java LZDecoder small improvement
On 2021-02-01 Brett Okken wrote: > I have played with this quite a bit and have come up with a slightly > modified change which does not regress for the smallest of the sample > objects and shows a nice improvement for the 2 larger files. It seems to regress horribly if dist is zero. A file with a very long sequence of the same byte is good for testing. The problem is that tmp is almost always 1 and then each arraycopy call will copy exactly one byte. The overhead is very high compared to doing the copying in a loop like in the original code. Below is a different version which is a little faster with Java 15 but worse than the current simple code on Java 8 (tested on the same computer and OS). The improvement over the current code is like 3-5 % with Java 15, so not a lot but not insignificant either (such optimizations add up). However, if the change is neutral or clearly negative on Java 8, maybe this patch isn't worth the complexity yet. Java 8 is still supported by its upstream. Maybe you get different results. Make sure the uncompressed size of the test files is several times larger than the dictionary size. With the current knowledge I think this patch will need to wait past XZ for Java 1.9. diff --git a/src/org/tukaani/xz/lz/LZDecoder.java b/src/org/tukaani/xz/lz/LZDecoder.java index 85b2ca1..8b3564c 100644 --- a/src/org/tukaani/xz/lz/LZDecoder.java +++ b/src/org/tukaani/xz/lz/LZDecoder.java @@ -92,14 +92,39 @@ public final class LZDecoder { pendingDist = dist; int back = pos - dist - 1; -if (dist >= pos) +if (dist >= pos) { +// We won't get here if the dictionary isn't full. +assert full == bufSize; + +// The distance wraps around to the end of the cyclic dictionary +// buffer. Here we will never copy more than dist + 1 bytes +// and so the copying won't repeat from its own output. Thus, +// we can always use arraycopy safely. back += bufSize; +int copySize = Math.min(bufSize - back, left); +assert copySize <= dist + 1; + +System.arraycopy(buf, back, buf, pos, copySize); +pos += copySize; +back = 0; +left -= copySize; -do { -buf[pos++] = buf[back++]; -if (back == bufSize) -back = 0; -} while (--left > 0); +if (left == 0) +return; +} + +assert left > 0; + +if (left > dist + 1) { +// We are copying more than dist + 1 bytes and thus will partly +// copy from our own output. +do { +buf[pos++] = buf[back++]; +} while (--left > 0); +} else { +System.arraycopy(buf, back, buf, pos, left); +pos += left; +} if (full < pos) full = pos; -- Lasse Collin | IRC: Larhzu @ IRCnet & Freenode