On 2021-02-13 Brett Okken wrote:
> On Thu, Feb 11, 2021 at 12:51 PM Lasse Collin
> <lasse.col...@tukaani.org> wrote:
> > I still worry about short copies. If the file is full of tiny
> > matches/repeats of 1-3 bytes or so, arraycopy can be slower. Such
> > files aren't typical at all but I don't want to add a corner case
> > where the performance drops too much.  
> 
> Do you have examples of such files, or code on how to generate one?

Use the patch below and compress with this:

    java -jar build/jar/XZEncDemo.jar 2 < infile > outfile.xz"

Adjust LIMIT to get longer matches.

diff --git a/src/org/tukaani/xz/lzma/LZMAEncoderFast.java 
b/src/org/tukaani/xz/lzma/LZMAEncoderFast.java
index f8230ee..cd92ca6 100644
--- a/src/org/tukaani/xz/lzma/LZMAEncoderFast.java
+++ b/src/org/tukaani/xz/lzma/LZMAEncoderFast.java
@@ -44,6 +44,8 @@ final class LZMAEncoderFast extends LZMAEncoder {
         return smallDist < (bigDist >>> 7);
     }
 
+    private static final int LIMIT = 2;
+
     int getNextSymbol() {
         // Get the matches for the next byte unless readAhead indicates
         // that we already got the new matches during the previous call
@@ -66,11 +68,13 @@ final class LZMAEncoderFast extends LZMAEncoder {
         int bestRepIndex = 0;
         for (int rep = 0; rep < REPS; ++rep) {
             int len = lz.getMatchLen(reps[rep], avail);
+            if (len > LIMIT)
+                len = LIMIT;
             if (len < MATCH_LEN_MIN)
                 continue;
 
             // If it is long enough, return it.
-            if (len >= niceLen) {
+            if (len >= LIMIT) {
                 back = rep;
                 skip(len - 1);
                 return len;
@@ -88,9 +92,11 @@ final class LZMAEncoderFast extends LZMAEncoder {
 
         if (matches.count > 0) {
             mainLen = matches.len[matches.count - 1];
+            if (mainLen > LIMIT)
+                mainLen = LIMIT;
             mainDist = matches.dist[matches.count - 1];
 
-            if (mainLen >= niceLen) {
+            if (mainLen >= LIMIT) {
                 back = mainDist + REPS;
                 skip(mainLen - 1);
                 return mainLen;

With a quick try I got a feeling that my worry about short repeats was
wrong. It doesn't matter because decoding each LZMA symbol is much more
expensive. What matters is avoiding multiple tiny arraycopy calls
within a single run of the repeat method, and that problem was already
solved.

> > I came up with the following. I haven't decided yet if I like it.  
> 
> On the 3 files I have been testing with, this change is a mixed bag.
> Compared to trunk 1 regresses by ~8%. While the other 2 do improve,
> neither are better than my last patch.

OK, thanks. So it isn't great. I wonder which details make the
difference.

One thing that confuses me in your version is the special handling of
the first byte:

        buf[pos++] = buf[back++];
        --left;

If there are two bytes to copy, then one will be copied above and the
other with arraycopy later. If there are more bytes to copy and
distance is very small, incrementing "back" above can mean that an
extra arraycopy call might be needed in the loop because the first copy
will be one byte smaller.

I understand that it might help when there is just one byte to repeat
because then the while-loop will be skipped. In all other situations it
sounds like that the special handling of the first byte would in theory
be harmful. Note that I don't doubt your test results; I already saw
with the CRC64 code that some changes in the code can affect
performance in weird ways.

Your code needs

            if (back == bufSize)
                back = 0;

in the beginning of the while-loop and later checking for tmp > 0. My
version avoids these branches by handling those cases under "if (back <
0)" (which is equivalent to "if (dist >= pos)"). On the other hand,
under "if (back < 0)" all copies, including tiny copies, are done with
arraycopy.

Another tiny difference is that your code uses left shift to double the
copy size in the loop while I used Math.min(pos - back, left).

> I was able to improve this a bit by pulling the handling of small
> copies outside of the while loop. This eliminates the regressions
> compared to trunk, but still does not feel like an improvement over my
> last patch.

Yeah, the switch isn't worth it. If I understand it correctly now,
trying to avoid arraycopy for the tiny copies wasn't a useful idea in
the first place. So the code can be simplified ("version 3"):

        int back = pos - dist - 1;
        if (back < 0) {
            // The distance wraps around to the end of the cyclic dictionary
            // buffer. We cannot get here if the dictionary isn't full.
            assert full == bufSize;
            back += bufSize;

            // Here we will never copy more than dist + 1 bytes and
            // so the copying won't repeat from its own output.
            // Thus, we can always use arraycopy safely.
            int copySize = Math.min(bufSize - back, left);
            assert copySize <= dist + 1;

            System.arraycopy(buf, back, buf, pos, copySize);
            pos += copySize;
            back = 0;
            left -= copySize;

            if (left == 0)
                return;
        }

        assert back < pos;
        assert left > 0;

        do {
            // Determine the number of bytes to copy on this loop iteration:
            // copySize is set so that the source and destination ranges
            // don't overlap. If "left" is large enough, the destination
            // range will start right after the last byte of the source
            // range. This way we don't need to advance "back" which
            // allows the next iteration of this loop to copy (up to)
            // twice the number of bytes.
            int copySize = Math.min(left, pos - back);
            System.arraycopy(buf, back, buf, pos, copySize);
            pos += copySize;
            left -= copySize;
        } while (left > 0);

I know I may sound stubborn for not accepting your version as is but
the special handling of the first byte and the readability of the
while-loop (how easy it is to understand on the first reading) make me
hesitate. For example, for dist = 0, len = 23 and assuming that copying
doesn't wrap:

  1. One byte is copied before the while-loop.
  2. The inner do-while-loop copies 1, 2, 4, and 8 bytes.
  3. The outer while-loop starts from the beginning and
     the latter arraycopy is used to copy the remaining 7 bytes.

In contrast my new simplified version above has just one loop in it
where it copies 1, 2, 4, 8, and 8 bytes. 

Just to play around, here's yet another method which has only one
arraycopy. The size difference is very minimal though after you ignore
the comments and assertions. For readability I prefer "version 3" over
the version below ("version 4").

        int back = pos - dist - 1;
        int backNext = back;

        do {
            int copySize;
            if (back < 0) {
                back += bufSize;
                copySize = bufSize - back;
                backNext = 0;
            } else {
                copySize = pos - back;
            }

            if (copySize > left)
                copySize = left;

            System.arraycopy(buf, back, buf, pos, copySize);
            pos += copySize;
            left -= copySize;
            back = backNext;
        } while (left > 0);

It would be nice if you could compare the versions again. Thanks!

The encoder patches in the other thread will need to wait a few
months.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode

Reply via email to