I have continued to refine the changes around the array comparisons
and think I am pretty well done there.

I did a small benchmark measuring the time to compress 3 different
files using new XZOutputStream(cos, new LZMA2Options()). Where cos was
an OutputStream which simply calculated the crc32 of the content
written.

The ihe_ovly.pr.dcm is a ~66KB binary file.
The image1.dcm file is a ~26MB binary file with some metadata wrapping
a grayscale bmp.
The large.xml file is a ~51MB formatted utf-8 xml file.

1.8
Benchmark                                 (file)  Mode  Cnt     Score
   Error  Units
XZCompressionBenchmark.compress  ihe_ovly_pr.dcm  avgt    5     9.575
±   0.322  ms/op
XZCompressionBenchmark.compress       image1.dcm  avgt    5  6767.376
± 675.373  ms/op
XZCompressionBenchmark.compress        large.xml  avgt    5  9680.569
± 207.521  ms/op

1.9-SNAPSHOT
Benchmark                                 (file)  Mode  Cnt     Score
  Error  Units
XZCompressionBenchmark.compress  ihe_ovly_pr.dcm  avgt    5     7.888
±  0.443  ms/op
XZCompressionBenchmark.compress       image1.dcm  avgt    5  6144.748
± 71.551  ms/op
XZCompressionBenchmark.compress        large.xml  avgt    5  7904.019
± 26.316  ms/op

These results are from 64 bit windows using open jdk 11.0.6.
The results are 9-18% faster across the 3 different file types.

I made a small change to ArrayUtil from what I sent previously. When
everything matches, it now returns length rather than -1.

The remaining changes are:

BT4:
in getMatches()

        if (matches.count > 0) {
            while (lenBest < matchLenLimit && buf[readPos + lenBest - delta2]
                                              == buf[readPos + lenBest])
                ++lenBest;

Changes to:
        if (matches.count > 0) {
            lenBest += ArrayUtil.mismatch(buf, readPos + lenBest -
delta2, buf, readPos + lenBest, matchLenLimit - lenBest);

Further down, inside a while(true) block,

            if (buf[readPos + len - delta] == buf[readPos + len]) {
                while (++len < matchLenLimit)
                    if (buf[readPos + len - delta] != buf[readPos + len])
                        break;

Changes to:
            int mismatch = ArrayUtil.mismatch(buf, readPos + len -
delta, buf, readPos + len, matchLenLimit - len);
            if (mismatch != 0) {
                len += mismatch;

in skip(int, int)

            if (buf[readPos + len - delta] == buf[readPos + len]) {
                // No need to look for longer matches than niceLenLimit
                // because we only are updating the tree, not returning
                // matches found to the caller.
                do {
                    if (++len == niceLenLimit) {
                        tree[ptr1] = tree[pair];
                        tree[ptr0] = tree[pair + 1];
                        return;
                    }
                } while (buf[readPos + len - delta] == buf[readPos + len]);
            }

Changes to:
            // No need to look for longer matches than niceLenLimit
            // because we only are updating the tree, not returning
            // matches found to the caller.
            int mismatch = ArrayUtil.mismatch(buf, readPos + len -
delta, buf, readPos + len, niceLenLimit);
            if (mismatch == niceLenLimit) {
                tree[ptr1] = tree[pair];
                tree[ptr0] = tree[pair + 1];
                return;
            }
            len += mismatch;


In HC4, both changes are in getMatches()

        if (matches.count > 0) {
            while (lenBest < matchLenLimit && buf[readPos + lenBest - delta2]
                                              == buf[readPos + lenBest])
                ++lenBest;

Changes to:
        if (matches.count > 0) {
            lenBest += ArrayUtil.mismatch(buf, readPos + lenBest -
delta2, buf, readPos + lenBest, matchLenLimit - lenBest);

And:

            // Test the first byte and the first new byte that would give us
            // a match that is at least one byte longer than lenBest. This
            // too short matches get quickly skipped.
            if (buf[readPos + lenBest - delta] == buf[readPos + lenBest]
                    && buf[readPos - delta] == buf[readPos]) {
                // Calculate the length of the match.
                int len = 0;
                while (++len < matchLenLimit)
                    if (buf[readPos + len - delta] != buf[readPos + len])
                        break;

                // Use the match if and only if it is better than the longest
                // match found so far.
                if (len > lenBest) {
                    lenBest = len;
                    matches.len[matches.count] = len;
                    matches.dist[matches.count] = delta - 1;
                    ++matches.count;

                    // Return if it is long enough (niceLen or reached the
                    // end of the dictionary).
                    if (len >= niceLenLimit)
                        return matches;
                }
            }

Changes to:
            final int mismatch = ArrayUtil.mismatch(buf, readPos -
delta, buf, readPos, matchLenLimit);
            //use the match iff it is better than the longest match found so far
            if (mismatch > lenBest) {
                lenBest = mismatch;
                matches.len[matches.count] = mismatch;
                matches.dist[matches.count] = delta - 1;
                ++matches.count;

                // Return if it is long enough (niceLen or reached the
                // end of the dictionary).
                if (mismatch >= niceLenLimit)
                    return matches;
            }

Finally, in LZEncoder, the 2 getMatchLen methods were changed to:

    public int getMatchLen(int dist, int lenLimit) {
        final int backPos = readPos - dist - 1;
        return ArrayUtil.mismatch(buf, readPos, buf, backPos, lenLimit);
    }

    public int getMatchLen(int forward, int dist, int lenLimit) {
        final int curPos = readPos + forward;
        final int backPos = curPos - dist - 1;
        return ArrayUtil.mismatch(buf, curPos, buf, backPos, lenLimit);
    }

On Tue, Jan 12, 2021 at 10:17 AM Brett Okken <[email protected]> wrote:
>
> It turns out that reading the longs in native byte order provides
> noticeable improvement.
> I did find that there was cost overhead of ~1 ns/op by using an
> interface/implementation to flex behavior if Unsafe could not be
> loaded. That cost goes away by using java.lang.invoke.MethodHandle.
> So here is an updated jdk 7 compatible ArrayUtil implementation which
> matches current performance if the first byte does not match and is
> faster in every other scenario. The gap in performance grows as more
> bytes actually match. At 2 bytes, it takes roughly half the time. At
> 97 bytes, it takes less than ten percent of the time.
>
> Here are the benchmark results:
>
> Benchmark                                        (length)  Mode  Cnt
> Score   Error  Units
> ArrayMismatchBenchmark.comparerMismatch_nomatch         0  avgt    5
> 4.487 ± 0.059  ns/op
> ArrayMismatchBenchmark.comparerMismatch_nomatch         1  avgt    5
> 4.515 ± 0.102  ns/op
> ArrayMismatchBenchmark.comparerMismatch_nomatch         2  avgt    5
> 4.523 ± 0.023  ns/op
> ArrayMismatchBenchmark.comparerMismatch_nomatch         7  avgt    5
> 5.164 ± 0.098  ns/op
> ArrayMismatchBenchmark.comparerMismatch_nomatch        13  avgt    5
> 5.748 ± 0.974  ns/op
> ArrayMismatchBenchmark.comparerMismatch_nomatch        57  avgt    5
> 10.060 ± 1.135  ns/op
> ArrayMismatchBenchmark.comparerMismatch_nomatch        97  avgt    5
> 11.518 ± 0.418  ns/op
> ArrayMismatchBenchmark.legacyMismatch_nomatch           0  avgt    5
> 3.259 ± 0.069  ns/op
> ArrayMismatchBenchmark.legacyMismatch_nomatch           1  avgt    5
> 5.712 ± 0.070  ns/op
> ArrayMismatchBenchmark.legacyMismatch_nomatch           2  avgt    5
> 6.017 ± 0.300  ns/op
> ArrayMismatchBenchmark.legacyMismatch_nomatch           7  avgt    5
> 12.949 ± 0.163  ns/op
> ArrayMismatchBenchmark.legacyMismatch_nomatch          13  avgt    5
> 18.696 ± 0.551  ns/op
> ArrayMismatchBenchmark.legacyMismatch_nomatch          57  avgt    5
> 43.232 ± 1.015  ns/op
> ArrayMismatchBenchmark.legacyMismatch_nomatch          97  avgt    5
> 90.599 ± 0.794  ns/op
> ArrayMismatchBenchmark.unsafeMisMatch_nomatch           0  avgt    5
> 3.246 ± 0.138  ns/op
> ArrayMismatchBenchmark.unsafeMisMatch_nomatch           1  avgt    5
> 3.225 ± 0.042  ns/op
> ArrayMismatchBenchmark.unsafeMisMatch_nomatch           2  avgt    5
> 3.242 ± 0.043  ns/op
> ArrayMismatchBenchmark.unsafeMisMatch_nomatch           7  avgt    5
> 3.244 ± 0.048  ns/op
> ArrayMismatchBenchmark.unsafeMisMatch_nomatch          13  avgt    5
> 3.477 ± 0.028  ns/op
> ArrayMismatchBenchmark.unsafeMisMatch_nomatch          57  avgt    5
> 5.968 ± 0.553  ns/op
> ArrayMismatchBenchmark.unsafeMisMatch_nomatch          97  avgt    5
> 7.182 ± 0.080  ns/op
> ArrayMismatchBenchmark.utilMismatch_nomatch             0  avgt    5
> 3.219 ± 0.044  ns/op
> ArrayMismatchBenchmark.utilMismatch_nomatch             1  avgt    5
> 3.217 ± 0.054  ns/op
> ArrayMismatchBenchmark.utilMismatch_nomatch             2  avgt    5
> 3.217 ± 0.069  ns/op
> ArrayMismatchBenchmark.utilMismatch_nomatch             7  avgt    5
> 3.206 ± 0.047  ns/op
> ArrayMismatchBenchmark.utilMismatch_nomatch            13  avgt    5
> 3.509 ± 0.218  ns/op
> ArrayMismatchBenchmark.utilMismatch_nomatch            57  avgt    5
> 5.870 ± 0.063  ns/op
> ArrayMismatchBenchmark.utilMismatch_nomatch            97  avgt    5
> 7.178 ± 0.267  ns/op
>
> The "comparer" implementation is using interface with different
> implementations based on whether Unsafe could be loaded.
> The "unsafe" implementation is directly using the Unsafe class.
> The "util" implementation is using the ArrayUtil class below.
>
>
> package org.tukaani.xz.common;
>
> import java.lang.invoke.MethodHandle;
> import java.lang.invoke.MethodHandles;
> import java.lang.invoke.MethodType;
> import java.lang.reflect.Constructor;
> import java.nio.ByteOrder;
>
> public final class ArrayUtil {
>
>     /**
>      * MethodHandle to the actual mismatch method to use at runtime.
>      */
>     private static final MethodHandle MISMATCH;
>
>     /**
>      * If {@code sun.misc.Unsafe} can be loaded, this is MethodHandle
> bound to an instance of Unsafe for method {@code long getLong(Object,
> long)}.
>      */
>     private static final MethodHandle UNSAFE_GET_LONG;
>
>     /**
>      * MethodHandle to either {@link Long#numberOfLeadingZeros(long)}
> or {@link Long#numberOfTrailingZeros(long)} depending on {@link
> ByteOrder#nativeOrder()}.
>      */
>     private static final MethodHandle LEADING_ZEROS;
>
>     /**
>      * Populated from reflected read of {@code
> sun.misc.Unsafe.ARRAY_BYTE_BASE_OFFSET}.
>      */
>     private static final long ARRAY_BASE_OFFSET;
>
>     static {
>         //try to create an instance using Unsafe
>         long arrayBaseOffset = 0;
>         MethodHandle unsafeGetLong = null;
>         MethodHandle leadingZeros = null;
>         MethodHandle mismatch = null;
>         final MethodHandles.Lookup lookup = MethodHandles.lookup();
>         final MethodType mismatchType =
> MethodType.methodType(int.class, byte[].class, int.class,
> byte[].class, int.class, int.class);
>         try {
>             Class<?> unsafeClazz = Class.forName("sun.misc.Unsafe", true, 
> null);
>             Constructor<?> unsafeConstructor =
> unsafeClazz.getDeclaredConstructor();
>             unsafeConstructor.setAccessible(true);
>             Object unsafe = unsafeConstructor.newInstance();
>
>             arrayBaseOffset =
> unsafeClazz.getField("ARRAY_BYTE_BASE_OFFSET").getLong(null);
>
>             MethodHandle virtualGetLong =
> lookup.findVirtual(unsafeClazz, "getLong",
> MethodType.methodType(long.class, Object.class, long.class));
>             unsafeGetLong = virtualGetLong.bindTo(unsafe);
>
>             // do a test read to confirm unsafe is actually functioning
>             long val = (long) unsafeGetLong.invokeExact((Object) new
> byte[] { 0, 0, 0, 0, 0, 0, 0, 0 }, arrayBaseOffset + 0L);
>             if (val != 0) {
>                 throw new IllegalStateException("invalid value: " + val);
>             }
>
>             final boolean bigEndian = ByteOrder.BIG_ENDIAN ==
> ByteOrder.nativeOrder();
>
>             //getInt interprets in platform byte order. the concept of
> "leading zeros" being bytes
>             //in encounter order is true for big endian
>             //for little endian platform, the trailing zeros gives the
> encounter order result
>             leadingZeros = lookup.findStatic(Long.class, bigEndian ?
> "numberOfLeadingZeros" : "numberOfTrailingZeros",
> MethodType.methodType(int.class, long.class));
>             mismatch = lookup.findStatic(ArrayUtil.class,
> "unsafeMismatch", mismatchType);
>         } catch (Throwable t) {
>             //TODO: log out?
>             unsafeGetLong = null;
>             leadingZeros = null;
>             try {
>                 mismatch = lookup.findStatic(ArrayUtil.class,
> "legacyMismatch", mismatchType);
>             } catch (Exception e) {
>                 throw new IllegalStateException(e);
>             }
>         }
>
>         UNSAFE_GET_LONG = unsafeGetLong;
>         ARRAY_BASE_OFFSET = arrayBaseOffset;
>         LEADING_ZEROS = leadingZeros;
>         MISMATCH = mismatch;
>     }
>
>     /**
>      * Compares the values in <i>a</i> and <i>b</i> and returns the
> index of the first {@code byte} which differs.
>      * @param a The first {@code byte[]} for comparison.
>      * @param aFromIndex The offset into <i>a</i> to start reading from.
>      * @param b The second {@code byte[]} for comparison.
>      * @param bFromIndex The offset into <i>b</i> to start reading from.
>      * @param length The number of bytes to compare.
>      * @return The offset from the starting indexes of the first byte
> which differs or {@code -1} if all match.
>      */
>     public static int mismatch(byte[] a, int aFromIndex, byte[] b, int
> bFromIndex, int length) {
>        try {
>           return (int) MISMATCH.invokeExact(a, aFromIndex, b,
> bFromIndex, length);
>        } catch (RuntimeException e) {
>            throw e;
>        } catch (Error e) {
>            throw e;
>        } catch (Throwable t) {
>            throw new RuntimeException(t);
>        }
>     }
>
>     /**
>      * Uses {@code UNSAFE_GET_LONG} to compare 8 bytes at a time.
>      */
>     @SuppressWarnings("unused")
>     private static int unsafeMismatch(byte[] a, int aFromIndex, byte[]
> b, int bFromIndex, int length) throws Throwable {
>         //TODO: should these asserts be uncommented and/or should real
> verification be done?
>         //by using Unsafe, we can actually SIGSEGV if length is not
> valid for a or b.
> //        assert a.length >= aFromIndex + length;
> //        assert b.length >= bFromIndex + length;
>         int i=0;
>         for (int j=length - 7; i<j; i+=8) {
>             final long aVal = (long)
> UNSAFE_GET_LONG.invokeExact((Object) a, ARRAY_BASE_OFFSET + aFromIndex
> + i);
>             final long bVal = (long)
> UNSAFE_GET_LONG.invokeExact((Object) b, ARRAY_BASE_OFFSET + bFromIndex
> + i);
>             if (aVal != bVal) {
>                 //this returns a value where bits which match are 0
> and bits which differ are 1
>                 final long diff = aVal ^ bVal;
>                 //the first (in native byte order) bit which differs
> tells us which byte differed
>                 final int leadingZeros = (int) 
> LEADING_ZEROS.invokeExact(diff);
>                 return i + (leadingZeros / Byte.SIZE);
>             }
>         }
>         for ( ; i<length; ++i) {
>             if (a[aFromIndex + i] != b[bFromIndex + i]) {
>                 return i;
>             }
>         }
>         return -1;
>     }
>
>     /**
>      * Simply loops over all of the bytes, comparing one at a time.
>      */
>     @SuppressWarnings("unused")
>     private static int legacyMismatch(byte[] a, int aFromIndex, byte[]
> b, int bFromIndex, int length) {
>         for (int i=0; i<length; ++i) {
>             if (a[aFromIndex + i] != b[bFromIndex + i]) {
>                 return i;
>             }
>         }
>         return -1;
>     }
>
>     private ArrayUtil() {
>     }
> }
>
> On Mon, Jan 11, 2021 at 6:12 PM Brett Okken <[email protected]> wrote:
> >
> > I threw together a quick jmh test, and there is no value in the
> > changes to Hash234.
> >
> > For the array mismatch, the results are kind of interesting. My
> > observation, stepping through some compression uses, is that the
> > comparison length is typically 100-200 bytes in length, but the actual
> > match length is typically fairly short. This is obviously going to be
> > highly dependent on data, and I was using raw image data for
> > observation. Content like xml or json might have longer matches. So I
> > set up a benchmark which is always comparing 128 bytes and the
> > mismatch occurs after various "lengths":
> >
> > Benchmark                                      (length)  Mode  Cnt
> > Score   Error  Units
> > ArrayMismatchBenchmark.legacyMismatch_nomatch         0  avgt    5
> > 3.198 ± 0.168  ns/op
> > ArrayMismatchBenchmark.legacyMismatch_nomatch         1  avgt    5
> > 5.607 ± 0.048  ns/op
> > ArrayMismatchBenchmark.legacyMismatch_nomatch         2  avgt    5
> > 5.852 ± 0.053  ns/op
> > ArrayMismatchBenchmark.legacyMismatch_nomatch         7  avgt    5
> > 12.703 ± 0.350  ns/op
> > ArrayMismatchBenchmark.legacyMismatch_nomatch        13  avgt    5
> > 18.275 ± 0.228  ns/op
> > ArrayMismatchBenchmark.legacyMismatch_nomatch        57  avgt    5
> > 42.313 ± 0.450  ns/op
> > ArrayMismatchBenchmark.legacyMismatch_nomatch        97  avgt    5
> > 89.410 ± 2.927  ns/op
> > ArrayMismatchBenchmark.arraysMismatch_nomatch         0  avgt    5
> > 4.629 ± 0.035  ns/op
> > ArrayMismatchBenchmark.arraysMismatch_nomatch         1  avgt    5
> > 9.515 ± 0.096  ns/op
> > ArrayMismatchBenchmark.arraysMismatch_nomatch         2  avgt    5
> > 9.526 ± 0.132  ns/op
> > ArrayMismatchBenchmark.arraysMismatch_nomatch         7  avgt    5
> > 9.581 ± 0.395  ns/op
> > ArrayMismatchBenchmark.arraysMismatch_nomatch        13  avgt    5
> > 9.781 ± 0.133  ns/op
> > ArrayMismatchBenchmark.arraysMismatch_nomatch        57  avgt    5
> > 9.846 ± 0.182  ns/op
> > ArrayMismatchBenchmark.arraysMismatch_nomatch        97  avgt    5
> > 10.809 ± 0.307  ns/op
> > ArrayMismatchBenchmark.intMismatch_nomatch            0  avgt    5
> > 3.417 ± 0.018  ns/op
> > ArrayMismatchBenchmark.intMismatch_nomatch            1  avgt    5
> > 3.412 ± 0.011  ns/op
> > ArrayMismatchBenchmark.intMismatch_nomatch            2  avgt    5
> > 3.414 ± 0.032  ns/op
> > ArrayMismatchBenchmark.intMismatch_nomatch            7  avgt    5
> > 5.401 ± 0.207  ns/op
> > ArrayMismatchBenchmark.intMismatch_nomatch           13  avgt    5
> > 8.311 ± 0.070  ns/op
> > ArrayMismatchBenchmark.intMismatch_nomatch           57  avgt    5
> > 20.536 ± 0.556  ns/op
> > ArrayMismatchBenchmark.intMismatch_nomatch           97  avgt    5
> > 30.969 ± 0.318  ns/op
> > ArrayMismatchBenchmark.longMismatch_nomatch           0  avgt    5
> > 4.399 ± 0.082  ns/op
> > ArrayMismatchBenchmark.longMismatch_nomatch           1  avgt    5
> > 4.390 ± 0.068  ns/op
> > ArrayMismatchBenchmark.longMismatch_nomatch           2  avgt    5
> > 4.398 ± 0.033  ns/op
> > ArrayMismatchBenchmark.longMismatch_nomatch           7  avgt    5
> > 4.403 ± 0.110  ns/op
> > ArrayMismatchBenchmark.longMismatch_nomatch          13  avgt    5
> > 6.564 ± 0.398  ns/op
> > ArrayMismatchBenchmark.longMismatch_nomatch          57  avgt    5
> > 11.548 ± 0.331  ns/op
> > ArrayMismatchBenchmark.longMismatch_nomatch          97  avgt    5
> > 16.335 ± 0.119  ns/op
> >
> > I labeled the current behavior as "legacy".
> > The Arrays.mismatch is significantly slower when the mismatch occurs
> > early in the array and significantly faster when the mismatch occurs
> > later.
> > Comparing an int (4 bytes) at a time is a clear winner if the mismatch
> > occurs in those 4 bytes, which appeared to be 90+% of the calls I
> > observed.
> > Comparing a long (8 bytes) at a time is faster than the current
> > behavior unless it is the first byte which does not match, but slower
> > than comparing ints if the mismatch occurs in the first 4 bytes.
> >
> > I wrote this test using jdk 9 VarHandle to read the ints and longs
> > from the byte[], but the same thing can be achieved using
> > sun.misc.Unsafe. I will add that as a case in the benchmark, but it is
> > expected to be similar to VarHandle (maybe slightly faster).
> >
> > Brett
> >
> > On Mon, Jan 11, 2021 at 10:04 AM Lasse Collin <[email protected]> 
> > wrote:
> > >
> > > On 2021-01-09 Brett Okken wrote:
> > > > This would seem to be a potential candidate for a multi-release
> > > > jar[1], if you can figure out a reasonable way to get a build system
> > > > to generate one.
> > >
> > > I suppose it can be done. The build system uses Apache Ant. From some
> > > sources I've understood that there are more modern alternatives but I
> > > haven't had any interest or energy to learn more as Ant seems to still
> > > work OK.
> > >
> > > > The 4 uses I found of comparing byte[] could be refactored to call a
> > > > new utility class to do the comparison. The "regular" implementation
> > > > could be java 7 compatible, and the jdk 9 version would be in the
> > > > META_INF folder.
> > > > Even for the java 7 compatible version, it might be worth exploring
> > > > how much improvement would come from using Unsafe to read int or long
> > > > values from the byte[] and compare those.
> > > >
> > > > For Hash234, I would think the whole class could be handled for the
> > >
> > > All these sound like worth checking out.
> > >
> > > On 2021-01-09 Brett Okken wrote:
> > > > Here is a class which is compatible with jdk 7. It will use a
> > > > MethodHandle to invoke Arrays.mismatch if that is found at runtime. If
> > > > that is not found, it will see if it can find Unsafe to read 4 bytes
> > > > at a time and compare as ints. If that cannot be found/loaded/invoked,
> > > > it falls back to iterating over bytes and comparing one by one.
> > > >
> > > > For jdk 9, the mismatch method could instead be implemented as:
> > > > return Arrays.mismatch(a, aFromIndex, aFromIndex + length, b,
> > > > bFromIndex, bFromIndex + length);
> > >
> > > Thanks! There are several XZ Utils related tasks I hope to get done (of
> > > which not all have been mentioned on xz-devel), so I won't think much
> > > about XZ for Java in the near future, I'm sorry.
> > >
> > > I assume that multi-release method has no performance overhead since
> > > the runtime will load the best .class file and that's it. How the other
> > > methods like using an utility class or looking for available methods at
> > > runtime compare to pure multi-release method in terms of performance?
> > > Perhaps this is a stupid question but I have so little Java experience
> > > that I don't have a clue about this.
> > >
> > > If you have time and interest, it would be valuable to know which
> > > tricks provide the largest performance improvements. However, I repeat
> > > that I cannot spend much time on this in the near future even though I
> > > think it would be good to have such improvements in XZ for Java.
> > >
> > > Thanks!
> > >
> > > --
> > > Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode
> > >

Reply via email to