I have continued to refine the changes around the array comparisons
and think I am pretty well done there.
I did a small benchmark measuring the time to compress 3 different
files using new XZOutputStream(cos, new LZMA2Options()). Where cos was
an OutputStream which simply calculated the crc32 of the content
written.
The ihe_ovly.pr.dcm is a ~66KB binary file.
The image1.dcm file is a ~26MB binary file with some metadata wrapping
a grayscale bmp.
The large.xml file is a ~51MB formatted utf-8 xml file.
1.8
Benchmark (file) Mode Cnt Score
Error Units
XZCompressionBenchmark.compress ihe_ovly_pr.dcm avgt 5 9.575
± 0.322 ms/op
XZCompressionBenchmark.compress image1.dcm avgt 5 6767.376
± 675.373 ms/op
XZCompressionBenchmark.compress large.xml avgt 5 9680.569
± 207.521 ms/op
1.9-SNAPSHOT
Benchmark (file) Mode Cnt Score
Error Units
XZCompressionBenchmark.compress ihe_ovly_pr.dcm avgt 5 7.888
± 0.443 ms/op
XZCompressionBenchmark.compress image1.dcm avgt 5 6144.748
± 71.551 ms/op
XZCompressionBenchmark.compress large.xml avgt 5 7904.019
± 26.316 ms/op
These results are from 64 bit windows using open jdk 11.0.6.
The results are 9-18% faster across the 3 different file types.
I made a small change to ArrayUtil from what I sent previously. When
everything matches, it now returns length rather than -1.
The remaining changes are:
BT4:
in getMatches()
if (matches.count > 0) {
while (lenBest < matchLenLimit && buf[readPos + lenBest - delta2]
== buf[readPos + lenBest])
++lenBest;
Changes to:
if (matches.count > 0) {
lenBest += ArrayUtil.mismatch(buf, readPos + lenBest -
delta2, buf, readPos + lenBest, matchLenLimit - lenBest);
Further down, inside a while(true) block,
if (buf[readPos + len - delta] == buf[readPos + len]) {
while (++len < matchLenLimit)
if (buf[readPos + len - delta] != buf[readPos + len])
break;
Changes to:
int mismatch = ArrayUtil.mismatch(buf, readPos + len -
delta, buf, readPos + len, matchLenLimit - len);
if (mismatch != 0) {
len += mismatch;
in skip(int, int)
if (buf[readPos + len - delta] == buf[readPos + len]) {
// No need to look for longer matches than niceLenLimit
// because we only are updating the tree, not returning
// matches found to the caller.
do {
if (++len == niceLenLimit) {
tree[ptr1] = tree[pair];
tree[ptr0] = tree[pair + 1];
return;
}
} while (buf[readPos + len - delta] == buf[readPos + len]);
}
Changes to:
// No need to look for longer matches than niceLenLimit
// because we only are updating the tree, not returning
// matches found to the caller.
int mismatch = ArrayUtil.mismatch(buf, readPos + len -
delta, buf, readPos + len, niceLenLimit);
if (mismatch == niceLenLimit) {
tree[ptr1] = tree[pair];
tree[ptr0] = tree[pair + 1];
return;
}
len += mismatch;
In HC4, both changes are in getMatches()
if (matches.count > 0) {
while (lenBest < matchLenLimit && buf[readPos + lenBest - delta2]
== buf[readPos + lenBest])
++lenBest;
Changes to:
if (matches.count > 0) {
lenBest += ArrayUtil.mismatch(buf, readPos + lenBest -
delta2, buf, readPos + lenBest, matchLenLimit - lenBest);
And:
// Test the first byte and the first new byte that would give us
// a match that is at least one byte longer than lenBest. This
// too short matches get quickly skipped.
if (buf[readPos + lenBest - delta] == buf[readPos + lenBest]
&& buf[readPos - delta] == buf[readPos]) {
// Calculate the length of the match.
int len = 0;
while (++len < matchLenLimit)
if (buf[readPos + len - delta] != buf[readPos + len])
break;
// Use the match if and only if it is better than the longest
// match found so far.
if (len > lenBest) {
lenBest = len;
matches.len[matches.count] = len;
matches.dist[matches.count] = delta - 1;
++matches.count;
// Return if it is long enough (niceLen or reached the
// end of the dictionary).
if (len >= niceLenLimit)
return matches;
}
}
Changes to:
final int mismatch = ArrayUtil.mismatch(buf, readPos -
delta, buf, readPos, matchLenLimit);
//use the match iff it is better than the longest match found so far
if (mismatch > lenBest) {
lenBest = mismatch;
matches.len[matches.count] = mismatch;
matches.dist[matches.count] = delta - 1;
++matches.count;
// Return if it is long enough (niceLen or reached the
// end of the dictionary).
if (mismatch >= niceLenLimit)
return matches;
}
Finally, in LZEncoder, the 2 getMatchLen methods were changed to:
public int getMatchLen(int dist, int lenLimit) {
final int backPos = readPos - dist - 1;
return ArrayUtil.mismatch(buf, readPos, buf, backPos, lenLimit);
}
public int getMatchLen(int forward, int dist, int lenLimit) {
final int curPos = readPos + forward;
final int backPos = curPos - dist - 1;
return ArrayUtil.mismatch(buf, curPos, buf, backPos, lenLimit);
}
On Tue, Jan 12, 2021 at 10:17 AM Brett Okken <[email protected]> wrote:
>
> It turns out that reading the longs in native byte order provides
> noticeable improvement.
> I did find that there was cost overhead of ~1 ns/op by using an
> interface/implementation to flex behavior if Unsafe could not be
> loaded. That cost goes away by using java.lang.invoke.MethodHandle.
> So here is an updated jdk 7 compatible ArrayUtil implementation which
> matches current performance if the first byte does not match and is
> faster in every other scenario. The gap in performance grows as more
> bytes actually match. At 2 bytes, it takes roughly half the time. At
> 97 bytes, it takes less than ten percent of the time.
>
> Here are the benchmark results:
>
> Benchmark (length) Mode Cnt
> Score Error Units
> ArrayMismatchBenchmark.comparerMismatch_nomatch 0 avgt 5
> 4.487 ± 0.059 ns/op
> ArrayMismatchBenchmark.comparerMismatch_nomatch 1 avgt 5
> 4.515 ± 0.102 ns/op
> ArrayMismatchBenchmark.comparerMismatch_nomatch 2 avgt 5
> 4.523 ± 0.023 ns/op
> ArrayMismatchBenchmark.comparerMismatch_nomatch 7 avgt 5
> 5.164 ± 0.098 ns/op
> ArrayMismatchBenchmark.comparerMismatch_nomatch 13 avgt 5
> 5.748 ± 0.974 ns/op
> ArrayMismatchBenchmark.comparerMismatch_nomatch 57 avgt 5
> 10.060 ± 1.135 ns/op
> ArrayMismatchBenchmark.comparerMismatch_nomatch 97 avgt 5
> 11.518 ± 0.418 ns/op
> ArrayMismatchBenchmark.legacyMismatch_nomatch 0 avgt 5
> 3.259 ± 0.069 ns/op
> ArrayMismatchBenchmark.legacyMismatch_nomatch 1 avgt 5
> 5.712 ± 0.070 ns/op
> ArrayMismatchBenchmark.legacyMismatch_nomatch 2 avgt 5
> 6.017 ± 0.300 ns/op
> ArrayMismatchBenchmark.legacyMismatch_nomatch 7 avgt 5
> 12.949 ± 0.163 ns/op
> ArrayMismatchBenchmark.legacyMismatch_nomatch 13 avgt 5
> 18.696 ± 0.551 ns/op
> ArrayMismatchBenchmark.legacyMismatch_nomatch 57 avgt 5
> 43.232 ± 1.015 ns/op
> ArrayMismatchBenchmark.legacyMismatch_nomatch 97 avgt 5
> 90.599 ± 0.794 ns/op
> ArrayMismatchBenchmark.unsafeMisMatch_nomatch 0 avgt 5
> 3.246 ± 0.138 ns/op
> ArrayMismatchBenchmark.unsafeMisMatch_nomatch 1 avgt 5
> 3.225 ± 0.042 ns/op
> ArrayMismatchBenchmark.unsafeMisMatch_nomatch 2 avgt 5
> 3.242 ± 0.043 ns/op
> ArrayMismatchBenchmark.unsafeMisMatch_nomatch 7 avgt 5
> 3.244 ± 0.048 ns/op
> ArrayMismatchBenchmark.unsafeMisMatch_nomatch 13 avgt 5
> 3.477 ± 0.028 ns/op
> ArrayMismatchBenchmark.unsafeMisMatch_nomatch 57 avgt 5
> 5.968 ± 0.553 ns/op
> ArrayMismatchBenchmark.unsafeMisMatch_nomatch 97 avgt 5
> 7.182 ± 0.080 ns/op
> ArrayMismatchBenchmark.utilMismatch_nomatch 0 avgt 5
> 3.219 ± 0.044 ns/op
> ArrayMismatchBenchmark.utilMismatch_nomatch 1 avgt 5
> 3.217 ± 0.054 ns/op
> ArrayMismatchBenchmark.utilMismatch_nomatch 2 avgt 5
> 3.217 ± 0.069 ns/op
> ArrayMismatchBenchmark.utilMismatch_nomatch 7 avgt 5
> 3.206 ± 0.047 ns/op
> ArrayMismatchBenchmark.utilMismatch_nomatch 13 avgt 5
> 3.509 ± 0.218 ns/op
> ArrayMismatchBenchmark.utilMismatch_nomatch 57 avgt 5
> 5.870 ± 0.063 ns/op
> ArrayMismatchBenchmark.utilMismatch_nomatch 97 avgt 5
> 7.178 ± 0.267 ns/op
>
> The "comparer" implementation is using interface with different
> implementations based on whether Unsafe could be loaded.
> The "unsafe" implementation is directly using the Unsafe class.
> The "util" implementation is using the ArrayUtil class below.
>
>
> package org.tukaani.xz.common;
>
> import java.lang.invoke.MethodHandle;
> import java.lang.invoke.MethodHandles;
> import java.lang.invoke.MethodType;
> import java.lang.reflect.Constructor;
> import java.nio.ByteOrder;
>
> public final class ArrayUtil {
>
> /**
> * MethodHandle to the actual mismatch method to use at runtime.
> */
> private static final MethodHandle MISMATCH;
>
> /**
> * If {@code sun.misc.Unsafe} can be loaded, this is MethodHandle
> bound to an instance of Unsafe for method {@code long getLong(Object,
> long)}.
> */
> private static final MethodHandle UNSAFE_GET_LONG;
>
> /**
> * MethodHandle to either {@link Long#numberOfLeadingZeros(long)}
> or {@link Long#numberOfTrailingZeros(long)} depending on {@link
> ByteOrder#nativeOrder()}.
> */
> private static final MethodHandle LEADING_ZEROS;
>
> /**
> * Populated from reflected read of {@code
> sun.misc.Unsafe.ARRAY_BYTE_BASE_OFFSET}.
> */
> private static final long ARRAY_BASE_OFFSET;
>
> static {
> //try to create an instance using Unsafe
> long arrayBaseOffset = 0;
> MethodHandle unsafeGetLong = null;
> MethodHandle leadingZeros = null;
> MethodHandle mismatch = null;
> final MethodHandles.Lookup lookup = MethodHandles.lookup();
> final MethodType mismatchType =
> MethodType.methodType(int.class, byte[].class, int.class,
> byte[].class, int.class, int.class);
> try {
> Class<?> unsafeClazz = Class.forName("sun.misc.Unsafe", true,
> null);
> Constructor<?> unsafeConstructor =
> unsafeClazz.getDeclaredConstructor();
> unsafeConstructor.setAccessible(true);
> Object unsafe = unsafeConstructor.newInstance();
>
> arrayBaseOffset =
> unsafeClazz.getField("ARRAY_BYTE_BASE_OFFSET").getLong(null);
>
> MethodHandle virtualGetLong =
> lookup.findVirtual(unsafeClazz, "getLong",
> MethodType.methodType(long.class, Object.class, long.class));
> unsafeGetLong = virtualGetLong.bindTo(unsafe);
>
> // do a test read to confirm unsafe is actually functioning
> long val = (long) unsafeGetLong.invokeExact((Object) new
> byte[] { 0, 0, 0, 0, 0, 0, 0, 0 }, arrayBaseOffset + 0L);
> if (val != 0) {
> throw new IllegalStateException("invalid value: " + val);
> }
>
> final boolean bigEndian = ByteOrder.BIG_ENDIAN ==
> ByteOrder.nativeOrder();
>
> //getInt interprets in platform byte order. the concept of
> "leading zeros" being bytes
> //in encounter order is true for big endian
> //for little endian platform, the trailing zeros gives the
> encounter order result
> leadingZeros = lookup.findStatic(Long.class, bigEndian ?
> "numberOfLeadingZeros" : "numberOfTrailingZeros",
> MethodType.methodType(int.class, long.class));
> mismatch = lookup.findStatic(ArrayUtil.class,
> "unsafeMismatch", mismatchType);
> } catch (Throwable t) {
> //TODO: log out?
> unsafeGetLong = null;
> leadingZeros = null;
> try {
> mismatch = lookup.findStatic(ArrayUtil.class,
> "legacyMismatch", mismatchType);
> } catch (Exception e) {
> throw new IllegalStateException(e);
> }
> }
>
> UNSAFE_GET_LONG = unsafeGetLong;
> ARRAY_BASE_OFFSET = arrayBaseOffset;
> LEADING_ZEROS = leadingZeros;
> MISMATCH = mismatch;
> }
>
> /**
> * Compares the values in <i>a</i> and <i>b</i> and returns the
> index of the first {@code byte} which differs.
> * @param a The first {@code byte[]} for comparison.
> * @param aFromIndex The offset into <i>a</i> to start reading from.
> * @param b The second {@code byte[]} for comparison.
> * @param bFromIndex The offset into <i>b</i> to start reading from.
> * @param length The number of bytes to compare.
> * @return The offset from the starting indexes of the first byte
> which differs or {@code -1} if all match.
> */
> public static int mismatch(byte[] a, int aFromIndex, byte[] b, int
> bFromIndex, int length) {
> try {
> return (int) MISMATCH.invokeExact(a, aFromIndex, b,
> bFromIndex, length);
> } catch (RuntimeException e) {
> throw e;
> } catch (Error e) {
> throw e;
> } catch (Throwable t) {
> throw new RuntimeException(t);
> }
> }
>
> /**
> * Uses {@code UNSAFE_GET_LONG} to compare 8 bytes at a time.
> */
> @SuppressWarnings("unused")
> private static int unsafeMismatch(byte[] a, int aFromIndex, byte[]
> b, int bFromIndex, int length) throws Throwable {
> //TODO: should these asserts be uncommented and/or should real
> verification be done?
> //by using Unsafe, we can actually SIGSEGV if length is not
> valid for a or b.
> // assert a.length >= aFromIndex + length;
> // assert b.length >= bFromIndex + length;
> int i=0;
> for (int j=length - 7; i<j; i+=8) {
> final long aVal = (long)
> UNSAFE_GET_LONG.invokeExact((Object) a, ARRAY_BASE_OFFSET + aFromIndex
> + i);
> final long bVal = (long)
> UNSAFE_GET_LONG.invokeExact((Object) b, ARRAY_BASE_OFFSET + bFromIndex
> + i);
> if (aVal != bVal) {
> //this returns a value where bits which match are 0
> and bits which differ are 1
> final long diff = aVal ^ bVal;
> //the first (in native byte order) bit which differs
> tells us which byte differed
> final int leadingZeros = (int)
> LEADING_ZEROS.invokeExact(diff);
> return i + (leadingZeros / Byte.SIZE);
> }
> }
> for ( ; i<length; ++i) {
> if (a[aFromIndex + i] != b[bFromIndex + i]) {
> return i;
> }
> }
> return -1;
> }
>
> /**
> * Simply loops over all of the bytes, comparing one at a time.
> */
> @SuppressWarnings("unused")
> private static int legacyMismatch(byte[] a, int aFromIndex, byte[]
> b, int bFromIndex, int length) {
> for (int i=0; i<length; ++i) {
> if (a[aFromIndex + i] != b[bFromIndex + i]) {
> return i;
> }
> }
> return -1;
> }
>
> private ArrayUtil() {
> }
> }
>
> On Mon, Jan 11, 2021 at 6:12 PM Brett Okken <[email protected]> wrote:
> >
> > I threw together a quick jmh test, and there is no value in the
> > changes to Hash234.
> >
> > For the array mismatch, the results are kind of interesting. My
> > observation, stepping through some compression uses, is that the
> > comparison length is typically 100-200 bytes in length, but the actual
> > match length is typically fairly short. This is obviously going to be
> > highly dependent on data, and I was using raw image data for
> > observation. Content like xml or json might have longer matches. So I
> > set up a benchmark which is always comparing 128 bytes and the
> > mismatch occurs after various "lengths":
> >
> > Benchmark (length) Mode Cnt
> > Score Error Units
> > ArrayMismatchBenchmark.legacyMismatch_nomatch 0 avgt 5
> > 3.198 ± 0.168 ns/op
> > ArrayMismatchBenchmark.legacyMismatch_nomatch 1 avgt 5
> > 5.607 ± 0.048 ns/op
> > ArrayMismatchBenchmark.legacyMismatch_nomatch 2 avgt 5
> > 5.852 ± 0.053 ns/op
> > ArrayMismatchBenchmark.legacyMismatch_nomatch 7 avgt 5
> > 12.703 ± 0.350 ns/op
> > ArrayMismatchBenchmark.legacyMismatch_nomatch 13 avgt 5
> > 18.275 ± 0.228 ns/op
> > ArrayMismatchBenchmark.legacyMismatch_nomatch 57 avgt 5
> > 42.313 ± 0.450 ns/op
> > ArrayMismatchBenchmark.legacyMismatch_nomatch 97 avgt 5
> > 89.410 ± 2.927 ns/op
> > ArrayMismatchBenchmark.arraysMismatch_nomatch 0 avgt 5
> > 4.629 ± 0.035 ns/op
> > ArrayMismatchBenchmark.arraysMismatch_nomatch 1 avgt 5
> > 9.515 ± 0.096 ns/op
> > ArrayMismatchBenchmark.arraysMismatch_nomatch 2 avgt 5
> > 9.526 ± 0.132 ns/op
> > ArrayMismatchBenchmark.arraysMismatch_nomatch 7 avgt 5
> > 9.581 ± 0.395 ns/op
> > ArrayMismatchBenchmark.arraysMismatch_nomatch 13 avgt 5
> > 9.781 ± 0.133 ns/op
> > ArrayMismatchBenchmark.arraysMismatch_nomatch 57 avgt 5
> > 9.846 ± 0.182 ns/op
> > ArrayMismatchBenchmark.arraysMismatch_nomatch 97 avgt 5
> > 10.809 ± 0.307 ns/op
> > ArrayMismatchBenchmark.intMismatch_nomatch 0 avgt 5
> > 3.417 ± 0.018 ns/op
> > ArrayMismatchBenchmark.intMismatch_nomatch 1 avgt 5
> > 3.412 ± 0.011 ns/op
> > ArrayMismatchBenchmark.intMismatch_nomatch 2 avgt 5
> > 3.414 ± 0.032 ns/op
> > ArrayMismatchBenchmark.intMismatch_nomatch 7 avgt 5
> > 5.401 ± 0.207 ns/op
> > ArrayMismatchBenchmark.intMismatch_nomatch 13 avgt 5
> > 8.311 ± 0.070 ns/op
> > ArrayMismatchBenchmark.intMismatch_nomatch 57 avgt 5
> > 20.536 ± 0.556 ns/op
> > ArrayMismatchBenchmark.intMismatch_nomatch 97 avgt 5
> > 30.969 ± 0.318 ns/op
> > ArrayMismatchBenchmark.longMismatch_nomatch 0 avgt 5
> > 4.399 ± 0.082 ns/op
> > ArrayMismatchBenchmark.longMismatch_nomatch 1 avgt 5
> > 4.390 ± 0.068 ns/op
> > ArrayMismatchBenchmark.longMismatch_nomatch 2 avgt 5
> > 4.398 ± 0.033 ns/op
> > ArrayMismatchBenchmark.longMismatch_nomatch 7 avgt 5
> > 4.403 ± 0.110 ns/op
> > ArrayMismatchBenchmark.longMismatch_nomatch 13 avgt 5
> > 6.564 ± 0.398 ns/op
> > ArrayMismatchBenchmark.longMismatch_nomatch 57 avgt 5
> > 11.548 ± 0.331 ns/op
> > ArrayMismatchBenchmark.longMismatch_nomatch 97 avgt 5
> > 16.335 ± 0.119 ns/op
> >
> > I labeled the current behavior as "legacy".
> > The Arrays.mismatch is significantly slower when the mismatch occurs
> > early in the array and significantly faster when the mismatch occurs
> > later.
> > Comparing an int (4 bytes) at a time is a clear winner if the mismatch
> > occurs in those 4 bytes, which appeared to be 90+% of the calls I
> > observed.
> > Comparing a long (8 bytes) at a time is faster than the current
> > behavior unless it is the first byte which does not match, but slower
> > than comparing ints if the mismatch occurs in the first 4 bytes.
> >
> > I wrote this test using jdk 9 VarHandle to read the ints and longs
> > from the byte[], but the same thing can be achieved using
> > sun.misc.Unsafe. I will add that as a case in the benchmark, but it is
> > expected to be similar to VarHandle (maybe slightly faster).
> >
> > Brett
> >
> > On Mon, Jan 11, 2021 at 10:04 AM Lasse Collin <[email protected]>
> > wrote:
> > >
> > > On 2021-01-09 Brett Okken wrote:
> > > > This would seem to be a potential candidate for a multi-release
> > > > jar[1], if you can figure out a reasonable way to get a build system
> > > > to generate one.
> > >
> > > I suppose it can be done. The build system uses Apache Ant. From some
> > > sources I've understood that there are more modern alternatives but I
> > > haven't had any interest or energy to learn more as Ant seems to still
> > > work OK.
> > >
> > > > The 4 uses I found of comparing byte[] could be refactored to call a
> > > > new utility class to do the comparison. The "regular" implementation
> > > > could be java 7 compatible, and the jdk 9 version would be in the
> > > > META_INF folder.
> > > > Even for the java 7 compatible version, it might be worth exploring
> > > > how much improvement would come from using Unsafe to read int or long
> > > > values from the byte[] and compare those.
> > > >
> > > > For Hash234, I would think the whole class could be handled for the
> > >
> > > All these sound like worth checking out.
> > >
> > > On 2021-01-09 Brett Okken wrote:
> > > > Here is a class which is compatible with jdk 7. It will use a
> > > > MethodHandle to invoke Arrays.mismatch if that is found at runtime. If
> > > > that is not found, it will see if it can find Unsafe to read 4 bytes
> > > > at a time and compare as ints. If that cannot be found/loaded/invoked,
> > > > it falls back to iterating over bytes and comparing one by one.
> > > >
> > > > For jdk 9, the mismatch method could instead be implemented as:
> > > > return Arrays.mismatch(a, aFromIndex, aFromIndex + length, b,
> > > > bFromIndex, bFromIndex + length);
> > >
> > > Thanks! There are several XZ Utils related tasks I hope to get done (of
> > > which not all have been mentioned on xz-devel), so I won't think much
> > > about XZ for Java in the near future, I'm sorry.
> > >
> > > I assume that multi-release method has no performance overhead since
> > > the runtime will load the best .class file and that's it. How the other
> > > methods like using an utility class or looking for available methods at
> > > runtime compare to pure multi-release method in terms of performance?
> > > Perhaps this is a stupid question but I have so little Java experience
> > > that I don't have a clue about this.
> > >
> > > If you have time and interest, it would be valuable to know which
> > > tricks provide the largest performance improvements. However, I repeat
> > > that I cannot spend much time on this in the near future even though I
> > > think it would be good to have such improvements in XZ for Java.
> > >
> > > Thanks!
> > >
> > > --
> > > Lasse Collin | IRC: Larhzu @ IRCnet & Freenode
> > >