Re: [xz-devel] xz-java and newer java

2024-07-11 Thread Brett Okken
Looks great to me.
I apologize for not getting to the arm stuff myself.

Thanks,
Brett

On Thu, Jul 11, 2024 at 11:47 AM Lasse Collin 
wrote:

> On 2024-03-26 Lasse Collin wrote:
> > I suppose the ARM64 speed is still to be determined by you or someone
> > else.
>
> I was given results from one ARM64 system with Java 23:
>   - Java 8 code: 5.6 s
>   - Basic:   3.8 s
>   - UnalignedLongLE: 2.4 s
>
> The test command:
>
> time head -c1 /dev/zero \
> | java -jar build/jar/XZEncDemo.jar > /dev/null
>
> It's a similar enough result as on x86-64.
>
> Is the bytearrayview branch ready for merging?
>
> --
> Lasse Collin
>
>


Re: [xz-devel] xz-java and newer java

2024-07-11 Thread Lasse Collin
On 2024-03-26 Lasse Collin wrote:
> I suppose the ARM64 speed is still to be determined by you or someone
> else.

I was given results from one ARM64 system with Java 23:
  - Java 8 code: 5.6 s
  - Basic:   3.8 s
  - UnalignedLongLE: 2.4 s

The test command:

time head -c1 /dev/zero \
| java -jar build/jar/XZEncDemo.jar > /dev/null

It's a similar enough result as on x86-64.

Is the bytearrayview branch ready for merging?

-- 
Lasse Collin



Re: [xz-devel] xz-java and newer java

2024-03-26 Thread Lasse Collin
On 2024-03-20 Brett Okken wrote:
> The jdk8 changes show nice improvements over head. My assumption is
> that with less math going on in the offsets of the while loop allowed
> the jvm to better optimize.

Sounds good, thanks! :-)

> I am surprised with the binary math behind your handling of long
> comparisons here:

I had to refresh my memory as I hadn't commented it in memcmplen.h. Now
it is (based on Agner Fog's microarchitecture.pdf):

  - On some x86-64 processors (Intel Sandy Bridge to Tiger Lake),
sub+jz and sub+jnz can be fused but xor+jz or xor+jnz cannot.
Thus using subtraction has potential to be a tiny amount faster
since the code checks if the quotient is non-zero.

  - Some processors (Intel Pentium 4) used to have more ALU
resources for add/sub instructions than and/or/xor.

So in the C code it's not a huge thing and in Java it's probably
about nothing. But there is no real downside to using subtraction.

I understand how xor seems more obvious choice. However, when looking
for the lowest differing bit, subtraction will make that bit 1 and the
bits below it 0. Only the bits above the 1 will differ between
subtraction and xor but those bits are irrelevant here.

I created a new branch, bytearrayview, which combines the CRC64 edits
with the encoder speed changes as they share the ByteArrayView class
(formerly ArrayUtil).

> > I still need to check a few of your edits if some of them should be
> > included. :-)  
> 
> I think the changes to LZMAEncoderNormal as part of this PR to avoid
> the negative length comparison would be good to carry forward.

Done, I hope.

> 1. Use an interface with implementation chosen statically to separate
> out the implementation options.

I had an early version that used separate implementation classes but I
must have done something wrong as that version was *clearly* slower. So
I tried it again and it's as you say, no speed difference. :-)

> 2. Allow specifying the implementation to use with a system property.

Done. I hope it's done in a sensible enough way. The Java < 9 code is
completely separate so it cannot be chosen. The property needs to be
documented somewhere too.

I suppose the ARM64 speed is still to be determined by you or someone
else.

-- 
Lasse Collin



Re: [xz-devel] xz-java and newer java

2024-03-20 Thread Brett Okken
> I still need to check a few of your edits if some of them should be
> included. :-)

I think the changes to LZMAEncoderNormal as part of this PR to avoid the
negative length comparison would be good to carry forward. It basically
compares to the MATCH_LEN_MIN instead of 0. This can avoid some (short)
calls to getMatchLen whose results are going to be ignored anyway on the
very next line.
https://github.com/tukaani-project/xz-java/pull/13/commits/544e449446a3d652de2a5f170d197aef695f12ec#diff-e6858bec0a168955b2ad68bbe89af8ab9ca7b9b1ebf2d9b8bc362fb80dab2967

>I pushed basic docs for getMatchLen.

Thanks - that is very helpful.

> I can wait for the summary, thanks.

The jdk8 changes show nice improvements over head. My assumption is that
with less math going on in the offsets of the while loop allowed the jvm to
better optimize.
Benchmark(file)  (preset)  Mode
 Cnt Score  Error  Units
XZCompressionBenchmark.head ihe_ovly_pr.dcm 3  avgt
 3 0.617 ±0.159  ms/op
XZCompressionBenchmark.lasseihe_ovly_pr.dcm 3  avgt
 3 0.556 ±0.163  ms/op
XZCompressionBenchmark.head ihe_ovly_pr.dcm 6  avgt
 3 2.908 ±0.346  ms/op
XZCompressionBenchmark.lasseihe_ovly_pr.dcm 6  avgt
 3 2.437 ±0.160  ms/op
XZCompressionBenchmark.head  image1.dcm 3  avgt
 3  2106.954 ± 1295.185  ms/op
XZCompressionBenchmark.lasse image1.dcm 3  avgt
 3  1703.705 ±  482.628  ms/op
XZCompressionBenchmark.head  image1.dcm 6  avgt
 3  4304.648 ± 1650.114  ms/op
XZCompressionBenchmark.lasse image1.dcm 6  avgt
 3  3430.697 ±  129.481  ms/op
XZCompressionBenchmark.head   large.xml 3  avgt
 3   805.220 ± 1094.696  ms/op
XZCompressionBenchmark.lasse  large.xml 3  avgt
 3   658.586 ±   31.645  ms/op
XZCompressionBenchmark.head   large.xml 6  avgt
 3  6743.478 ± 1634.641  ms/op
XZCompressionBenchmark.lasse  large.xml 6  avgt
 3  5880.570 ±  587.226  ms/op


Defining an interface to defer the implementation to has no impact on
performance. Here are the results of LZUtil using an implementation which
matches what you have for jdk 8.

XZCompressionBenchmark.compress_legacy_loop  ihe_ovly_pr.dcm 3
 avgt3 0.548 ±0.056  ms/op
XZCompressionBenchmark.compress_legacy_loop  ihe_ovly_pr.dcm 6
 avgt3 2.493 ±0.097  ms/op
XZCompressionBenchmark.compress_legacy_loop   image1.dcm 3
 avgt3  1720.038 ±  237.015  ms/op
XZCompressionBenchmark.compress_legacy_loop   image1.dcm 6
 avgt3  3671.539 ± 2016.282  ms/op
XZCompressionBenchmark.compress_legacy_looplarge.xml 3
 avgt3   667.045 ±  108.601  ms/op
XZCompressionBenchmark.compress_legacy_looplarge.xml 6
 avgt3  5842.107 ±  552.634  ms/op

> Thanks. I was already tilted towards not using Unsafe and now I'm even
> more. The speed benefit of Unsafe over VarHandle should be tiny enough.
> It feels better that memory safety isn't ignored on any JDK version.

I have no problem with that. The performance differences between Unsafe and
VarHandle are very minimal (and sometimes reverse when bounds checks are
introduced to use of Unsafe).


I am surprised with the binary math behind your handling of long
comparisons here:
https://github.com/tukaani-project/xz-java/compare/master...array_compare#diff-1c6fd3fbd64728f8d99a692827015a1bd7341a0dc651cf6205cc72024e90b065R141-R147
Specifically you are using subtraction instead of XOR (like I did here)
https://github.com/tukaani-project/xz-java/pull/13/files#diff-2fc691ea3e96cf4821f4eceac43919cb659e7ae91b4e6919e35fb25f37439d3dR118-R127

Is there an advantage? By not supporting Unsafe, you do not have to deal
with BigEndian, so that makes this approach possible. I personally find XOR
to more clearly answer the question being asked (which is first byte with
difference). My first reaction was subtraction would not produce the
correct results.


Generally, I really like what you have. I would propose 2 changes:
1. Use an interface with implementation chosen statically to separate out
the implementation options. This makes it much easier to unit test all the
implementations. I also find that it makes the code easier to read/reason
about by being more modular.
2. Allow specifying the implementation to use with a system property. This
would be unlikely to be used outside of benchmarking, but would provide
options for users on unusual hardware.

Brett







On Tue, Mar 12, 2024 at 12:55 PM Lasse Collin 
wrote:

> On 2024-03-12 Brett Okken wrote:
> > I am still working on digesting your branch.
>
> I still need to check a few of your edits if some of them should be
> included. :-)
>
> > The difference in method signature is subtle, but I think a key part

Re: [xz-devel] xz-java and newer java

2024-03-12 Thread Lasse Collin
On 2024-03-12 Brett Okken wrote:
> I am still working on digesting your branch.

I still need to check a few of your edits if some of them should be
included. :-)

> The difference in method signature is subtle, but I think a key part
> of the improvements you are getting. Could you add javadoc to more
> clearly describe how the args are to be interpreted and what the
> return value means?

I pushed basic docs for getMatchLen.

Once crc64_varhandle2 is merged then array_compare should use ArrayUtil
too. It doesn't make a difference in speed.

> I am playing with manually unrolling the java 8 byte-by-byte impl
> along with tests comparing unsafe, var handle, and vector approaches.
> These tests take a long time to run, so it will be a couple days
> before I have complete results. Do you want data as I have it (and it
> is interesting), or wait for summary?

I can wait for the summary, thanks.

> I am not sure when I will get opportunity to test out arm64.

If someone has, for example, a Raspberry Pi, the compression of zeros
test is simple enough to do and at least on x86-64 has clear enough
difference. It's an over-simplified test but it's a data point still.

> I do have some things still on jdk 8, but only decompression. Surveys
> seem to indicate quite a bit of jdk 8 still in use, but I have no
> personal need.

Thanks. I was already tilted towards not using Unsafe and now I'm even
more. The speed benefit of Unsafe over VarHandle should be tiny enough.
It feels better that memory safety isn't ignored on any JDK version. If
a bug was found, it's nicer to not wonder if Unsafe had a role in it.
This is better for security too.

In my previous email I wondered if using Unsafe only with Java 8 would
make upgrading to newer JDK look bad if newer JDK used VarHandle
instead of Unsafe. Perhaps that worry was overblown. But the other
reasons and keeping the code simpler make me want to avoid Unsafe.

(C code via JNI wouldn't be memory safe but then the speed benefits
should be much more significant too.)

-- 
Lasse Collin



Re: [xz-devel] xz-java and newer java

2024-03-12 Thread Brett Okken
I am still working on digesting your branch. The difference in method
signature is subtle, but I think a key part of the improvements you are
getting. Could you add javadoc to more clearly describe how the args are to
be interpreted and what the return value means?

I am playing with manually unrolling the java 8 byte-by-byte impl along
with tests comparing unsafe, var handle, and vector approaches. These tests
take a long time to run, so it will be a couple days before I have complete
results. Do you want data as I have it (and it is interesting), or wait for
summary? I am not sure when I will get opportunity to test out arm64. That
could be awhile yet.

I do have some things still on jdk 8, but only decompression. Surveys seem
to indicate quite a bit of jdk 8 still in use, but I have no personal need.

Brett

On Sun, Mar 10, 2024 at 2:49 PM Lasse Collin 
wrote:

> On 2024-03-09 Brett Okken wrote:
> > When I tested graviton2 (arm64) previously, Arrays.mismatch was
> > better than comparing longs using a VarHandle.
>
> Sounds promising. :-) However, your array_comparison_performance
> handles the last 1-7 bytes byte-by-byte. My array_compare branch
> reserves extra 7 bytes at the end of the array so that one can safely
> read up to 7 bytes more than one actually needs. This way no bounds
> checks are needed (even with Unsafe). This might affect the comparision
> between Arrays.mismatch and VarHandle if the results were close before.
>
> > I do like Unsafe as an option for jdk 8 users on x86 or arm64.
>
> Unsafe seems very slightly faster than VarHandle. If Java 8 uses Unsafe,
> should newer versions do too? It could be counter-productive if Java 8
> was faster, even if the difference was tiny.
>
> Do you have use cases that are (for now) stuck on Java 8 or is your
> wish a more generic one?
>
> --
> Lasse Collin
>


Re: [xz-devel] xz-java and newer java

2024-03-10 Thread Lasse Collin
On 2024-03-09 Brett Okken wrote:
> When I tested graviton2 (arm64) previously, Arrays.mismatch was
> better than comparing longs using a VarHandle.

Sounds promising. :-) However, your array_comparison_performance
handles the last 1-7 bytes byte-by-byte. My array_compare branch
reserves extra 7 bytes at the end of the array so that one can safely
read up to 7 bytes more than one actually needs. This way no bounds
checks are needed (even with Unsafe). This might affect the comparision
between Arrays.mismatch and VarHandle if the results were close before.

> I do like Unsafe as an option for jdk 8 users on x86 or arm64.

Unsafe seems very slightly faster than VarHandle. If Java 8 uses Unsafe,
should newer versions do too? It could be counter-productive if Java 8
was faster, even if the difference was tiny.

Do you have use cases that are (for now) stuck on Java 8 or is your
wish a more generic one?

-- 
Lasse Collin



Re: [xz-devel] xz-java and newer java

2024-03-09 Thread Brett Okken
When I tested graviton2 (arm64) previously, Arrays.mismatch was better than
comparing longs using a VarHandle.

The benefits are definitely with content that compresses more - because
there are more long matches.

I do like Unsafe as an option for jdk 8 users on x86 or arm64.

On Sat, Mar 9, 2024 at 3:51 PM Lasse Collin 
wrote:

> I created a branch array_compare. It has a simple version for Java <= 8
> which seems very slightly faster than the current code in master, at
> least when tested with OpenJDK 21. For Java >= 9 there is
> Arrays.mismatch for portability and VarHandle for x86-64 and ARM64.
> These are clearly faster than the basic version.
>
> sun.misc.Unsafe would be a little faster than VarHandle but I feel it's
> not enough to be worth the downsides (non-standard and not memory safe).
>
> 32-bit archs I didn't include, for now at least, since if people want
> speed I hope they don't run 32-bit Java.
>
> Speed differences are very minor when testing with files that don't
> compress extremely well. That was the problem I had with my earlier
> test results. With files that have compression ratio like 0.05 the
> speed differences are clear.
>
> I cannot test on ARM64 so it would be great if someone can, comparing
> the three versions. The most extreme difference is when compressing
> just zeros:
>
> time head -c1 /dev/zero \
> | java -jar build/jar/XZEncDemo.jar > /dev/null
>
> Internal docs should be added to the branch and perhaps there are other
> related optimizations to do still. So it's not fully finished yet but
> now it's ready for testing and feedback. For example, some tweaks from
> your array_comp_incremental could be considered after testing.
>
> --
> Lasse Collin
>


Re: [xz-devel] xz-java and newer java

2024-03-09 Thread Lasse Collin
I created a branch array_compare. It has a simple version for Java <= 8
which seems very slightly faster than the current code in master, at
least when tested with OpenJDK 21. For Java >= 9 there is
Arrays.mismatch for portability and VarHandle for x86-64 and ARM64.
These are clearly faster than the basic version.

sun.misc.Unsafe would be a little faster than VarHandle but I feel it's
not enough to be worth the downsides (non-standard and not memory safe).

32-bit archs I didn't include, for now at least, since if people want
speed I hope they don't run 32-bit Java.

Speed differences are very minor when testing with files that don't
compress extremely well. That was the problem I had with my earlier
test results. With files that have compression ratio like 0.05 the
speed differences are clear.

I cannot test on ARM64 so it would be great if someone can, comparing
the three versions. The most extreme difference is when compressing
just zeros:

time head -c1 /dev/zero \
| java -jar build/jar/XZEncDemo.jar > /dev/null

Internal docs should be added to the branch and perhaps there are other
related optimizations to do still. So it's not fully finished yet but
now it's ready for testing and feedback. For example, some tweaks from
your array_comp_incremental could be considered after testing.

-- 
Lasse Collin



Re: [xz-devel] xz-java and newer java

2024-03-07 Thread Lasse Collin
On 2024-02-29 Brett Okken wrote:
> > Thanks! Ideally there would be one commit to add the minimal
> > portable version, then separate commits for each optimized variant.
> 
> Would you like me to remove the Unsafe based impl from
> https://github.com/tukaani-project/xz-java/pull/13?

There are new commits in master now and those might slightly conflict
with your PR (@Override additions). I'm playing around a bit and
learning about the faster methods still. So right now I don't have
wishes for changes; I don't want to request anything when there's a
possibility that some other way might end up looking more preferable.

In general, I would prefer splitting to more commits. Using your PR as
an example:

  1. Adding the changes to lz/*.java and the portable *Array*.java
 code required by those changes.

  2. Adding one advanced implementation that affects only the
 *Array*.java files.

  3. Repeat step 2. until all implementations are added.

When reasonably possible, the line length should be under 80 chars.

> > So far I have given it only a quick try. array_comp_incremental
> > seems faster than xz-java.git master. Compression time was reduced
> > by about 10 %. :-) This is with OpenJDK 21.0.2, only a quick test,
> > and my computer is old so I don't doubt your higher numbers.  
> 
> How are you testing? I am using jmh, so it has a warm up period before
> actually measuring, giving the jvm plenty of opportunity to perform
> optimizations. If you are doing single shot executions to compress a
> file, that could provide pretty different results.

I was simply timing a XZEncDemo at the default preset (6). I had hoped
that big files (binary and source packages) that take tens of seconds
to compress, repeating each test a few times, would work well enough.
But perhaps the difference is big enough only with certain types of
files.

On 2024-03-05 Brett Okken wrote:
> I have added a comment to the PR with updated benchmark results:
> https://github.com/tukaani-project/xz-java/pull/13#issuecomment-1977705691

Thanks! I'm not sure if I read the results well enough. The "Error"
column seems to have oddly high values on several lines. If the same
test set is run again, are the results in the "Score" column similar
enough between the two runs, retaining the speed order of the
implementations being tested?

If the first file is only ~66KB, I wonder if other factors like
initiazing large arrays in the classes take so much time that
differences in array comparison speeds becomes hard to measure.

When each test is repeated by the benchmarking framework, each run has
to allocate the classes again. Perhaps it might trigger garbage
collection. Did you have ArrayCache enabled?

ArrayCache.setDefaultCache(BasicArrayCache.getInstance());

I suppose optimizing only for new JDK version(s) would be fine if it
makes things easier. That is, it could be enough that performance
doesn't get worse on Java 8.

If the indirection adds overhead, would it make sense to have a
preprocessing step that creates .java file variants that directly use
the optimized methods? So LZMAEncoder.getInstance could choose at
runtime if it should use LZMAEncoderNormalPortable or
LZMAEncoderNormalUnsafe or some other implementation. That is, if this
cannot be done with multi-release JAR. It's not a pretty solution but if
it is faster then it could be one option, maybe.

Negative lenLimit currently occurs in two places (at least). Perhaps it
should be handled in those places instead of requiring the array
comparison to support it (the C code in liblzma does it like that).

-- 
Lasse Collin



Re: [xz-devel] xz-java and newer java

2024-03-05 Thread Brett Okken
I have added a comment to the PR with updated benchmark results:
https://github.com/tukaani-project/xz-java/pull/13#issuecomment-1977705691

On Fri, Mar 1, 2024 at 6:23 AM Brett Okken  wrote:
>
> I found and resolved the difference:
> https://github.com/tukaani-project/xz-java/pull/13/commits/1e4550e06d8cbec4079b2b2fba4a2245307cc4e6
>
> It was indeed in BT4 and had to do with searching for the
> niceLenLimit. I will update the benchmarks over the weekend, as they
> take some time to run.
>
> Brett
>
> On Thu, Feb 29, 2024 at 8:47 PM Brett Okken  wrote:
> >
> > > Thanks! Ideally there would be one commit to add the minimal portable
> > > version, then separate commits for each optimized variant.
> >
> > Would you like me to remove the Unsafe based impl from
> > https://github.com/tukaani-project/xz-java/pull/13?
> >
> > > So far I have given it only a quick try. array_comp_incremental seems
> > > faster than xz-java.git master. Compression time was reduced by about
> > > 10 %. :-) This is with OpenJDK 21.0.2, only a quick test, and my
> > > computer is old so I don't doubt your higher numbers.
> >
> > How are you testing? I am using jmh, so it has a warm up period before
> > actually measuring, giving the jvm plenty of opportunity to perform
> > optimizations. If you are doing single shot executions to compress a
> > file, that could provide pretty different results.
> >
> > > With array_comparison_performance the improvement seems to be less,
> > > maybe 5 %. I didn't test much yet but it still seems clear that
> > > array_comp_incremental is faster on my computer.
> >
> > Going back to the previous question, this could be due to fact that I
> > collapsed some class hierarchy in the _incremental pr. This could take
> > the optimizer a bit longer to figure out.
> >
> > > However, your code produces different output compared to xz-java.git
> > > master so the speed comparison isn't entirely fair. I assume there was
> > > no intent to affect the encoder output with these changes so I wonder
> > > what is going on. Both of your branches produce the same output so it's
> > > something common between them that makes the difference.
> >
> > This was definitely not the intent, and I had not noticed this previously.
> >
> > With the 3 files I test with, none have any difference with preset of
> > 3. The smallest file (ihe_ovly_pr.cm) also has no difference at preset
> > 6.
> >
> > With the ~25MB image1.dcm (mostly a greyscale bmp), the PR versions
> > produce more compressed content at preset 6.
> > 1.9 = 4,041,476
> > PR = 4,004,156
> >
> > There is a smaller difference with the ~50MB xml file, but strangely,
> > the PR version is slightly bigger.
> > 1.9 = 1,589,512
> > PR = 1,589,564
> >
> > Given that I am only seeing differences with preset of 6, I am
> > guessing the difference must be in BT4.
> > The result still seems to be valid (at least the java XZInputStream
> > reads it back correctly).
> > There is clearly a subtle "defect" somewhere, but I cannot tell if it
> > is in the current trunk or the PR. My best guess is that there is an
> > off by 1 error in one or the other.
> >
> > Brett
> >
> > On Thu, Feb 29, 2024 at 11:35 AM Lasse Collin  
> > wrote:
> > >
> > > On 2024-02-25 Brett Okken wrote:
> > > > I created https://github.com/tukaani-project/xz-java/pull/13 with the
> > > > bare bones changes to utilize a utility for array comparisons and an
> > > > Unsafe implementation.
> > > > When/if that is reviewed and approved, we can move on through the
> > > > other implementation options.
> > >
> > > Thanks! Ideally there would be one commit to add the minimal portable
> > > version, then separate commits for each optimized variant.
> > >
> > > So far I have given it only a quick try. array_comp_incremental seems
> > > faster than xz-java.git master. Compression time was reduced by about
> > > 10 %. :-) This is with OpenJDK 21.0.2, only a quick test, and my
> > > computer is old so I don't doubt your higher numbers.
> > >
> > > With array_comparison_performance the improvement seems to be less,
> > > maybe 5 %. I didn't test much yet but it still seems clear that
> > > array_comp_incremental is faster on my computer.
> > >
> > > However, your code produces different output compared to xz-java.git
> > > master so the speed comparison isn't entirely fair. I assume there was
> > > no intent to affect the encoder output with these changes so I wonder
> > > what is going on. Both of your branches produce the same output so it's
> > > something common between them that makes the difference.
> > >
> > > I plan to get back to this next week.
> > >
> > > > > One thing I wonder is if JNI could help.
> > > >
> > > > It would most likely make things faster, but also more complicated. I
> > > > like the java version for the simplicity. I am not necessarily looking
> > > > to compete with native performance, but would like to get improvements
> > > > where they are reasonably available. Here there is some complexity 

Re: [xz-devel] xz-java and newer java

2024-03-01 Thread Brett Okken
I found and resolved the difference:
https://github.com/tukaani-project/xz-java/pull/13/commits/1e4550e06d8cbec4079b2b2fba4a2245307cc4e6

It was indeed in BT4 and had to do with searching for the
niceLenLimit. I will update the benchmarks over the weekend, as they
take some time to run.

Brett

On Thu, Feb 29, 2024 at 8:47 PM Brett Okken  wrote:
>
> > Thanks! Ideally there would be one commit to add the minimal portable
> > version, then separate commits for each optimized variant.
>
> Would you like me to remove the Unsafe based impl from
> https://github.com/tukaani-project/xz-java/pull/13?
>
> > So far I have given it only a quick try. array_comp_incremental seems
> > faster than xz-java.git master. Compression time was reduced by about
> > 10 %. :-) This is with OpenJDK 21.0.2, only a quick test, and my
> > computer is old so I don't doubt your higher numbers.
>
> How are you testing? I am using jmh, so it has a warm up period before
> actually measuring, giving the jvm plenty of opportunity to perform
> optimizations. If you are doing single shot executions to compress a
> file, that could provide pretty different results.
>
> > With array_comparison_performance the improvement seems to be less,
> > maybe 5 %. I didn't test much yet but it still seems clear that
> > array_comp_incremental is faster on my computer.
>
> Going back to the previous question, this could be due to fact that I
> collapsed some class hierarchy in the _incremental pr. This could take
> the optimizer a bit longer to figure out.
>
> > However, your code produces different output compared to xz-java.git
> > master so the speed comparison isn't entirely fair. I assume there was
> > no intent to affect the encoder output with these changes so I wonder
> > what is going on. Both of your branches produce the same output so it's
> > something common between them that makes the difference.
>
> This was definitely not the intent, and I had not noticed this previously.
>
> With the 3 files I test with, none have any difference with preset of
> 3. The smallest file (ihe_ovly_pr.cm) also has no difference at preset
> 6.
>
> With the ~25MB image1.dcm (mostly a greyscale bmp), the PR versions
> produce more compressed content at preset 6.
> 1.9 = 4,041,476
> PR = 4,004,156
>
> There is a smaller difference with the ~50MB xml file, but strangely,
> the PR version is slightly bigger.
> 1.9 = 1,589,512
> PR = 1,589,564
>
> Given that I am only seeing differences with preset of 6, I am
> guessing the difference must be in BT4.
> The result still seems to be valid (at least the java XZInputStream
> reads it back correctly).
> There is clearly a subtle "defect" somewhere, but I cannot tell if it
> is in the current trunk or the PR. My best guess is that there is an
> off by 1 error in one or the other.
>
> Brett
>
> On Thu, Feb 29, 2024 at 11:35 AM Lasse Collin  
> wrote:
> >
> > On 2024-02-25 Brett Okken wrote:
> > > I created https://github.com/tukaani-project/xz-java/pull/13 with the
> > > bare bones changes to utilize a utility for array comparisons and an
> > > Unsafe implementation.
> > > When/if that is reviewed and approved, we can move on through the
> > > other implementation options.
> >
> > Thanks! Ideally there would be one commit to add the minimal portable
> > version, then separate commits for each optimized variant.
> >
> > So far I have given it only a quick try. array_comp_incremental seems
> > faster than xz-java.git master. Compression time was reduced by about
> > 10 %. :-) This is with OpenJDK 21.0.2, only a quick test, and my
> > computer is old so I don't doubt your higher numbers.
> >
> > With array_comparison_performance the improvement seems to be less,
> > maybe 5 %. I didn't test much yet but it still seems clear that
> > array_comp_incremental is faster on my computer.
> >
> > However, your code produces different output compared to xz-java.git
> > master so the speed comparison isn't entirely fair. I assume there was
> > no intent to affect the encoder output with these changes so I wonder
> > what is going on. Both of your branches produce the same output so it's
> > something common between them that makes the difference.
> >
> > I plan to get back to this next week.
> >
> > > > One thing I wonder is if JNI could help.
> > >
> > > It would most likely make things faster, but also more complicated. I
> > > like the java version for the simplicity. I am not necessarily looking
> > > to compete with native performance, but would like to get improvements
> > > where they are reasonably available. Here there is some complexity in
> > > supporting multiple implementations for different versions and/or
> > > architectures, but that complexity does not intrude into the core of
> > > the xz code.
> >
> > I think your thoughts are similar to mine here. Java version is clearly
> > slower but it's nicer code to read too. A separate class for buffer
> > comparisons indeed doesn't hurt the readability of the core code.
> >
> > 

Re: [xz-devel] xz-java and newer java

2024-02-29 Thread Brett Okken
> Thanks! Ideally there would be one commit to add the minimal portable
> version, then separate commits for each optimized variant.

Would you like me to remove the Unsafe based impl from
https://github.com/tukaani-project/xz-java/pull/13?

> So far I have given it only a quick try. array_comp_incremental seems
> faster than xz-java.git master. Compression time was reduced by about
> 10 %. :-) This is with OpenJDK 21.0.2, only a quick test, and my
> computer is old so I don't doubt your higher numbers.

How are you testing? I am using jmh, so it has a warm up period before
actually measuring, giving the jvm plenty of opportunity to perform
optimizations. If you are doing single shot executions to compress a
file, that could provide pretty different results.

> With array_comparison_performance the improvement seems to be less,
> maybe 5 %. I didn't test much yet but it still seems clear that
> array_comp_incremental is faster on my computer.

Going back to the previous question, this could be due to fact that I
collapsed some class hierarchy in the _incremental pr. This could take
the optimizer a bit longer to figure out.

> However, your code produces different output compared to xz-java.git
> master so the speed comparison isn't entirely fair. I assume there was
> no intent to affect the encoder output with these changes so I wonder
> what is going on. Both of your branches produce the same output so it's
> something common between them that makes the difference.

This was definitely not the intent, and I had not noticed this previously.

With the 3 files I test with, none have any difference with preset of
3. The smallest file (ihe_ovly_pr.cm) also has no difference at preset
6.

With the ~25MB image1.dcm (mostly a greyscale bmp), the PR versions
produce more compressed content at preset 6.
1.9 = 4,041,476
PR = 4,004,156

There is a smaller difference with the ~50MB xml file, but strangely,
the PR version is slightly bigger.
1.9 = 1,589,512
PR = 1,589,564

Given that I am only seeing differences with preset of 6, I am
guessing the difference must be in BT4.
The result still seems to be valid (at least the java XZInputStream
reads it back correctly).
There is clearly a subtle "defect" somewhere, but I cannot tell if it
is in the current trunk or the PR. My best guess is that there is an
off by 1 error in one or the other.

Brett

On Thu, Feb 29, 2024 at 11:35 AM Lasse Collin  wrote:
>
> On 2024-02-25 Brett Okken wrote:
> > I created https://github.com/tukaani-project/xz-java/pull/13 with the
> > bare bones changes to utilize a utility for array comparisons and an
> > Unsafe implementation.
> > When/if that is reviewed and approved, we can move on through the
> > other implementation options.
>
> Thanks! Ideally there would be one commit to add the minimal portable
> version, then separate commits for each optimized variant.
>
> So far I have given it only a quick try. array_comp_incremental seems
> faster than xz-java.git master. Compression time was reduced by about
> 10 %. :-) This is with OpenJDK 21.0.2, only a quick test, and my
> computer is old so I don't doubt your higher numbers.
>
> With array_comparison_performance the improvement seems to be less,
> maybe 5 %. I didn't test much yet but it still seems clear that
> array_comp_incremental is faster on my computer.
>
> However, your code produces different output compared to xz-java.git
> master so the speed comparison isn't entirely fair. I assume there was
> no intent to affect the encoder output with these changes so I wonder
> what is going on. Both of your branches produce the same output so it's
> something common between them that makes the difference.
>
> I plan to get back to this next week.
>
> > > One thing I wonder is if JNI could help.
> >
> > It would most likely make things faster, but also more complicated. I
> > like the java version for the simplicity. I am not necessarily looking
> > to compete with native performance, but would like to get improvements
> > where they are reasonably available. Here there is some complexity in
> > supporting multiple implementations for different versions and/or
> > architectures, but that complexity does not intrude into the core of
> > the xz code.
>
> I think your thoughts are similar to mine here. Java version is clearly
> slower but it's nicer code to read too. A separate class for buffer
> comparisons indeed doesn't hurt the readability of the core code.
>
> On the other hand, if Java version happened to be used a lot then JNI
> could save both time (up to 50 %) and even electricity. java.util.zip
> uses native zlib for the performance-critical code.
>
> In the long run both faster Java code and JNI might be worth doing.
> There's more than enough pure Java stuff to do for now so any JNI
> thoughts have to wait.
>
> --
> Lasse Collin
>



Re: [xz-devel] xz-java and newer java

2024-02-29 Thread Lasse Collin
On 2024-02-25 Brett Okken wrote:
> I created https://github.com/tukaani-project/xz-java/pull/13 with the
> bare bones changes to utilize a utility for array comparisons and an
> Unsafe implementation.
> When/if that is reviewed and approved, we can move on through the
> other implementation options.

Thanks! Ideally there would be one commit to add the minimal portable
version, then separate commits for each optimized variant.

So far I have given it only a quick try. array_comp_incremental seems
faster than xz-java.git master. Compression time was reduced by about
10 %. :-) This is with OpenJDK 21.0.2, only a quick test, and my
computer is old so I don't doubt your higher numbers.

With array_comparison_performance the improvement seems to be less,
maybe 5 %. I didn't test much yet but it still seems clear that
array_comp_incremental is faster on my computer.

However, your code produces different output compared to xz-java.git
master so the speed comparison isn't entirely fair. I assume there was
no intent to affect the encoder output with these changes so I wonder
what is going on. Both of your branches produce the same output so it's
something common between them that makes the difference.

I plan to get back to this next week.

> > One thing I wonder is if JNI could help.  
> 
> It would most likely make things faster, but also more complicated. I
> like the java version for the simplicity. I am not necessarily looking
> to compete with native performance, but would like to get improvements
> where they are reasonably available. Here there is some complexity in
> supporting multiple implementations for different versions and/or
> architectures, but that complexity does not intrude into the core of
> the xz code.

I think your thoughts are similar to mine here. Java version is clearly
slower but it's nicer code to read too. A separate class for buffer
comparisons indeed doesn't hurt the readability of the core code.

On the other hand, if Java version happened to be used a lot then JNI
could save both time (up to 50 %) and even electricity. java.util.zip
uses native zlib for the performance-critical code.

In the long run both faster Java code and JNI might be worth doing.
There's more than enough pure Java stuff to do for now so any JNI
thoughts have to wait.

-- 
Lasse Collin



Re: [xz-devel] xz-java and newer java

2024-02-25 Thread Brett Okken
> Thanks! I could be good to split into smaller commits to make reviewing
> easier.

I created https://github.com/tukaani-project/xz-java/pull/13 with the
bare bones changes to utilize a utility for array comparisons and an
Unsafe implementation.
When/if that is reviewed and approved, we can move on through the
other implementation options.

> One thing I wonder is if JNI could help.

It would most likely make things faster, but also more complicated. I
like the java version for the simplicity. I am not necessarily looking
to compete with native performance, but would like to get improvements
where they are reasonably available. Here there is some complexity in
supporting multiple implementations for different versions and/or
architectures, but that complexity does not intrude into the core of
the xz code.

Thanks,
Brett

On Mon, Feb 19, 2024 at 1:32 PM Lasse Collin  wrote:
>
> On 2024-02-19 Brett Okken wrote:
> > I have created a pr to the GitHub project.
> >
> > https://github.com/tukaani-project/xz-java/pull/12
>
> Thanks! I could be good to split into smaller commits to make reviewing
> easier.
>
> > It is not clear to me if that is actually seeing active dev on the
> > Java project yet.
>
> I see now that there are quite a few things on GH. I had forgotten to
> turn email notifications on for the xz-java project; clearly those
> aren't on by default. :-( But likely not much would have been done even
> if I had noticed those issues and PRs earlier so the main problem is
> that the silence has been impolite. I'm sorry.
>
> XZ Utils 5.6.0 has to be released this month since there was a wish to
> get it into the next Ubuntu LTS. I'm hoping that next month something
> will finally get done around XZ for Java. We'll see.
>
> One thing I wonder is if JNI could help. Optimizing the Java code can
> help a bit but I suspect that it still won't be very fast. So far it has
> been nice that the Java code is quite readable and I would like keep it
> that way in the future too.
>
> --
> Lasse Collin



Re: [xz-devel] xz-java and newer java

2024-02-19 Thread Lasse Collin
On 2024-02-19 Brett Okken wrote:
> I have created a pr to the GitHub project.
> 
> https://github.com/tukaani-project/xz-java/pull/12

Thanks! I could be good to split into smaller commits to make reviewing
easier.

> It is not clear to me if that is actually seeing active dev on the
> Java project yet.

I see now that there are quite a few things on GH. I had forgotten to
turn email notifications on for the xz-java project; clearly those
aren't on by default. :-( But likely not much would have been done even
if I had noticed those issues and PRs earlier so the main problem is
that the silence has been impolite. I'm sorry.

XZ Utils 5.6.0 has to be released this month since there was a wish to
get it into the next Ubuntu LTS. I'm hoping that next month something
will finally get done around XZ for Java. We'll see.

One thing I wonder is if JNI could help. Optimizing the Java code can
help a bit but I suspect that it still won't be very fast. So far it has
been nice that the Java code is quite readable and I would like keep it
that way in the future too.

-- 
Lasse Collin



Re: [xz-devel] xz-java and newer java

2024-02-19 Thread Brett Okken
I have created a pr to the GitHub project.

https://github.com/tukaani-project/xz-java/pull/12

It is not clear to me if that is actually seeing active dev on the Java
project yet.

Thanks,
Brett

On Sat, Feb 12, 2022 at 11:45 AM Brett Okken 
wrote:

> Can this be taken up again?
>
> On Wed, Mar 24, 2021 at 6:20 AM Brett Okken 
> wrote:
>
>> I grabbed an older version in the last mail. This is the updated
>> version for aarch64.
>>
>


Re: [xz-devel] xz-java and newer java

2022-05-10 Thread Dennis Ens
> Can this be taken up again?
+1

Any updates on this?

--
Dennis Ens



Re: [xz-devel] xz-java and newer java

2022-02-12 Thread Brett Okken
Can this be taken up again?

On Wed, Mar 24, 2021 at 6:20 AM Brett Okken 
wrote:

> I grabbed an older version in the last mail. This is the updated
> version for aarch64.
>


Re: [xz-devel] xz-java and newer java

2021-03-24 Thread Brett Okken
I grabbed an older version in the last mail. This is the updated
version for aarch64.


ArrayUtil.java
Description: Binary data


Re: [xz-devel] xz-java and newer java

2021-03-23 Thread Brett Okken
I was able to test on AWS graviton2 instances (aarch64), but only with
jdk 15. The results show that the vectorized approach appears the best
option, though long comparisons are also an improvement over baseline.


Based on this, I made a small change to ArrayUtil to, by default, use
unsafe long comparisons for aarch64 for older JDKs.

I attached an image of the summarized data, but here it is raw:

BASELINE
Benchmark (file)  (preset)  Mode  Cnt
Score Error  Units
XZCompressionBenchmark.compress  ihe_ovly_pr.dcm 3  avgt3
1.192 ±   0.005  ms/op
XZCompressionBenchmark.compress  ihe_ovly_pr.dcm 6  avgt3
6.066 ±   0.023  ms/op
XZCompressionBenchmark.compress   image1.dcm 3  avgt3
 4125.464 ± 134.683  ms/op
XZCompressionBenchmark.compress   image1.dcm 6  avgt3
 8193.866 ± 205.916  ms/op
XZCompressionBenchmark.compresslarge.xml 3  avgt3
 1438.101 ±   7.357  ms/op
XZCompressionBenchmark.compresslarge.xml 6  avgt3
11961.600 ±  38.130  ms/op

Updated
Benchmark (file)  (preset)
 Mode  Cnt  Score Error  Units
XZCompressionBenchmark.compress_legacy   ihe_ovly_pr.dcm 3
 avgt3  1.434 ±   0.007  ms/op
XZCompressionBenchmark.compress_legacy   ihe_ovly_pr.dcm 6
 avgt3  6.694 ±   0.006  ms/op
XZCompressionBenchmark.compress_legacyimage1.dcm 3
 avgt3   4236.741 ± 176.331  ms/op
XZCompressionBenchmark.compress_legacyimage1.dcm 6
 avgt3   8923.713 ± 715.574  ms/op
XZCompressionBenchmark.compress_legacy large.xml 3
 avgt3   1399.139 ±   5.955  ms/op
XZCompressionBenchmark.compress_legacy large.xml 6
 avgt3  15793.829 ± 169.169  ms/op
XZCompressionBenchmark.compress_unsafe_long  ihe_ovly_pr.dcm 3
 avgt3  1.341 ±   0.016  ms/op
XZCompressionBenchmark.compress_unsafe_long  ihe_ovly_pr.dcm 6
 avgt3  4.441 ±   0.012  ms/op
XZCompressionBenchmark.compress_unsafe_long   image1.dcm 3
 avgt3   4172.261 ±  41.783  ms/op
XZCompressionBenchmark.compress_unsafe_long   image1.dcm 6
 avgt3   7414.503 ± 123.315  ms/op
XZCompressionBenchmark.compress_unsafe_longlarge.xml 3
 avgt3   1289.451 ±  10.420  ms/op
XZCompressionBenchmark.compress_unsafe_longlarge.xml 6
 avgt3  11355.386 ±  68.198  ms/op
XZCompressionBenchmark.compress_vh_int   ihe_ovly_pr.dcm 3
 avgt3  1.343 ±   0.008  ms/op
XZCompressionBenchmark.compress_vh_int   ihe_ovly_pr.dcm 6
 avgt3  5.137 ±   0.016  ms/op
XZCompressionBenchmark.compress_vh_intimage1.dcm 3
 avgt3   4097.739 ± 162.179  ms/op
XZCompressionBenchmark.compress_vh_intimage1.dcm 6
 avgt3   7865.803 ± 127.912  ms/op
XZCompressionBenchmark.compress_vh_int large.xml 3
 avgt3   1284.516 ±  26.837  ms/op
XZCompressionBenchmark.compress_vh_int large.xml 6
 avgt3  12066.120 ± 106.382  ms/op
XZCompressionBenchmark.compress_vh_long  ihe_ovly_pr.dcm 3
 avgt3  1.330 ±   0.009  ms/op
XZCompressionBenchmark.compress_vh_long  ihe_ovly_pr.dcm 6
 avgt3  4.569 ±   0.205  ms/op
XZCompressionBenchmark.compress_vh_long   image1.dcm 3
 avgt3   4110.154 ± 182.196  ms/op
XZCompressionBenchmark.compress_vh_long   image1.dcm 6
 avgt3   7420.054 ± 330.954  ms/op
XZCompressionBenchmark.compress_vh_longlarge.xml 3
 avgt3   1271.665 ±  10.160  ms/op
XZCompressionBenchmark.compress_vh_longlarge.xml 6
 avgt3  11077.733 ±  64.546  ms/op
XZCompressionBenchmark.compress_vh_vectorihe_ovly_pr.dcm 3
 avgt3  1.326 ±   0.016  ms/op
XZCompressionBenchmark.compress_vh_vectorihe_ovly_pr.dcm 6
 avgt3  4.551 ±   0.085  ms/op
XZCompressionBenchmark.compress_vh_vector image1.dcm 3
 avgt3   4084.482 ±  32.445  ms/op
XZCompressionBenchmark.compress_vh_vector image1.dcm 6
 avgt3   7670.077 ± 343.810  ms/op
XZCompressionBenchmark.compress_vh_vector  large.xml 3
 avgt3   1274.196 ±   2.831  ms/op
XZCompressionBenchmark.compress_vh_vector  large.xml 6
 avgt3  10485.505 ±  43.182  ms/op


ArrayUtil.java
Description: Binary data


Re: [xz-devel] xz-java and newer java

2021-02-19 Thread Brett Okken
I have attached updated patches and ArrayUtil.java.
HC4 needed changes/optimizations in both locations.
I also found a better way to handle BT4 occasionally sending -1 as the length.
diff --git a/src/org/tukaani/xz/lz/BT4.java b/src/org/tukaani/xz/lz/BT4.java
index 6c46feb..7d78aef 100644
--- a/src/org/tukaani/xz/lz/BT4.java
+++ b/src/org/tukaani/xz/lz/BT4.java
@@ -11,6 +11,7 @@
 package org.tukaani.xz.lz;

 import org.tukaani.xz.ArrayCache;
+import org.tukaani.xz.common.ArrayUtil;

 final class BT4 extends LZEncoder {
 private final Hash234 hash;
@@ -46,6 +47,7 @@ final class BT4 extends LZEncoder {
 this.depthLimit = depthLimit > 0 ? depthLimit : 16 + niceLen / 2;
 }

+@Override
 public void putArraysToCache(ArrayCache arrayCache) {
 arrayCache.putArray(tree);
 hash.putArraysToCache(arrayCache);
@@ -70,6 +72,7 @@ final class BT4 extends LZEncoder {
 return avail;
 }

+@Override
 public Matches getMatches() {
 matches.count = 0;

@@ -118,9 +121,10 @@ final class BT4 extends LZEncoder {

 // If a match was found, see how long it is.
 if (matches.count > 0) {
-while (lenBest < matchLenLimit && buf[readPos + lenBest - delta2]
-  == buf[readPos + lenBest])
-++lenBest;
+lenBest += ArrayUtil.mismatch(buf,
+  readPos + lenBest - delta2,
+  readPos + lenBest,
+  matchLenLimit - lenBest);

 matches.len[matches.count - 1] = lenBest;

@@ -160,25 +164,27 @@ final class BT4 extends LZEncoder {
 + (delta > cyclicPos ? cyclicSize : 0)) << 1;
 int len = Math.min(len0, len1);

-if (buf[readPos + len - delta] == buf[readPos + len]) {
-while (++len < matchLenLimit)
-if (buf[readPos + len - delta] != buf[readPos + len])
-break;
-
-if (len > lenBest) {
-lenBest = len;
-matches.len[matches.count] = len;
-matches.dist[matches.count] = delta - 1;
-++matches.count;
-
-if (len >= niceLenLimit) {
-tree[ptr1] = tree[pair];
-tree[ptr0] = tree[pair + 1];
-return matches;
-}
+int mismatch = ArrayUtil.mismatch(buf,
+  readPos + len - delta,
+  readPos + len,
+  matchLenLimit - len);
+
+len += mismatch;
+
+if (len > lenBest) {
+lenBest = len;
+matches.len[matches.count] = len;
+matches.dist[matches.count] = delta - 1;
+++matches.count;
+
+if (len >= niceLenLimit) {
+tree[ptr1] = tree[pair];
+tree[ptr0] = tree[pair + 1];
+return matches;
 }
 }

+
 if ((buf[readPos + len - delta] & 0xFF)
 < (buf[readPos + len] & 0xFF)) {
 tree[ptr1] = currentMatch;
@@ -215,18 +221,19 @@ final class BT4 extends LZEncoder {
 + (delta > cyclicPos ? cyclicSize : 0)) << 1;
 int len = Math.min(len0, len1);

-if (buf[readPos + len - delta] == buf[readPos + len]) {
-// No need to look for longer matches than niceLenLimit
-// because we only are updating the tree, not returning
-// matches found to the caller.
-do {
-if (++len == niceLenLimit) {
-tree[ptr1] = tree[pair];
-tree[ptr0] = tree[pair + 1];
-return;
-}
-} while (buf[readPos + len - delta] == buf[readPos + len]);
+// No need to look for longer matches than niceLenLimit
+// because we only are updating the tree, not returning
+// matches found to the caller.
+int mismatch = ArrayUtil.mismatch(buf,
+  readPos + len - delta,
+  readPos + len,
+  niceLenLimit);
+if (mismatch == niceLenLimit) {
+tree[ptr1] = tree[pair];
+tree[ptr0] = tree[pair + 1];
+return;
 }
+len += mismatch;

 if ((buf[readPos + len - delta] & 0xFF)
 < (buf[readPos + len] & 0xFF)) {
@@ -243,6 +250,7 @@ final class BT4 extends LZEncoder {
 }
 }

+@Override
 public void skip(int len) {
 while (len-- > 0) {
  

Re: [xz-devel] xz-java and newer java

2021-02-16 Thread Brett Okken
On Tue, Feb 16, 2021 at 12:48 PM Lasse Collin  wrote:
>
> I quickly tried these with "XZEncDemo 2". I used the preset 2 because
> that uses LZMAEncoderFast instead of LZMAEncoderNormal where the
> negative lengths result in a crash.

I updated the mismatch method to check for negative lengths upfront:

return length > 0 ? (int) MISMATCH.invokeExact(bytes, aFromIndex,
bFromIndex, length) : 0;

I must not have sent that out.

> The performance was about the same
> or worse than the original code. I don't know why. I didn't spend much
> time on this and it's possible that I messed up something.

I have only been testing at the default preset (6). I should add tests
for the one of the "fast" presets as well.

> One thing that may be worth checking out is how in HC4.java (and
> BT4.java too) the patch doesn't try to quickly skip too short matches
> like the original code does. I suppose the first set of patches should
> be such that they only replace the byte-by-byte loops with a function
> call to make comparison as fair as possible.

In the BT4, (what is being tested), it does not actually seem to
benefit from the "too short" matches, at least for the content I was
testing. This might be different for HC4.

> These patches won't get into XZ for Java 1.9 but might be in a later
> version if I see them being/becoming good. The only remaining patch
> that might get into 1.9 is LZDecoder.repeat improvements.

Sounds good.

> When you post a patch or other code, please make sure that word-wrapping
> is disabled in the email client or use attachments. Thanks!

I will move to attachments. I do not see how to stick with plain text
and keep gmail from wrapping.

Brett



Re: [xz-devel] xz-java and newer java

2021-02-16 Thread Lasse Collin
I quickly tried these with "XZEncDemo 2". I used the preset 2 because
that uses LZMAEncoderFast instead of LZMAEncoderNormal where the
negative lengths result in a crash. The performance was about the same
or worse than the original code. I don't know why. I didn't spend much
time on this and it's possible that I messed up something.

One thing that may be worth checking out is how in HC4.java (and
BT4.java too) the patch doesn't try to quickly skip too short matches
like the original code does. I suppose the first set of patches should
be such that they only replace the byte-by-byte loops with a function
call to make comparison as fair as possible.

These patches won't get into XZ for Java 1.9 but might be in a later
version if I see them being/becoming good. The only remaining patch
that might get into 1.9 is LZDecoder.repeat improvements.

When you post a patch or other code, please make sure that word-wrapping
is disabled in the email client or use attachments. Thanks!

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode



Re: [xz-devel] xz-java and newer java

2021-01-31 Thread Brett Okken
Replacing while loops with switch statements for the "extra bytes"
also yields a small improvement. Pulling that common logic out into a
utility method negates most of the benefit.
Here is the updated ArrayUtil class.



package org.tukaani.xz.common;

import static java.lang.invoke.MethodType.methodType;

import java.lang.invoke.MethodHandle;
import java.lang.invoke.MethodHandles;
import java.lang.invoke.MethodType;
import java.lang.reflect.Constructor;
import java.lang.reflect.Method;
import java.nio.ByteOrder;
import java.util.Arrays;
import java.util.Locale;
import java.util.Properties;
import java.util.logging.Level;
import java.util.logging.Logger;

/**
 * Utilities for optimized array interactions.
 *
 * 
 * The means of comparing arrays can be controlled by setting the sytem property
 * {@code org.tukaani.xz.ArrayComparison} to a value from {@link
ArrayComparison}.
 * 
 *
 * @author Brett Okken
 */
public final class ArrayUtil {

/**
 * Enumerated options for controlling implementation of how to
compare arrays.
 */
public static enum ArrayComparison {
/**
 * Uses {@code VarHandle} for {@code int[]} access.
 * 
 * This is default behavior on jdk9+ for 32 bit x86.
 * 
 */
VH_INT,
/**
 * Uses {@code VarHandle} for {@code int[]} access after attempting
 * to align the reads on 4 byte boundaries.
 */
VH_INT_ALIGN,
/**
 * Uses {@code VarHandle} for {@code long[]} access.
 * 
 * This is default behavior on jdk9+ for 64 bit x86.
 * 
 */
VH_LONG,
/**
 * Uses {@code VarHandle} for {@code long[]} access after attempting
 * to align the reads.
 */
VH_LONG_ALIGN,
/**
 * Uses {@code Arrays.mismatch()} to perform vectorized comparison.
 * 
 * This is default behavior on jdk9+ for non-x86.
 * 
 */
VECTOR,
/**
 * Uses {@code sun.misc.Unsafe.getInt()} for unaligned {@code int[]}
 * access.
 * 
 * This is default behavior on jdk 8 and prior for 32 bit x86.
 * 
 */
UNSAFE_GET_INT,
/**
 * Uses {@code sun.misc.Unsafe.getLong()} for unaligned {@code long[]}
 * access.
 * 
 * This is default behavior on jdk 8 and prior for 64 bit x86.
 * 
 */
UNSAFE_GET_LONG,
/**
 * Performs byte-by-byte comparison.
 */
LEGACY;

static ArrayComparison getFromProperty(String prop) {
if (prop == null || prop.isEmpty()) {
return null;
}
try {
return ArrayComparison.valueOf(prop.toUpperCase(Locale.US));
} catch (Exception e) {
final Logger logger =
Logger.getLogger(ArrayUtil.class.getName());
logger.log(Level.INFO,
   "Invalid ArrayComparison option, using
default behavior",
   e);
return null;
}
}
}

/**
 * MethodHandle to the actual mismatch method to use at runtime.
 */
private static final MethodHandle MISMATCH;

/**
 * The method this is bound to at runtime is depends on the chosen
 * implementation for {@code byte[]} comparison.
 * 
 * For {@code long} based comparisons, it will be bound to either
 * {@link Long#numberOfLeadingZeros(long)} or
 * {@link Long#numberOfTrailingZeros(long)} depending on
 * {@link ByteOrder#nativeOrder()}.
 * 
 * 
 * For {@code int} based comparisons it will be bound to either
 * {@link Integer#numberOfLeadingZeros(int)} or
 * {@link Integer#numberOfTrailingZeros(int)} depending on
 * {@link ByteOrder#nativeOrder()}.
 * 
 */
private static final MethodHandle LEADING_ZEROS;

/**
 * Populated from reflected read of
 * {@code sun.misc.Unsafe.ARRAY_BYTE_BASE_OFFSET} if one of the unsafe
 * implementations is used.
 */
private static final long ARRAY_BASE_OFFSET;

/**
 * The method this is bound to at runtime is depends on the chosen
 * implementation for {@code byte[]} comparison.
 * 
 * For {@link ArrayComparison#VECTOR} and
 * {@link ArrayComparison#LEGACY} this will be {@code null}.
 * 
 * 
 * For {@link ArrayComparison#VH_INT} and {@link
ArrayComparison#VH_INT_ALIGN}
 * this will be a jdk 9+ {@code byteArrayViewVarHandle} for {@code int[]}
 * using the {@link ByteOrder#nativeOrder()}. The method signature is
 * {@code int get(byte[], int)}.
 * 
 * 
 * For {@link ArrayComparison#VH_LONG} and {@link
ArrayComparison#VH_LONG_ALIGN}
 * this will be a jdk 9+ {@code byteArrayViewVarHandle} for {@code long[]}
 * using the {@link ByteOrder#nativeOrder()}. The method signature is
 * {@code long get(byte[], int)}.
 * 
 

Re: [xz-devel] xz-java and newer java

2021-01-24 Thread Brett Okken
Based on some playing around with unrolling loops as part of the crc64
implementation, I tried unrolling the "legacy" implementation and
found it provided some nice improvements. The improvements were most
pronounced on 32 bit jdk 11:

32 jdk 11 - LEGACY
Benchmark (file)  Mode  Cnt  Score
Error  Units
XZCompressionBenchmark.compress  ihe_ovly_pr.dcm  avgt3 17.812
±   0.588  ms/op
XZCompressionBenchmark.compress   image1.dcm  avgt3   8404.259
± 391.678  ms/op
XZCompressionBenchmark.compresslarge.xml  avgt3  16037.416
± 467.379  ms/op

Unrolled
Benchmark (file)  Mode  Cnt  Score
Error  Units
XZCompressionBenchmark.compress  ihe_ovly_pr.dcm  avgt3 13.624
±   0.845  ms/op
XZCompressionBenchmark.compress   image1.dcm  avgt3   7833.118
±  28.132  ms/op
XZCompressionBenchmark.compresslarge.xml  avgt3  12838.831
± 192.884  ms/op

32 jdk 11 - LEGACY (server)
Benchmark (file)  Mode  Cnt  Score
Error  Units
XZCompressionBenchmark.compress  ihe_ovly_pr.dcm  avgt3 14.105
±   0.081  ms/op
XZCompressionBenchmark.compress   image1.dcm  avgt3   8474.630
± 518.903  ms/op
XZCompressionBenchmark.compresslarge.xml  avgt3  16009.553
± 529.315  ms/op

Unrolled
Benchmark (file)  Mode  Cnt  Score
Error  Units
XZCompressionBenchmark.compress  ihe_ovly_pr.dcm  avgt3 10.513
±   0.290  ms/op
XZCompressionBenchmark.compress   image1.dcm  avgt3   7900.578
± 309.317  ms/op
XZCompressionBenchmark.compresslarge.xml  avgt3  12871.200
± 570.491  ms/op


/**
 * Simply loops over all of the bytes, comparing one at a time.
 */
@SuppressWarnings("unused")
private static int legacyMismatch(
byte[] a, int aFromIndex, int bFromIndex, int length) {
int i=0;
for (int j=length - 7; i

Re: [xz-devel] xz-java and newer java

2021-01-22 Thread Brett Okken
package org.tukaani.xz.common;

import static java.lang.invoke.MethodType.methodType;

import java.lang.invoke.MethodHandle;
import java.lang.invoke.MethodHandles;
import java.lang.invoke.MethodType;
import java.lang.reflect.Constructor;
import java.lang.reflect.Method;
import java.nio.ByteOrder;
import java.util.Arrays;
import java.util.Locale;
import java.util.Properties;
import java.util.logging.Level;
import java.util.logging.Logger;

/**
 * Utilities for optimized array interactions.
 *
 * 
 * The means of comparing arrays can be controlled by setting the
system property
 * {@code org.tukaani.xz.ArrayComparison} to a value from {@link
ArrayComparison}.
 * 
 *
 * @author Brett Okken
 */
public final class ArrayUtil {

/**
 * Enumerated options for controlling implementation of how to
compare arrays.
 */
public static enum ArrayComparison {
/**
 * Uses {@code VarHandle} for {@code int[]} access.
 * 
 * This is default behavior on jdk9+ for 32 bit x86.
 * 
 */
VH_INT,
/**
 * Uses {@code VarHandle} for {@code int[]} access after attempting
 * to align the reads on 4 byte boundaries.
 */
VH_INT_ALIGN,
/**
 * Uses {@code VarHandle} for {@code long[]} access.
 * 
 * This is default behavior on jdk9+ for 64 bit x86.
 * 
 */
VH_LONG,
/**
 * Uses {@code VarHandle} for {@code long[]} access after attempting
 * to align the reads.
 */
VH_LONG_ALIGN,
/**
 * Uses {@code Arrays.mismatch()} to perform vectorized comparison.
 * 
 * This is default behavior on jdk9+ for non-x86.
 * 
 */
VECTOR,
/**
 * Uses {@code sun.misc.Unsafe.getInt()} for unaligned {@code int[]}
 * access.
 * 
 * This is default behavior on jdk 8 and prior for 32 bit x86.
 * 
 */
UNSAFE_GET_INT,
/**
 * Uses {@code sun.misc.Unsafe.getLong()} for unaligned {@code long[]}
 * access.
 * 
 * This is default behavior on jdk 8 and prior for 64 bit x86.
 * 
 */
UNSAFE_GET_LONG,
/**
 * Performs byte-by-byte comparison.
 */
LEGACY;

static ArrayComparison getFromProperty(String prop) {
if (prop == null || prop.isEmpty()) {
return null;
}
try {
return ArrayComparison.valueOf(prop.toUpperCase(Locale.US));
} catch (Exception e) {
final Logger logger =
Logger.getLogger(ArrayUtil.class.getName());
logger.log(Level.INFO,
   "Invalid ArrayComparison option, using
default behavior",
   e);
return null;
}
}
}

/**
 * MethodHandle to the actual mismatch method to use at runtime.
 */
private static final MethodHandle MISMATCH;

/**
 * The method this is bound to at runtime depends on the chosen
 * implementation for {@code byte[]} comparison.
 * 
 * For {@code long} based comparisons, it will be bound to either
 * {@link Long#numberOfLeadingZeros(long)} or
 * {@link Long#numberOfTrailingZeros(long)} depending on
 * {@link ByteOrder#nativeOrder()}.
 * 
 * 
 * For {@code int} based comparisons it will be bound to either
 * {@link Integer#numberOfLeadingZeros(int)} or
 * {@link Integer#numberOfTrailingZeros(int)} depending on
 * {@link ByteOrder#nativeOrder()}.
 * 
 */
private static final MethodHandle LEADING_ZEROS;

/**
 * Populated from reflected read of
 * {@code sun.misc.Unsafe.ARRAY_BYTE_BASE_OFFSET} if one of the unsafe
 * implementations is used.
 */
private static final long ARRAY_BASE_OFFSET;

/**
 * The method this is bound to at runtime is depends on the chosen
 * implementation for {@code byte[]} comparison.
 * 
 * For {@link ArrayComparison#VECTOR} and
 * {@link ArrayComparison#LEGACY} this will be {@code null}.
 * 
 * 
 * For {@link ArrayComparison#VH_INT} and {@link
ArrayComparison#VH_INT_ALIGN}
 * this will be a jdk 9+ {@code byteArrayViewVarHandle} for {@code int[]}
 * using the {@link ByteOrder#nativeOrder()}. The method signature is
 * {@code int get(byte[], int)}.
 * 
 * 
 * For {@link ArrayComparison#VH_LONG} and {@link
ArrayComparison#VH_LONG_ALIGN}
 * this will be a jdk 9+ {@code byteArrayViewVarHandle} for {@code long[]}
 * using the {@link ByteOrder#nativeOrder()}. The method signature is
 * {@code long get(byte[], int)}.
 * 
 * 
 * For {@link ArrayComparison#UNSAFE_GET_INT} this is bound to
 * {@code sun.misc.Unsafe.getInt(Object, long)}.
 * 
 * 
 * For {@link ArrayComparison#UNSAFE_GET_LONG} this is bound to
 * 

Re: [xz-devel] xz-java and newer java

2021-01-22 Thread Brett Okken
diff --git a/src/org/tukaani/xz/lz/BT4.java b/src/org/tukaani/xz/lz/BT4.java
index 6c46feb..c96c766 100644
--- a/src/org/tukaani/xz/lz/BT4.java
+++ b/src/org/tukaani/xz/lz/BT4.java
@@ -11,6 +11,7 @@
 package org.tukaani.xz.lz;

 import org.tukaani.xz.ArrayCache;
+import org.tukaani.xz.common.ArrayUtil;

 final class BT4 extends LZEncoder {
 private final Hash234 hash;
@@ -118,9 +119,10 @@ final class BT4 extends LZEncoder {

 // If a match was found, see how long it is.
 if (matches.count > 0) {
-while (lenBest < matchLenLimit && buf[readPos + lenBest - delta2]
-  == buf[readPos + lenBest])
-++lenBest;
+lenBest += ArrayUtil.mismatch(buf,
+  readPos + lenBest - delta2,
+  readPos + lenBest,
+  matchLenLimit - lenBest);

 matches.len[matches.count - 1] = lenBest;

@@ -160,10 +162,12 @@ final class BT4 extends LZEncoder {
 + (delta > cyclicPos ? cyclicSize : 0)) << 1;
 int len = Math.min(len0, len1);

-if (buf[readPos + len - delta] == buf[readPos + len]) {
-while (++len < matchLenLimit)
-if (buf[readPos + len - delta] != buf[readPos + len])
-break;
+int mismatch = ArrayUtil.mismatch(buf,
+  readPos + len - delta,
+  readPos + len,
+  matchLenLimit - len);
+if (mismatch != 0) {
+len += mismatch;

 if (len > lenBest) {
 lenBest = len;
@@ -215,18 +219,19 @@ final class BT4 extends LZEncoder {
 + (delta > cyclicPos ? cyclicSize : 0)) << 1;
 int len = Math.min(len0, len1);

-if (buf[readPos + len - delta] == buf[readPos + len]) {
-// No need to look for longer matches than niceLenLimit
-// because we only are updating the tree, not returning
-// matches found to the caller.
-do {
-if (++len == niceLenLimit) {
-tree[ptr1] = tree[pair];
-tree[ptr0] = tree[pair + 1];
-return;
-}
-} while (buf[readPos + len - delta] == buf[readPos + len]);
+// No need to look for longer matches than niceLenLimit
+// because we only are updating the tree, not returning
+// matches found to the caller.
+int mismatch = ArrayUtil.mismatch(buf,
+  readPos + len - delta,

+  readPos + len,
+  niceLenLimit);
+if (mismatch == niceLenLimit) {
+tree[ptr1] = tree[pair];
+tree[ptr0] = tree[pair + 1];
+return;
 }
+len += mismatch;

 if ((buf[readPos + len - delta] & 0xFF)
 < (buf[readPos + len] & 0xFF)) {



diff --git a/src/org/tukaani/xz/lz/HC4.java b/src/org/tukaani/xz/lz/HC4.java
index d2b4e84..623d59d 100644
--- a/src/org/tukaani/xz/lz/HC4.java
+++ b/src/org/tukaani/xz/lz/HC4.java
@@ -11,6 +11,7 @@
 package org.tukaani.xz.lz;

 import org.tukaani.xz.ArrayCache;
+import org.tukaani.xz.common.ArrayUtil;

 final class HC4 extends LZEncoder {
 private final Hash234 hash;
@@ -136,9 +137,10 @@ final class HC4 extends LZEncoder {

 // If a match was found, see how long it is.
 if (matches.count > 0) {
-while (lenBest < matchLenLimit && buf[readPos + lenBest - delta2]
-  == buf[readPos + lenBest])
-++lenBest;
+lenBest += ArrayUtil.mismatch(buf,
+  readPos + lenBest - delta2,
+  readPos + lenBest,
+  matchLenLimit - lenBest);

 matches.len[matches.count - 1] = lenBest;

@@ -167,30 +169,21 @@ final class HC4 extends LZEncoder {
 currentMatch = chain[cyclicPos - delta
  + (delta > cyclicPos ? cyclicSize : 0)];

-// Test the first byte and the first new byte that would give us
-// a match that is at least one byte longer than lenBest. This
-// too short matches get quickly skipped.
-if (buf[readPos + lenBest - delta] == buf[readPos + lenBest]
-&& buf[readPos - delta] == buf[readPos]) {
-// Calculate the length of the match.
-int len = 0;
-while (++len < matchLenLimit)
-if 

Re: [xz-devel] xz-java and newer java

2021-01-21 Thread Brett Okken
> Have you tested with 32-bit Java too? It's quite possible that it's
> better to use ints than longs on 32-bit system. If so, that should be
> detected at runtime too, I guess.

I have now run benchmarks using the 32bit jre on 64bit windows system.
That actually introduces additional interesting impacts by using the
client jvm by default, which does not use the c2 compiler. The longs
appear to be a bit faster in client mode (not as optimized) and the
ints a bit faster in server mode.

> In XZ Utils the arrays have extra room at the end so that memcmplen.h
> can always read 4/8/16 bytes at a time. Since this is easy to do, I
> think it should be done in XZ for Java too to avoid special handling of
> the last bytes.

Somewhat surprisingly, this actually appears to make things slightly
worse by having to introduce the check to see a detected difference
was beyond the actual length.

> Since Java in general is memory safe, having bound checks with Unsafe is
> nice as long as it doesn't hurt performance too much. This
>
>if (aFromIndex < 0 || aFromIndex + length > a.length ||
>bFromIndex < 0 || bFromIndex + length > b.length) {
>
> is a bit relaxed though since it doesn't catch integer overflows.
> Something like this would be more strict:
>
>if (length < 0 ||
> aFromIndex < 0 || aFromIndex > a.length - length ||
> bFromIndex < 0 || bFromIndex > b.length - length) {

Nice catch. Arrays approaching 2GB are not common yet, but seems
likely in future.

> Comparing byte arrays as ints or longs results in unaligned/misaligned
> memory access. MethodHandles.byteArrayViewVarHandle docs say that this
> is OK. A quick web search gave me an impression that it might not be
> safe with Unsafe though. Can you verify how it is with Unsafe? If it
> isn't allowed, dropping support for Unsafe may be fine. It's just the
> older Java versions that would use it anyway.

This is a bit trickier. The closest I can find to documentation is
from the getInt method[1].

The object referred to by o is an array, and the offset is an integer
of the form B+N*S, where N is a valid index into the array, and B and
S are the values obtained by arrayBaseOffset and arrayIndexScale
(respectively) from the array's class. The value referred to is the
Nth element of the array.
...
However, the results are undefined if that variable is not in fact of
the type returned by this method.

Taken in context with what the new
jdk.internal.misc.Unsafe.getLongUnaligned implementation looks like[2]
unaligned access is not safe with Unsafe.getLong.

> Do you have a way to check how these methods behave on Android and ARM?
> (I understand that this might be too much work to check. This may be
> skipped.)

I /might/ be able to run some benchmarks on aws graviton2 instances.

I will send out updated code soon, but I currently have
implementations for jdk 9+ to use VarHandle for x86 (ints for 32bit
longs for 64 bit) and non-x86 to use the vectorized Arrays mismatch.
For older jdks, x86 will use Unsafe (again, ints for 32 bit and longs
for 64 bit). There are also implementations for VarHandles which will
make an attempt to compare individual bytes to try and get memory
reads into alignment. The default implementation behavior can be
overridden by setting a system property.

[1] - 
https://github.com/openjdk/jdk11u-dev/blob/master/src/jdk.unsupported/share/classes/sun/misc/Unsafe.java#L127-L139
[2] - 
https://github.com/openjdk/jdk11u-dev/blob/master/src/java.base/share/classes/jdk/internal/misc/Unsafe.java#L3398



Re: [xz-devel] xz-java and newer java

2021-01-18 Thread Lasse Collin
On 2021-01-11 Brett Okken wrote:
> I threw together a quick jmh test, and there is no value in the
> changes to Hash234.

OK, let's forget that then.

On 2021-01-16 Brett Okken wrote:
> I have found a way to use VarHandle byte array access at runtime in
> code which is compile time compatible with jdk 7. So here is an
> updated ArrayUtil class which will use a VarHandle to read long values
> in jdk 9+. If that is not available, it will attempt to use
> sun.misc.Unsafe. If that cannot be found, it falls back to standard
> byte by byte comparison.

Sounds promising. :-) You have already done quite a bit of work in both
writing code and benchmarking. Thank you!

The method you ended up is similar to src/liblzma/common/memcmplen.h
in XZ Utils. There 8-byte version is used on 64-bit systems and 4-byte
version on 32-bit systems. In XZ Utils, SSE2 version (16-byte
comparison) is faster than 4-byte compare on 32-bit x86, but on x86-64
the 8-byte version has similar speed or is faster than the SSE2 version
(it depends on the CPU).

Have you tested with 32-bit Java too? It's quite possible that it's
better to use ints than longs on 32-bit system. If so, that should be
detected at runtime too, I guess.

In XZ Utils the arrays have extra room at the end so that memcmplen.h
can always read 4/8/16 bytes at a time. Since this is easy to do, I
think it should be done in XZ for Java too to avoid special handling of
the last bytes.

> I did add an index bounds check for the unsafe implementation and
> found it had minimal impact on over all performance.

Since Java in general is memory safe, having bound checks with Unsafe is
nice as long as it doesn't hurt performance too much. This

if (aFromIndex < 0 || aFromIndex + length > a.length ||
bFromIndex < 0 || bFromIndex + length > b.length) {

is a bit relaxed though since it doesn't catch integer overflows.
Something like this would be more strict:

if (length < 0 ||
aFromIndex < 0 || aFromIndex > a.length - length ||
bFromIndex < 0 || bFromIndex > b.length - length) {

> Using VarHandle (at least on jdk 11) offers very similar performance
> to Unsafe across all 3 files I used for benchmarking.

OK. I cannot comment the details much because I'm not familiar with
either API for now.

Comparing byte arrays as ints or longs results in unaligned/misaligned
memory access. MethodHandles.byteArrayViewVarHandle docs say that this
is OK. A quick web search gave me an impression that it might not be
safe with Unsafe though. Can you verify how it is with Unsafe? If it
isn't allowed, dropping support for Unsafe may be fine. It's just the
older Java versions that would use it anyway.

It is *essential* that the code works well also on archs that don't
have fast unaligned access. Even if the VarHandle method is safe, it's
not clear how the performance is on archs that don't support fast
unaligned access. It would be bad to add an optimization that is good
on x86-64 but counter-productive on some other archs. One may need
arch-specific code just like there is in XZ Utils, although on the
other hand it would be nice to keep the Java code less complicated.

Do you have a way to check how these methods behave on Android and ARM?
(I understand that this might be too much work to check. This may be
skipped.)

I wish to add module-info.java in the next release. Do these new
methods affect what should be in module-info.java? With the current
code this seems to be enough:

module org.tukaani.xz {
exports org.tukaani.xz;
}

> final int leadingZeros = (int)LEADING_ZEROS.invokeExact(diff);
> return i + (leadingZeros / Byte.SIZE);

Seems that Java might not optimize that division to a right shift. It
could be better to use "leadingZeros >>> 3".

> I know you said you were not going to be able to work on xz-java for
> awhile, but given these benchmark results, which really exceeded my
> expectations, could this get some priority to release?

I understood that it's 9-18 % faster. That is significant but it's
still a performance optimization only, not an important bug fix, and to
me the code doesn't feel completely ready yet (for example, the
unaligned access is important to get right).

(Compare to the threaded decompression support that is coming to XZ
Utils. It will speed things up a few hundred percent.)

Can you provide a complete patch to make testing easier (or if not
possible, complete copies of modified files)? Also, please try to wrap
the lines so that they stay within 80 columns (with some long
unbreakable strings this may not be possible, then those lines can be
overlong instead of messing up the indentation).

I think your patch will find its way into XZ for Java in some form
but once again I repeat that it will take some time. These XZ projects
are only a hobby for me and currently I don't even turn on my computer
every day.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode



Re: [xz-devel] xz-java and newer java

2021-01-16 Thread Brett Okken
Lasse,

I have found a way to use VarHandle byte array access at runtime in
code which is compile time compatible with jdk 7. So here is an
updated ArrayUtil class which will use a VarHandle to read long values
in jdk 9+. If that is not available, it will attempt to use
sun.misc.Unsafe. If that cannot be found, it falls back to standard
byte by byte comparison.
I did add an index bounds check for the unsafe implementation and
found it had minimal impact on over all performance.
Using VarHandle (at least on jdk 11) offers very similar performance
to Unsafe across all 3 files I used for benchmarking.

--Baseline 1.8
Benchmark (file)  Mode  Cnt  Score
Error  Units
XZCompressionBenchmark.compress  ihe_ovly_pr.dcm  avgt4  9.558
±   0.239  ms/op
XZCompressionBenchmark.compress   image1.dcm  avgt4   6553.304
± 112.475  ms/op
XZCompressionBenchmark.compresslarge.xml  avgt4  10592.151
± 291.527  ms/op

--Unsafe
Benchmark (file)  Mode  Cnt Score
   Error  Units
XZCompressionBenchmark.compress  ihe_ovly_pr.dcm  avgt4 7.699
±   0.058  ms/op
XZCompressionBenchmark.compress   image1.dcm  avgt4  6001.170
± 143.814  ms/op
XZCompressionBenchmark.compresslarge.xml  avgt4  7853.963
±  83.753  ms/op

--VarHandle
Benchmark (file)  Mode  Cnt Score
   Error  Units
XZCompressionBenchmark.compress  ihe_ovly_pr.dcm  avgt4 7.630
±   0.542  ms/op
XZCompressionBenchmark.compress   image1.dcm  avgt4  5872.098
±  71.185  ms/op
XZCompressionBenchmark.compresslarge.xml  avgt4  8239.880
± 346.036  ms/op

I know you said you were not going to be able to work on xz-java for
awhile, but given these benchmark results, which really exceeded my
expectations, could this get some priority to release?


package org.tukaani.xz.common;

import java.lang.invoke.MethodHandle;
import java.lang.invoke.MethodHandles;
import java.lang.invoke.MethodType;
import java.lang.reflect.Constructor;
import java.lang.reflect.Method;
import java.nio.ByteOrder;
import java.util.logging.Level;
import java.util.logging.Logger;

/**
 * Utilities for optimized array interactions.
 *
 * @author Brett Okken
 */
public final class ArrayUtil {

/**
 * MethodHandle to the actual mismatch method to use at runtime.
 */
private static final MethodHandle MISMATCH;

/**
 * If {@code sun.misc.Unsafe} can be loaded, this is MethodHandle
bound to an instance of Unsafe for method {@code long getLong(Object,
long)}.
 */
private static final MethodHandle UNSAFE_GET_LONG;

/**
 * MethodHandle to either {@link Long#numberOfLeadingZeros(long)}
or {@link Long#numberOfTrailingZeros(long)} depending on {@link
ByteOrder#nativeOrder()}.
 */
private static final MethodHandle LEADING_ZEROS;

/**
 * Populated from reflected read of {@code
sun.misc.Unsafe.ARRAY_BYTE_BASE_OFFSET}.
 */
private static final long ARRAY_BASE_OFFSET;

/**
 * {@code MethodHandle} for a jdk 9+ {@code
byteArrayViewVarHandle} for {@code long[]} using the {@link
ByteOrder#nativeOrder()}.
 * The method signature is {@code long get(byte[], int)}.
 */
private static final MethodHandle VAR_HANDLE_GET_LONG;

static {
final Logger logger = Logger.getLogger(ArrayUtil.class.getName());
MethodHandle leadingZeros = null;
MethodHandle varHandleGetLong = null;
MethodHandle unsafeGetLong = null;
long arrayBaseOffset = 0;
MethodHandle mismatch = null;
final MethodHandles.Lookup lookup = MethodHandles.lookup();
final MethodType mismatchType =
MethodType.methodType(int.class, byte[].class, int.class,
byte[].class, int.class, int.class);
try {
//getLong interprets in platform byte order. the concept
of "leading zeros" being bytes
//in encounter order is true for big endian
//for little endian platform, the trailing zeros gives the
encounter order result
leadingZeros = lookup.findStatic(Long.class,
 ByteOrder.BIG_ENDIAN ==
ByteOrder.nativeOrder()
 ?
"numberOfLeadingZeros" : "numberOfTrailingZeros",

MethodType.methodType(int.class, long.class));

//first try to load byteArrayViewVarHandle for a long[]
try {
final Class varHandleClazz =
Class.forName("java.lang.invoke.VarHandle", true, null);
final Method byteArrayViewHandle =
MethodHandles.class.getDeclaredMethod("byteArrayViewVarHandle", new
Class[] {Class.class, ByteOrder.class});
final Object varHandle =
byteArrayViewHandle.invoke(null, long[].class,
ByteOrder.nativeOrder());
final Class accessModeEnum =
Class.forName("java.lang.invoke.VarHandle$AccessMode", true, null);
@SuppressWarnings({ 

Re: [xz-devel] xz-java and newer java

2021-01-13 Thread Brett Okken
I have continued to refine the changes around the array comparisons
and think I am pretty well done there.

I did a small benchmark measuring the time to compress 3 different
files using new XZOutputStream(cos, new LZMA2Options()). Where cos was
an OutputStream which simply calculated the crc32 of the content
written.

The ihe_ovly.pr.dcm is a ~66KB binary file.
The image1.dcm file is a ~26MB binary file with some metadata wrapping
a grayscale bmp.
The large.xml file is a ~51MB formatted utf-8 xml file.

1.8
Benchmark (file)  Mode  Cnt Score
   Error  Units
XZCompressionBenchmark.compress  ihe_ovly_pr.dcm  avgt5 9.575
±   0.322  ms/op
XZCompressionBenchmark.compress   image1.dcm  avgt5  6767.376
± 675.373  ms/op
XZCompressionBenchmark.compresslarge.xml  avgt5  9680.569
± 207.521  ms/op

1.9-SNAPSHOT
Benchmark (file)  Mode  Cnt Score
  Error  Units
XZCompressionBenchmark.compress  ihe_ovly_pr.dcm  avgt5 7.888
±  0.443  ms/op
XZCompressionBenchmark.compress   image1.dcm  avgt5  6144.748
± 71.551  ms/op
XZCompressionBenchmark.compresslarge.xml  avgt5  7904.019
± 26.316  ms/op

These results are from 64 bit windows using open jdk 11.0.6.
The results are 9-18% faster across the 3 different file types.

I made a small change to ArrayUtil from what I sent previously. When
everything matches, it now returns length rather than -1.

The remaining changes are:

BT4:
in getMatches()

if (matches.count > 0) {
while (lenBest < matchLenLimit && buf[readPos + lenBest - delta2]
  == buf[readPos + lenBest])
++lenBest;

Changes to:
if (matches.count > 0) {
lenBest += ArrayUtil.mismatch(buf, readPos + lenBest -
delta2, buf, readPos + lenBest, matchLenLimit - lenBest);

Further down, inside a while(true) block,

if (buf[readPos + len - delta] == buf[readPos + len]) {
while (++len < matchLenLimit)
if (buf[readPos + len - delta] != buf[readPos + len])
break;

Changes to:
int mismatch = ArrayUtil.mismatch(buf, readPos + len -
delta, buf, readPos + len, matchLenLimit - len);
if (mismatch != 0) {
len += mismatch;

in skip(int, int)

if (buf[readPos + len - delta] == buf[readPos + len]) {
// No need to look for longer matches than niceLenLimit
// because we only are updating the tree, not returning
// matches found to the caller.
do {
if (++len == niceLenLimit) {
tree[ptr1] = tree[pair];
tree[ptr0] = tree[pair + 1];
return;
}
} while (buf[readPos + len - delta] == buf[readPos + len]);
}

Changes to:
// No need to look for longer matches than niceLenLimit
// because we only are updating the tree, not returning
// matches found to the caller.
int mismatch = ArrayUtil.mismatch(buf, readPos + len -
delta, buf, readPos + len, niceLenLimit);
if (mismatch == niceLenLimit) {
tree[ptr1] = tree[pair];
tree[ptr0] = tree[pair + 1];
return;
}
len += mismatch;


In HC4, both changes are in getMatches()

if (matches.count > 0) {
while (lenBest < matchLenLimit && buf[readPos + lenBest - delta2]
  == buf[readPos + lenBest])
++lenBest;

Changes to:
if (matches.count > 0) {
lenBest += ArrayUtil.mismatch(buf, readPos + lenBest -
delta2, buf, readPos + lenBest, matchLenLimit - lenBest);

And:

// Test the first byte and the first new byte that would give us
// a match that is at least one byte longer than lenBest. This
// too short matches get quickly skipped.
if (buf[readPos + lenBest - delta] == buf[readPos + lenBest]
&& buf[readPos - delta] == buf[readPos]) {
// Calculate the length of the match.
int len = 0;
while (++len < matchLenLimit)
if (buf[readPos + len - delta] != buf[readPos + len])
break;

// Use the match if and only if it is better than the longest
// match found so far.
if (len > lenBest) {
lenBest = len;
matches.len[matches.count] = len;
matches.dist[matches.count] = delta - 1;
++matches.count;

// Return if it is long enough (niceLen or reached the
// end of the dictionary).
if (len >= niceLenLimit)
  

Re: [xz-devel] xz-java and newer java

2021-01-12 Thread Brett Okken
It turns out that reading the longs in native byte order provides
noticeable improvement.
I did find that there was cost overhead of ~1 ns/op by using an
interface/implementation to flex behavior if Unsafe could not be
loaded. That cost goes away by using java.lang.invoke.MethodHandle.
So here is an updated jdk 7 compatible ArrayUtil implementation which
matches current performance if the first byte does not match and is
faster in every other scenario. The gap in performance grows as more
bytes actually match. At 2 bytes, it takes roughly half the time. At
97 bytes, it takes less than ten percent of the time.

Here are the benchmark results:

Benchmark(length)  Mode  Cnt
Score   Error  Units
ArrayMismatchBenchmark.comparerMismatch_nomatch 0  avgt5
4.487 ± 0.059  ns/op
ArrayMismatchBenchmark.comparerMismatch_nomatch 1  avgt5
4.515 ± 0.102  ns/op
ArrayMismatchBenchmark.comparerMismatch_nomatch 2  avgt5
4.523 ± 0.023  ns/op
ArrayMismatchBenchmark.comparerMismatch_nomatch 7  avgt5
5.164 ± 0.098  ns/op
ArrayMismatchBenchmark.comparerMismatch_nomatch13  avgt5
5.748 ± 0.974  ns/op
ArrayMismatchBenchmark.comparerMismatch_nomatch57  avgt5
10.060 ± 1.135  ns/op
ArrayMismatchBenchmark.comparerMismatch_nomatch97  avgt5
11.518 ± 0.418  ns/op
ArrayMismatchBenchmark.legacyMismatch_nomatch   0  avgt5
3.259 ± 0.069  ns/op
ArrayMismatchBenchmark.legacyMismatch_nomatch   1  avgt5
5.712 ± 0.070  ns/op
ArrayMismatchBenchmark.legacyMismatch_nomatch   2  avgt5
6.017 ± 0.300  ns/op
ArrayMismatchBenchmark.legacyMismatch_nomatch   7  avgt5
12.949 ± 0.163  ns/op
ArrayMismatchBenchmark.legacyMismatch_nomatch  13  avgt5
18.696 ± 0.551  ns/op
ArrayMismatchBenchmark.legacyMismatch_nomatch  57  avgt5
43.232 ± 1.015  ns/op
ArrayMismatchBenchmark.legacyMismatch_nomatch  97  avgt5
90.599 ± 0.794  ns/op
ArrayMismatchBenchmark.unsafeMisMatch_nomatch   0  avgt5
3.246 ± 0.138  ns/op
ArrayMismatchBenchmark.unsafeMisMatch_nomatch   1  avgt5
3.225 ± 0.042  ns/op
ArrayMismatchBenchmark.unsafeMisMatch_nomatch   2  avgt5
3.242 ± 0.043  ns/op
ArrayMismatchBenchmark.unsafeMisMatch_nomatch   7  avgt5
3.244 ± 0.048  ns/op
ArrayMismatchBenchmark.unsafeMisMatch_nomatch  13  avgt5
3.477 ± 0.028  ns/op
ArrayMismatchBenchmark.unsafeMisMatch_nomatch  57  avgt5
5.968 ± 0.553  ns/op
ArrayMismatchBenchmark.unsafeMisMatch_nomatch  97  avgt5
7.182 ± 0.080  ns/op
ArrayMismatchBenchmark.utilMismatch_nomatch 0  avgt5
3.219 ± 0.044  ns/op
ArrayMismatchBenchmark.utilMismatch_nomatch 1  avgt5
3.217 ± 0.054  ns/op
ArrayMismatchBenchmark.utilMismatch_nomatch 2  avgt5
3.217 ± 0.069  ns/op
ArrayMismatchBenchmark.utilMismatch_nomatch 7  avgt5
3.206 ± 0.047  ns/op
ArrayMismatchBenchmark.utilMismatch_nomatch13  avgt5
3.509 ± 0.218  ns/op
ArrayMismatchBenchmark.utilMismatch_nomatch57  avgt5
5.870 ± 0.063  ns/op
ArrayMismatchBenchmark.utilMismatch_nomatch97  avgt5
7.178 ± 0.267  ns/op

The "comparer" implementation is using interface with different
implementations based on whether Unsafe could be loaded.
The "unsafe" implementation is directly using the Unsafe class.
The "util" implementation is using the ArrayUtil class below.


package org.tukaani.xz.common;

import java.lang.invoke.MethodHandle;
import java.lang.invoke.MethodHandles;
import java.lang.invoke.MethodType;
import java.lang.reflect.Constructor;
import java.nio.ByteOrder;

public final class ArrayUtil {

/**
 * MethodHandle to the actual mismatch method to use at runtime.
 */
private static final MethodHandle MISMATCH;

/**
 * If {@code sun.misc.Unsafe} can be loaded, this is MethodHandle
bound to an instance of Unsafe for method {@code long getLong(Object,
long)}.
 */
private static final MethodHandle UNSAFE_GET_LONG;

/**
 * MethodHandle to either {@link Long#numberOfLeadingZeros(long)}
or {@link Long#numberOfTrailingZeros(long)} depending on {@link
ByteOrder#nativeOrder()}.
 */
private static final MethodHandle LEADING_ZEROS;

/**
 * Populated from reflected read of {@code
sun.misc.Unsafe.ARRAY_BYTE_BASE_OFFSET}.
 */
private static final long ARRAY_BASE_OFFSET;

static {
//try to create an instance using Unsafe
long arrayBaseOffset = 0;
MethodHandle unsafeGetLong = null;
MethodHandle leadingZeros = null;
MethodHandle mismatch = null;
final MethodHandles.Lookup lookup = MethodHandles.lookup();
final MethodType mismatchType =
MethodType.methodType(int.class, byte[].class, int.class,
byte[].class, int.class, int.class);
try {
Class unsafeClazz = Class.forName("sun.misc.Unsafe", 

Re: [xz-devel] xz-java and newer java

2021-01-11 Thread Brett Okken
I threw together a quick jmh test, and there is no value in the
changes to Hash234.

For the array mismatch, the results are kind of interesting. My
observation, stepping through some compression uses, is that the
comparison length is typically 100-200 bytes in length, but the actual
match length is typically fairly short. This is obviously going to be
highly dependent on data, and I was using raw image data for
observation. Content like xml or json might have longer matches. So I
set up a benchmark which is always comparing 128 bytes and the
mismatch occurs after various "lengths":

Benchmark  (length)  Mode  Cnt
Score   Error  Units
ArrayMismatchBenchmark.legacyMismatch_nomatch 0  avgt5
3.198 ± 0.168  ns/op
ArrayMismatchBenchmark.legacyMismatch_nomatch 1  avgt5
5.607 ± 0.048  ns/op
ArrayMismatchBenchmark.legacyMismatch_nomatch 2  avgt5
5.852 ± 0.053  ns/op
ArrayMismatchBenchmark.legacyMismatch_nomatch 7  avgt5
12.703 ± 0.350  ns/op
ArrayMismatchBenchmark.legacyMismatch_nomatch13  avgt5
18.275 ± 0.228  ns/op
ArrayMismatchBenchmark.legacyMismatch_nomatch57  avgt5
42.313 ± 0.450  ns/op
ArrayMismatchBenchmark.legacyMismatch_nomatch97  avgt5
89.410 ± 2.927  ns/op
ArrayMismatchBenchmark.arraysMismatch_nomatch 0  avgt5
4.629 ± 0.035  ns/op
ArrayMismatchBenchmark.arraysMismatch_nomatch 1  avgt5
9.515 ± 0.096  ns/op
ArrayMismatchBenchmark.arraysMismatch_nomatch 2  avgt5
9.526 ± 0.132  ns/op
ArrayMismatchBenchmark.arraysMismatch_nomatch 7  avgt5
9.581 ± 0.395  ns/op
ArrayMismatchBenchmark.arraysMismatch_nomatch13  avgt5
9.781 ± 0.133  ns/op
ArrayMismatchBenchmark.arraysMismatch_nomatch57  avgt5
9.846 ± 0.182  ns/op
ArrayMismatchBenchmark.arraysMismatch_nomatch97  avgt5
10.809 ± 0.307  ns/op
ArrayMismatchBenchmark.intMismatch_nomatch0  avgt5
3.417 ± 0.018  ns/op
ArrayMismatchBenchmark.intMismatch_nomatch1  avgt5
3.412 ± 0.011  ns/op
ArrayMismatchBenchmark.intMismatch_nomatch2  avgt5
3.414 ± 0.032  ns/op
ArrayMismatchBenchmark.intMismatch_nomatch7  avgt5
5.401 ± 0.207  ns/op
ArrayMismatchBenchmark.intMismatch_nomatch   13  avgt5
8.311 ± 0.070  ns/op
ArrayMismatchBenchmark.intMismatch_nomatch   57  avgt5
20.536 ± 0.556  ns/op
ArrayMismatchBenchmark.intMismatch_nomatch   97  avgt5
30.969 ± 0.318  ns/op
ArrayMismatchBenchmark.longMismatch_nomatch   0  avgt5
4.399 ± 0.082  ns/op
ArrayMismatchBenchmark.longMismatch_nomatch   1  avgt5
4.390 ± 0.068  ns/op
ArrayMismatchBenchmark.longMismatch_nomatch   2  avgt5
4.398 ± 0.033  ns/op
ArrayMismatchBenchmark.longMismatch_nomatch   7  avgt5
4.403 ± 0.110  ns/op
ArrayMismatchBenchmark.longMismatch_nomatch  13  avgt5
6.564 ± 0.398  ns/op
ArrayMismatchBenchmark.longMismatch_nomatch  57  avgt5
11.548 ± 0.331  ns/op
ArrayMismatchBenchmark.longMismatch_nomatch  97  avgt5
16.335 ± 0.119  ns/op

I labeled the current behavior as "legacy".
The Arrays.mismatch is significantly slower when the mismatch occurs
early in the array and significantly faster when the mismatch occurs
later.
Comparing an int (4 bytes) at a time is a clear winner if the mismatch
occurs in those 4 bytes, which appeared to be 90+% of the calls I
observed.
Comparing a long (8 bytes) at a time is faster than the current
behavior unless it is the first byte which does not match, but slower
than comparing ints if the mismatch occurs in the first 4 bytes.

I wrote this test using jdk 9 VarHandle to read the ints and longs
from the byte[], but the same thing can be achieved using
sun.misc.Unsafe. I will add that as a case in the benchmark, but it is
expected to be similar to VarHandle (maybe slightly faster).

Brett

On Mon, Jan 11, 2021 at 10:04 AM Lasse Collin  wrote:
>
> On 2021-01-09 Brett Okken wrote:
> > This would seem to be a potential candidate for a multi-release
> > jar[1], if you can figure out a reasonable way to get a build system
> > to generate one.
>
> I suppose it can be done. The build system uses Apache Ant. From some
> sources I've understood that there are more modern alternatives but I
> haven't had any interest or energy to learn more as Ant seems to still
> work OK.
>
> > The 4 uses I found of comparing byte[] could be refactored to call a
> > new utility class to do the comparison. The "regular" implementation
> > could be java 7 compatible, and the jdk 9 version would be in the
> > META_INF folder.
> > Even for the java 7 compatible version, it might be worth exploring
> > how much improvement would come from using Unsafe to read int or long
> > values from the byte[] and compare those.
> >
> > For Hash234, I would think the whole class could be handled for the
>
> All these sound like worth checking out.
>
> On 

Re: [xz-devel] xz-java and newer java

2021-01-09 Thread Brett Okken
This would seem to be a potential candidate for a multi-release
jar[1], if you can figure out a reasonable way to get a build system
to generate one.

The 4 uses I found of comparing byte[] could be refactored to call a
new utility class to do the comparison. The "regular" implementation
could be java 7 compatible, and the jdk 9 version would be in the
META_INF folder.
Even for the java 7 compatible version, it might be worth exploring
how much improvement would come from using Unsafe to read int or long
values from the byte[] and compare those.

For Hash234, I would think the whole class could be handled for the MR jar.

[1] - https://openjdk.java.net/jeps/238

Thanks,

Brett

On Fri, Jan 8, 2021 at 1:36 PM Lasse Collin  wrote:
>
> On 2021-01-08 Brett Okken wrote:
> > Are there any plans to update xz-java to take advantage of newer
> > features in jdk 9+?
>
> There aren't much plans at all. Adding module-info.java is likely to
> happen in the next release, whenever that will be.
>
> Apache Commons Compress 1.20 requires Java 7. It depends on XZ for
> Java. I think it wouldn't be good to make XZ for Java require a newer
> Java version than Commons Compress but it could be discussed with
> Commons Compress developers. There's a bug with .7z files that requires
> changing both XZ for Java and Commons Compress so I could ask about the
> Java version too.
>
> > For example, Arrays.mismatch[1] leverages vectorized comparisons of 2
> > byte[]. This could be leveraged in the getMatches methods of BT4 and
> > HC4 as well as the 2 getMatchLen methods in LZEncoder.
> >
> > Another example would be to use a VarHandle to read int values out of
> > a byte[][2], which would be useful for the calcHashes method in
> > Hash234.
>
> Thanks! These sound interesting. If they make big enough difference, it
> could be a good reason to require Java 9.
>
> I will need to check out the LZDecoder improvement from the other
> message too, and perhaps a few variations of it. Thanks!
>
> There are multiple things in XZ Utils that I try to look at in the near
> future so it will be a while until I will play with the Java code.
>
> --
> Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode
>



Re: [xz-devel] xz-java and newer java

2021-01-08 Thread Lasse Collin
On 2021-01-08 Brett Okken wrote:
> Are there any plans to update xz-java to take advantage of newer
> features in jdk 9+?

There aren't much plans at all. Adding module-info.java is likely to
happen in the next release, whenever that will be.

Apache Commons Compress 1.20 requires Java 7. It depends on XZ for
Java. I think it wouldn't be good to make XZ for Java require a newer
Java version than Commons Compress but it could be discussed with
Commons Compress developers. There's a bug with .7z files that requires
changing both XZ for Java and Commons Compress so I could ask about the
Java version too.

> For example, Arrays.mismatch[1] leverages vectorized comparisons of 2
> byte[]. This could be leveraged in the getMatches methods of BT4 and
> HC4 as well as the 2 getMatchLen methods in LZEncoder.
> 
> Another example would be to use a VarHandle to read int values out of
> a byte[][2], which would be useful for the calcHashes method in
> Hash234.

Thanks! These sound interesting. If they make big enough difference, it
could be a good reason to require Java 9.

I will need to check out the LZDecoder improvement from the other
message too, and perhaps a few variations of it. Thanks!

There are multiple things in XZ Utils that I try to look at in the near
future so it will be a while until I will play with the Java code.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode