Re: [xz-devel] Question about using Java API for geospatial data

Lasse Collin Sun, 10 Jul 2022 11:00:27 -0700

On 2022-07-09 Gary Lucas wrote:
> I am using the library to compress a public-domain data product called
> ETOPO1. ETOPO1 provides a global-scale grid of 233 million elevation
> and ocean depth samples as integer meters. My implementation
> compresses the data in separate blocks of about 20 thousand values
> each.


So that is about 12 thousand blocks?

> Previously, I used Huffman coding and Deflate to reduce the size
> of the data to about 4.39 bits per value. With your library, LZMA
> reduces that to 4.14 bits per value and XZ to 4.16.

Is the compressed size of each block about ten kilobytes?

> The original implementation requires an average of 4.8 seconds to
> decompress the full set of 233 million points.  The LZMA version
> requires 15.2 seconds, and the XZ version requires 18.9 seconds.

The Deflate implementation in java.util.zip uses zlib (native code). XZ
for Java is pure Java. LZMA is significantly slower than Deflate and
being pure Java makes the difference even bigger.

> My understanding is that XZ should perform better than LZMA. Since
> that is not the case, could there be something suboptimal with the way
> my code uses the API?

The core compression code is the same in both: XZ uses LZMA2 which is
LZMA with framing. XZ adds a few features like filters, integrity
checking, and block-based random access reading.

> And here are the Code Snippets:

The XZ examples don't use XZ for Java directly. This is clear due to
"Xz" vs. "XZ" difference in the class names and that XZOutputStream has
no constructor that takes the input size as an argument.

Non-performance notes:

  - Section "When uncompressed size is known beforehand" in
    XZInputStream is worth reading. Basically adding a check
    that "xzIn.read() == -1" is true at the end to verify the integrity
    check. This at least used to be true (I haven't tested recently)
    for GZipInputStream too.

  - When compressing, .finish() is redundant. .close() will do it
    anyway.

  - If XZ data is embedded insize another file format, you may want
    to use SingleXZInputStream instead of XZInputStream. XZInputStream
    supports concatenated streams that are possible on standalone .xz
    files but probably shouldn't occur when embedded inside another
    format. In your case this likely makes no difference in practice.

Might affect performance:

  - The default LZMA2 dictionary size is 8 MiB. If the uncompressed
    size is known to be much smaller than this, it's waste of memory to
    use so big dictionary. In that case pick a value that is at least as
    big as the largest uncompressed size, possibly round up to 2^n
    value.

  - Compressing or decompressing multiple streams that use identical
    settings means creating many compressor or decompressor instances.
    To reduce garbage collector pressure there is ArrayCache which
    reuses large array allocations. You can enable this globally with
    this:

        ArrayCache.setDefaultCache(BasicArrayCache.getInstance());

    However, setting the default like this might not be desired if
    multiple unrelated things in the application might use XZ for Java.

    Note that ArrayCache can help both LZMA and XZ classes.

Likely will affect performance:

  - Since compression ratio is high, the integrity checking starts to
    become more significant for performance. To test how much integrity
    checking slows XZ down, use SingleXZInputStream or XZInputStream
    constructor that takes "boolean verifyCheck" and set it to false.

    You can also compress to XZ without integrity checking at all
    (using XZ.CHECK_NONE as the third argument in XZOutputStream
    constructor). Using XZ.CHECK_CRC32 is likely much faster than the
    default XZ.CHECK_CRC64 because CRC32 comes from java.util.zip which
    uses native code from zlib.

It's quite possible that XZ provides no value over raw LZMA in this
application, especially if you don't need integrity checking. Raw LZMA
instead of .lzma will even avoid the 13-byte .lzma header saving 150
kilobytes with 12 thousand blocks. If the uncompressed size is stored
in the container headers then further 4-5 bytes per block can be saved
by telling the size to the raw LZMA encoder and decoder.

Note that LZMAOutputStream and LZMAInputStream support .lzma and raw
LZMA: the choise between these is done by picking the right
constructors.

Finally, it might be worth playing with the lc/lp/pb parameters in
LZMA/LZMA2. Usually those make only tiny difference but with some data
types they have a bigger effect. These won't affect performance other
than that the smaller the compressed file the faster it tends to
decompress in case of LZMA/LZMA2.

Other compressors might be worth trying too. Zstandard typically
compresses only slightly worse than XZ/LZMA but it is *a lot* faster to
decompress.

-- 
Lasse Collin

Re: [xz-devel] Question about using Java API for geospatial data

Reply via email to