Re: [xz-devel] Question about using Java API for geospatial data

2022-07-11 Thread Gary Lucas
John,

Thanks for the note.  Map projections are one of my personal
interests... though I admit to approaching the topic with rather more
enthusiasm than finesse. I'm going to try to resist the temptation to
get carried away by a discussion of cartography on this data
compression mailing list (well, I'm going to try).  I will say that my
library supports projected coordinate systems and accepts Well-Known
Text (WKT) specifications as metadata.  But, really, my focus is on
developing lossless data compression tools for all sorts of raster
data, including both integer data and real-valued surfaces (fields).
Geophysical information is a natural topic because so much data is so
readily available.  Incidentally, with the addition of LZMA for
high-resolution SRTM elevation data,  compression rates are running
about 1.9 bits per sample (more or less, depending on the terrain).

Gary

P.S. That was an interesting note about UTM being in use for 70 years.
I had no idea...  The math has been around for a long time (I think
Lambert invented it and Gauss made important refinements), but the
transformation is so complex that I wouldn't have expected it to take
hold until well into the computer era.

P.P.S. The original posting in this thread mentioned the  FAQ
document. If you'd like to read more about what the GVRS software is
attempting, it's a good place to start
https://github.com/gwlucastrig/gridfour/wiki/A-GVRS-FAQ


On Sun, Jul 10, 2022 at 10:07 PM John Reiser  wrote:
>
> On 7/10/22, Gary Lucas wrote:
>
> > The other motivation for the block scheme is that the API provides
> > random-access to data. Typically, if one is looking at data for
> > Finland, one usually doesn't care much about the data from
> > Australia.Thus the file is divided into regional blocks. So the choice
> > of block size also reflects the way in which I anticipate applications
> > would use the data.
>
> I hope you are aware that such a system has been in use for about seventy 
> years,
> exactly for this purpose, namely the Universal Traverse Mercator system which
> uses 70 rectangular zones (some of them slightly overlapping).  See
> https://en.wikipedia.org/wiki/Universal_Transverse_Mercator_coordinate_system 
> .
> Most serious public Earth spatial data systems are based on UTM coordinates.
> If your system does not interoperate with UTMs then nobody will talk to you.
>
> Various package delivery corporations have proprietary systems
> that are quite similar.  United Parcel Service (UPS) in North America,
> DHL (a division of Deutsche Post) for much of the world, and others.
> Amazon had a coordinate system that specified points for delivery
> in the first 48 US states using two 16-bit integers.  Google Maps
> identifies locations on Earth using about 7 characters which encode
> a recursive nested quadrant system.
>
> I'm pleased that you consider altitude.  There are places in
> Grand Canyon National Park (Arizona, US) which have the same
> (x, y) coordinates but are several hundred feet apart.
> It takes a few hours to walk from one instance to another "same"
> (x, y) point.  And if you are delivering ice cream cones to
> someone in a sky-scraper tall building, then it can take minutes
> to travel from the street entrance to an upper floor.
>
> --
>



Re: [xz-devel] Question about using Java API for geospatial data

2022-07-10 Thread John Reiser

On 7/10/22, Gary Lucas wrote:


The other motivation for the block scheme is that the API provides
random-access to data. Typically, if one is looking at data for
Finland, one usually doesn't care much about the data from
Australia.Thus the file is divided into regional blocks. So the choice
of block size also reflects the way in which I anticipate applications
would use the data.


I hope you are aware that such a system has been in use for about seventy years,
exactly for this purpose, namely the Universal Traverse Mercator system which
uses 70 rectangular zones (some of them slightly overlapping).  See
https://en.wikipedia.org/wiki/Universal_Transverse_Mercator_coordinate_system .
Most serious public Earth spatial data systems are based on UTM coordinates.
If your system does not interoperate with UTMs then nobody will talk to you.

Various package delivery corporations have proprietary systems
that are quite similar.  United Parcel Service (UPS) in North America,
DHL (a division of Deutsche Post) for much of the world, and others.
Amazon had a coordinate system that specified points for delivery
in the first 48 US states using two 16-bit integers.  Google Maps
identifies locations on Earth using about 7 characters which encode
a recursive nested quadrant system.

I'm pleased that you consider altitude.  There are places in
Grand Canyon National Park (Arizona, US) which have the same
(x, y) coordinates but are several hundred feet apart.
It takes a few hours to walk from one instance to another "same"
(x, y) point.  And if you are delivering ice cream cones to
someone in a sky-scraper tall building, then it can take minutes
to travel from the street entrance to an upper floor.

--



Re: [xz-devel] Question about using Java API for geospatial data

2022-07-10 Thread Gary Lucas
Lasse,

Thanks for the information.  That is just the kind of thing I was
looking for.  I think it will be very helpful.

One thing I'd like to clarify is that I do not consider a decompressor
that takes 18.9 seconds to read 233 million sample values to be slow.
To me, that's a remarkable accomplishment. My simple-minded Huffman
decoder takes 5.93 seconds to read the same number of points and does
not get nearly as good compression ratios as LZMA and XZ.  And 5.93
seconds is the result of a lot of work trying to optimize the code.

>>> So that is about 12 thousand blocks?

Yes.  That's a fair estimate. There are actually 10800 blocks.  Each
covers 2 degrees of latitude and 3 degrees of longitude.  I arrived at
that size specification through trial-and-error. Naturally,
conventional data compressors work better with larger text sizes. So a
larger block size might have advantages because it would contain a
larger symbol set. But, at the same time, a larger block size would
cover a larger area on the face of the Earth and would lead to more
statistical variation (heteroskedasticity) in the data. So the
increase in the entropy of the uncompressed text might lead to worse
results in terms of data compression.  Before the code uses
conventional data compression, it runs a set of predictors over the
data (similar to what the PNG format does).

The other motivation for the block scheme is that the API provides
random-access to data. Typically, if one is looking at data for
Finland, one usually doesn't care much about the data from
Australia.Thus the file is divided into regional blocks. So the choice
of block size also reflects the way in which I anticipate applications
would use the data.

Thanks again for your help.

Gary



Re: [xz-devel] Question about using Java API for geospatial data

2022-07-10 Thread Lasse Collin
On 2022-07-09 Gary Lucas wrote:
> I am using the library to compress a public-domain data product called
> ETOPO1. ETOPO1 provides a global-scale grid of 233 million elevation
> and ocean depth samples as integer meters. My implementation
> compresses the data in separate blocks of about 20 thousand values
> each.

So that is about 12 thousand blocks?

> Previously, I used Huffman coding and Deflate to reduce the size
> of the data to about 4.39 bits per value. With your library, LZMA
> reduces that to 4.14 bits per value and XZ to 4.16.

Is the compressed size of each block about ten kilobytes?

> The original implementation requires an average of 4.8 seconds to
> decompress the full set of 233 million points.  The LZMA version
> requires 15.2 seconds, and the XZ version requires 18.9 seconds.

The Deflate implementation in java.util.zip uses zlib (native code). XZ
for Java is pure Java. LZMA is significantly slower than Deflate and
being pure Java makes the difference even bigger.

> My understanding is that XZ should perform better than LZMA. Since
> that is not the case, could there be something suboptimal with the way
> my code uses the API?

The core compression code is the same in both: XZ uses LZMA2 which is
LZMA with framing. XZ adds a few features like filters, integrity
checking, and block-based random access reading.

> And here are the Code Snippets:

The XZ examples don't use XZ for Java directly. This is clear due to
"Xz" vs. "XZ" difference in the class names and that XZOutputStream has
no constructor that takes the input size as an argument.

Non-performance notes:

  - Section "When uncompressed size is known beforehand" in
XZInputStream is worth reading. Basically adding a check
that "xzIn.read() == -1" is true at the end to verify the integrity
check. This at least used to be true (I haven't tested recently)
for GZipInputStream too.

  - When compressing, .finish() is redundant. .close() will do it
anyway.

  - If XZ data is embedded insize another file format, you may want
to use SingleXZInputStream instead of XZInputStream. XZInputStream
supports concatenated streams that are possible on standalone .xz
files but probably shouldn't occur when embedded inside another
format. In your case this likely makes no difference in practice.

Might affect performance:

  - The default LZMA2 dictionary size is 8 MiB. If the uncompressed
size is known to be much smaller than this, it's waste of memory to
use so big dictionary. In that case pick a value that is at least as
big as the largest uncompressed size, possibly round up to 2^n
value.

  - Compressing or decompressing multiple streams that use identical
settings means creating many compressor or decompressor instances.
To reduce garbage collector pressure there is ArrayCache which
reuses large array allocations. You can enable this globally with
this:

ArrayCache.setDefaultCache(BasicArrayCache.getInstance());

However, setting the default like this might not be desired if
multiple unrelated things in the application might use XZ for Java.

Note that ArrayCache can help both LZMA and XZ classes.

Likely will affect performance:

  - Since compression ratio is high, the integrity checking starts to
become more significant for performance. To test how much integrity
checking slows XZ down, use SingleXZInputStream or XZInputStream
constructor that takes "boolean verifyCheck" and set it to false.

You can also compress to XZ without integrity checking at all
(using XZ.CHECK_NONE as the third argument in XZOutputStream
constructor). Using XZ.CHECK_CRC32 is likely much faster than the
default XZ.CHECK_CRC64 because CRC32 comes from java.util.zip which
uses native code from zlib.

It's quite possible that XZ provides no value over raw LZMA in this
application, especially if you don't need integrity checking. Raw LZMA
instead of .lzma will even avoid the 13-byte .lzma header saving 150
kilobytes with 12 thousand blocks. If the uncompressed size is stored
in the container headers then further 4-5 bytes per block can be saved
by telling the size to the raw LZMA encoder and decoder.

Note that LZMAOutputStream and LZMAInputStream support .lzma and raw
LZMA: the choise between these is done by picking the right
constructors.

Finally, it might be worth playing with the lc/lp/pb parameters in
LZMA/LZMA2. Usually those make only tiny difference but with some data
types they have a bigger effect. These won't affect performance other
than that the smaller the compressed file the faster it tends to
decompress in case of LZMA/LZMA2.

Other compressors might be worth trying too. Zstandard typically
compresses only slightly worse than XZ/LZMA but it is *a lot* faster to
decompress.

-- 
Lasse Collin



Re: [xz-devel] Question about using Java API for geospatial data

2022-07-10 Thread Gary Lucas
>>> I am not certain which statement you believe is not authoritative

Sorry.  I didn't mean to challenge the Javadoc...I meant that I
was not sure I looked in the right place.

On Sun, Jul 10, 2022 at 9:10 AM Brett Okken  wrote:
>
> > I'm not sure that this is authoritative. The Java API documentation
> > says that it "aims" to provide "Full support for the .xz file format
> > specification version 1.0.4"
>
> I am not certain which statement you believe is not authoritative.
> There are existing constructors (such as[1]) which allow the disabling of 
> checksum verification.
>
> [1] -
> https://tukaani.org/xz/xz-javadoc/org/tukaani/xz/XZInputStream.html#%3Cinit%3E(java.io.InputStream,int,boolean)



Re: [xz-devel] Question about using Java API for geospatial data

2022-07-10 Thread Brett Okken
> I'm not sure that this is authoritative. The Java API documentation
> says that it "aims" to provide "Full support for the .xz file format
> specification version 1.0.4"

I am not certain which statement you believe is not authoritative.
There are existing constructors (such as[1]) which allow the disabling of
checksum verification.

[1] -
https://tukaani.org/xz/xz-javadoc/org/tukaani/xz/XZInputStream.html#%3Cinit%3E(java.io.InputStream,int,boolean)


Re: [xz-devel] Question about using Java API for geospatial data

2022-07-10 Thread Gary Lucas
Hi Brett,

I'm not sure that this is authoritative. The Java API documentation
says that it "aims" to provide "Full support for the .xz file format
specification version 1.0.4"

I am using the latest release of the Java library, version 1.19

Gary

On Sat, Jul 9, 2022 at 7:58 PM Brett Okken  wrote:
>
> What version of xz are you using?
>
> The differences between xz and lzma are a bit more involved. One such example 
> is that xz is a framed format which includes checksums on each “frame”. I 
> would not expect checksum verification to account for all of that difference, 
> but it can be disabled to confirm.
>
> On Sat, Jul 9, 2022 at 6:31 AM Gary Lucas  wrote:
>>
>> Hi,
>>
>> Would anyone be able to confirm that I am using the Java library
>> xz-java-1.9.zip correctly? If not, could you suggest a better way to
>> use it? Code snippets are included below.
>>
>> I am using the library to compress a public-domain data product called
>> ETOPO1. ETOPO1 provides a global-scale grid of 233 million elevation
>> and ocean depth samples as integer meters. My implementation
>> compresses the data in separate blocks of about 20 thousand values
>> each. Previously, I used Huffman coding and Deflate to reduce the size
>> of the data to about 4.39 bits per value. With your library, LZMA
>> reduces that to 4.14 bits per value and XZ to 4.16. So both techniques
>> represent a substantial improvement in compression compared to the
>> Huffman/Deflate methods. That improvement comes with a reasonable
>> cost. Decompression using LZMA and XZ is slower than Huffman/Deflate.
>> The original implementation requires an average of 4.8 seconds to
>> decompress the full set of 233 million points.  The LZMA version
>> requires 15.2 seconds, and the XZ version requires 18.9 seconds.
>>
>> My understanding is that XZ should perform better than LZMA. Since
>> that is not the case, could there be something suboptimal with the way
>> my code uses the API?
>>
>> If you would like more detail about the implementation, please visit
>>
>> Compression Algorithms for Raster Data:
>> https://gwlucastrig.github.io/GridfourDocs/notes/GridfourDataCompressionAlgorithms.html
>> Compression using Lagrange Multipliers for Optimal Predictors:
>> https://gwlucastrig.github.io/GridfourDocs/notes/CompressionUsingOptimalPredictors.html
>> GVRS Frequently asked Questions (FAQ):
>> https://github.com/gwlucastrig/gridfour/wiki/A-GVRS-FAQ
>>
>> Thank you for your great data compression library.
>>
>> Gary
>>
>> And here are the Code Snippets:
>>
>> The Gridfour Virtual Raster Store (GVRS) is a wrapper format that
>> stores separate blocks of compressed data to provide random-access by
>> application code
>>
>> LZMA --
>> // byte [] input is input data
>> ByteArrayOutputStream baos = new  ByteArrayOutputStream();
>> lzmaOut = new LZMAOutputStream(baos, new LZMA2Options(), 
>> input.length);
>> lzmaOut.write(input, 0, input.length);
>> lzmaOut.finish();
>> lzmaOut.close();
>> return baos.toByteArray();   // return byte[] which is stored to file
>>
>>
>> // reading the compressed data:
>> ByteArrayInputStream bais = new
>> ByteArrayInputStream(compressedInput, 0, compressedInput.length);
>> LZMAInputStream lzmaIn = new LZMAInputStream(bais);
>> byte[] output = new byte[expectedOutputLength];
>> lzmaIn.read(output, 0, output.length);
>>
>>
>> XZ 
>> // byte [] input is input data
>> ByteArrayOutputStream baos = new  ByteArrayOutputStream();
>> xzOut = new XzOutputStream(baos, new LZMA2Options(), input.length);
>> xzOut.write(input, 0, input.length);
>> xzOut.finish();
>> xzOut.close();
>> return baos.toByteArray();   // return byte[] which is stored to file
>>
>>// reading the compressed data:
>>ByteArrayInputStream bais = new
>> ByteArrayInputStream(compressedInput, 0, compressedInput.length);
>> XzInputStream xzIn = new XzInputStream(bais);
>> byte[] output = new byte[expectedOutputLength];
>> xzIn.read(output, 0, output.length);
>>



Re: [xz-devel] Question about using Java API for geospatial data

2022-07-09 Thread Brett Okken
What version of xz are you using?

The differences between xz and lzma are a bit more involved. One such
example is that xz is a framed format which includes checksums on each
“frame”. I would not expect checksum verification to account for all of
that difference, but it can be disabled to confirm.

On Sat, Jul 9, 2022 at 6:31 AM Gary Lucas  wrote:

> Hi,
>
> Would anyone be able to confirm that I am using the Java library
> xz-java-1.9.zip correctly? If not, could you suggest a better way to
> use it? Code snippets are included below.
>
> I am using the library to compress a public-domain data product called
> ETOPO1. ETOPO1 provides a global-scale grid of 233 million elevation
> and ocean depth samples as integer meters. My implementation
> compresses the data in separate blocks of about 20 thousand values
> each. Previously, I used Huffman coding and Deflate to reduce the size
> of the data to about 4.39 bits per value. With your library, LZMA
> reduces that to 4.14 bits per value and XZ to 4.16. So both techniques
> represent a substantial improvement in compression compared to the
> Huffman/Deflate methods. That improvement comes with a reasonable
> cost. Decompression using LZMA and XZ is slower than Huffman/Deflate.
> The original implementation requires an average of 4.8 seconds to
> decompress the full set of 233 million points.  The LZMA version
> requires 15.2 seconds, and the XZ version requires 18.9 seconds.
>
> My understanding is that XZ should perform better than LZMA. Since
> that is not the case, could there be something suboptimal with the way
> my code uses the API?
>
> If you would like more detail about the implementation, please visit
>
> Compression Algorithms for Raster Data:
>
> https://gwlucastrig.github.io/GridfourDocs/notes/GridfourDataCompressionAlgorithms.html
> Compression using Lagrange Multipliers for Optimal Predictors:
>
> https://gwlucastrig.github.io/GridfourDocs/notes/CompressionUsingOptimalPredictors.html
> GVRS Frequently asked Questions (FAQ):
> https://github.com/gwlucastrig/gridfour/wiki/A-GVRS-FAQ
>
> Thank you for your great data compression library.
>
> Gary
>
> And here are the Code Snippets:
>
> The Gridfour Virtual Raster Store (GVRS) is a wrapper format that
> stores separate blocks of compressed data to provide random-access by
> application code
>
> LZMA --
> // byte [] input is input data
> ByteArrayOutputStream baos = new  ByteArrayOutputStream();
> lzmaOut = new LZMAOutputStream(baos, new LZMA2Options(),
> input.length);
> lzmaOut.write(input, 0, input.length);
> lzmaOut.finish();
> lzmaOut.close();
> return baos.toByteArray();   // return byte[] which is stored to
> file
>
>
> // reading the compressed data:
> ByteArrayInputStream bais = new
> ByteArrayInputStream(compressedInput, 0, compressedInput.length);
> LZMAInputStream lzmaIn = new LZMAInputStream(bais);
> byte[] output = new byte[expectedOutputLength];
> lzmaIn.read(output, 0, output.length);
>
>
> XZ 
> // byte [] input is input data
> ByteArrayOutputStream baos = new  ByteArrayOutputStream();
> xzOut = new XzOutputStream(baos, new LZMA2Options(), input.length);
> xzOut.write(input, 0, input.length);
> xzOut.finish();
> xzOut.close();
> return baos.toByteArray();   // return byte[] which is stored to
> file
>
>// reading the compressed data:
>ByteArrayInputStream bais = new
> ByteArrayInputStream(compressedInput, 0, compressedInput.length);
> XzInputStream xzIn = new XzInputStream(bais);
> byte[] output = new byte[expectedOutputLength];
> xzIn.read(output, 0, output.length);
>
>


[xz-devel] Question about using Java API for geospatial data

2022-07-09 Thread Gary Lucas
Hi,

Would anyone be able to confirm that I am using the Java library
xz-java-1.9.zip correctly? If not, could you suggest a better way to
use it? Code snippets are included below.

I am using the library to compress a public-domain data product called
ETOPO1. ETOPO1 provides a global-scale grid of 233 million elevation
and ocean depth samples as integer meters. My implementation
compresses the data in separate blocks of about 20 thousand values
each. Previously, I used Huffman coding and Deflate to reduce the size
of the data to about 4.39 bits per value. With your library, LZMA
reduces that to 4.14 bits per value and XZ to 4.16. So both techniques
represent a substantial improvement in compression compared to the
Huffman/Deflate methods. That improvement comes with a reasonable
cost. Decompression using LZMA and XZ is slower than Huffman/Deflate.
The original implementation requires an average of 4.8 seconds to
decompress the full set of 233 million points.  The LZMA version
requires 15.2 seconds, and the XZ version requires 18.9 seconds.

My understanding is that XZ should perform better than LZMA. Since
that is not the case, could there be something suboptimal with the way
my code uses the API?

If you would like more detail about the implementation, please visit

Compression Algorithms for Raster Data:
https://gwlucastrig.github.io/GridfourDocs/notes/GridfourDataCompressionAlgorithms.html
Compression using Lagrange Multipliers for Optimal Predictors:
https://gwlucastrig.github.io/GridfourDocs/notes/CompressionUsingOptimalPredictors.html
GVRS Frequently asked Questions (FAQ):
https://github.com/gwlucastrig/gridfour/wiki/A-GVRS-FAQ

Thank you for your great data compression library.

Gary

And here are the Code Snippets:

The Gridfour Virtual Raster Store (GVRS) is a wrapper format that
stores separate blocks of compressed data to provide random-access by
application code

LZMA --
// byte [] input is input data
ByteArrayOutputStream baos = new  ByteArrayOutputStream();
lzmaOut = new LZMAOutputStream(baos, new LZMA2Options(), input.length);
lzmaOut.write(input, 0, input.length);
lzmaOut.finish();
lzmaOut.close();
return baos.toByteArray();   // return byte[] which is stored to file


// reading the compressed data:
ByteArrayInputStream bais = new
ByteArrayInputStream(compressedInput, 0, compressedInput.length);
LZMAInputStream lzmaIn = new LZMAInputStream(bais);
byte[] output = new byte[expectedOutputLength];
lzmaIn.read(output, 0, output.length);


XZ 
// byte [] input is input data
ByteArrayOutputStream baos = new  ByteArrayOutputStream();
xzOut = new XzOutputStream(baos, new LZMA2Options(), input.length);
xzOut.write(input, 0, input.length);
xzOut.finish();
xzOut.close();
return baos.toByteArray();   // return byte[] which is stored to file

   // reading the compressed data:
   ByteArrayInputStream bais = new
ByteArrayInputStream(compressedInput, 0, compressedInput.length);
XzInputStream xzIn = new XzInputStream(bais);
byte[] output = new byte[expectedOutputLength];
xzIn.read(output, 0, output.length);