Thanks for the information.  That is just the kind of thing I was
looking for.  I think it will be very helpful.

One thing I'd like to clarify is that I do not consider a decompressor
that takes 18.9 seconds to read 233 million sample values to be slow.
To me, that's a remarkable accomplishment. My simple-minded Huffman
decoder takes 5.93 seconds to read the same number of points and does
not get nearly as good compression ratios as LZMA and XZ.  And 5.93
seconds is the result of a lot of work trying to optimize the code.

>>> So that is about 12 thousand blocks?

Yes.  That's a fair estimate. There are actually 10800 blocks.  Each
covers 2 degrees of latitude and 3 degrees of longitude.  I arrived at
that size specification through trial-and-error. Naturally,
conventional data compressors work better with larger text sizes. So a
larger block size might have advantages because it would contain a
larger symbol set. But, at the same time, a larger block size would
cover a larger area on the face of the Earth and would lead to more
statistical variation (heteroskedasticity) in the data. So the
increase in the entropy of the uncompressed text might lead to worse
results in terms of data compression.  Before the code uses
conventional data compression, it runs a set of predictors over the
data (similar to what the PNG format does).

The other motivation for the block scheme is that the API provides
random-access to data. Typically, if one is looking at data for
Finland, one usually doesn't care much about the data from
Australia.Thus the file is divided into regional blocks. So the choice
of block size also reflects the way in which I anticipate applications
would use the data.

Thanks again for your help.


Reply via email to