RE: Compression through normalization

jon Wed, 26 Nov 2003 06:25:08 -0800

> In the case of GIF versus JPG, which are usually regarded as "lossless" 
> versus "lossy", please note that there /is/ no "orignal", in the sense 
> of a stream of bytes. Why not? Because an image is not a stream of 
> bytes. Period. What is being compressed here is a rectangular array of 
> pixels, and that is what is being restored when the image is "viewed". I 
> am not aware of ANY use of the GIF format to compress an arbitrary byte 
> stream.
> 
> So, by analogy, if the XYZ compression format (I made that up) claims to 
> compress a sequence of Unicode glyphs, as opposed to an arbitrary byte 
> stream, and can later reconstruct that sequence of glyphs exactly, then 
> I argue that it has every right to be called "lossless", in the same 
> manner that GIF is called "lossless", because /there is no original byte 
> stream to preserve/.


Well there *is* a stream of bytes with GIFs and they *are* reconstructed 
perfectly on decompression. Most of the time it only matters that the image 
isn't altered by the compression (PNG is perhaps a better analogy, since with 
GIF we might be forced to reduce the colour depth of the image to make it work 
with the format, with PNG we can have better compression if we drop to 256 or 
fewer colours but we don't have to). However it could be a issue if we 
performed some operation on the underlying data which treated it as bytes 
(signing the image in BMP format springs to mind as a possibility).

While in practice we would generally not have such issues (we would move our 
signing operation to after the PNG creation) they could arise (if we have some 
concept of signing an image independent of image format, implemented by 
converting to a canonical format as needed - in such a case PNG being lossless 
in its treatment of the bytestream would make it usable, JPEG being lossy would 
not).

With a similar operation on Unicode text data we have a similar problem and a 
similar solution. If we have a need to use the underlying bytes (lets say we're 
signing again) we can either move the operation on the bytes until after the 
compression or, if that is not permitted by some requirement, we are forced to 
use a compression scheme that is lossless at the byte level.
If our concept of signing is independent of encoding then we can move between 
encodings during the compression process (and sign on a canonical encoding) XML 
Signature is an example of this (it treats UTF-8 as a canonical encoding).
If our concept of signing considers canonically equivalent sequences to be 
equivalent we can move between normalisation forms in the compression process 
and sign and verify on a specified normalisation form (again XML signature 
aludes to this possibility, though it doesn't use it, as it could introduce 
security issues in some cases - though for applications that truly treat 
canonically equivalent sequences as equivalent then this is a viable pre-
processing step to XML signature).

So perhaps we should stop talking about "lossy/lossless" and talk about "what 
is lost" in a given operation. The advantage gained (theoretically, at least, 
does anyone have data on how significant this is?) is from removing entropy of 
a type that the compression algorithm is unlikely to be able to remove itself. 
The question is whether this is truly entropy, or if it's actually data. I'd 
lean towards considering it entropy and removing it - but I'd like to be warned 
in advance that this was going to happen, and have other options available.

RE: Compression through normalization

Reply via email to