> In the case of GIF versus JPG, which are usually regarded as "lossless" > versus "lossy", please note that there /is/ no "orignal", in the sense > of a stream of bytes. Why not? Because an image is not a stream of > bytes. Period. What is being compressed here is a rectangular array of > pixels, and that is what is being restored when the image is "viewed". I > am not aware of ANY use of the GIF format to compress an arbitrary byte > stream. > > So, by analogy, if the XYZ compression format (I made that up) claims to > compress a sequence of Unicode glyphs, as opposed to an arbitrary byte > stream, and can later reconstruct that sequence of glyphs exactly, then > I argue that it has every right to be called "lossless", in the same > manner that GIF is called "lossless", because /there is no original byte > stream to preserve/.
Well there *is* a stream of bytes with GIFs and they *are* reconstructed perfectly on decompression. Most of the time it only matters that the image isn't altered by the compression (PNG is perhaps a better analogy, since with GIF we might be forced to reduce the colour depth of the image to make it work with the format, with PNG we can have better compression if we drop to 256 or fewer colours but we don't have to). However it could be a issue if we performed some operation on the underlying data which treated it as bytes (signing the image in BMP format springs to mind as a possibility). While in practice we would generally not have such issues (we would move our signing operation to after the PNG creation) they could arise (if we have some concept of signing an image independent of image format, implemented by converting to a canonical format as needed - in such a case PNG being lossless in its treatment of the bytestream would make it usable, JPEG being lossy would not). With a similar operation on Unicode text data we have a similar problem and a similar solution. If we have a need to use the underlying bytes (lets say we're signing again) we can either move the operation on the bytes until after the compression or, if that is not permitted by some requirement, we are forced to use a compression scheme that is lossless at the byte level. If our concept of signing is independent of encoding then we can move between encodings during the compression process (and sign on a canonical encoding) XML Signature is an example of this (it treats UTF-8 as a canonical encoding). If our concept of signing considers canonically equivalent sequences to be equivalent we can move between normalisation forms in the compression process and sign and verify on a specified normalisation form (again XML signature aludes to this possibility, though it doesn't use it, as it could introduce security issues in some cases - though for applications that truly treat canonically equivalent sequences as equivalent then this is a viable pre- processing step to XML signature). So perhaps we should stop talking about "lossy/lossless" and talk about "what is lost" in a given operation. The advantage gained (theoretically, at least, does anyone have data on how significant this is?) is from removing entropy of a type that the compression algorithm is unlikely to be able to remove itself. The question is whether this is truly entropy, or if it's actually data. I'd lean towards considering it entropy and removing it - but I'd like to be warned in advance that this was going to happen, and have other options available.

