RE: Compression through normalization

Arcane Jill Tue, 25 Nov 2003 04:17:01 -0800

I'm pretty sure it depends on whether you regard a text document as a sequence of characters, or as a sequence of glyphs. (Er - I mean "default grapheme clusters" of course). Regarded as a sequence of characters, normalisation changes that sequence. But regarded as a sequence of glyphs, normalisation leaves the sequence unchanged. So a compression algorithm could legitimately claim to be "lossless" if it did normalisation but operated at the glyph level.

I'm pretty sure you DON'T need to preserve the byte-stream bit for bit. For example, at the byte level, I see no reason to preserve invalid encoding sequences, and at the codepoint level I see no reason to preserve non-character codepoints. So - at the glyph level - we only need to preserve glyphs, no? It all depends on how the compression algorithm describes itself.

I think this might go wrong for "tailored grapheme clusters", but I don't know much about them.

Jill

RE: Compression through normalization

Reply via email to