Kenneth Whistler <kenw at sybase dot com> wrote: > It is perfectly conformant with the Unicode Standard to assert > that <U+00E9> "Ã" and <U+0065, U+0301> "Ã" are different > Unicode strings. They *are* different Unicode strings. They > contain different encoded characters, and they have different > lengths. > ... > What canonical equivalence is about is making non-distinctions > in the *interpretation* of equivalent sequences. No Unicode- > conformant process should assume that another process will > systematically distinguish a meaningful interpretation > difference between <U+00E9> "Ã" and <U+0065, U+0301> "Ã" -- > they both represent the *same* abstract character, namely > an e-acute.
Just to wrap up the discussion we had last week on compression: For me at least, this settles it. Compression engines generally operate at a level where strings of encoded characters, not their interpretation, are at issue. Differences between strings that are due to normalization are not relevant for interpretation, but may be very relevant for other factors, like string length and checksums. That being the case, it would *not* generally be appropriate for a compressor to normalize its input text. To do so would be to introduce differences at a level where there should be none. -Doug Ewell Fullerton, California http://users.adelphia.net/~dewell/

