Philippe Verdy wrote:
> Please don't use the term "normalize" in this context.
> Normalization in
> Unicode involves transformation of the stream of *code
> points*, but is
> independant of their encoding form or encoding scheme.
Yes, I believe I shouldn't have used "normalization". I do not know if this word has a general meaning and can be used when describing escaping or TES-es. For now I will assume it can, until someone enlightens me. But yes, since we're discussing Unicode (I still believe we are), it could be confusing. So, let me use "pre-normalization", and please consider that in the post being discussed this is what I meant. None of the references to "normalization" was intended to mean Unicode Normalization.
I did want to point out the similarity though. And that the pre-normalization is typically strongly tied to Unicode Normalization. Where Unicode Normalization is desired or required, pre-normalization also applies. There could be cases where only one of the two needs to be used, but I believe the two would usually go together. Specifically, in the case of specifying filenames in UI, no normalization should be applied, unless the filesystem itself only allows normalized instances of filenames. Goes for any normalization, and any store (when selecting, not when searching).
For the sake of completeness, let me attempt to describe how a normalization consisting of several sub-normalizations should be done, assuming my escaping technique is used. Let me use a W3C example. W3C pre-normalization should be done first. It can produce any codepoint, but is not affected itself by Unicode Normalization, nor can it be affected by MUTF-8 pre-normalization (since it can produce no codepoints in the U+0000..U+007F range). Next, MUTF-8 normalization is applied, since it can produce codepoints that will need to be Unicode Normalized. Last, Unicode Normalization is applied.
Now, I can assume someone would want to also use a CESU-8 pre-normalization in that process. It needs to be done just before Unicode Normalization and is not needed if data is in UTF-16 form at that point.
Note that the above process is straightforward and only the order matters. However, that is only true because of some assumptions. Things would get complicated if:
A - MUTF-8 (escaping) would use codepoints outside of the BMP. Then CESU-8 pre-normalization could produce a codepoint that would need MUTF-8 pre-normalization. The reverse is already true. In general, one would need to alternate the two pre-normalizations until they both finish.
B - Some other escaping technique is used where I used the W3C as an example and this escaping technique would use a non-ASCII codepoint (alone or within the escape sequence). Again, pre-normalizations would need to be alternated. If this codepoint would be non-BMP, then all three would need to be applied repeatedly.
OK, one more thing. I have tested my conversion with the controversial test file mentioned in the "UTF-8 stress test file?" thread. I was very pleased with the result (which I cannot say for the way Internet Explorer displays the original file: it is not dropping invalid sequences, but some invalid sequences 'eat' characters after them). But while I examined my output, I noticed that I treated both unpaired and paired (CESU-8) surrogates as invalid sequences. Which puzzled me for a moment since I never had any intentions to actively obstruct CESU-8. But I made the right decision when I was implementing the conversion - CESU-8 input would not roundtrip otherwise.
Another example of what one could do is implement a very forgiving NON-UTF-8 decoder, which would preserve most of invalid sequences, normalize CESU-8 and partially pre-normalize MUTF-8 escape sequences. The implementation of such decoder would be close to mine, with two checks that guarantee the rountrip removed, namely escaping the escapes and treating surrogates as invalid sequences. An optional next step would be full MUTF-8 pre-normalization. If output would not be UTF-16, then CESU-8 pre-normalization would be the last step needed. The equivalent of the above is:
1 - CESU-8 pre-normalization
2 - use unmodified MUTF-8 decoder
3 - one step (optionally full) MUTF-8 pre-normalization
4 - CESU-8 pre-normalization
But the same (especially when full pre-normaliztion is not required) can be achieved more efficiently by modifying the behavior of the function itself.
Lars

