[EMAIL PROTECTED] wrote: > > There is also a concern that CESU-8 is really just a > variation of UTF-8, > allowing (nay, requiring) sequences that are illegal in UTF-8 > but otherwise > looking just like UTF-8. This could open security holes that > the UTC has > worked hard to close, and is continuing to close in Unicode 3.2. >
First, I would like to thank Doug for alerting us about what I think is a dangerous move - the continued promotion of UTF-8S, a.k.a. CESU-8. For some time now, my focus was on another UTF-8 variation, namely the UTF-8B. What scares me is that the two UTF-8 mutations seem mutually exclusive to me. They both break existing rules by using illegal or irregular sequences, and once one of the two transformations is accepted, the other one is doomed. I have to admit one thing though, UTF-8S only has one possible implementation, while UTF-8B could also be achieved in some other way, probably by reserving 128 code points, hopefully in the BMP (hear me laughing). Unfortunately the exceptions it would create would then no longer be in the domain of irregular sequences which was one of the beauties of its original design. Now, for those who are not familiar with UTF-8B, the intent of UTF-8B is to guarantee the roundtrip from 8-bit data to UTF-16 (or UCS-4) and back. I think it addresses a problem that will become more and more evident, even in the near future. This is the so called 'problem of illegal UTF-8 sequences'. Why is this important? Well, UTF-16 is simple - except for a little mess with UTF-16 vs. UCS-2, you pretty much know what you have - it's Unicode, and if a program fails to realize that, the results are catastrophic, immediately. Which is good. On the other hand, 8-bit data is very tricky. It can be UTF-8 or it can be encoded in any SBCS or MBCS codeset there is. From an armchair point of view, it may look pretty trivial - if your editor encounters an illegal sequence in the text file, ask the user if the file is to be interpreted as codeset based, right? Well, how about searching or indexing? Who can the program ask then? How about presenting a Unix filesystem on the web (in an html file, marked as UTF-8) or to a UTF-16 based OS, like Windows? And filenames are very tricky indeed, because you have no way of embedding any codeset information. It is my belief, that UTF-8 will become more and more popular on Unix. Some day in near future (ok, 5 or 10 years?) I expect to see 90% of the filenames in UTF-8. Everybody will use UTF-8 as their codeset. Can somebody explain to me, how the remaining 10% will be treated? Producing errors, on open, maybe even on ls/dir?! The only way that can be avoided is to standardize a transformation that will guarantee that any zero terminated sequence of 8-bit characters can be transformed to Unicode points and back without any data loss. And any promotion of CESU-8 will take us a step further away from solving this problem, which I believe is far more important than anything CESU-8 is addressing. Let me just finish with saying that I am not making up this problem or just foreseeing it - I have an actual requirement to store Unix filenames into a UTF-16 database. Since CESU-8 is not helping me there, I cannot but urge everyone - if there is going to be a mutation of UTF-8, it should be UTF-8B and *not* CESU-8. Merry Christmas to all, Lars Kristan

