Kenneth Whistler <kenw at sybase dot com> wrote: >> UTF-16 does not allow the representation of an unpaired surrogate >> 0xD800 followed by another, coincidental unpaired surrogate 0xDC00. >> (It maps the two to U+10000.) Among the standard UTFs, only UTF-32 >> allows the two to be treated as unpaired surrogates. > > Actually, not that, either. > >> In fact, before UTF-8 was >> "tightened up" in 3.2, the only UTF that DID NOT permit these two >> coincidental unpaired surrogates was UTF-16. >> >> UTF-8: D800 DC00 <==> ED A0 80 ED B0 80 (no longer legal) >> UTF-32: D800 DC00 <==> 0000D800 0000DC00 > > This is ill-formed in UTF-32, and thereby, illegal.
I'm glad to hear that unpaired surrogates are now also illegal in UTF-32, and presumably also in UTF-16. However, I did do my homework before writing yesterday's post, and that wasn't the impression I got, so I sense another opportunity to tighten up the definitions before Unicode 4.0 is released. In UAX #28, "Unicode 3.2," the section on "Elimination of Irregular Sequences" starts out talking about "transformation formats such as UTF-8." However, the rest of the section deals exclusively with UTF-8; UTF-16 and UTF-32 are not mentioned. UAX #19, "UTF-32" (written by Mark) is listed in the header block as having been updated to Unicode 3.2, but it does not state anywhere that unpaired surrogates are illegal. In particular, the following passages from UAX #19 led me to believe that all code points, from 0x0000 through 0x10FFFF inclusive, are legal in UTF-32: "UTF-32 is restricted in values to the range 0..10FFFF<sub>16</sub>, which precisely matches the range of characters defined in the Unicode Standard (and other standards such as XML), and those representable by UTF-8 and UTF-16." "(b) An illegal UTF-32 code unit sequence is any byte sequence that would correspond to a numeric value outside of the range 0 to 10FFFF<sub>16</sub>. "(c) An irregular UTF-32 code unit sequence is an eight-byte sequence where the first four bytes correspond to a high surrogate, and the next four bytes correspond to a low surrogate. As a consequence of C12, these irregular UTF-32 sequences shall not be generated by a conformant process." I suggest that the Unicode 4.0 text specifically state, in unambiguous terms, which code points are and are not valid in UTF-8, UTF-16, and UTF-32. And if it is true that the surrogate code points 0xD800 through 0xDFFF are illegal in UTF-32, then I suggest that UAX #18 be revised to state this unambiguously. -Doug Ewell Fullerton, California

