I disagree with Ken, but don't have time now to write a lengthy reply.. I'll try to get to that soon.
Mark __________ http://www.macchiato.com ◄ “Eppur si muove” ► ----- Original Message ----- From: "Doug Ewell" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Cc: "Kenneth Whistler" <[EMAIL PROTECTED]> Sent: Tuesday, July 23, 2002 19:44 Subject: Re: Abstract character? > Kenneth Whistler <kenw at sybase dot com> wrote: > > >> UTF-16 does not allow the representation of an unpaired surrogate > >> 0xD800 followed by another, coincidental unpaired surrogate 0xDC00. > >> (It maps the two to U+10000.) Among the standard UTFs, only UTF-32 > >> allows the two to be treated as unpaired surrogates. > > > > Actually, not that, either. > > > >> In fact, before UTF-8 was > >> "tightened up" in 3.2, the only UTF that DID NOT permit these two > >> coincidental unpaired surrogates was UTF-16. > >> > >> UTF-8: D800 DC00 <==> ED A0 80 ED B0 80 (no longer legal) > >> UTF-32: D800 DC00 <==> 0000D800 0000DC00 > > > > This is ill-formed in UTF-32, and thereby, illegal. > > I'm glad to hear that unpaired surrogates are now also illegal in > UTF-32, and presumably also in UTF-16. However, I did do my homework > before writing yesterday's post, and that wasn't the impression I got, > so I sense another opportunity to tighten up the definitions before > Unicode 4.0 is released. > > In UAX #28, "Unicode 3.2," the section on "Elimination of Irregular > Sequences" starts out talking about "transformation formats such as > UTF-8." However, the rest of the section deals exclusively with UTF-8; > UTF-16 and UTF-32 are not mentioned. > > UAX #19, "UTF-32" (written by Mark) is listed in the header block as > having been updated to Unicode 3.2, but it does not state anywhere that > unpaired surrogates are illegal. In particular, the following passages > from UAX #19 led me to believe that all code points, from 0x0000 through > 0x10FFFF inclusive, are legal in UTF-32: > > "UTF-32 is restricted in values to the range 0..10FFFF<sub>16</sub>, > which precisely matches the range of characters defined in the Unicode > Standard (and other standards such as XML), and those representable by > UTF-8 and UTF-16." > > "(b) An illegal UTF-32 code unit sequence is any byte sequence that > would correspond to a numeric value outside of the range 0 to > 10FFFF<sub>16</sub>. > > "(c) An irregular UTF-32 code unit sequence is an eight-byte sequence > where the first four bytes correspond to a high surrogate, and the next > four bytes correspond to a low surrogate. As a consequence of C12, these > irregular UTF-32 sequences shall not be generated by a conformant > process." > > I suggest that the Unicode 4.0 text specifically state, in unambiguous > terms, which code points are and are not valid in UTF-8, UTF-16, and > UTF-32. And if it is true that the surrogate code points 0xD800 through > 0xDFFF are illegal in UTF-32, then I suggest that UAX #18 be revised to > state this unambiguously. > > -Doug Ewell > Fullerton, California > > >

