On Wed, 24 Aug 2011 17:07:03 -0700 Ken Whistler <[email protected]> wrote:
> > <Snip> The > > BMP is littered with concessions to the limitations of rendering > > systems - precomposed characters, Hangul syllables and Arabic > > presentation forms are the most significant. > Those are not concessions to "the limitations of rendering systems" > -- they are concessions to the need to stay compatible with the > character encodings of legacy systems, which had limitations for > their rendering systems. Which earlier coding system supported Welsh? (I'm thinking of 'W WITH CIRCUMFLEX', U+0174 and U+0175.) How was the use of the canonical decompositions incompatible with the character encodings of legacy systems? Latin-1 has the same codes as ISO-8859-1, but that's as far as having the same codes goes. Was the use of combining jamo incompatible with legacy Hangul encodings? > >>> > > I think, however, that<high><high><rare > >>> > > BMP code><low> offers a legitimate extension mechanism > >> > One could argue about the description as "legitimate". It is > >> > clearly not conformant, > In whichever encoding form you choose to specify, the sequence > <high><high> is non-conformant. Not merely a possibly new type of > code unit sequence. > > <D800 D800> is non-conformant UTF-16 > > <0000D800 0000D800> is non-conformant UTF-32 > > <ED A0 80 ED A0 80> is non-conformant UTF-8 <high><low> is also non-conformant UTF-8 and UTF-32. Obviously <D800 D800 000E DC00> is non-conformant with current UTF-16. Remembering that there is a guarantee that there will be no more surrogate points, an extension form has to be non-conformant with current UTF-16! > >> > I see no chance of that happening for either the Unicode > >> > Standard or 10646. > > It will only happen when the need becomes obvious, which may be > > never, or may be 30 years hence. It's even conceivable that UTF-16 > > will drop out of use. > > Could happen. It still doesn't matter, because such a proposal also > breaks UTF-8 and UTF-32. Everyone should know how to extend UTF-8 and UTF-32 to cover the 31-bit range. Just go back to the old ISO 10646 definitions! UTF-16 is the problem. Past suggestions have included a new set of surrogate points, which would restrict the numbers that could represent characters. One might, for instance, allocate U+B0000 to U+BFFDD to 'high extended surrogates' and U+C0000 to U+C7FFF to 'low extended surrogates'. That's a lot of codepoints so that a 31-bit number can be expressed in 64 bits and could easily be rendered impossible by a few random assignments. (Using three surrogates would be more economical in codepoints - one could even do <high1><high2><low3><low4> with high2 having a restricted range taking out just 2^11 codepoints from the supplementary planes.) Andrew reasonably asked whether an extension *could* be done without creating more surrogates. All the solutions we've thought of affect searching for a single character - using an ISO 2022 escape code is probably the worst of them from this point of view. Richard.

