Re: Code pages and Unicode

Richard Wordingham Wed, 24 Aug 2011 19:54:39 -0700

On Wed, 24 Aug 2011 17:07:03 -0700
Ken Whistler <[email protected]> wrote:


> > <Snip>  The
> > BMP is littered with concessions to the limitations of rendering
> > systems - precomposed characters, Hangul syllables and Arabic
> > presentation forms are the most significant.
 
> Those are not concessions to "the limitations of rendering systems"
> -- they are concessions to the need to stay compatible with the
> character encodings of legacy systems, which had limitations for
> their rendering systems.

Which earlier coding system supported Welsh?  (I'm thinking of 'W WITH
CIRCUMFLEX', U+0174 and U+0175.)  How was the use of the canonical
decompositions incompatible with the character encodings of legacy
systems?  Latin-1 has the same codes as ISO-8859-1, but that's as far
as having the same codes goes. Was the use of combining jamo
incompatible with legacy Hangul encodings?

> >>> >  >     I think, however, that<high><high><rare
> >>> >  >  BMP code><low>   offers a legitimate extension mechanism
> >> >  One could argue about the description as "legitimate". It is
> >> > clearly not conformant,

> In whichever encoding form you choose to specify, the sequence
> <high><high> is non-conformant. Not merely a possibly new type of
> code unit sequence.
> 
> <D800 D800> is non-conformant UTF-16
> 
> <0000D800 0000D800> is non-conformant UTF-32
> 
> <ED A0 80 ED A0 80> is non-conformant UTF-8

<high><low> is also non-conformant UTF-8 and UTF-32.

Obviously <D800 D800 000E DC00> is non-conformant with current UTF-16.
Remembering that there is a guarantee that there will be no more
surrogate points, an extension form has to be non-conformant with
current UTF-16!

> >> >  I see no chance of that happening for either the Unicode
> >> >  Standard or 10646.
> > It will only happen when the need becomes obvious, which may be
> > never, or may be 30 years hence.  It's even conceivable that UTF-16
> > will drop out of use.
> 
> Could happen. It still doesn't matter, because such a proposal also
> breaks UTF-8 and UTF-32.

Everyone should know how to extend UTF-8 and UTF-32 to cover the 31-bit
range.  Just go back to the old ISO 10646 definitions!  UTF-16 is the
problem.

Past suggestions have included a new set of surrogate points, which
would restrict the numbers that could represent characters.  One might,
for instance, allocate U+B0000 to U+BFFDD to 'high extended surrogates'
and U+C0000 to U+C7FFF to 'low extended surrogates'.  That's a lot of
codepoints so that a 31-bit number can be expressed in 64 bits and
could easily be rendered impossible by a few random assignments.
(Using three surrogates would be more economical in codepoints - one
could even do <high1><high2><low3><low4> with high2 having a restricted
range taking out just 2^11 codepoints from the supplementary planes.)

Andrew reasonably asked whether an extension *could* be done without
creating more surrogates.  All the solutions we've thought of
affect searching for a single character - using an ISO 2022 escape code
is probably the worst of them from this point of view.

Richard.

Re: Code pages and Unicode

Reply via email to