Re: Code pages and Unicode

Richard Wordingham Mon, 22 Aug 2011 15:45:11 -0700

On Mon, 22 Aug 2011 14:06:00 +0100 (BST)
William_J_G Overington <[email protected]> wrote:

> On Monday 22 August 2011, Andrew West <[email protected]> wrote:
>  
> > Can anyone think of a way to extend UTF-16 without adding new
> > surrogates or inventing a new general category?
> > 
> > Andrew
>  
> How about a triple sequence of two high surrogates followed by one
> low surrogate? 

The problem is that a search for the character represented by the code
unit sequence (H2,L3) would also pick up the sequence (H1,H2,L3).
While there is no ambiguity, it does make searching more complicated
to code.  The same issue applies to the suggestion of using
(H1,H2,L3,L4) sequences.

Now, we could use (H1,H2,L3,L4) sequences and never assign the (H2,L3)
combinations.  They would therefore be category Cn, which currently
consists of both the unassigned characters and the non-characters.
However, I can't help feeling that they'd be almost a sort of
surrogate.  It's slightly more efficient to replace L3 by a single BMP
character.

Practically, I think that if we can change the semantics of the Myanmar
script, our descendants can go back on the guarantee of no more
surrogates.

Richard.

Re: Code pages and Unicode

Reply via email to