On Mon, 22 Aug 2011 14:06:00 +0100 (BST) William_J_G Overington <[email protected]> wrote:
> On Monday 22 August 2011, Andrew West <[email protected]> wrote: > > > Can anyone think of a way to extend UTF-16 without adding new > > surrogates or inventing a new general category? > > > > Andrew > > How about a triple sequence of two high surrogates followed by one > low surrogate? The problem is that a search for the character represented by the code unit sequence (H2,L3) would also pick up the sequence (H1,H2,L3). While there is no ambiguity, it does make searching more complicated to code. The same issue applies to the suggestion of using (H1,H2,L3,L4) sequences. Now, we could use (H1,H2,L3,L4) sequences and never assign the (H2,L3) combinations. They would therefore be category Cn, which currently consists of both the unassigned characters and the non-characters. However, I can't help feeling that they'd be almost a sort of surrogate. It's slightly more efficient to replace L3 by a single BMP character. Practically, I think that if we can change the semantics of the Myanmar script, our descendants can go back on the guarantee of no more surrogates. Richard.

