Re: UTF-16 Beyond U+10FFFF (was: Java char and Unicode 3.0+)

Philippe Verdy Thu, 16 Oct 2003 11:47:21 -0700

----- Original Message ----- 
From: "Jill Ramonsky" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Thursday, October 16, 2003 4:35 PM
Subject: UTF-16 Beyond U+10FFFF (was: Java char and Unicode 3.0+)



>
> Here's an alternative idea.
>
> In UTF-16, as it's currently defined, codepoints in the range U+010000
> to U+10FFFF are represented as some High Surrogate (HS) followed by some
> Low Surrogate (LS). Also, as currently defined, any HS not followed by
> an LS, or an LS not preceeded by an HS, is illegal.
>
> So, to create even higher codepoints still, all you have to do is use
> some currently illegal sequences. For example:
>
> HS + LS => 10 bits from HS plus 10 bits from LS (as now)
> [This gives a range of 0x00000 to 0xFFFFF, to which we add 0x10000
> giving an actual range of U+10000 to U+10FFFF]
>
> HS + HS + LS => 10 bits from first HS plus 10 bits from second HS plus
> 10 bits from LS
> [This gives a range of 0x00000000 to 0x3FFFFFFF, to which we can add
> 0x110000 giving an actual range of U+110000 to U+4010FFFF]
>
> HS + HS + HS + LS => 10 bits from first HS plus 10 bits from second HS
> plus 10 bits from third HS plus 10 bits from LS
> [This gives a range of 0x0000000000 to 0xFFFFFFFFFF, to which we can add
> 0x40110000 giving an actual range of U+40110000 to U+1004010FFFF]

I don't like this idea: there's a performance penalty when parsing from
random places if they points to the HS codepoint: one has to scan backward
to find the start of the sequence (this is effectively the case with UTF-8,
but
not with UTF-16 where a single read indicates the position of the first
character in the encoding sequence).

I frankly would prefer the solution based on "hyper-surrogates" allocated
out of the BMP, with a couple of existing UTF-16 surrogates encoding
each hyper-surrogate (reserved for example in the special plane 14).

Re: UTF-16 Beyond U+10FFFF (was: Java char and Unicode 3.0+)

Reply via email to