----- Original Message ----- From: "Jill Ramonsky" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Sent: Thursday, October 16, 2003 4:35 PM Subject: UTF-16 Beyond U+10FFFF (was: Java char and Unicode 3.0+)
> > Here's an alternative idea. > > In UTF-16, as it's currently defined, codepoints in the range U+010000 > to U+10FFFF are represented as some High Surrogate (HS) followed by some > Low Surrogate (LS). Also, as currently defined, any HS not followed by > an LS, or an LS not preceeded by an HS, is illegal. > > So, to create even higher codepoints still, all you have to do is use > some currently illegal sequences. For example: > > HS + LS => 10 bits from HS plus 10 bits from LS (as now) > [This gives a range of 0x00000 to 0xFFFFF, to which we add 0x10000 > giving an actual range of U+10000 to U+10FFFF] > > HS + HS + LS => 10 bits from first HS plus 10 bits from second HS plus > 10 bits from LS > [This gives a range of 0x00000000 to 0x3FFFFFFF, to which we can add > 0x110000 giving an actual range of U+110000 to U+4010FFFF] > > HS + HS + HS + LS => 10 bits from first HS plus 10 bits from second HS > plus 10 bits from third HS plus 10 bits from LS > [This gives a range of 0x0000000000 to 0xFFFFFFFFFF, to which we can add > 0x40110000 giving an actual range of U+40110000 to U+1004010FFFF] I don't like this idea: there's a performance penalty when parsing from random places if they points to the HS codepoint: one has to scan backward to find the start of the sequence (this is effectively the case with UTF-8, but not with UTF-16 where a single read indicates the position of the first character in the encoding sequence). I frankly would prefer the solution based on "hyper-surrogates" allocated out of the BMP, with a couple of existing UTF-16 surrogates encoding each hyper-surrogate (reserved for example in the special plane 14).

