Beyond 17 planes, was: Java char and Unicode 3.0+

Peter Kirk Thu, 16 Oct 2003 08:57:55 -0700

On 16/10/2003 06:33, Philippe Verdy wrote:

From: "John Cowan" <[EMAIL PROTECTED]>

Philippe Verdy scripsit:

I am also doubting, but I would not bet on it. After all, when Unicode started, a single plane was considered waaaaaay more than sufficient

too.

I not only would bet on it, I actually have a bet on it. Henry Thompson of the W3C's Schema WG bet me that we'd outrun the existing planes within five years; four left to go and no sign of it, even if Michael Everson were to achieve pluripresence and actually get everything accepted into the standard that he knows needs to be done.
Just for the case it would be needed, are you keeping an unassigned range
in the BMP so that extension will remain possible to preserve an ascending
compatibility or support for UTF-16 which currently is the main reason why
there are for now 17 planes defined ?
(for example in the range between Hangul syllables and existing surrogates)

...

I would guess not. I can think of much more useful things to do with any remaining space in the BMP. Anyway, the space you mention, if used for additional high-half or low-half surrogates, is only 80 characters wide and so would give just slightly more than one more plane, in fact 80 x 1024 characters. And it is the largest space on the BMP which is not already roadmapped.

I suppose that, in the unlikely event that in the foreseeable future it looks as if more than 17 planes might become necessary, and anyone is still trying to use UTF-16 (although by that time memory and bandwidth will probably be so cheap that no one bothers any more with encodings that save them), it will be possible to reserve part of the 17 planes for surrogate pairs representing the remaining planes. So the UTF-16 encoding would be two existing 16-bit surrogate pairs forming a higher level surrogate pair. UTF-32 would of course be more efficient (32 bits rather than 64), but I doubt if anyone will care.

If two whole planes were reserved for such surrogates, this mechanism could cover the whole 32-bit hyperspace. Meanwhile UTF-8 can be extended to 6 bytes (byte 1 being 111110xx) to cover the same space. Plenty of room there to encode not just all the scripts of the Galactic Federation but even to squeeze in those of the Klingons and their allies!

Or perhaps a way can be found to graciously retire UTF-16 in some distant future version of Unicode. That is likely to become viable long before the extra planes are needed.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

Beyond 17 planes, was: Java char and Unicode 3.0+

Reply via email to