Re: Thoughts on upsizing Unicode

Dominikus Dittes Scherkl via Unicode Fri, 03 Apr 2026 03:50:32 -0700

Am 03.04.26 um 10:05 schrieb Rebecca Bettencourt via Unicode:

My idea is to reallocate the Private Use High Surrogates to three- andfour-surrogate sequences using a UTF-8-like encoding, so that UTF-16text without PUA characters remains unchanged. Only SPUA-A, SPUA-B, andnon-UTF16 characters would use new encoding forms.

[...]

Of course, this is all just for speculation, as we are at least a couplehundred years away from this actually being a problem.

This is all much too complicated. There is a much better solution, thatwouldn't change any of the existing code points and would also allow for

relatively short encodings:

For UTF-8 simply use the now forbidden sequences starting with 0xFC or0xFD (maybe not 0xFE and 0xFF as this sometimes is missused forencoding-detection).For ease of description let's say they introduce a 6-byte sequence, thefirst encodes only one bit (0xFC or 0xFD), the five follow-up bytes 6biteach, together using up the full 31bit range of UTF-32


For  UTF-16 simply use the now forbidden surrogate sequences as already

suggested (3 surrogates Hi-Hi-Lo or Hi-Lo-Lo each encoding 10 bits + onebit decided by using Hi or Lo in the middle for 31bits at all).


In both cases that's 6 byte to encode one codepoint above 0x10FFFF.

--

Dominikus Dittes Scherkl

Re: Thoughts on upsizing Unicode

Reply via email to