Am 03.04.26 um 10:05 schrieb Rebecca Bettencourt via Unicode:
My idea is to reallocate the Private Use High Surrogates to three- and
four-surrogate sequences using a UTF-8-like encoding, so that UTF-16
text without PUA characters remains unchanged. Only SPUA-A, SPUA-B, and
non-UTF16 characters would use new encoding forms.
[...]
Of course, this is all just for speculation, as we are at least a couple
hundred years away from this actually being a problem.
This is all much too complicated. There is a much better solution, that
wouldn't change any of the existing code points and would also allow for
relatively short encodings:
For UTF-8 simply use the now forbidden sequences starting with 0xFC or
0xFD (maybe not 0xFE and 0xFF as this sometimes is missused for
encoding-detection).
For ease of description let's say they introduce a 6-byte sequence, the
first encodes only one bit (0xFC or 0xFD), the five follow-up bytes 6bit
each, together using up the full 31bit range of UTF-32
For UTF-16 simply use the now forbidden surrogate sequences as already
suggested (3 surrogates Hi-Hi-Lo or Hi-Lo-Lo each encoding 10 bits + one
bit decided by using Hi or Lo in the middle for 31bits at all).
In both cases that's 6 byte to encode one codepoint above 0x10FFFF.
--
Dominikus Dittes Scherkl