Re: Thoughts on upsizing Unicode

Sławomir Osipiuk via Unicode Sat, 04 Apr 2026 00:37:38 -0700

On Sat, Apr 4, 2026 at 2:18 AM Rebecca Bettencourt <[email protected]> wrote:
>
> There is no "stealing" going on. The PUA code points still exist, only now 
> they are encoded as U+110000 and above would be. (Just like when converting 
> from Latin-1 to UTF-8, U+0080 to U+00FF still exist, only now they are 
> encoded as U+0100 and above are.) And this solution does only use the 
> existing set of surrogates.


That's invalidating *existing* UTF-16 encodings of PUA characters.
>From your previous message:

> 0xDBFB 0xDFFF 0xDFFF => U+07BFFFFF

But 0xDBFB 0xDFFF already encodes U+10EFFF in valid UTF-16.

> 0xDBFD 0xDC00 0xDC00 0xDC00 => U+4000000

But 0xDBFD 0xDC00 already encodes U+10F400 in valid UTF-16.

You can't retroactively change the definition of UTF-16, and it's a
must-have that all valid UTF-16 encoded data be valid XTF-16 data with
the same meaning. Just because they're PUA characters does not make it
okay to change their encoding. In fact, if we *must* go with this
strategy, using the high surrogates for yet-unassigned planes would be
better than the PUA planes. PUA characters are meaningful, while
unassigned characters are not (yet). This would still be rather bad
and cause issues once those planes do get assigned, because they'd
have different encodings in UTF-16 and XTF-16, but using PUA planes
would cause that problem immediately.

> If this were an actual call for proposals this would definitely be a major 
> problem.

Respectfully, I think the problems introduced by other schemes would
be major-er.

Re: Thoughts on upsizing Unicode

Reply via email to