Re: Thoughts on upsizing Unicode

Rebecca Bettencourt via Unicode Fri, 03 Apr 2026 23:23:27 -0700

On Fri, Apr 3, 2026 at 8:54 PM Sławomir Osipiuk <[email protected]> wrote:


> It's also wrong to "steal" PUA code points to cater to this wonky
> encoding. It's bad enough that people already perceive PUA characters
> as second-class and are reluctant to use them. The solution should be
> restricted to the existing set of surrogates only.
>

There is no "stealing" going on. The PUA code points still exist, only now
they are encoded as U+110000 and above would be. (Just like when converting
from Latin-1 to UTF-8, U+0080 to U+00FF still exist, only now they are
encoded as U+0100 and above are.) And this solution does only use the
existing set of surrogates.

Maybe there's some confusion over code units vs code points. Both of our
solutions are proposing changing how UTF-16 code *units* work; neither of
them are proposing changing code *points* in any way (besides allowing code
points beyond U+10FFFF).

Any solution is going to be wonky, as UTF-16 itself is a wonky encoding
to begin with. Having multiple advantages such as error detection,
substring matching, and self-synchronization is more important than whether
the solution is "simple" versus "wonky."

> If a Hi-Lo-Lo sequence is truncated, it appears as a valid Hi-Lo sequence
> for a different character, and the error is impossible to detect.
>
> I think this is a minor problem, and the most acceptable one.
>

If this were an actual call for proposals this would definitely be a major
problem. It complicates text rendering, string validation, security,
substring matching, collation, and a host of other things that are much
easier to deal with when one valid code sequence can't contain another.


> If we're talking about security, a
> truncated stream should be raising an alarm anyway.
>

That's hard to do when you can't detect it.

Re: Thoughts on upsizing Unicode

Reply via email to