On Fri, Apr 3, 2026 at 8:54 PM Sławomir Osipiuk <[email protected]> wrote:
> It's also wrong to "steal" PUA code points to cater to this wonky > encoding. It's bad enough that people already perceive PUA characters > as second-class and are reluctant to use them. The solution should be > restricted to the existing set of surrogates only. > There is no "stealing" going on. The PUA code points still exist, only now they are encoded as U+110000 and above would be. (Just like when converting from Latin-1 to UTF-8, U+0080 to U+00FF still exist, only now they are encoded as U+0100 and above are.) And this solution does only use the existing set of surrogates. Maybe there's some confusion over code units vs code points. Both of our solutions are proposing changing how UTF-16 code *units* work; neither of them are proposing changing code *points* in any way (besides allowing code points beyond U+10FFFF). Any solution is going to be wonky, as UTF-16 itself is a wonky encoding to begin with. Having multiple advantages such as error detection, substring matching, and self-synchronization is more important than whether the solution is "simple" versus "wonky." > If a Hi-Lo-Lo sequence is truncated, it appears as a valid Hi-Lo sequence > for a different character, and the error is impossible to detect. > > I think this is a minor problem, and the most acceptable one. > If this were an actual call for proposals this would definitely be a major problem. It complicates text rendering, string validation, security, substring matching, collation, and a host of other things that are much easier to deal with when one valid code sequence can't contain another. > If we're talking about security, a > truncated stream should be raising an alarm anyway. > That's hard to do when you can't detect it.
