Re: Thoughts on upsizing Unicode

stgiga via Unicode Sat, 04 Apr 2026 08:58:56 -0700









On 2026/04/04 00:31, Sławomir Osipiuk via Unicode wrote:
>On Sat, Apr 4, 2026 at 2:18 AM Rebecca Bettencourt <[email protected]> 
>wrote:
>>
>> There is no "stealing" going on. The PUA code points still exist, only now 
>> they are encoded as U+110000 and above would be. (Just like when converting 
>> from Latin-1 to UTF-8, U+0080 to U+00FF still exist, only now they are 
>> encoded as U+0100 and above are.) And this solution does only use the 
>> existing set of surrogates.
>
>That's invalidating *existing* UTF-16 encodings of PUA characters.
>From your previous message:
>
>> 0xDBFB 0xDFFF 0xDFFF => U+07BFFFFF
>
>But 0xDBFB 0xDFFF already encodes U+10EFFF in valid UTF-16.
>
>> 0xDBFD 0xDC00 0xDC00 0xDC00 => U+4000000
>
>But 0xDBFD 0xDC00 already encodes U+10F400 in valid UTF-16.
>
>You can't retroactively change the definition of UTF-16, and it's a
>must-have that all valid UTF-16 encoded data be valid XTF-16 data with
>the same meaning. Just because they're PUA characters does not make it
>okay to change their encoding. In fact, if we *must* go with this
>strategy, using the high surrogates for yet-unassigned planes would be
>better than the PUA planes. PUA characters are meaningful, while
>unassigned characters are not (yet). This would still be rather bad
>and cause issues once those planes do get assigned, because they'd
>have different encodings in UTF-16 and XTF-16, but using PUA planes
>would cause that problem immediately.
>
>> If this were an actual call for proposals this would definitely be a major 
>> problem.
>
>Respectfully, I think the problems introduced by other schemes would
>be major-er.

I personally still rely on UTF16 to make stuff like BWTC32Key and some OSes and 
languages work, so I'm in favor of whatever does not kill it. Plus if you kill 
UTF16 you inherently break LMBCS's Unicode mode and Punycode. So I'd rather not 
break the Web. Also because a decent amount of Asian text is smallest in UTF16, 
if you kill UTF16, you inherently anger Asia, and nobody wants to do that. Also 
there is a case to be made for the fact that if you deny one UTF, one that 
still is used in quite a few systems in the wild, extra characters, problems 
ensue. So is killing UTF16 even ethically okay to begin with? Due to 
JavaScript, most of the Web and stuff like Electron apps, including stuff like 
Discord, needs UTF16. If you make it so only HTML has the Plane18+ characters, 
but JavaScript does not, that is when you start to run into major problems, of 
which Mojibake is the tamest. So extending UTF16 past 1114112 slots may have 
utility. After all, Bronze Script and Oracle Bone are n!
 ext candidates for Plane 3 after Seal, plus then we have Mayan and Rongorongo 
to encode somewhere, and the former is character-heavy. So I do see a situation 
in which if we encode ALL the CJKV ancestors and then all the remaining notable 
historic scripts, and even some of the niche ones, where we COULD have to maybe 
do this. And that's not considering that we still don't know the full truth of 
the past, nor what symbologies might qualify for Unicode that have not been 
proposed yet. Stuff like Visible Speech has the same qualifications as Sutton 
SignWriting. And SignWriting is NOT the only conscript in Unicode. So I daresay 
that SOME of the UCSUR and its relatives could qualify for Unicode if proposed 
in just the right way (if you consider the length of the battle it took to get 
Legacy Computing into Unicode and how many times the idea seemed doomed), such 
as the parts about Visible Speech, the extensions to Braille Patterns that 
follow older drafts of Braille and in some re!
 gards behave like Tai Xuan Jing in 

On the topic of Braille and bits: 21bit Unicode could be shoved into a 15-dot 
cell using 3x5 (so a "Braille Patterns" cell with one added to each dimension), 
followed by a 6-dot cell for characters over U+7FFF. Also I've debated hooking 
UnifontEX (when done) to a system that uses Punycode over Baudot to use Unicode 
on the American TTY/TDD network, in order to make it support stuff that isn't 
English or Romanized to the bare minimum.

I know a lot of this post may seem a bit "why", but it IS well-reasoned, as 
someone deep into the Unicode rabbit hole since 2014.

-- 
"I'm here. I'm glad you're there."

I use they/them and neopronouns.
Re: Thoughts on upsizing Unicode

Reply via email to