On 2026/04/04 00:31, Sławomir Osipiuk via Unicode wrote:
>On Sat, Apr 4, 2026 at 2:18 AM Rebecca Bettencourt <[email protected]>
>wrote:
>>
>> There is no "stealing" going on. The PUA code points still exist, only now
>> they are encoded as U+110000 and above would be. (Just like when converting
>> from Latin-1 to UTF-8, U+0080 to U+00FF still exist, only now they are
>> encoded as U+0100 and above are.) And this solution does only use the
>> existing set of surrogates.
>
>That's invalidating *existing* UTF-16 encodings of PUA characters.
>From your previous message:
>
>> 0xDBFB 0xDFFF 0xDFFF => U+07BFFFFF
>
>But 0xDBFB 0xDFFF already encodes U+10EFFF in valid UTF-16.
>
>> 0xDBFD 0xDC00 0xDC00 0xDC00 => U+4000000
>
>But 0xDBFD 0xDC00 already encodes U+10F400 in valid UTF-16.
>
>You can't retroactively change the definition of UTF-16, and it's a
>must-have that all valid UTF-16 encoded data be valid XTF-16 data with
>the same meaning. Just because they're PUA characters does not make it
>okay to change their encoding. In fact, if we *must* go with this
>strategy, using the high surrogates for yet-unassigned planes would be
>better than the PUA planes. PUA characters are meaningful, while
>unassigned characters are not (yet). This would still be rather bad
>and cause issues once those planes do get assigned, because they'd
>have different encodings in UTF-16 and XTF-16, but using PUA planes
>would cause that problem immediately.
>
>> If this were an actual call for proposals this would definitely be a major
>> problem.
>
>Respectfully, I think the problems introduced by other schemes would
>be major-er.
I personally still rely on UTF16 to make stuff like BWTC32Key and some OSes and
languages work, so I'm in favor of whatever does not kill it. Plus if you kill
UTF16 you inherently break LMBCS's Unicode mode and Punycode. So I'd rather not
break the Web. Also because a decent amount of Asian text is smallest in UTF16,
if you kill UTF16, you inherently anger Asia, and nobody wants to do that. Also
there is a case to be made for the fact that if you deny one UTF, one that
still is used in quite a few systems in the wild, extra characters, problems
ensue. So is killing UTF16 even ethically okay to begin with? Due to
JavaScript, most of the Web and stuff like Electron apps, including stuff like
Discord, needs UTF16. If you make it so only HTML has the Plane18+ characters,
but JavaScript does not, that is when you start to run into major problems, of
which Mojibake is the tamest. So extending UTF16 past 1114112 slots may have
utility. After all, Bronze Script and Oracle Bone are n!
ext candidates for Plane 3 after Seal, plus then we have Mayan and Rongorongo
to encode somewhere, and the former is character-heavy. So I do see a situation
in which if we encode ALL the CJKV ancestors and then all the remaining notable
historic scripts, and even some of the niche ones, where we COULD have to maybe
do this. And that's not considering that we still don't know the full truth of
the past, nor what symbologies might qualify for Unicode that have not been
proposed yet. Stuff like Visible Speech has the same qualifications as Sutton
SignWriting. And SignWriting is NOT the only conscript in Unicode. So I daresay
that SOME of the UCSUR and its relatives could qualify for Unicode if proposed
in just the right way (if you consider the length of the battle it took to get
Legacy Computing into Unicode and how many times the idea seemed doomed), such
as the parts about Visible Speech, the extensions to Braille Patterns that
follow older drafts of Braille and in some re!
gards behave like Tai Xuan Jing in
On the topic of Braille and bits: 21bit Unicode could be shoved into a 15-dot
cell using 3x5 (so a "Braille Patterns" cell with one added to each dimension),
followed by a 6-dot cell for characters over U+7FFF. Also I've debated hooking
UnifontEX (when done) to a system that uses Punycode over Baudot to use Unicode
on the American TTY/TDD network, in order to make it support stuff that isn't
English or Romanized to the bare minimum.
I know a lot of this post may seem a bit "why", but it IS well-reasoned, as
someone deep into the Unicode rabbit hole since 2014.
--
"I'm here. I'm glad you're there."
I use they/them and neopronouns.