On Fri, Apr 3, 2026 at 10:41 PM Rebecca Bettencourt via Unicode
<[email protected]> wrote:
>
> The problem with using a Hi-Lo-Lo sequence like this is that you then cannot 
> tell if a Hi-Lo sequence is the beginning of a Hi-Lo-Lo sequence without 
> looking ahead to the next code unit.

Correct that you effectively need a three-byte read buffer to work
with this encoding. A sequence might be two characters or just one,
and you only know which you have after you read the third byte.

> Even then there are still issues. If your text consists entirely of code 
> points U+40000000 and above, you lose self-synchronization.

There will always be some unavoidable issue. The question lies in
which issue can be an accepted "Fact Of Life".
IMO, self-synchronization is a very high priority. Presumably, the
ultra-high code points will compose script blocks and will occur
together in a stream.
It's also wrong to "steal" PUA code points to cater to this wonky
encoding. It's bad enough that people already perceive PUA characters
as second-class and are reluctant to use them. The solution should be
restricted to the existing set of surrogates only.

> If a Hi-Lo-Lo sequence is truncated, it appears as a valid Hi-Lo sequence for 
> a different character, and the error is impossible to detect.

I think this is a minor problem, and the most acceptable one. (Though
it IS a problem.) If your stream is getting truncated then you're
already losing information and one wrong character is unlikely to be
the worst of your worries. If we're talking about security, a
truncated stream should be raising an alarm anyway.
I still think using three-unit sequences starting with a
high-surrogate and ending on a low-surrogate is the least bad option
with the most advantages and fewest disruptions, should we ever find
ourselves in a world that needs a bigger Unicode AND still uses
UTF-16.

Reply via email to