On Fri, Apr 3, 2026 at 10:41 PM Rebecca Bettencourt via Unicode <[email protected]> wrote: > > The problem with using a Hi-Lo-Lo sequence like this is that you then cannot > tell if a Hi-Lo sequence is the beginning of a Hi-Lo-Lo sequence without > looking ahead to the next code unit.
Correct that you effectively need a three-byte read buffer to work with this encoding. A sequence might be two characters or just one, and you only know which you have after you read the third byte. > Even then there are still issues. If your text consists entirely of code > points U+40000000 and above, you lose self-synchronization. There will always be some unavoidable issue. The question lies in which issue can be an accepted "Fact Of Life". IMO, self-synchronization is a very high priority. Presumably, the ultra-high code points will compose script blocks and will occur together in a stream. It's also wrong to "steal" PUA code points to cater to this wonky encoding. It's bad enough that people already perceive PUA characters as second-class and are reluctant to use them. The solution should be restricted to the existing set of surrogates only. > If a Hi-Lo-Lo sequence is truncated, it appears as a valid Hi-Lo sequence for > a different character, and the error is impossible to detect. I think this is a minor problem, and the most acceptable one. (Though it IS a problem.) If your stream is getting truncated then you're already losing information and one wrong character is unlikely to be the worst of your worries. If we're talking about security, a truncated stream should be raising an alarm anyway. I still think using three-unit sequences starting with a high-surrogate and ending on a low-surrogate is the least bad option with the most advantages and fewest disruptions, should we ever find ourselves in a world that needs a bigger Unicode AND still uses UTF-16.
