On Fri, Apr 3, 2026 at 3:48 AM Dominikus Dittes Scherkl via Unicode <
[email protected]> wrote:

> For  UTF-16 simply use the now forbidden surrogate sequences as already
> suggested (3 surrogates Hi-Hi-Lo or Hi-Lo-Lo each encoding 10 bits + one
> bit decided by using Hi or Lo in the middle for 31bits at all).
>

The problem with using a Hi-Lo-Lo sequence like this is that you then
cannot tell if a Hi-Lo sequence is the beginning of a Hi-Lo-Lo sequence
without looking ahead to the next code unit. If a Hi-Lo-Lo sequence is
truncated, it appears as a valid Hi-Lo sequence for a different character,
and the error is impossible to detect.

A small tweak to this system can solve this issue, however. Instead of
Hi-Lo-Lo and Hi-Hi-Lo sequences, use Hi-Hi-Lo and Hi-Hi-Hi sequences. Then
no valid sequence can appear at the beginning of another valid sequence,
and if a three-surrogate sequence is truncated, it appears as a Hi-Hi
sequence, which is invalid and can be detected.

Hi-Lo sequence (2^20 codepoints):
0xD800 0xDC00 => U+00010000
0xDBFF 0xDFFF => U+0010FFFF

Hi-Hi-Lo sequence (2^30 codepoints):
0xD800 0xD800 0xDC00 => invalid (overlong encoding of U+00000000)
0xD810 0xDBFF 0xDFFF => invalid (overlong encoding of U+0010FFFF)
0xD811 0xD800 0xDC00 => U+00110000
0xDBFF 0xDBFF 0xDFFF => U+3FFFFFFF

Hi-Hi-Hi sequence (2^30 codepoints):
0xD800 0xD800 0xD800 => U+40000000
0xDBFF 0xDBFF 0xDBFF => U+7FFFFFFF

Even then there are still issues. If your text consists entirely of code
points U+40000000 and above, you lose self-synchronization.

-- Rebecca Bettencourt

Reply via email to