On Fri, Apr 3, 2026 at 3:48 AM Dominikus Dittes Scherkl via Unicode < [email protected]> wrote:
> For UTF-16 simply use the now forbidden surrogate sequences as already > suggested (3 surrogates Hi-Hi-Lo or Hi-Lo-Lo each encoding 10 bits + one > bit decided by using Hi or Lo in the middle for 31bits at all). > The problem with using a Hi-Lo-Lo sequence like this is that you then cannot tell if a Hi-Lo sequence is the beginning of a Hi-Lo-Lo sequence without looking ahead to the next code unit. If a Hi-Lo-Lo sequence is truncated, it appears as a valid Hi-Lo sequence for a different character, and the error is impossible to detect. A small tweak to this system can solve this issue, however. Instead of Hi-Lo-Lo and Hi-Hi-Lo sequences, use Hi-Hi-Lo and Hi-Hi-Hi sequences. Then no valid sequence can appear at the beginning of another valid sequence, and if a three-surrogate sequence is truncated, it appears as a Hi-Hi sequence, which is invalid and can be detected. Hi-Lo sequence (2^20 codepoints): 0xD800 0xDC00 => U+00010000 0xDBFF 0xDFFF => U+0010FFFF Hi-Hi-Lo sequence (2^30 codepoints): 0xD800 0xD800 0xDC00 => invalid (overlong encoding of U+00000000) 0xD810 0xDBFF 0xDFFF => invalid (overlong encoding of U+0010FFFF) 0xD811 0xD800 0xDC00 => U+00110000 0xDBFF 0xDBFF 0xDFFF => U+3FFFFFFF Hi-Hi-Hi sequence (2^30 codepoints): 0xD800 0xD800 0xD800 => U+40000000 0xDBFF 0xDBFF 0xDBFF => U+7FFFFFFF Even then there are still issues. If your text consists entirely of code points U+40000000 and above, you lose self-synchronization. -- Rebecca Bettencourt
