Fix broken UTF-8 decoding

Crystal Kolipe Sat, 25 Feb 2023 09:32:24 -0800

Currently it is not possible to use unicode codepoints > 0xFF on the console,
because our UTF-8 decoding logic is badly broken.


The code in question is in wsemul_subr.c, wsemul_getchar().

The problem is that we calculate the number of bytes in a multi-byte
sequence by just looking at the high bits in turn:

                        if (frag & 0x20) {
                                frag &= ~0x20;
                                mbleft++;
                        }
                        if (frag & 0x10) {
                                frag &= ~0x10;
                                mbleft++;
                        }
                        if (frag & 0x08) {
                                frag &= ~0x08;
                                mbleft++;
                        }
                        if (frag & 0x04) {
                                frag &= ~0x04;
                                mbleft++;
                        }

This is wrong, for several reasons.

Firstly, since about 20 years ago, the maximum number of bytes in a UTF-8
sequence has been four, so we shouldn't be checking 0x08 and 0x04, (or rather
we should only check that 0x08 is 0 when 0x10 is 1 to indicate a four-byte
sequence.

Secondly, the check for 0x10 should only be performed when 0x20 is also set.

By chance, the current logic successfully decodes UTF-8 encodings of unicode
codepoints 0x80 - 0xFF, because these don't touch bits 2-4 of the first byte.

However, to use console fonts with more than 256 characters we need this
fixed.  I created a font with an extra glyph at position 0x100, and am able to
use it once I had applied the attached patch.

The UTF-8 decoder still needs more work done on it to reject invalid
sequences such as over long encodings and the UTF-16 surrogates.

But it would be nice to get at least this fix in as it is trivial and allows
further experimentation with UTF-8 on the console using fonts with more than
256 glyphs.

I'll do a more detailed write-up about this at some time, but since I've
already had questions off-list about "why OpenBSD doesn't support more than
256 characters in a font", since I started posting the console patches, I
thought it would be good to get this patch out there.

--- wsemul_subr.c.dist  Fri Oct 18 19:06:41 2013
+++ wsemul_subr.c       Sat Feb 25 13:58:00 2023
@@ -125,20 +125,11 @@
                        if (frag & 0x20) {
                                frag &= ~0x20;
                                mbleft++;
+                               if (frag & 0x10) {
+                                       frag &= ~0x10;
+                                       mbleft++;
+                               }
                        }
-                       if (frag & 0x10) {
-                               frag &= ~0x10;
-                               mbleft++;
-                       }
-                       if (frag & 0x08) {
-                               frag &= ~0x08;
-                               mbleft++;
-                       }
-                       if (frag & 0x04) {
-                               frag &= ~0x04;
-                               mbleft++;
-                       }
-
                        tmpchar = frag;
                }
        }

Fix broken UTF-8 decoding

Reply via email to