Re: Fix broken UTF-8 decoding

2023-03-06 Thread Miod Vallat
> Currently it is not possible to use unicode codepoints > 0xFF on the console,
> because our UTF-8 decoding logic is badly broken.
> 
> The code in question is in wsemul_subr.c, wsemul_getchar().
> 
> The problem is that we calculate the number of bytes in a multi-byte
> sequence by just looking at the high bits in turn:
> 
>   if (frag & 0x20) {
>   frag &= ~0x20;
>   mbleft++;
>   }
>   if (frag & 0x10) {
>   frag &= ~0x10;
>   mbleft++;
>   }
>   if (frag & 0x08) {
>   frag &= ~0x08;
>   mbleft++;
>   }
>   if (frag & 0x04) {
>   frag &= ~0x04;
>   mbleft++;
>   }
> 
> This is wrong, for several reasons.

Doh! Thanks for noticing this. I have replaced that code with something
much saner now.

Miod



Re: Fix broken UTF-8 decoding

2023-02-25 Thread Crystal Kolipe
On Sat, Feb 25, 2023 at 08:29:54PM +0100, Steffen Nurpmeso wrote:
> Crystal Kolipe wrote in
>  :
>  |Currently it is not possible to use unicode codepoints > 0xFF on the \
>  |console,
>  |because our UTF-8 decoding logic is badly broken.
>  |
>  |The code in question is in wsemul_subr.c, wsemul_getchar().
>  |
>  |The problem is that we calculate the number of bytes in a multi-byte
>  |sequence by just looking at the high bits in turn:
>  ...
>  |This is wrong, for several reasons.
> 
> Just to note there are also holes, UTF-8 sequences are not
> necessarily well-formed (per se -- maybe they are when you control
> their generation, of course).  It is actually a real mess

Well, I did elude to further issues in my original post:

Crystal Kolipe  wrote:
> The UTF-8 decoder still needs more work done on it to reject invalid
> sequences such as over long encodings and the UTF-16 surrogates.

If people would rather wait to change this until I can fix the other issues in
the UTF-8 decoder then fine, but with what we've got in the tree at the
moment, it's not possible to use characters beyond 0xFF.  So the only use that
we can make of unicode is to display the existing 'extended ASCII' characters
using UTF-8 sequences instead of single character 8-bit ones.

With the patch I provided, it should be possible to add glyphs for non-latin
scripts, mathematical symbols, etc.

That in turn allows people to test userland applications with other character
sets, etc, highlight and fix any issues in those applications.

And note that this doesn't add any bloat to the kernel, because all of the
functionality needed to do this is already in there, (apart from the
relevant font, obviously).



Re: Fix broken UTF-8 decoding

2023-02-25 Thread Steffen Nurpmeso
Crystal Kolipe wrote in
 :
 |Currently it is not possible to use unicode codepoints > 0xFF on the \
 |console,
 |because our UTF-8 decoding logic is badly broken.
 |
 |The code in question is in wsemul_subr.c, wsemul_getchar().
 |
 |The problem is that we calculate the number of bytes in a multi-byte
 |sequence by just looking at the high bits in turn:
 ...
 |This is wrong, for several reasons.

Just to note there are also holes, UTF-8 sequences are not
necessarily well-formed (per se -- maybe they are when you control
their generation, of course).  It is actually a real mess:

if(LIKELY(x <= 0x7Fu))
c = x;
/* 0xF8, but Unicode guarantees maximum of 0x10u -> F4 8F BF BF.
 * Unicode 9.0, 3.9, UTF-8, Table 3-7. Well-Formed UTF-8 Byte Sequences 
*/
else if(LIKELY(x > 0xC0u && x <= 0xF4u)){
if(LIKELY(x < 0xE0u)){
if(UNLIKELY(l < 1))
goto jenobuf;
--l;

c = (x &= 0x1Fu);
}else if(LIKELY(x < 0xF0u)){
if(UNLIKELY(l < 2))
goto jenobuf;
l -= 2;

x1 = x;
c = (x &= 0x0Fu);

/* Second byte constraints */
x = S(u8,*cp++);
switch(x1){
case 0xE0u:
if(UNLIKELY(x < 0xA0u || x > 0xBFu))
goto jerr;
break;
case 0xEDu:
if(UNLIKELY(x < 0x80u || x > 0x9Fu))
goto jerr;
break;
default:
if(UNLIKELY((x & 0xC0u) != 0x80u))
goto jerr;
break;
}
c <<= 6;
c |= (x &= 0x3Fu);
}else{
if(UNLIKELY(l < 3))
goto jenobuf;
l -= 3;

x1 = x;
c = (x &= 0x07u);

/* Third byte constraints */
x = S(u8,*cp++);
switch(x1){
case 0xF0u:
if(UNLIKELY(x < 0x90u || x > 0xBFu))
goto jerr;
break;
case 0xF4u:
if(UNLIKELY((x & 0xF0u) != 0x80u)) /* 80..8F */
goto jerr;
break;
default:
if(UNLIKELY((x & 0xC0u) != 0x80u))
goto jerr;
break;
}
c <<= 6;
c |= (x &= 0x3Fu);

x = S(u8,*cp++);
if(UNLIKELY((x & 0xC0u) != 0x80u))
goto jerr;
c <<= 6;
c |= (x &= 0x3Fu);
}

x = S(u8,*cp++);
if(UNLIKELY((x & 0xC0u) != 0x80u))
goto jerr;
c <<= 6;
c |= x & 0x3Fu;
}else
goto jerr;

--steffen
|
|Der Kragenbaer,The moon bear,
|der holt sich munter   he cheerfully and one by one
|einen nach dem anderen runter  wa.ks himself off
|(By Robert Gernhardt)



Fix broken UTF-8 decoding

2023-02-25 Thread Crystal Kolipe
Currently it is not possible to use unicode codepoints > 0xFF on the console,
because our UTF-8 decoding logic is badly broken.

The code in question is in wsemul_subr.c, wsemul_getchar().

The problem is that we calculate the number of bytes in a multi-byte
sequence by just looking at the high bits in turn:

if (frag & 0x20) {
frag &= ~0x20;
mbleft++;
}
if (frag & 0x10) {
frag &= ~0x10;
mbleft++;
}
if (frag & 0x08) {
frag &= ~0x08;
mbleft++;
}
if (frag & 0x04) {
frag &= ~0x04;
mbleft++;
}

This is wrong, for several reasons.

Firstly, since about 20 years ago, the maximum number of bytes in a UTF-8
sequence has been four, so we shouldn't be checking 0x08 and 0x04, (or rather
we should only check that 0x08 is 0 when 0x10 is 1 to indicate a four-byte
sequence.

Secondly, the check for 0x10 should only be performed when 0x20 is also set.

By chance, the current logic successfully decodes UTF-8 encodings of unicode
codepoints 0x80 - 0xFF, because these don't touch bits 2-4 of the first byte.

However, to use console fonts with more than 256 characters we need this
fixed.  I created a font with an extra glyph at position 0x100, and am able to
use it once I had applied the attached patch.

The UTF-8 decoder still needs more work done on it to reject invalid
sequences such as over long encodings and the UTF-16 surrogates.

But it would be nice to get at least this fix in as it is trivial and allows
further experimentation with UTF-8 on the console using fonts with more than
256 glyphs.

I'll do a more detailed write-up about this at some time, but since I've
already had questions off-list about "why OpenBSD doesn't support more than
256 characters in a font", since I started posting the console patches, I
thought it would be good to get this patch out there.

--- wsemul_subr.c.dist  Fri Oct 18 19:06:41 2013
+++ wsemul_subr.c   Sat Feb 25 13:58:00 2023
@@ -125,20 +125,11 @@
if (frag & 0x20) {
frag &= ~0x20;
mbleft++;
+   if (frag & 0x10) {
+   frag &= ~0x10;
+   mbleft++;
+   }
}
-   if (frag & 0x10) {
-   frag &= ~0x10;
-   mbleft++;
-   }
-   if (frag & 0x08) {
-   frag &= ~0x08;
-   mbleft++;
-   }
-   if (frag & 0x04) {
-   frag &= ~0x04;
-   mbleft++;
-   }
-
tmpchar = frag;
}
}