On Sun, Sep 24, 2023 at 07:06:35AM +0200, Walter Alejandro Iglesias wrote:
> Hi Ingo,
> 
> On Thu, Sep 21, 2023 at 03:04:24PM +0200, Ingo Schwarze wrote:
> > In general, the tool for checking the validity of UTF-8 strings
> > is a simple loop around mblen(3) if you want to report the precise
> > positions of errors found, or simply mbstowcs(3) with a NULL pwcs
> > argument if you are content with a one-bit "valid" or "invalid" answer.
> 
> Acording to mbstowcs(3):
> ------------------------------------------------------------------------
> RETURN VALUES
>   mbstowcs() returns:
> 
>   0 or positive
>         The value returned is the number of elements stored in the array
>         pointed to by pwcs, except for a terminating null wide character
>         (if any).  If pwcs is not null and the value returned is equal
>         to n, the wide-character string pointed to by pwcs is not null
>         terminated.  If pwcs is a null pointer, the value returned is
>         the number of elements to contain the whole string converted,
>         except for a terminating null wide character.
> 
>   (size_t)-1  The array indirectly pointed to by s contains a byte
>               sequence forming invalid character.  In this case,
>               mbstowcs() sets errno to indicate the error.
> 
> ERRORS
>      mbstowcs() may cause an error in the following cases:
> 
>      [EILSEQ]  s points to the string containing invalid or
>                incomplete multibyte character.
> ------------------------------------------------------------------------
> 
> To understand what mbstowcs(3) does I wrote the little test.c program
> pasted at bottom.  In the following example [a] is UTF-8 aaculte and (a)
> iso-latin aacute.
> 
> Using setlocale(LC_CTYPE, "en_US.UTF-8");
> 
>   $ cc -g -Wall test.c
>   $ echo -n arbol | a.out
>   ulen: 5
>   $ echo -n [a]rbol | a.out
>   ulen: 5
>   $ echo -n (a)rbol | a.out
>   ulen: 5

In the UTF-8 locale I can trigger an error message with your program
by sending the latin1 code for a-acute to stdin. I suppose your test
command didn't actually send latin1 to stdin for some reason?

  $ perl -e 'printf "\xe1rbol\n"' | ./a.out
  error: Illegal byte sequence

> Using setlocale(LC_CTYPE, "C");
> 
>   $ cc -g -Wall test.c
>   $ echo -n arbol | a.out
>   ulen: 5
>   $ echo -n [a]rbol | a.out
>   ulen: 6
>   $ echo -n (a)rbol | a.out
>   ulen: 7
> 
> And no error message in any case.  I don't understand in which way those
> return values let me know that the third string is invalid UTF-8.  Am I
> doing something wrong?

There is no concept of byte sequences in the C locale, bytes are bytes.
It is not possible to detect invalid UTF-8 via libc while running in the
C locale since the citrus code in libc won't even run. However, the various
ctype tests like isascii(unsigned char)c); isprint((unsigned char)c); and so
on can be used to filter or stub out non-ASCII characters, which is what
users running in the C locale would want.

Reply via email to