On Sun, Sep 24, 2023 at 07:06:35AM +0200, Walter Alejandro Iglesias wrote: > Hi Ingo, > > On Thu, Sep 21, 2023 at 03:04:24PM +0200, Ingo Schwarze wrote: > > In general, the tool for checking the validity of UTF-8 strings > > is a simple loop around mblen(3) if you want to report the precise > > positions of errors found, or simply mbstowcs(3) with a NULL pwcs > > argument if you are content with a one-bit "valid" or "invalid" answer. > > Acording to mbstowcs(3): > ------------------------------------------------------------------------ > RETURN VALUES > mbstowcs() returns: > > 0 or positive > The value returned is the number of elements stored in the array > pointed to by pwcs, except for a terminating null wide character > (if any). If pwcs is not null and the value returned is equal > to n, the wide-character string pointed to by pwcs is not null > terminated. If pwcs is a null pointer, the value returned is > the number of elements to contain the whole string converted, > except for a terminating null wide character. > > (size_t)-1 The array indirectly pointed to by s contains a byte > sequence forming invalid character. In this case, > mbstowcs() sets errno to indicate the error. > > ERRORS > mbstowcs() may cause an error in the following cases: > > [EILSEQ] s points to the string containing invalid or > incomplete multibyte character. > ------------------------------------------------------------------------ > > To understand what mbstowcs(3) does I wrote the little test.c program > pasted at bottom. In the following example [a] is UTF-8 aaculte and (a) > iso-latin aacute. > > Using setlocale(LC_CTYPE, "en_US.UTF-8"); > > $ cc -g -Wall test.c > $ echo -n arbol | a.out > ulen: 5 > $ echo -n [a]rbol | a.out > ulen: 5 > $ echo -n (a)rbol | a.out > ulen: 5
In the UTF-8 locale I can trigger an error message with your program by sending the latin1 code for a-acute to stdin. I suppose your test command didn't actually send latin1 to stdin for some reason? $ perl -e 'printf "\xe1rbol\n"' | ./a.out error: Illegal byte sequence > Using setlocale(LC_CTYPE, "C"); > > $ cc -g -Wall test.c > $ echo -n arbol | a.out > ulen: 5 > $ echo -n [a]rbol | a.out > ulen: 6 > $ echo -n (a)rbol | a.out > ulen: 7 > > And no error message in any case. I don't understand in which way those > return values let me know that the third string is invalid UTF-8. Am I > doing something wrong? There is no concept of byte sequences in the C locale, bytes are bytes. It is not possible to detect invalid UTF-8 via libc while running in the C locale since the citrus code in libc won't even run. However, the various ctype tests like isascii(unsigned char)c); isprint((unsigned char)c); and so on can be used to filter or stub out non-ASCII characters, which is what users running in the C locale would want.