On Thu, Sep 21, 2023 at 02:12:50PM +0200, Stefan Sperling wrote: > On Thu, Sep 21, 2023 at 01:25:01PM +0200, Walter Alejandro Iglesias wrote: > > I corrected many of the things you pointed me, but not all. The > > function I use to check utf8 is mine, I use it in a pair of little > > programs which I've *hardly* checked for memory leacks. I know my > > function looks BIG :-), but I know for sure that it does the job. > > We already have code in libc that does this, see the function > _citrus_utf8_ctype_mbrtowc in lib/libc/citrus/citrus_utf8.c. > Please use the libc interface if at all possible, it is best to > have just one place to fix when a UTF-8 parser bug is found. > > There is also utf8_isvalid() in tmux utf8.c though you would > have to trim tmux UTF-8 code down for your narrow use case. > > Your implementation lacks proper bounds checking. It accesses > s[i + 3] based purely on the contents of the input string, without > checking whether len < i + 3. Entering the while (i != len) loop with > i == len-1 and a specially crafted input string can be problematic.
Hey Stefan, I'll give up for now. Another day I'll invetigate and try to understand what you're asking me, because so far I fail to see how what you propose could facilitate maintenance or reduce bugs. Notice that you saw the issue in my code (bounds checking) at a first glance, that's because my code is neither too complicated (citrus) nor too elegant (tmux), hence by far easier to read, understand and debug. Among other things it deals with utf-8 without using wchar.h. I'm sorry you don't like it. Anyways, in case someone else can do something with it, here is my last version with the boundary check. Index: send.c =================================================================== RCS file: /cvs/src/usr.bin/mail/send.c,v retrieving revision 1.26 diff -u -p -r1.26 send.c --- send.c 8 Mar 2023 04:43:11 -0000 1.26 +++ send.c 21 Sep 2023 14:16:08 -0000 @@ -33,6 +33,10 @@ #include "rcv.h" #include "extern.h" +/* To check charset of the message and add the appropiate MIME headers */ +static char nutf8; +static int not_utf8(FILE *s, int len); + static volatile sig_atomic_t sendsignal; /* Interrupted by a signal? */ /* @@ -341,6 +345,11 @@ mail1(struct header *hp, int printheader else puts("Null message body; hope that's ok"); } + + /* Check non valid UTF-8 characters in the message */ + nutf8 = not_utf8(mtf, fsize(mtf)); + rewind(mtf); + /* * Now, take the user names from the combined * to and cc lists and do all the alias @@ -525,6 +534,14 @@ puthead(struct header *hp, FILE *fo, int fmt("To:", hp->h_to, fo, w&GCOMMA), gotcha++; if (hp->h_subject != NULL && w & GSUBJECT) fprintf(fo, "Subject: %s\n", hp->h_subject), gotcha++; + if (nutf8 == 0) + fprintf(fo, "MIME-Version: 1.0\n" + "Content-Type: text/plain; charset=us-ascii\n" + "Content-Transfer-Encoding: 7bit\n"), gotcha++; + else if (nutf8 == 1) + fprintf(fo, "MIME-Version: 1.0\n" + "Content-Type: text/plain; charset=utf-8\n" + "Content-Transfer-Encoding: 8bit\n"), gotcha++; if (hp->h_cc != NULL && w & GCC) fmt("Cc:", hp->h_cc, fo, w&GCOMMA), gotcha++; if (hp->h_bcc != NULL && w & GBCC) @@ -609,4 +626,60 @@ sendint(int s) { sendsignal = s; +} + +/* Search non valid UTF-8 characters in the message */ +static int +not_utf8(FILE *message, int len) +{ + int i, n, nonascii; + char c; + unsigned char s[len + 1]; + + i = 0; + while ((c = getc(message)) != EOF) + s[i++] = c; + + s[i] = '\0'; + + i = n = nonascii = 0; + while (i != len) + if (s[i] <= 0x7f) { + i++; + /* Two bytes case */ + } else if (len < i + 1 && s[i] >= 0xc2 && s[i] < 0xe0 && + s[i + 1] >= 0x80 && s[i + 1] <= 0xbf) { + i += 2; + nonascii++; + /* Special three bytes case */ + } else if ((len < i + 2 && s[i] == 0xe0 && + s[i + 1] >= 0xa0 && s[i + 1] <= 0xbf && + s[i + 2] >= 0x80 && s[i + 2] <= 0xbf) || + /* Three bytes case */ + (len < i + 2 && s[i] > 0xe0 && s[i] < 0xf0 && + s[i + 1] >= 0x80 && s[i + 1] <= 0xbf && + s[i + 2] >= 0x80 && s[i + 2] <= 0xbf)) { + i += 3; + nonascii++; + /* Special four bytes case */ + } else if ((len < i + 3 && s[i] == 0xf0 && + s[i + 1] >= 0x90 && s[i + 1] <= 0xbf && + s[i + 2] >= 0x80 && s[i + 2] <= 0xbf && + s[i + 3] >= 0x80 && s[i + 3] <= 0xbf) || + /* Four bytes case */ + (len < i + 3 && s[i] > 0xf0 && + s[i + 1] >= 0x80 && s[i + 1] <= 0xbf && + s[i + 2] >= 0x80 && s[i + 2] <= 0xbf && + s[i + 3] >= 0x80 && s[i + 3] <= 0xbf)) { + i += 4; + nonascii++; + } else { + n = i + 1; + break; + } + + if (nonascii) + n++; + + return n; } -- Walter