On 2023/09/21 10:55:47 +0200, Walter Alejandro Iglesias <w...@roquesor.com> wrote: > On Wed, Sep 20, 2023 at 08:36:23PM +0200, Walter Alejandro Iglesias wrote: > > On Wed, Sep 20, 2023 at 07:44:12PM +0200, Walter Alejandro Iglesias wrote: > > > And this new idea simplifies all to this: > > > > In case anyone else is worried. Crystal Kolipe already pointed me out > > that a better UTF-8 checking is needed, I know, I'll get to that > > tomorrow. > > The following version checks for not valid UTF-8 characters. I could > make it fail in this case and send a dead.letter but I imagine that > those who really use mail(1) surely do it mostly in a tty console where, > at least with a non US keyboard, is too easy to type some non valid utf-8 > character, hence this feature would be more a hassle than a help, so I > chose to make it simply skip adding any MIME header in this case (how it > has been used until now and no one complained :-)). If you prefer the > other behavior let me know. > > > Index: send.c > =================================================================== > RCS file: /cvs/src/usr.bin/mail/send.c,v > retrieving revision 1.26 > diff -u -p -r1.26 send.c > --- send.c 8 Mar 2023 04:43:11 -0000 1.26 > +++ send.c 21 Sep 2023 08:40:11 -0000 > @@ -33,6 +33,15 @@ > #include "rcv.h" > #include "extern.h" > > +/* > + * Variables and functions declared here will be useful to check the > + * character set of the message to add the appropiate MIME headers. > + */ > +static char nascii = 0; > +static char nutf8 = 0;
There's no need to explicitly zero static (or global) variables. > +static int not_ascii(struct __sFILE *s); > +static int not_utf8(struct __sFILE *s, int len); I'd use FILE * instead of struct __sFILE > static volatile sig_atomic_t sendsignal; /* Interrupted by a signal? */ > > /* > @@ -341,6 +350,15 @@ mail1(struct header *hp, int printheader > else > puts("Null message body; hope that's ok"); > } > + > + /* Check for non ASCII characters in the message */ > + nascii = not_ascii(mtf); > + rewind(mtf); > + > + /* Check for non valid UTF-8 characters in the message */ > + nutf8 = not_utf8(mtf, fsize(mtf)); > + rewind(mtf); assuming that we care for this two checks, why not doing everything in a single pass? Do we really need the two checks? > /* > * Now, take the user names from the combined > * to and cc lists and do all the alias > @@ -525,6 +543,14 @@ puthead(struct header *hp, FILE *fo, int > fmt("To:", hp->h_to, fo, w&GCOMMA), gotcha++; > if (hp->h_subject != NULL && w & GSUBJECT) > fprintf(fo, "Subject: %s\n", hp->h_subject), gotcha++; > + if (!nascii) > + fprintf(fo, "MIME-Version: 1.0\n" > + "Content-Type: text/plain; charset=us-ascii\n" > + "Content-Transfer-Encoding: 7bit\n"), gotcha++; +1 for splitting the string in multiple lines, this is an improvements over previous versions, but please - use four spaces of indentation for continuation lines - although existing code uses ", gotcha++" I'd split that in a separate line for clarity. > + else if (nutf8 == 0) > + fprintf(fo, "MIME-Version: 1.0\n" > + "Content-Type: text/plain; charset=utf-8\n" > + "Content-Transfer-Encoding: 8bit\n"), gotcha++; > if (hp->h_cc != NULL && w & GCC) > fmt("Cc:", hp->h_cc, fo, w&GCOMMA), gotcha++; > if (hp->h_bcc != NULL && w & GBCC) > @@ -609,4 +635,67 @@ sendint(int s) > { > > sendsignal = s; > +} > + > +/* Search non ASCII characters in the message */ > +static int > +not_ascii(struct __sFILE *s) > +{ > + int ch, n; > + n = 0; > + while ((ch = getc(s)) != EOF) There are some spacing issues, both here and below. > + if (ch > 0x7f) > + n = 1; > + > + return n; > +} > + > +/* Search non valid UTF-8 characters in the message */ > +static int > +not_utf8(struct __sFILE *message, int len) > +{ > + int i, nou8; > + char c; > + unsigned char s[len + 1]; Please don't. Variable length arrays (VLA) with a possibly large len are a bad idea. They have more or less the same issues as alloca(3), see the CAVEATS section of it to get an idea. > + > + i = 0; > + while ((c = getc(message)) != EOF) > + s[i++] = c; and even then, fread() is simpler :-) > + s[i] = '\0'; > + > + i = nou8 = 0; > + while (i != len) ...and even then, mbtowc is easier to use. See Ingo' /src/usr/bin/ls/utf8.c for an example usage. > + if (s[i] <= 0x7f) > + ++i; > + /* Two bytes case */ > + else if (s[i] >= 0xc2 && s[i] < 0xe0 && > + s[i + 1] >= 0x80 && s[i + 1] <= 0xbf) > + i += 2; > + /* Special three bytes case */ > + else if ((s[i] == 0xe0 && > + s[i + 1] >= 0xa0 && s[i + 1] <= 0xbf && > + s[i + 2] >= 0x80 && s[i + 2] <= 0xbf) || > [...] > } Admittedly I haven't followed the thread too closely, although I hope something like this will end up in mail(1) sometimes :-) Thanks, Omar Polo