On 2023/09/21 10:55:47 +0200, Walter Alejandro Iglesias <w...@roquesor.com>
wrote:
> On Wed, Sep 20, 2023 at 08:36:23PM +0200, Walter Alejandro Iglesias wrote:
> > On Wed, Sep 20, 2023 at 07:44:12PM +0200, Walter Alejandro Iglesias wrote:
> > > And this new idea simplifies all to this:
> >
> > In case anyone else is worried. Crystal Kolipe already pointed me out
> > that a better UTF-8 checking is needed, I know, I'll get to that
> > tomorrow.
>
> The following version checks for not valid UTF-8 characters. I could
> make it fail in this case and send a dead.letter but I imagine that
> those who really use mail(1) surely do it mostly in a tty console where,
> at least with a non US keyboard, is too easy to type some non valid utf-8
> character, hence this feature would be more a hassle than a help, so I
> chose to make it simply skip adding any MIME header in this case (how it
> has been used until now and no one complained :-)). If you prefer the
> other behavior let me know.
>
>
> Index: send.c
> ===================================================================
> RCS file: /cvs/src/usr.bin/mail/send.c,v
> retrieving revision 1.26
> diff -u -p -r1.26 send.c
> --- send.c 8 Mar 2023 04:43:11 -0000 1.26
> +++ send.c 21 Sep 2023 08:40:11 -0000
> @@ -33,6 +33,15 @@
> #include "rcv.h"
> #include "extern.h"
>
> +/*
> + * Variables and functions declared here will be useful to check the
> + * character set of the message to add the appropiate MIME headers.
> + */
> +static char nascii = 0;
> +static char nutf8 = 0;
There's no need to explicitly zero static (or global) variables.
> +static int not_ascii(struct __sFILE *s);
> +static int not_utf8(struct __sFILE *s, int len);
I'd use FILE * instead of struct __sFILE
> static volatile sig_atomic_t sendsignal; /* Interrupted by a signal? */
>
> /*
> @@ -341,6 +350,15 @@ mail1(struct header *hp, int printheader
> else
> puts("Null message body; hope that's ok");
> }
> +
> + /* Check for non ASCII characters in the message */
> + nascii = not_ascii(mtf);
> + rewind(mtf);
> +
> + /* Check for non valid UTF-8 characters in the message */
> + nutf8 = not_utf8(mtf, fsize(mtf));
> + rewind(mtf);
assuming that we care for this two checks, why not doing everything in
a single pass?
Do we really need the two checks?
> /*
> * Now, take the user names from the combined
> * to and cc lists and do all the alias
> @@ -525,6 +543,14 @@ puthead(struct header *hp, FILE *fo, int
> fmt("To:", hp->h_to, fo, w&GCOMMA), gotcha++;
> if (hp->h_subject != NULL && w & GSUBJECT)
> fprintf(fo, "Subject: %s\n", hp->h_subject), gotcha++;
> + if (!nascii)
> + fprintf(fo, "MIME-Version: 1.0\n"
> + "Content-Type: text/plain; charset=us-ascii\n"
> + "Content-Transfer-Encoding: 7bit\n"), gotcha++;
+1 for splitting the string in multiple lines, this is an improvements
over previous versions, but please
- use four spaces of indentation for continuation lines
- although existing code uses ", gotcha++" I'd split that in a
separate line for clarity.
> + else if (nutf8 == 0)
> + fprintf(fo, "MIME-Version: 1.0\n"
> + "Content-Type: text/plain; charset=utf-8\n"
> + "Content-Transfer-Encoding: 8bit\n"), gotcha++;
> if (hp->h_cc != NULL && w & GCC)
> fmt("Cc:", hp->h_cc, fo, w&GCOMMA), gotcha++;
> if (hp->h_bcc != NULL && w & GBCC)
> @@ -609,4 +635,67 @@ sendint(int s)
> {
>
> sendsignal = s;
> +}
> +
> +/* Search non ASCII characters in the message */
> +static int
> +not_ascii(struct __sFILE *s)
> +{
> + int ch, n;
> + n = 0;
> + while ((ch = getc(s)) != EOF)
There are some spacing issues, both here and below.
> + if (ch > 0x7f)
> + n = 1;
> +
> + return n;
> +}
> +
> +/* Search non valid UTF-8 characters in the message */
> +static int
> +not_utf8(struct __sFILE *message, int len)
> +{
> + int i, nou8;
> + char c;
> + unsigned char s[len + 1];
Please don't. Variable length arrays (VLA) with a possibly large len
are a bad idea. They have more or less the same issues as alloca(3),
see the CAVEATS section of it to get an idea.
> +
> + i = 0;
> + while ((c = getc(message)) != EOF)
> + s[i++] = c;
and even then, fread() is simpler :-)
> + s[i] = '\0';
> +
> + i = nou8 = 0;
> + while (i != len)
...and even then, mbtowc is easier to use. See Ingo'
/src/usr/bin/ls/utf8.c for an example usage.
> + if (s[i] <= 0x7f)
> + ++i;
> + /* Two bytes case */
> + else if (s[i] >= 0xc2 && s[i] < 0xe0 &&
> + s[i + 1] >= 0x80 && s[i + 1] <= 0xbf)
> + i += 2;
> + /* Special three bytes case */
> + else if ((s[i] == 0xe0 &&
> + s[i + 1] >= 0xa0 && s[i + 1] <= 0xbf &&
> + s[i + 2] >= 0x80 && s[i + 2] <= 0xbf) ||
> [...]
> }
Admittedly I haven't followed the thread too closely, although I hope
something like this will end up in mail(1) sometimes :-)
Thanks,
Omar Polo