On 2023/09/21 10:55:47 +0200, Walter Alejandro Iglesias <w...@roquesor.com> 
wrote:
> On Wed, Sep 20, 2023 at 08:36:23PM +0200, Walter Alejandro Iglesias wrote:
> > On Wed, Sep 20, 2023 at 07:44:12PM +0200, Walter Alejandro Iglesias wrote:
> > > And this new idea simplifies all to this:
> > 
> > In case anyone else is worried.  Crystal Kolipe already pointed me out
> > that a better UTF-8 checking is needed, I know, I'll get to that
> > tomorrow.
> 
> The following version checks for not valid UTF-8 characters.  I could
> make it fail in this case and send a dead.letter but I imagine that
> those who really use mail(1) surely do it mostly in a tty console where,
> at least with a non US keyboard, is too easy to type some non valid utf-8
> character, hence this feature would be more a hassle than a help, so I
> chose to make it simply skip adding any MIME header in this case (how it
> has been used until now and no one complained :-)).  If you prefer the
> other behavior let me know.
> 
> 
> Index: send.c
> ===================================================================
> RCS file: /cvs/src/usr.bin/mail/send.c,v
> retrieving revision 1.26
> diff -u -p -r1.26 send.c
> --- send.c    8 Mar 2023 04:43:11 -0000       1.26
> +++ send.c    21 Sep 2023 08:40:11 -0000
> @@ -33,6 +33,15 @@
>  #include "rcv.h"
>  #include "extern.h"
>  
> +/*
> + * Variables and functions declared here will be useful to check the
> + * character set of the message to add the appropiate MIME headers.
> + */
> +static char nascii = 0;
> +static char nutf8 = 0;

There's no need to explicitly zero static (or global) variables.

> +static int not_ascii(struct __sFILE *s);
> +static int not_utf8(struct __sFILE *s, int len);

I'd use FILE * instead of struct __sFILE

>  static volatile sig_atomic_t sendsignal;     /* Interrupted by a signal? */
>  
>  /*
> @@ -341,6 +350,15 @@ mail1(struct header *hp, int printheader
>               else
>                       puts("Null message body; hope that's ok");
>       }
> +
> +     /* Check for non ASCII characters in the message */
> +     nascii = not_ascii(mtf);
> +     rewind(mtf);
> +
> +     /* Check for non valid UTF-8 characters in the message */
> +     nutf8 = not_utf8(mtf, fsize(mtf));
> +     rewind(mtf);

assuming that we care for this two checks, why not doing everything in
a single pass?

Do we really need the two checks?

>       /*
>        * Now, take the user names from the combined
>        * to and cc lists and do all the alias
> @@ -525,6 +543,14 @@ puthead(struct header *hp, FILE *fo, int
>               fmt("To:", hp->h_to, fo, w&GCOMMA), gotcha++;
>       if (hp->h_subject != NULL && w & GSUBJECT)
>               fprintf(fo, "Subject: %s\n", hp->h_subject), gotcha++;
> +     if (!nascii)
> +             fprintf(fo, "MIME-Version: 1.0\n"
> +                     "Content-Type: text/plain; charset=us-ascii\n"
> +                     "Content-Transfer-Encoding: 7bit\n"), gotcha++;

+1 for splitting the string in multiple lines, this is an improvements
over previous versions, but please

 - use four spaces of indentation for continuation lines

 - although existing code uses ", gotcha++" I'd split that in a
   separate line for clarity.

> +     else if (nutf8 == 0)
> +             fprintf(fo, "MIME-Version: 1.0\n"
> +                     "Content-Type: text/plain; charset=utf-8\n"
> +                     "Content-Transfer-Encoding: 8bit\n"), gotcha++;
>       if (hp->h_cc != NULL && w & GCC)
>               fmt("Cc:", hp->h_cc, fo, w&GCOMMA), gotcha++;
>       if (hp->h_bcc != NULL && w & GBCC)
> @@ -609,4 +635,67 @@ sendint(int s)
>  {
>  
>       sendsignal = s;
> +}
> +
> +/* Search non ASCII characters in the message */
> +static int
> +not_ascii(struct __sFILE *s)
> +{
> +     int ch, n;
> +     n = 0;
> +        while ((ch = getc(s)) != EOF)

There are some spacing issues, both here and below.

> +                if (ch > 0x7f)
> +                     n = 1;
> +
> +     return n;
> +}
> +
> +/* Search non valid UTF-8 characters in the message */
> +static int
> +not_utf8(struct __sFILE *message, int len)
> +{
> +     int i, nou8;
> +     char c;
> +     unsigned char s[len + 1];

Please don't.  Variable length arrays (VLA) with a possibly large len
are a bad idea.  They have more or less the same issues as alloca(3),
see the CAVEATS section of it to get an idea.

> +
> +     i = 0;
> +        while ((c = getc(message)) != EOF)
> +             s[i++] = c;

and even then, fread() is simpler :-)

> +     s[i] = '\0';
> +
> +     i = nou8 = 0;
> +     while (i != len)

...and even then, mbtowc is easier to use.  See Ingo'
/src/usr/bin/ls/utf8.c for an example usage.

> +             if (s[i] <= 0x7f)
> +                     ++i;
> +             /* Two bytes case */
> +             else if (s[i] >= 0xc2 && s[i] < 0xe0 &&
> +                     s[i + 1] >= 0x80 && s[i + 1] <= 0xbf)
> +                     i += 2;
> +             /* Special three bytes case */
> +             else if ((s[i] == 0xe0 &&
> +                     s[i + 1] >= 0xa0 && s[i + 1] <= 0xbf &&
> +                     s[i + 2] >= 0x80 && s[i + 2] <= 0xbf) ||
> [...]
>  }


Admittedly I haven't followed the thread too closely, although I hope
something like this will end up in mail(1) sometimes :-)

Thanks,

Omar Polo

Reply via email to