Re: Send international text with mail(1) - proposal and patches

2023-10-11 Thread Crystal Kolipe
On Thu, Oct 12, 2023 at 02:10:47AM +0200, Steffen Nurpmeso wrote:
> Crystal Kolipe wrote in
>  :
>  |On Thu, Oct 12, 2023 at 12:36:48AM +0200, Steffen Nurpmeso wrote:
>  |> Non-7-bit clean headers need RFC 2047 (and/or RFC 2231) encoding.
>  |
>  |The use of MIME encoded words to encode header content is no longer
>  |considered best practice.  See, for example RFC 6532.
> 
> Yes there is SMTPUTF8, which is a special protocol.
> The /global MIME thing i personally have _never_ seen in practice.

I wasn't suggesting it as the only alternative.  For the foreseeable future we
obviously need to default to using the old style encoding for non-ASCII
headers.



Re: Send international text with mail(1) - proposal and patches

2023-10-11 Thread Steffen Nurpmeso
Crystal Kolipe wrote in
 :
 |On Thu, Oct 12, 2023 at 12:36:48AM +0200, Steffen Nurpmeso wrote:
 |> Non-7-bit clean headers need RFC 2047 (and/or RFC 2231) encoding.
 |
 |The use of MIME encoded words to encode header content is no longer
 |considered best practice.  See, for example RFC 6532.

Yes there is SMTPUTF8, which is a special protocol.
The /global MIME thing i personally have _never_ seen in practice.
I have downloaded the RFC on 2012-07-23.

 |But as Omar said, let's get the basics of any new functionality
 |sorted out before jumping ahead.  We don't really want to break
 |mail in some unexpected and non-obvious way.

I cannot comment on that.  I am pretty sure i have never seen
/global yet.  Ie an archive search here reveals only three mails
where i mention them in the text; the last is from a thread from
nmh-work...@nongnu.org from July this year, and let me shamelessly
quote Ken Hornstein who said on 2023-07-23

 The message/global MIME type (a RFC822 message but with UTF-8
 everywhere) has a suggested file extension of ".u8msg", which
 I have never personally seen "in the wild" anywhere.  ¯\_(ツ)_/¯

--steffen
|
|Der Kragenbaer,The moon bear,
|der holt sich munter   he cheerfully and one by one
|einen nach dem anderen runter  wa.ks himself off
|(By Robert Gernhardt)



Re: Send international text with mail(1) - proposal and patches

2023-10-11 Thread Crystal Kolipe
On Thu, Oct 12, 2023 at 12:36:48AM +0200, Steffen Nurpmeso wrote:
> Non-7-bit clean headers need RFC 2047 (and/or RFC 2231) encoding.

The use of MIME encoded words to encode header content is no longer
considered best practice.  See, for example RFC 6532.

But as Omar said, let's get the basics of any new functionality
sorted out before jumping ahead.  We don't really want to break
mail in some unexpected and non-obvious way.



Re: Send international text with mail(1) - proposal and patches

2023-10-11 Thread Steffen Nurpmeso
Hello Omar.

Omar Polo wrote in
 <2HJQ4VX5L4J1P.3G4A39B0RA6T7@venera>:
 ...
 |>MUAs always set appropriate MIME headers.  RFC 2046 section 4.1.2
 |>paragraph 8 also "strongly" recommends the explicit inclusion of a
 |>"charset" parameter even for us-ascii.

So that really went me looking again, and i read

 The default character set, which must be assumed in the absence
 of a charset parameter, is US-ASCII.

I have read the following though.  Still, you know...

  ...
 |>Consequently, i think using 8bit is indeed better for our mail(1)
 |>than quoted-printable or base64.

I have nothing to say beside that, but want to point out that to
the best of my knowledge 8bit content-type only refers to MIME
part contents, it does _not_ refer to any email headers.
Non-7-bit clean headers need RFC 2047 (and/or RFC 2231) encoding.
So letting aside any email addresses which possibly would require
IDNA encoding,

 |  if (hp->h_subject != NULL && w & GSUBJECT)
 |- fprintf(fo, "Subject: %s\n", hp->h_subject), gotcha++;
 |+ fprintf(fo, "Subject: %s\n", hp->h_subject);

that is not (again, to the best of my knowledge, i had to read
again all those standards, .. after many years) covered by

  ...
 |+ if (multibyte)
 |+ fprintf(fo, "Content-Transfer-Encoding: 8bit\n"
 |+ "Content-Type: text/plain; charset=utf-8\n");

That is to say: just in case someone thinks this.

Ciao Omar!

--steffen
|
|Der Kragenbaer,The moon bear,
|der holt sich munter   he cheerfully and one by one
|einen nach dem anderen runter  wa.ks himself off
|(By Robert Gernhardt)



Re: Send international text with mail(1) - proposal and patches

2023-10-11 Thread Omar Polo
Hello,

Walter: I'm happy that you've been hacking on mail and at least in
principle I think what you're doing makes sense; however, let's try to
get one bit committed at a time.

Let's start with the MIME needed for sending utf-8 messages.

I've going through the various mail and I think it's here where things
started to go off the rails.  Ingo provided some valuable feedback,
I've updated your diff to address it.  Other additions, such as doing
checks on the content, adding other headers, etc... can be done as a
follow-up after this goes in IMHO.

On 2023/09/20 17:30:08 +0200, Ingo Schwarze  wrote:
> Hi,
> 
> i checked the following points:
> 
>  * Even though RFC 2049 section 2 bullet point 1 only *requires*
>MIME-conformant MUAs to always write the header "MIME-Version:
>1.0" - and mail(1) is most certainly not MIME-conformant - RFC 2049
>section 2 bullet point 8 explicitly *recommends* that even non-MIME
>MUAs always set appropriate MIME headers.  RFC 2046 section 4.1.2
>paragraph 8 also "strongly" recommends the explicit inclusion of a
>"charset" parameter even for us-ascii.
> 
>Consequently, i believe that when sending a message in US-ASCII,
>mail(1) should include these headers:
> 
>MIME-Version: 1.0
>Content-Transfer-Encoding: 7bit
>Content-Type: text/plain; charset=us-ascii
> 
>  * Adding a "Content-Transfer-Encoding: ..." header is indeed required
>for sending UTF-8 messages, see  RFC 2049 section 2 bullet point 2.
>"8bit" is one of the valid values that MUAs must support for
>receiving messages by default.
>Using it seems sane because it is most likely to work with receiving
>MUAs that are not MIME-conformant, like our mail(1) itself.
>I think nowadays, that's a bigger concern than MTAs that are not
>8-bit clean, in particular when maintaining a low-level program
>like our mail(1).
>Consequently, i think using 8bit is indeed better for our mail(1)
>than quoted-printable or base64.
> 
>  * Adding "Content-Type: text/plain; charset=utf-8" is required by
>RFC 2049 section 2 bullet point 4 (for the simplest kind of UTF-8
>encoded messages).
> 
>  * The Content-Disposition: header is defined in RFC 2183, clearly
>optional, and not useful in single-part messages.  Consequently,
>mail(1) should not write it.
> 
> So apart from writing the headers for us-ascii, i think you are
> almost there.
> 
> Given that the charset cannot be inferred from the environment
> and that setting it per-system or per-user in a configuration file
> is also inadequate - it shouldn't be uncommon for users to sometimes
> send US-ASCII and sometimes UTF-8 mail - i think that a new option
> is indeed needed.
> 
> Regarding the naming of the option, compatibility with POSIX
>   https://pubs.opengroup.org/onlinepubs/9699919799/utilities/mailx.html
> is paramount, which kills the tentative idea to use -u for "UTF-8"
> because -u already means "user".
> 
> Compatibility with other mailx(1) implementations is also a
> consideration.  See, for example,
>   https://linux.die.net/man/1/mail
> and -m is indeed among the very few options still available over there.
> I would document it focussing on a "multibyte character encoding"
> mnemonic.  The "mime" mnemonic feels far too broad because MIME can
> be used for lots of other purposes besides specifying a character
> encoding.
> 
> The -m option is also free here:
>   https://man.freebsd.org/cgi/man.cgi?query=mail(1)
>   https://man.netbsd.org/mail.1
>   https://docs.oracle.com/cd/E88353_01/html/E37839/mailx-1.html
>   https://www.ibm.com/docs/en/aix/7.3?topic=m-mail-command-1
> None of those appears to support command line selection of the
> character set for sending mail, so i don't see any immediate
> logioc clashes either.
> 
> The -m option does clash with this one:
>   https://www.sdaoden.eu/code-nail.html
> But i think dismissing Steffen Daode Nurpmeso as a lunatic is obviously
> the way to go.  Try to listen to that person and you will never get
> anything done.
> 
> The mailx(1) documented on die.net appears to be the Heirloom one.
> It does not have an option to select sending US-ASCII or UTF-8.
> Instead, it has a "sendcharsets" configuration variable.  That's
> clearly overengineering, but even when hardcoding the equivalent of
> 
>   sendcharsets=utf-8
> 
> which is also the default, that's nasty because it silently switches to
> UTF-8 as soon as a non-ASCII character appears in the input.  I think
> at least in interactive mode, explicit confirmation from the user would
> be required to send UTF-8, instead writing dead.letter if the user
> rejects the request, such that they can clean up the file and try again.
> 
> That would certainly be more complicated than requiring an option
> up front, not only from the implementation perspective, but arguably
> also from the user perspective.  So unless other developers think this
> should be fully automatic with confirmation 

Re: Send international text with mail(1) - proposal and patches

2023-09-25 Thread wai
On Mon, 25 Sep 2023 21:31:08 +0200, Walter wrote:
> Yours are the first technical, functional corrections I got about the
> code.  Thanks!  Let's go back in time, then.  I think that what you're
> telling me can be done by simply replacing "break" for "return" in my
> original function.  Tell me what you think, please.

Yesterday I was so tired that I told you nonsense, there's no difference
between puting break or return there, my original funtion already did
what you told me.


--- send.c.orig 2023-09-25 21:01:34.780102611 +0200
+++ send.c  2023-09-25 21:17:11.120117761 +0200
@@ -33,6 +33,10 @@
 #include "rcv.h"
 #include "extern.h"
 
++/* To check charset of the message and add the appropiate MIME headers  */
++static char nutf8;
++static int not_utf8(FILE *s, int len);
+
 static volatile sig_atomic_t sendsignal;   /* Interrupted by a signal? */
 
 /*
@@ -341,6 +345,17 @@
else
puts("Null message body; hope that's ok");
}
+
+   /* Check non valid UTF-8 characters in the message */
+   nutf8 = not_utf8(mtf, fsize(mtf));
+   rewind(mtf);
+   if (nutf8 > 1) {
+   savedeadletter(mtf);
+   puts("Invalid or incomplete multibyte or wide character");
+   fputs(". . . message not sent.\n", stderr);
+   exit(1);
+   }
+
/*
 * Now, take the user names from the combined
 * to and cc lists and do all the alias
@@ -369,7 +384,7 @@
}
if ((cp = value("record")) != NULL)
(void)savemail(expand(cp), mtf);
-   
+
/* Setup sendmail arguments. */
 *ap++ = "sendmail";
 *ap++ = "-i";
@@ -525,6 +540,16 @@
fmt("To:", hp->h_to, fo, w), gotcha++;
if (hp->h_subject != NULL && w & GSUBJECT)
fprintf(fo, "Subject: %s\n", hp->h_subject), gotcha++;
+   if (nutf8 == 0)
+   fprintf(fo, "MIME-Version: 1.0\n"
+   "Content-Type: text/plain; charset=us-ascii\n"
+   "Content-Transfer-Encoding: 7bit\n"),
+   gotcha++;
+   else if (nutf8 == 1)
+   fprintf(fo, "MIME-Version: 1.0\n"
+   "Content-Type: text/plain; charset=utf-8\n"
+   "Content-Transfer-Encoding: 8bit\n"),
+   gotcha++;
if (hp->h_cc != NULL && w & GCC)
fmt("Cc:", hp->h_cc, fo, w), gotcha++;
if (hp->h_bcc != NULL && w & GBCC)
@@ -610,3 +635,59 @@
 
sendsignal = s;
 }
+
+/* Search non valid UTF-8 characters in the message */
+static int
+not_utf8(FILE *fp, int len)
+{
+   int i, n, nonascii;
+   char c;
+   unsigned char s[len];
+
+   i = 0;
+while ((c = getc(fp)) != EOF)
+   s[i++] = c;
+
+   s[i] = '\0';
+
+   i = n = nonascii = 0;
+   while (s[i] != '\0')
+   if (s[i] <= 0x7f) {
+   i++;
+   /* Two bytes case */
+   } else if (s[i] >= 0xc2 && s[i] < 0xe0 &&
+   s[i + 1] >= 0x80 && s[i + 1] <= 0xbf) {
+   i += 2;
+   nonascii++;
+   /* Special three bytes case */
+   } else if ((s[i] == 0xe0 &&
+   s[i + 1] >= 0xa0 && s[i + 1] <= 0xbf &&
+   s[i + 2] >= 0x80 && s[i + 2] <= 0xbf) ||
+   /* Three bytes case */
+   (s[i] > 0xe0 && s[i] < 0xf0 &&
+   s[i + 1] >= 0x80 && s[i + 1] <= 0xbf &&
+   s[i + 2] >= 0x80 && s[i + 2] <= 0xbf)) {
+   i += 3;
+   nonascii++;
+   /* Special four bytes case */
+   } else if ((s[i] == 0xf0 &&
+   s[i + 1] >= 0x90 && s[i + 1] <= 0xbf &&
+   s[i + 2] >= 0x80 && s[i + 2] <= 0xbf &&
+   s[i + 3] >= 0x80 && s[i + 3] <= 0xbf) ||
+   /* Four bytes case */
+   (s[i] > 0xf0 &&
+   s[i + 1] >= 0x80 && s[i + 1] <= 0xbf &&
+   s[i + 2] >= 0x80 && s[i + 2] <= 0xbf &&
+   s[i + 3] >= 0x80 && s[i + 3] <= 0xbf)) {
+   i += 4;
+   nonascii++;
+   } else {
+   n = i + 1;
+   break;
+   }
+
+   if (nonascii)
+   n++;
+
+   return n;
+}


-- 
Walter



Re: Send international text with mail(1) - proposal and patches

2023-09-25 Thread Walter Alejandro Iglesias
On Mon, 25 Sep 2023 19:00:15 +0200, Hiltjo Posthuma wrote:
> On Mon, Sep 25, 2023 at 03:13:03PM +0200, Walter Alejandro Iglesias wrote:
> > This new version, when it detects invalid utf-8 in the body saves a
> > dead.letter, prints the following message and exits.
> > 
> >   $ mail -s hello user < invalid_utf8.txt
> >   Invalid or incomplete multibyte or wide character
> >   . . . message not sent.
> > 
> > 
> > 
> > Index: send.c
> > ===
> > RCS file: /cvs/src/usr.bin/mail/send.c,v
> > retrieving revision 1.26
> > diff -u -p -r1.26 send.c
> > --- send.c  8 Mar 2023 04:43:11 -   1.26
> > +++ send.c  25 Sep 2023 13:07:17 -
> > @@ -32,6 +32,11 @@
> >  
> >  #include "rcv.h"
> >  #include "extern.h"
> > +#include "locale.h"
> > +
> > +/* To check charset of the message and add the appropiate MIME headers  */
> > +static char nutf8;
> > +static int not_utf8(FILE *s, int len);
> >  
> >  static volatile sig_atomic_t sendsignal;   /* Interrupted by a signal? */
> >  
> > @@ -341,6 +346,17 @@ mail1(struct header *hp, int printheader
> > else
> > puts("Null message body; hope that's ok");
> > }
> > +
> > +   /* Check non valid UTF-8 characters in the message */
> > +   nutf8 = not_utf8(mtf, fsize(mtf));
> > +   rewind(mtf);
> > +   if (nutf8 > 1) {
> > +   savedeadletter(mtf);
> > +   puts("Invalid or incomplete multibyte or wide character");
> > +   fputs(". . . message not sent.\n", stderr);
> > +   exit(1);
> > +   }
> > +
> > /*
> >  * Now, take the user names from the combined
> >  * to and cc lists and do all the alias
> > @@ -520,15 +536,30 @@ puthead(struct header *hp, FILE *fo, int
> > gotcha = 0;
> > from = hp->h_from ? hp->h_from : value("from");
> > if (from != NULL)
> > -   fprintf(fo, "From: %s\n", from), gotcha++;
> > +   fprintf(fo, "From: %s\n", from),
> > +   gotcha++;
> > if (hp->h_to != NULL && w & GTO)
> > -   fmt("To:", hp->h_to, fo, w), gotcha++;
> > +   fmt("To:", hp->h_to, fo, w),
> > +   gotcha++;
> > if (hp->h_subject != NULL && w & GSUBJECT)
> > -   fprintf(fo, "Subject: %s\n", hp->h_subject), gotcha++;
> > +   fprintf(fo, "Subject: %s\n", hp->h_subject),
> > +   gotcha++;
> > +   if (nutf8 == 0)
> > +   fprintf(fo, "MIME-Version: 1.0\n"
> > +   "Content-Type: text/plain; charset=us-ascii\n"
> > +   "Content-Transfer-Encoding: 7bit\n"),
> > +   gotcha++;
> > +   else if (nutf8 == 1)
> > +   fprintf(fo, "MIME-Version: 1.0\n"
> > +   "Content-Type: text/plain; charset=utf-8\n"
> > +   "Content-Transfer-Encoding: 8bit\n"),
> > +   gotcha++;
> > if (hp->h_cc != NULL && w & GCC)
> > -   fmt("Cc:", hp->h_cc, fo, w), gotcha++;
> > +   fmt("Cc:", hp->h_cc, fo, w),
> > +   gotcha++;
> > if (hp->h_bcc != NULL && w & GBCC)
> > -   fmt("Bcc:", hp->h_bcc, fo, w), gotcha++;
> > +   fmt("Bcc:", hp->h_bcc, fo, w),
> > +   gotcha++;
> > if (gotcha && w & GNL)
> > (void)putc('\n', fo);
> > return(0);
> > @@ -609,4 +640,44 @@ sendint(int s)
> >  {
> >  
> > sendsignal = s;
> > +}
> > +
> > +/* Search non valid UTF-8 characters in the message */
> > +static int
> > +not_utf8(FILE *message, int len)
> > +{
>
> Nitpick: I would call `message` maybe `fp` or something here.
>
> > +   int c, count, n, ulen;
> > +   size_t i, resize;
> > +   size_t jump = 100;
> > +   unsigned char *s = NULL;
> > +
> > +   setlocale(LC_CTYPE, "en_US.UTF-8");
> > +
>
> Should setlocale() be restored later on?
>
> > +   if (s == NULL && (s = malloc(jump)) == NULL)
> > +   err(1, NULL);
>
> The check if `s` is NULL seems unncessary here.
>
> > +
> > +   i = count = 0;
> > +   while ((c = getc(message)) != EOF) {
> > +   if (s == NULL || count == jump) {
>
> The check if `s` is NULL seems unncessary here.
>
> > +   if ((s = realloc(s, i + jump + 1)) == NULL)
> > +   err(1, NULL);
> > +   count = 0;
> > +   }
> > +   s[i++] = c;
> > +   count++;
> > +   }
> > +
> > +   s[i] = '\0';
> > +
> > +   ulen = mbstowcs(NULL, s, 0);
> > +
> > +   if (ulen == len)
> > +   n = 0;
> > +   else if (ulen < 0)
> > +   n = 2; 
> > +   else if (ulen < len)
> > +   n = 1;
> > +   
> > +   free(s);
> > +   return n;
> >  }
> > 
> > 
> > -- 
> > Walter
> > 
>
> Since it assumes UTF-8, maybe mbstowcs() is not needed and it can be done in
> one pass while reading the stream (no need to allocate, slurp the whole file
> and decode). Just: read the per byte and return on the first invalid sequence.

Yours are the first technical, functional corrections I got about the
code.  Thanks!  Let's go back in time, then. 

Re: Send international text with mail(1) - proposal and patches

2023-09-25 Thread Hiltjo Posthuma
On Mon, Sep 25, 2023 at 03:13:03PM +0200, Walter Alejandro Iglesias wrote:
> This new version, when it detects invalid utf-8 in the body saves a
> dead.letter, prints the following message and exits.
> 
>   $ mail -s hello user < invalid_utf8.txt
>   Invalid or incomplete multibyte or wide character
>   . . . message not sent.
> 
> 
> 
> Index: send.c
> ===
> RCS file: /cvs/src/usr.bin/mail/send.c,v
> retrieving revision 1.26
> diff -u -p -r1.26 send.c
> --- send.c8 Mar 2023 04:43:11 -   1.26
> +++ send.c25 Sep 2023 13:07:17 -
> @@ -32,6 +32,11 @@
>  
>  #include "rcv.h"
>  #include "extern.h"
> +#include "locale.h"
> +
> +/* To check charset of the message and add the appropiate MIME headers  */
> +static char nutf8;
> +static int not_utf8(FILE *s, int len);
>  
>  static volatile sig_atomic_t sendsignal; /* Interrupted by a signal? */
>  
> @@ -341,6 +346,17 @@ mail1(struct header *hp, int printheader
>   else
>   puts("Null message body; hope that's ok");
>   }
> +
> + /* Check non valid UTF-8 characters in the message */
> + nutf8 = not_utf8(mtf, fsize(mtf));
> + rewind(mtf);
> + if (nutf8 > 1) {
> + savedeadletter(mtf);
> + puts("Invalid or incomplete multibyte or wide character");
> + fputs(". . . message not sent.\n", stderr);
> + exit(1);
> + }
> +
>   /*
>* Now, take the user names from the combined
>* to and cc lists and do all the alias
> @@ -520,15 +536,30 @@ puthead(struct header *hp, FILE *fo, int
>   gotcha = 0;
>   from = hp->h_from ? hp->h_from : value("from");
>   if (from != NULL)
> - fprintf(fo, "From: %s\n", from), gotcha++;
> + fprintf(fo, "From: %s\n", from),
> + gotcha++;
>   if (hp->h_to != NULL && w & GTO)
> - fmt("To:", hp->h_to, fo, w), gotcha++;
> + fmt("To:", hp->h_to, fo, w),
> + gotcha++;
>   if (hp->h_subject != NULL && w & GSUBJECT)
> - fprintf(fo, "Subject: %s\n", hp->h_subject), gotcha++;
> + fprintf(fo, "Subject: %s\n", hp->h_subject),
> + gotcha++;
> + if (nutf8 == 0)
> + fprintf(fo, "MIME-Version: 1.0\n"
> + "Content-Type: text/plain; charset=us-ascii\n"
> + "Content-Transfer-Encoding: 7bit\n"),
> + gotcha++;
> + else if (nutf8 == 1)
> + fprintf(fo, "MIME-Version: 1.0\n"
> + "Content-Type: text/plain; charset=utf-8\n"
> + "Content-Transfer-Encoding: 8bit\n"),
> + gotcha++;
>   if (hp->h_cc != NULL && w & GCC)
> - fmt("Cc:", hp->h_cc, fo, w), gotcha++;
> + fmt("Cc:", hp->h_cc, fo, w),
> + gotcha++;
>   if (hp->h_bcc != NULL && w & GBCC)
> - fmt("Bcc:", hp->h_bcc, fo, w), gotcha++;
> + fmt("Bcc:", hp->h_bcc, fo, w),
> + gotcha++;
>   if (gotcha && w & GNL)
>   (void)putc('\n', fo);
>   return(0);
> @@ -609,4 +640,44 @@ sendint(int s)
>  {
>  
>   sendsignal = s;
> +}
> +
> +/* Search non valid UTF-8 characters in the message */
> +static int
> +not_utf8(FILE *message, int len)
> +{

Nitpick: I would call `message` maybe `fp` or something here.

> + int c, count, n, ulen;
> + size_t i, resize;
> + size_t jump = 100;
> + unsigned char *s = NULL;
> +
> + setlocale(LC_CTYPE, "en_US.UTF-8");
> +

Should setlocale() be restored later on?

> + if (s == NULL && (s = malloc(jump)) == NULL)
> + err(1, NULL);

The check if `s` is NULL seems unncessary here.

> +
> + i = count = 0;
> + while ((c = getc(message)) != EOF) {
> + if (s == NULL || count == jump) {

The check if `s` is NULL seems unncessary here.

> + if ((s = realloc(s, i + jump + 1)) == NULL)
> + err(1, NULL);
> + count = 0;
> + }
> + s[i++] = c;
> + count++;
> + }
> +
> + s[i] = '\0';
> +
> + ulen = mbstowcs(NULL, s, 0);
> +
> + if (ulen == len)
> + n = 0;
> + else if (ulen < 0)
> + n = 2; 
> + else if (ulen < len)
> + n = 1;
> + 
> + free(s);
> + return n;
>  }
> 
> 
> -- 
> Walter
> 

Since it assumes UTF-8, maybe mbstowcs() is not needed and it can be done in
one pass while reading the stream (no need to allocate, slurp the whole file
and decode). Just: read the per byte and return on the first invalid sequence.

-- 
Kind regards,
Hiltjo



Re: Send international text with mail(1) - proposal and patches

2023-09-25 Thread Walter Alejandro Iglesias
This new version, when it detects invalid utf-8 in the body saves a
dead.letter, prints the following message and exits.

  $ mail -s hello user < invalid_utf8.txt
  Invalid or incomplete multibyte or wide character
  . . . message not sent.



Index: send.c
===
RCS file: /cvs/src/usr.bin/mail/send.c,v
retrieving revision 1.26
diff -u -p -r1.26 send.c
--- send.c  8 Mar 2023 04:43:11 -   1.26
+++ send.c  25 Sep 2023 13:07:17 -
@@ -32,6 +32,11 @@
 
 #include "rcv.h"
 #include "extern.h"
+#include "locale.h"
+
+/* To check charset of the message and add the appropiate MIME headers  */
+static char nutf8;
+static int not_utf8(FILE *s, int len);
 
 static volatile sig_atomic_t sendsignal;   /* Interrupted by a signal? */
 
@@ -341,6 +346,17 @@ mail1(struct header *hp, int printheader
else
puts("Null message body; hope that's ok");
}
+
+   /* Check non valid UTF-8 characters in the message */
+   nutf8 = not_utf8(mtf, fsize(mtf));
+   rewind(mtf);
+   if (nutf8 > 1) {
+   savedeadletter(mtf);
+   puts("Invalid or incomplete multibyte or wide character");
+   fputs(". . . message not sent.\n", stderr);
+   exit(1);
+   }
+
/*
 * Now, take the user names from the combined
 * to and cc lists and do all the alias
@@ -520,15 +536,30 @@ puthead(struct header *hp, FILE *fo, int
gotcha = 0;
from = hp->h_from ? hp->h_from : value("from");
if (from != NULL)
-   fprintf(fo, "From: %s\n", from), gotcha++;
+   fprintf(fo, "From: %s\n", from),
+   gotcha++;
if (hp->h_to != NULL && w & GTO)
-   fmt("To:", hp->h_to, fo, w), gotcha++;
+   fmt("To:", hp->h_to, fo, w),
+   gotcha++;
if (hp->h_subject != NULL && w & GSUBJECT)
-   fprintf(fo, "Subject: %s\n", hp->h_subject), gotcha++;
+   fprintf(fo, "Subject: %s\n", hp->h_subject),
+   gotcha++;
+   if (nutf8 == 0)
+   fprintf(fo, "MIME-Version: 1.0\n"
+   "Content-Type: text/plain; charset=us-ascii\n"
+   "Content-Transfer-Encoding: 7bit\n"),
+   gotcha++;
+   else if (nutf8 == 1)
+   fprintf(fo, "MIME-Version: 1.0\n"
+   "Content-Type: text/plain; charset=utf-8\n"
+   "Content-Transfer-Encoding: 8bit\n"),
+   gotcha++;
if (hp->h_cc != NULL && w & GCC)
-   fmt("Cc:", hp->h_cc, fo, w), gotcha++;
+   fmt("Cc:", hp->h_cc, fo, w),
+   gotcha++;
if (hp->h_bcc != NULL && w & GBCC)
-   fmt("Bcc:", hp->h_bcc, fo, w), gotcha++;
+   fmt("Bcc:", hp->h_bcc, fo, w),
+   gotcha++;
if (gotcha && w & GNL)
(void)putc('\n', fo);
return(0);
@@ -609,4 +640,44 @@ sendint(int s)
 {
 
sendsignal = s;
+}
+
+/* Search non valid UTF-8 characters in the message */
+static int
+not_utf8(FILE *message, int len)
+{
+   int c, count, n, ulen;
+   size_t i, resize;
+   size_t jump = 100;
+   unsigned char *s = NULL;
+
+   setlocale(LC_CTYPE, "en_US.UTF-8");
+
+   if (s == NULL && (s = malloc(jump)) == NULL)
+   err(1, NULL);
+
+   i = count = 0;
+   while ((c = getc(message)) != EOF) {
+   if (s == NULL || count == jump) {
+   if ((s = realloc(s, i + jump + 1)) == NULL)
+   err(1, NULL);
+   count = 0;
+   }
+   s[i++] = c;
+   count++;
+   }
+
+   s[i] = '\0';
+
+   ulen = mbstowcs(NULL, s, 0);
+
+   if (ulen == len)
+   n = 0;
+   else if (ulen < 0)
+   n = 2; 
+   else if (ulen < len)
+   n = 1;
+   
+   free(s);
+   return n;
 }


-- 
Walter



Re: Send international text with mail(1) - proposal and patches

2023-09-24 Thread Walter Alejandro Iglesias
On Sun, Sep 24, 2023 at 11:12:10AM -0300, Crystal Kolipe wrote:
> On Sun, Sep 24, 2023 at 12:37:08PM +0200, Walter Alejandro Iglesias wrote:
> > +static int
> > +not_utf8(FILE *message, int len)
> > +{
> > +   int n, ulen;
> > +   unsigned char s[len];
> 
> Please re-read Omar's advice about large unbounded arrays.

Better?


Index: send.c
===
RCS file: /cvs/src/usr.bin/mail/send.c,v
retrieving revision 1.26
diff -u -p -r1.26 send.c
--- send.c  8 Mar 2023 04:43:11 -   1.26
+++ send.c  24 Sep 2023 14:54:25 -
@@ -32,6 +32,11 @@
 
 #include "rcv.h"
 #include "extern.h"
+#include "locale.h"
+
+/* To check charset of the message and add the appropiate MIME headers  */
+static char nutf8;
+static int not_utf8(FILE *s, int len);
 
 static volatile sig_atomic_t sendsignal;   /* Interrupted by a signal? */
 
@@ -341,6 +346,11 @@ mail1(struct header *hp, int printheader
else
puts("Null message body; hope that's ok");
}
+
+   /* Check non valid UTF-8 characters in the message */
+   nutf8 = not_utf8(mtf, fsize(mtf));
+   rewind(mtf);
+
/*
 * Now, take the user names from the combined
 * to and cc lists and do all the alias
@@ -520,15 +530,30 @@ puthead(struct header *hp, FILE *fo, int
gotcha = 0;
from = hp->h_from ? hp->h_from : value("from");
if (from != NULL)
-   fprintf(fo, "From: %s\n", from), gotcha++;
+   fprintf(fo, "From: %s\n", from),
+   gotcha++;
if (hp->h_to != NULL && w & GTO)
-   fmt("To:", hp->h_to, fo, w), gotcha++;
+   fmt("To:", hp->h_to, fo, w),
+   gotcha++;
if (hp->h_subject != NULL && w & GSUBJECT)
-   fprintf(fo, "Subject: %s\n", hp->h_subject), gotcha++;
+   fprintf(fo, "Subject: %s\n", hp->h_subject),
+   gotcha++;
+   if (nutf8 == 0)
+   fprintf(fo, "MIME-Version: 1.0\n"
+   "Content-Type: text/plain; charset=us-ascii\n"
+   "Content-Transfer-Encoding: 7bit\n"),
+   gotcha++;
+   else if (nutf8 == 1)
+   fprintf(fo, "MIME-Version: 1.0\n"
+   "Content-Type: text/plain; charset=utf-8\n"
+   "Content-Transfer-Encoding: 8bit\n"),
+   gotcha++;
if (hp->h_cc != NULL && w & GCC)
-   fmt("Cc:", hp->h_cc, fo, w), gotcha++;
+   fmt("Cc:", hp->h_cc, fo, w),
+   gotcha++;
if (hp->h_bcc != NULL && w & GBCC)
-   fmt("Bcc:", hp->h_bcc, fo, w), gotcha++;
+   fmt("Bcc:", hp->h_bcc, fo, w),
+   gotcha++;
if (gotcha && w & GNL)
(void)putc('\n', fo);
return(0);
@@ -609,4 +634,44 @@ sendint(int s)
 {
 
sendsignal = s;
+}
+
+/* Search non valid UTF-8 characters in the message */
+static int
+not_utf8(FILE *message, int len)
+{
+   int c, count, n, ulen;
+   size_t i, resize;
+   size_t jump = 100;
+   unsigned char *s = NULL;
+
+   setlocale(LC_CTYPE, "en_US.UTF-8");
+
+   if (s == NULL && (s = malloc(jump)) == NULL)
+   err(1, NULL);
+
+   i = count = 0;
+   while ((c = getc(message)) != EOF) {
+   if (s == NULL || count == jump) {
+   if ((s = realloc(s, i + jump + 1)) == NULL)
+   err(1, NULL);
+   count = 0;
+   }
+   s[i++] = c;
+   count++;
+   }
+
+   s[i] = '\0';
+
+   ulen = mbstowcs(NULL, s, 0);
+
+   if (ulen == len)
+   n = 0;
+   else if (ulen < 0)
+   n = 2; 
+   else if (ulen < len)
+   n = 1;
+   
+   free(s);
+   return n;
 }


-- 
Walter



Re: Send international text with mail(1) - proposal and patches

2023-09-24 Thread Crystal Kolipe
On Sun, Sep 24, 2023 at 12:37:08PM +0200, Walter Alejandro Iglesias wrote:
> +static int
> +not_utf8(FILE *message, int len)
> +{
> + int n, ulen;
> + unsigned char s[len];

Please re-read Omar's advice about large unbounded arrays.



Re: Send international text with mail(1) - proposal and patches

2023-09-24 Thread Walter Alejandro Iglesias
Hi Stefan,

Do you like this?


Index: send.c
===
RCS file: /cvs/src/usr.bin/mail/send.c,v
retrieving revision 1.26
diff -u -p -r1.26 send.c
--- send.c  8 Mar 2023 04:43:11 -   1.26
+++ send.c  24 Sep 2023 10:33:11 -
@@ -32,6 +32,11 @@
 
 #include "rcv.h"
 #include "extern.h"
+#include "locale.h"
+
+/* To check charset of the message and add the appropiate MIME headers  */
+static char nutf8;
+static int not_utf8(FILE *s, int len);
 
 static volatile sig_atomic_t sendsignal;   /* Interrupted by a signal? */
 
@@ -341,6 +346,11 @@ mail1(struct header *hp, int printheader
else
puts("Null message body; hope that's ok");
}
+
+   /* Check non valid UTF-8 characters in the message */
+   nutf8 = not_utf8(mtf, fsize(mtf));
+   rewind(mtf);
+
/*
 * Now, take the user names from the combined
 * to and cc lists and do all the alias
@@ -520,15 +530,30 @@ puthead(struct header *hp, FILE *fo, int
gotcha = 0;
from = hp->h_from ? hp->h_from : value("from");
if (from != NULL)
-   fprintf(fo, "From: %s\n", from), gotcha++;
+   fprintf(fo, "From: %s\n", from),
+   gotcha++;
if (hp->h_to != NULL && w & GTO)
-   fmt("To:", hp->h_to, fo, w), gotcha++;
+   fmt("To:", hp->h_to, fo, w),
+   gotcha++;
if (hp->h_subject != NULL && w & GSUBJECT)
-   fprintf(fo, "Subject: %s\n", hp->h_subject), gotcha++;
+   fprintf(fo, "Subject: %s\n", hp->h_subject),
+   gotcha++;
+   if (nutf8 == 0)
+   fprintf(fo, "MIME-Version: 1.0\n"
+   "Content-Type: text/plain; charset=us-ascii\n"
+   "Content-Transfer-Encoding: 7bit\n"),
+   gotcha++;
+   else if (nutf8 == 1)
+   fprintf(fo, "MIME-Version: 1.0\n"
+   "Content-Type: text/plain; charset=utf-8\n"
+   "Content-Transfer-Encoding: 8bit\n"),
+   gotcha++;
if (hp->h_cc != NULL && w & GCC)
-   fmt("Cc:", hp->h_cc, fo, w), gotcha++;
+   fmt("Cc:", hp->h_cc, fo, w),
+   gotcha++;
if (hp->h_bcc != NULL && w & GBCC)
-   fmt("Bcc:", hp->h_bcc, fo, w), gotcha++;
+   fmt("Bcc:", hp->h_bcc, fo, w),
+   gotcha++;
if (gotcha && w & GNL)
(void)putc('\n', fo);
return(0);
@@ -609,4 +634,25 @@ sendint(int s)
 {
 
sendsignal = s;
+}
+
+/* Search non valid UTF-8 characters in the message */
+static int
+not_utf8(FILE *message, int len)
+{
+   int n, ulen;
+   unsigned char s[len];
+   setlocale(LC_CTYPE, "en_US.UTF-8");
+
+   fread(, sizeof(char), len, message);
+   ulen = mbstowcs(NULL, s, 0);
+
+   if (ulen == len)
+   n = 0;
+   else if (ulen < 0)
+   n = 2; 
+   else if (ulen < len)
+   n = 1;
+   
+   return n;
 }


-- 
Walter



Re: Send international text with mail(1) - proposal and patches

2023-09-24 Thread Walter Alejandro Iglesias
On Sun, Sep 24, 2023 at 09:47:41AM +0200, Stefan Sperling wrote:
> In the UTF-8 locale I can trigger an error message with your program
> by sending the latin1 code for a-acute to stdin. I suppose your test
> command didn't actually send latin1 to stdin for some reason?
> 
>   $ perl -e 'printf "\xe1rbol\n"' | ./a.out
>   error: Illegal byte sequence
> 

Right, I can trigger the error with your command, also directly typing
the characters in wscons (my keyboard is Spanish), what I was doing in
those commands was to copy and paste those latin charactes with my
mouse.  The strange thing is xterm still showed me the (?) glyphos.
Besides, I made a test using the mbstowcs function in my mail patch, and
it didn't worked.  I'll try again.


Thanks Stefan!


-- 
Walter



Re: Send international text with mail(1) - proposal and patches

2023-09-24 Thread Stefan Sperling
On Sun, Sep 24, 2023 at 07:06:35AM +0200, Walter Alejandro Iglesias wrote:
> Hi Ingo,
> 
> On Thu, Sep 21, 2023 at 03:04:24PM +0200, Ingo Schwarze wrote:
> > In general, the tool for checking the validity of UTF-8 strings
> > is a simple loop around mblen(3) if you want to report the precise
> > positions of errors found, or simply mbstowcs(3) with a NULL pwcs
> > argument if you are content with a one-bit "valid" or "invalid" answer.
> 
> Acording to mbstowcs(3):
> 
> RETURN VALUES
>   mbstowcs() returns:
> 
>   0 or positive
> The value returned is the number of elements stored in the array
> pointed to by pwcs, except for a terminating null wide character
> (if any).  If pwcs is not null and the value returned is equal
> to n, the wide-character string pointed to by pwcs is not null
> terminated.  If pwcs is a null pointer, the value returned is
> the number of elements to contain the whole string converted,
> except for a terminating null wide character.
> 
>   (size_t)-1  The array indirectly pointed to by s contains a byte
>   sequence forming invalid character.  In this case,
>   mbstowcs() sets errno to indicate the error.
> 
> ERRORS
>  mbstowcs() may cause an error in the following cases:
> 
>  [EILSEQ]  s points to the string containing invalid or
>incomplete multibyte character.
> 
> 
> To understand what mbstowcs(3) does I wrote the little test.c program
> pasted at bottom.  In the following example [a] is UTF-8 aaculte and (a)
> iso-latin aacute.
> 
> Using setlocale(LC_CTYPE, "en_US.UTF-8");
> 
>   $ cc -g -Wall test.c
>   $ echo -n arbol | a.out
>   ulen: 5
>   $ echo -n [a]rbol | a.out
>   ulen: 5
>   $ echo -n (a)rbol | a.out
>   ulen: 5

In the UTF-8 locale I can trigger an error message with your program
by sending the latin1 code for a-acute to stdin. I suppose your test
command didn't actually send latin1 to stdin for some reason?

  $ perl -e 'printf "\xe1rbol\n"' | ./a.out
  error: Illegal byte sequence

> Using setlocale(LC_CTYPE, "C");
> 
>   $ cc -g -Wall test.c
>   $ echo -n arbol | a.out
>   ulen: 5
>   $ echo -n [a]rbol | a.out
>   ulen: 6
>   $ echo -n (a)rbol | a.out
>   ulen: 7
> 
> And no error message in any case.  I don't understand in which way those
> return values let me know that the third string is invalid UTF-8.  Am I
> doing something wrong?

There is no concept of byte sequences in the C locale, bytes are bytes.
It is not possible to detect invalid UTF-8 via libc while running in the
C locale since the citrus code in libc won't even run. However, the various
ctype tests like isascii(unsigned char)c); isprint((unsigned char)c); and so
on can be used to filter or stub out non-ASCII characters, which is what
users running in the C locale would want.



Re: Send international text with mail(1) - proposal and patches

2023-09-23 Thread Walter Alejandro Iglesias
Hi Ingo,

On Thu, Sep 21, 2023 at 03:04:24PM +0200, Ingo Schwarze wrote:
> In general, the tool for checking the validity of UTF-8 strings
> is a simple loop around mblen(3) if you want to report the precise
> positions of errors found, or simply mbstowcs(3) with a NULL pwcs
> argument if you are content with a one-bit "valid" or "invalid" answer.

Acording to mbstowcs(3):

RETURN VALUES
  mbstowcs() returns:

  0 or positive
The value returned is the number of elements stored in the array
pointed to by pwcs, except for a terminating null wide character
(if any).  If pwcs is not null and the value returned is equal
to n, the wide-character string pointed to by pwcs is not null
terminated.  If pwcs is a null pointer, the value returned is
the number of elements to contain the whole string converted,
except for a terminating null wide character.

  (size_t)-1  The array indirectly pointed to by s contains a byte
  sequence forming invalid character.  In this case,
  mbstowcs() sets errno to indicate the error.

ERRORS
 mbstowcs() may cause an error in the following cases:

 [EILSEQ]  s points to the string containing invalid or
   incomplete multibyte character.


To understand what mbstowcs(3) does I wrote the little test.c program
pasted at bottom.  In the following example [a] is UTF-8 aaculte and (a)
iso-latin aacute.

Using setlocale(LC_CTYPE, "en_US.UTF-8");

  $ cc -g -Wall test.c
  $ echo -n arbol | a.out
  ulen: 5
  $ echo -n [a]rbol | a.out
  ulen: 5
  $ echo -n (a)rbol | a.out
  ulen: 5

Using setlocale(LC_CTYPE, "C");

  $ cc -g -Wall test.c
  $ echo -n arbol | a.out
  ulen: 5
  $ echo -n [a]rbol | a.out
  ulen: 6
  $ echo -n (a)rbol | a.out
  ulen: 7

And no error message in any case.  I don't understand in which way those
return values let me know that the third string is invalid UTF-8.  Am I
doing something wrong?


test.c

#include 
#include 
#include 

int
main()
{

int c, i;
size_t ulen;
char s[100];

i = 0;
while ((c = getchar()) != EOF)
s[i++] = c;

s[i] = '\0';

setlocale(LC_CTYPE, "en_US.UTF-8");
//setlocale(LC_CTYPE, "C");

if ((ulen = mbstowcs(NULL, s, 0)) == (size_t)-1)
perror("error");

printf("ulen: %zu\n", ulen);

return 0;
}

-- 
Walter



Re: Send international text with mail(1) proposal and patches]

2023-09-23 Thread Crystal Kolipe
On Sat, Sep 23, 2023 at 12:10:41PM +0200, Walter Alejandro Iglesias wrote:
> > On Thu, Sep 21, 2023 at 02:12:50PM +0200, Stefan Sperling wrote:
> > > Your implementation lacks proper bounds checking. It accesses
> > > s[i + 3] based purely on the contents of the input string, without
> > > checking whether len < i + 3. Entering the while (i != len) loop with
> 
> You surely meant "len > i + 3" (grater than).  The patch below is wrong.
> 
> I know it doesn't matter anymore but I'm still clarifying so that no one
> wastes time trying the patch.

The bounds checks that you added is not the correct way to improve this code.

There are a lot of potentially dangerous coding mistakes, mis-understandings
and bad style going on here.  Maybe we wouldn't have the first two, if it
wasn't for the last one.

The very specific issue with s[i + 3] that Stefan mentioned does not exist,
assuming that the code is compiled with a compliant C compiler, refer to
ISO 9899 6.5.13, the && operator creates a sequence point so the
s[i + 3] that Stefan is worried about will not be evaluated unless the
previous s[i + 2] comparisons are evaluated.  Those in turn will not be
evaluated unless the s[i + 1] is evalulated, which will not be evaluated
unless s[i] == 0xf0.

Therefore, entering the loop with i == len-1 and a specially crafted input
string should not be exploitable IFF s[len]==0, as this would fail the
s[i+1] >= 0x90 test.

Except that you don't strictly terminate the input at s[len].

You terminate the string at s[i] where is the value of i after the while
loop which terminates when you get EOF.  So it's _probably_ going to be
len, but there is no guarantee because getc can return EOF on a read error.

That would mean you terminate the string early, but still use the full
supplied value of len, (which ultimately came from st_size), when you
loop through the characters checking the validity of the UTF-8 stream.

So overall, you have now added error checking code for the wrong reasons
which either nobody noticed, or nobody bothered to point out possibly due
to the unclear style of coding making review more difficult.

Additionally, you have not addressed the more important issue of using
the value of i upon leaving the first loop as the offset for adding the
terminating byte, and the value of len for the actual termination
condition of the second loop.

That might be a useful technique in the underhanded C content, but not
so much in OpenBSD code :-) :-).

But joking aside, your claim that:

On Thu, Sep 21, 2023 at 09:01:25PM +0200, Walter Alejandro Iglesias wrote:
> Notice that you saw the issue in my code (bounds checking) at a first
> glance, that's because my code is neither too complicated (citrus) nor
> too elegant (tmux), hence by far easier to read, understand and debug.
> Among other things it deals with utf-8 without using wchar.h.

is not valid.  In fact, the opposite is true.  Your code is hard to
review, and that increases the chances that bugs will be missed.

To be fair, I wouldn't recommend learning from the code in utf8_isvalid()
in tmux either.  That's also not a great example, and absolutely not
appropriate for inclusion in mail to validate UTF-8 for the purposes
of adding or not adding a content encoding header.

As a final note, surely the solution here is not to add any parsing or
other intelligence to mail itself, but just to optionally allow mail to
call an external user-configurable program with the proposed mail body
and then have that program return a simple 0 or 1 to indicate whether
to include the UTF-8 content header or not?

That way, the user can chose their own level of parsing, or even opt to
have _all_ mail sent as UTF-8, (which is often fine, since 7-bit ASCII is
a valid UTF-8 stream).

I'll try to find time to write up some demo code showing reliable ways to
comprehensively parse UTF-8 streams in C.



Re: Send international text with mail(1) proposal and patches]

2023-09-23 Thread Walter Alejandro Iglesias
> On Thu, Sep 21, 2023 at 02:12:50PM +0200, Stefan Sperling wrote:
> > Your implementation lacks proper bounds checking. It accesses
> > s[i + 3] based purely on the contents of the input string, without
> > checking whether len < i + 3. Entering the while (i != len) loop with

You surely meant "len > i + 3" (grater than).  The patch below is wrong.

I know it doesn't matter anymore but I'm still clarifying so that no one
wastes time trying the patch.

> 
> 
> 
> Index: send.c
> ===
> RCS file: /cvs/src/usr.bin/mail/send.c,v
> retrieving revision 1.26
> diff -u -p -r1.26 send.c
> --- send.c8 Mar 2023 04:43:11 -   1.26
> +++ send.c21 Sep 2023 14:16:08 -
> @@ -33,6 +33,10 @@
>  #include "rcv.h"
>  #include "extern.h"
>  
> +/* To check charset of the message and add the appropiate MIME headers  */
> +static char nutf8;
> +static int not_utf8(FILE *s, int len);
> +
>  static volatile sig_atomic_t sendsignal; /* Interrupted by a signal? */
>  
>  /*
> @@ -341,6 +345,11 @@ mail1(struct header *hp, int printheader
>   else
>   puts("Null message body; hope that's ok");
>   }
> +
> + /* Check non valid UTF-8 characters in the message */
> + nutf8 = not_utf8(mtf, fsize(mtf));
> + rewind(mtf);
> +
>   /*
>* Now, take the user names from the combined
>* to and cc lists and do all the alias
> @@ -525,6 +534,14 @@ puthead(struct header *hp, FILE *fo, int
>   fmt("To:", hp->h_to, fo, w), gotcha++;
>   if (hp->h_subject != NULL && w & GSUBJECT)
>   fprintf(fo, "Subject: %s\n", hp->h_subject), gotcha++;
> + if (nutf8 == 0)
> + fprintf(fo, "MIME-Version: 1.0\n"
> + "Content-Type: text/plain; charset=us-ascii\n"
> + "Content-Transfer-Encoding: 7bit\n"), gotcha++;
> + else if (nutf8 == 1)
> + fprintf(fo, "MIME-Version: 1.0\n"
> + "Content-Type: text/plain; charset=utf-8\n"
> + "Content-Transfer-Encoding: 8bit\n"), gotcha++;
>   if (hp->h_cc != NULL && w & GCC)
>   fmt("Cc:", hp->h_cc, fo, w), gotcha++;
>   if (hp->h_bcc != NULL && w & GBCC)
> @@ -609,4 +626,60 @@ sendint(int s)
>  {
>  
>   sendsignal = s;
> +}
> +
> +/* Search non valid UTF-8 characters in the message */
> +static int
> +not_utf8(FILE *message, int len)
> +{
> + int i, n, nonascii;
> + char c;
> + unsigned char s[len + 1];
> +
> + i = 0;
> +while ((c = getc(message)) != EOF)
> + s[i++] = c;
> +
> + s[i] = '\0';
> +
> + i = n = nonascii = 0;
> + while (i != len)
> + if (s[i] <= 0x7f) {
> + i++;
> + /* Two bytes case */
> + } else if (len < i + 1 && s[i] >= 0xc2 && s[i] < 0xe0 &&
> + s[i + 1] >= 0x80 && s[i + 1] <= 0xbf) {
> + i += 2;
> + nonascii++;
> + /* Special three bytes case */
> + } else if ((len < i + 2 && s[i] == 0xe0 &&
> + s[i + 1] >= 0xa0 && s[i + 1] <= 0xbf &&
> + s[i + 2] >= 0x80 && s[i + 2] <= 0xbf) ||
> + /* Three bytes case */
> + (len < i + 2 && s[i] > 0xe0 && s[i] < 0xf0 &&
> + s[i + 1] >= 0x80 && s[i + 1] <= 0xbf &&
> + s[i + 2] >= 0x80 && s[i + 2] <= 0xbf)) {
> + i += 3;
> + nonascii++;
> + /* Special four bytes case */
> + } else if ((len < i + 3 && s[i] == 0xf0 &&
> + s[i + 1] >= 0x90 && s[i + 1] <= 0xbf &&
> + s[i + 2] >= 0x80 && s[i + 2] <= 0xbf &&
> + s[i + 3] >= 0x80 && s[i + 3] <= 0xbf) ||
> + /* Four bytes case */
> + (len < i + 3 && s[i] > 0xf0 &&
> + s[i + 1] >= 0x80 && s[i + 1] <= 0xbf &&
> + s[i + 2] >= 0x80 && s[i + 2] <= 0xbf &&
> + s[i + 3] >= 0x80 && s[i + 3] <= 0xbf)) {
> + i += 4;
> + nonascii++;
> + } else {
> + n = i + 1;
> + break;
> + }
> +
> + if (nonascii)
> + n++;
> +
> + return n;
>  }
> 
> 
> -- 
> Walter

-- 
Walter



Re: Send international text with mail(1) - proposal and patches

2023-09-21 Thread Walter Alejandro Iglesias
On Fri, Sep 22, 2023 at 06:57:24AM +0200, Walter Alejandro Iglesias wrote:
> Below, a version without utf8 parser.  I added a ASCII check for the
> subject.  The day will come when wscons support UTF-8, right?  In the
> meantime, just by being careful not to type iso-latin characters while
> using mail on wscons this version does its job.

Last version caused a core dump when sending without subject.  I fixed
that by adding a check in the conditional:

while (hp->h_subject != NULL && hp->h_subject[i] != '\0') {



Index: send.c
===
RCS file: /cvs/src/usr.bin/mail/send.c,v
retrieving revision 1.26
diff -u -p -r1.26 send.c
--- send.c  8 Mar 2023 04:43:11 -   1.26
+++ send.c  22 Sep 2023 05:47:37 -
@@ -33,6 +33,10 @@
 #include "rcv.h"
 #include "extern.h"
 
+/* This will be used to add MIME headers */
+static char noascii_subject;
+static char noascii_body;
+
 static volatile sig_atomic_t sendsignal;   /* Interrupted by a signal? */
 
 /*
@@ -341,6 +345,22 @@ mail1(struct header *hp, int printheader
else
puts("Null message body; hope that's ok");
}
+
+   /* Check for non ascii characters in the subject */
+   int i, ch;
+   i = 0;
+   while (hp->h_subject != NULL && hp->h_subject[i] != '\0') {
+   if (!isascii(hp->h_subject[i]))
+   noascii_subject = 1;
+   i++;
+   }
+
+   /* Check for non ascii characters in the body */
+   while ((ch = getc(mtf)) != EOF)
+   if (!isascii(ch))
+   noascii_body = 1;
+   rewind(mtf);
+
/*
 * Now, take the user names from the combined
 * to and cc lists and do all the alias
@@ -524,7 +544,18 @@ puthead(struct header *hp, FILE *fo, int
if (hp->h_to != NULL && w & GTO)
fmt("To:", hp->h_to, fo, w), gotcha++;
if (hp->h_subject != NULL && w & GSUBJECT)
-   fprintf(fo, "Subject: %s\n", hp->h_subject), gotcha++;
+   fprintf(fo, "Subject: %s\n", hp->h_subject),
+   gotcha++;
+   if (noascii_subject || noascii_body)
+   fprintf(fo, "MIME-Version: 1.0\n"
+   "Content-Type: text/plain; charset=utf-8\n"
+   "Content-Transfer-Encoding: 8bit\n"),
+   gotcha++;
+   else
+   fprintf(fo, "MIME-Version: 1.0\n"
+   "Content-Type: text/plain; charset=us-ascii\n"
+   "Content-Transfer-Encoding: 7bit\n"),
+   gotcha++;
if (hp->h_cc != NULL && w & GCC)
fmt("Cc:", hp->h_cc, fo, w), gotcha++;
if (hp->h_bcc != NULL && w & GBCC)
@@ -607,6 +638,5 @@ savemail(char *name, FILE *fi)
 void
 sendint(int s)
 {
-
sendsignal = s;
 }

-- 
Walter



Re: Send international text with mail(1) - proposal and patches

2023-09-21 Thread Walter Alejandro Iglesias
Hi Ingo,

On Thu, Sep 21, 2023 at 03:04:24PM +0200, Ingo Schwarze wrote:
> As Stefan says, adding a hand-written UTF-8 parser to mail(1) is
> clearly not acceptable.

Below, a version without utf8 parser.  I added a ASCII check for the
subject.  The day will come when wscons support UTF-8, right?  In the
meantime, just by being careful not to type iso-latin characters while
using mail on wscons this version does its job.

> 
> Yours,
>   Ingo


Index: send.c
===
RCS file: /cvs/src/usr.bin/mail/send.c,v
retrieving revision 1.26
diff -u -p -r1.26 send.c
--- send.c  8 Mar 2023 04:43:11 -   1.26
+++ send.c  22 Sep 2023 03:54:37 -
@@ -33,6 +33,10 @@
 #include "rcv.h"
 #include "extern.h"
 
+/* This will be used to add MIME headers */
+static char noascii_subject;
+static char noascii_body;
+
 static volatile sig_atomic_t sendsignal;   /* Interrupted by a signal? */
 
 /*
@@ -341,6 +345,22 @@ mail1(struct header *hp, int printheader
else
puts("Null message body; hope that's ok");
}
+
+   /* Check for non ascii characters in the subject */
+   int i, ch;
+   i = 0;
+   while (hp->h_subject[i] != '\0') {
+   if (!isascii(hp->h_subject[i]))
+   noascii_subject = 1;
+   i++;
+   }
+
+   /* Check for non ascii characters in the body */
+   while ((ch = getc(mtf)) != EOF)
+   if (!isascii(ch))
+   noascii_body = 1;
+   rewind(mtf);
+
/*
 * Now, take the user names from the combined
 * to and cc lists and do all the alias
@@ -524,7 +544,18 @@ puthead(struct header *hp, FILE *fo, int
if (hp->h_to != NULL && w & GTO)
fmt("To:", hp->h_to, fo, w), gotcha++;
if (hp->h_subject != NULL && w & GSUBJECT)
-   fprintf(fo, "Subject: %s\n", hp->h_subject), gotcha++;
+   fprintf(fo, "Subject: %s\n", hp->h_subject),
+   gotcha++;
+   if (noascii_subject || noascii_body)
+   fprintf(fo, "MIME-Version: 1.0\n"
+   "Content-Type: text/plain; charset=utf-8\n"
+   "Content-Transfer-Encoding: 8bit\n"),
+   gotcha++;
+   else
+   fprintf(fo, "MIME-Version: 1.0\n"
+   "Content-Type: text/plain; charset=us-ascii\n"
+   "Content-Transfer-Encoding: 7bit\n"),
+   gotcha++;
if (hp->h_cc != NULL && w & GCC)
fmt("Cc:", hp->h_cc, fo, w), gotcha++;
if (hp->h_bcc != NULL && w & GBCC)
@@ -607,6 +638,5 @@ savemail(char *name, FILE *fi)
 void
 sendint(int s)
 {
-
sendsignal = s;
 }


-- 
Walter



Re: Send international text with mail(1) - proposal and patches

2023-09-21 Thread Walter Alejandro Iglesias
On Thu, Sep 21, 2023 at 02:12:50PM +0200, Stefan Sperling wrote:
> On Thu, Sep 21, 2023 at 01:25:01PM +0200, Walter Alejandro Iglesias wrote:
> > I corrected many of the things you pointed me, but not all.  The
> > function I use to check utf8 is mine, I use it in a pair of little
> > programs which I've *hardly* checked for memory leacks.  I know my
> > function looks BIG :-), but I know for sure that it does the job.
> 
> We already have code in libc that does this, see the function
> _citrus_utf8_ctype_mbrtowc in lib/libc/citrus/citrus_utf8.c.
> Please use the libc interface if at all possible, it is best to
> have just one place to fix when a UTF-8 parser bug is found.
> 
> There is also utf8_isvalid() in tmux utf8.c though you would
> have to trim tmux UTF-8 code down for your narrow use case.
> 
> Your implementation lacks proper bounds checking. It accesses
> s[i + 3] based purely on the contents of the input string, without
> checking whether len < i + 3. Entering the while (i != len) loop with
> i == len-1 and a specially crafted input string can be problematic.

Hey Stefan,

I'll give up for now.  Another day I'll invetigate and try to understand
what you're asking me, because so far I fail to see how what you propose
could facilitate maintenance or reduce bugs.

Notice that you saw the issue in my code (bounds checking) at a first
glance, that's because my code is neither too complicated (citrus) nor
too elegant (tmux), hence by far easier to read, understand and debug.
Among other things it deals with utf-8 without using wchar.h.

I'm sorry you don't like it.  Anyways, in case someone else can do
something with it, here is my last version with the boundary check.



Index: send.c
===
RCS file: /cvs/src/usr.bin/mail/send.c,v
retrieving revision 1.26
diff -u -p -r1.26 send.c
--- send.c  8 Mar 2023 04:43:11 -   1.26
+++ send.c  21 Sep 2023 14:16:08 -
@@ -33,6 +33,10 @@
 #include "rcv.h"
 #include "extern.h"
 
+/* To check charset of the message and add the appropiate MIME headers  */
+static char nutf8;
+static int not_utf8(FILE *s, int len);
+
 static volatile sig_atomic_t sendsignal;   /* Interrupted by a signal? */
 
 /*
@@ -341,6 +345,11 @@ mail1(struct header *hp, int printheader
else
puts("Null message body; hope that's ok");
}
+
+   /* Check non valid UTF-8 characters in the message */
+   nutf8 = not_utf8(mtf, fsize(mtf));
+   rewind(mtf);
+
/*
 * Now, take the user names from the combined
 * to and cc lists and do all the alias
@@ -525,6 +534,14 @@ puthead(struct header *hp, FILE *fo, int
fmt("To:", hp->h_to, fo, w), gotcha++;
if (hp->h_subject != NULL && w & GSUBJECT)
fprintf(fo, "Subject: %s\n", hp->h_subject), gotcha++;
+   if (nutf8 == 0)
+   fprintf(fo, "MIME-Version: 1.0\n"
+   "Content-Type: text/plain; charset=us-ascii\n"
+   "Content-Transfer-Encoding: 7bit\n"), gotcha++;
+   else if (nutf8 == 1)
+   fprintf(fo, "MIME-Version: 1.0\n"
+   "Content-Type: text/plain; charset=utf-8\n"
+   "Content-Transfer-Encoding: 8bit\n"), gotcha++;
if (hp->h_cc != NULL && w & GCC)
fmt("Cc:", hp->h_cc, fo, w), gotcha++;
if (hp->h_bcc != NULL && w & GBCC)
@@ -609,4 +626,60 @@ sendint(int s)
 {
 
sendsignal = s;
+}
+
+/* Search non valid UTF-8 characters in the message */
+static int
+not_utf8(FILE *message, int len)
+{
+   int i, n, nonascii;
+   char c;
+   unsigned char s[len + 1];
+
+   i = 0;
+while ((c = getc(message)) != EOF)
+   s[i++] = c;
+
+   s[i] = '\0';
+
+   i = n = nonascii = 0;
+   while (i != len)
+   if (s[i] <= 0x7f) {
+   i++;
+   /* Two bytes case */
+   } else if (len < i + 1 && s[i] >= 0xc2 && s[i] < 0xe0 &&
+   s[i + 1] >= 0x80 && s[i + 1] <= 0xbf) {
+   i += 2;
+   nonascii++;
+   /* Special three bytes case */
+   } else if ((len < i + 2 && s[i] == 0xe0 &&
+   s[i + 1] >= 0xa0 && s[i + 1] <= 0xbf &&
+   s[i + 2] >= 0x80 && s[i + 2] <= 0xbf) ||
+   /* Three bytes case */
+   (len < i + 2 && s[i] > 0xe0 && s[i] < 0xf0 &&
+   s[i + 1] >= 0x80 && s[i + 1] <= 0xbf &&
+   s[i + 2] >= 0x80 && s[i + 2] <= 0xbf)) {
+   i += 3;
+   nonascii++;
+   /* Special four bytes case */
+   } else if ((len < i + 3 && s[i] == 0xf0 &&
+   s[i + 1] >= 0x90 && s[i + 1] <= 0xbf &&
+   s[i + 2] >= 0x80 && s[i + 2] <= 0xbf &&
+   s[i + 3] >= 0x80 && s[i + 3] <= 0xbf) ||
+  

Re: Send international text with mail(1) - proposal and patches

2023-09-21 Thread Ingo Schwarze
Hi,

i fear this is getting a bit out of hand...

Stefan Sperling wrote on Thu, Sep 21, 2023 at 02:12:50PM +0200:
> On Thu, Sep 21, 2023 at 01:25:01PM +0200, Walter Alejandro Iglesias wrote:

>> I corrected many of the things you pointed me, but not all.  The
>> function I use to check utf8 is mine, I use it in a pair of little
>> programs which I've *hardly* checked for memory leacks.  I know my
>> function looks BIG :-), but I know for sure that it does the job.

> We already have code in libc that does this, see the function
> _citrus_utf8_ctype_mbrtowc in lib/libc/citrus/citrus_utf8.c.
> Please use the libc interface if at all possible, it is best to
> have just one place to fix when a UTF-8 parser bug is found.

In general, the tool for checking the validity of UTF-8 strings
is a simple loop around mblen(3) if you want to report the precise
positions of errors found, or simply mbstowcs(3) with a NULL pwcs
argument if you are content with a one-bit "valid" or "invalid" answer.

But checking the validity of UTF-8 is probably beyond the scope of a
simple tool like mail(1), i think.  All i suggested was checking the
validity of US-ASCII when that encoding is selected - in a separate
patch to be considered *after* support for the MIME headers has gone in.

As Stefan says, adding a hand-written UTF-8 parser to mail(1) is
clearly not acceptable.

Yours,
  Ingo



Re: Send international text with mail(1) - proposal and patches

2023-09-21 Thread Stefan Sperling
On Thu, Sep 21, 2023 at 01:25:01PM +0200, Walter Alejandro Iglesias wrote:
> I corrected many of the things you pointed me, but not all.  The
> function I use to check utf8 is mine, I use it in a pair of little
> programs which I've *hardly* checked for memory leacks.  I know my
> function looks BIG :-), but I know for sure that it does the job.

We already have code in libc that does this, see the function
_citrus_utf8_ctype_mbrtowc in lib/libc/citrus/citrus_utf8.c.
Please use the libc interface if at all possible, it is best to
have just one place to fix when a UTF-8 parser bug is found.

There is also utf8_isvalid() in tmux utf8.c though you would
have to trim tmux UTF-8 code down for your narrow use case.

Your implementation lacks proper bounds checking. It accesses
s[i + 3] based purely on the contents of the input string, without
checking whether len < i + 3. Entering the while (i != len) loop with
i == len-1 and a specially crafted input string can be problematic.



Re: Send international text with mail(1) - proposal and patches

2023-09-21 Thread Walter Alejandro Iglesias
On Thu, Sep 21, 2023 at 11:26:11AM +0200, Omar Polo wrote:
> On 2023/09/21 10:55:47 +0200, Walter Alejandro Iglesias  
> wrote:
> > On Wed, Sep 20, 2023 at 08:36:23PM +0200, Walter Alejandro Iglesias wrote:
> > > On Wed, Sep 20, 2023 at 07:44:12PM +0200, Walter Alejandro Iglesias wrote:
> > > > And this new idea simplifies all to this:
> > > 
> > > In case anyone else is worried.  Crystal Kolipe already pointed me out
> > > that a better UTF-8 checking is needed, I know, I'll get to that
> > > tomorrow.
> > 
> > The following version checks for not valid UTF-8 characters.  I could
> > make it fail in this case and send a dead.letter but I imagine that
> > those who really use mail(1) surely do it mostly in a tty console where,
> > at least with a non US keyboard, is too easy to type some non valid utf-8
> > character, hence this feature would be more a hassle than a help, so I
> > chose to make it simply skip adding any MIME header in this case (how it
> > has been used until now and no one complained :-)).  If you prefer the
> > other behavior let me know.
> > 
> > 
> > Index: send.c
> > ===
> > RCS file: /cvs/src/usr.bin/mail/send.c,v
> > retrieving revision 1.26
> > diff -u -p -r1.26 send.c
> > --- send.c  8 Mar 2023 04:43:11 -   1.26
> > +++ send.c  21 Sep 2023 08:40:11 -
> > @@ -33,6 +33,15 @@
> >  #include "rcv.h"
> >  #include "extern.h"
> >  
> > +/*
> > + * Variables and functions declared here will be useful to check the
> > + * character set of the message to add the appropiate MIME headers.
> > + */
> > +static char nascii = 0;
> > +static char nutf8 = 0;
> 
> There's no need to explicitly zero static (or global) variables.
> 
> > +static int not_ascii(struct __sFILE *s);
> > +static int not_utf8(struct __sFILE *s, int len);
> 
> I'd use FILE * instead of struct __sFILE
> 
> >  static volatile sig_atomic_t sendsignal;   /* Interrupted by a signal? */
> >  
> >  /*
> > @@ -341,6 +350,15 @@ mail1(struct header *hp, int printheader
> > else
> > puts("Null message body; hope that's ok");
> > }
> > +
> > +   /* Check for non ASCII characters in the message */
> > +   nascii = not_ascii(mtf);
> > +   rewind(mtf);
> > +
> > +   /* Check for non valid UTF-8 characters in the message */
> > +   nutf8 = not_utf8(mtf, fsize(mtf));
> > +   rewind(mtf);
> 
> assuming that we care for this two checks, why not doing everything in
> a single pass?
> 
> Do we really need the two checks?
> 
> > /*
> >  * Now, take the user names from the combined
> >  * to and cc lists and do all the alias
> > @@ -525,6 +543,14 @@ puthead(struct header *hp, FILE *fo, int
> > fmt("To:", hp->h_to, fo, w), gotcha++;
> > if (hp->h_subject != NULL && w & GSUBJECT)
> > fprintf(fo, "Subject: %s\n", hp->h_subject), gotcha++;
> > +   if (!nascii)
> > +   fprintf(fo, "MIME-Version: 1.0\n"
> > +   "Content-Type: text/plain; charset=us-ascii\n"
> > +   "Content-Transfer-Encoding: 7bit\n"), gotcha++;
> 
> +1 for splitting the string in multiple lines, this is an improvements
> over previous versions, but please
> 
>  - use four spaces of indentation for continuation lines
> 
>  - although existing code uses ", gotcha++" I'd split that in a
>separate line for clarity.
> 
> > +   else if (nutf8 == 0)
> > +   fprintf(fo, "MIME-Version: 1.0\n"
> > +   "Content-Type: text/plain; charset=utf-8\n"
> > +   "Content-Transfer-Encoding: 8bit\n"), gotcha++;
> > if (hp->h_cc != NULL && w & GCC)
> > fmt("Cc:", hp->h_cc, fo, w), gotcha++;
> > if (hp->h_bcc != NULL && w & GBCC)
> > @@ -609,4 +635,67 @@ sendint(int s)
> >  {
> >  
> > sendsignal = s;
> > +}
> > +
> > +/* Search non ASCII characters in the message */
> > +static int
> > +not_ascii(struct __sFILE *s)
> > +{
> > +   int ch, n;
> > +   n = 0;
> > +while ((ch = getc(s)) != EOF)
> 
> There are some spacing issues, both here and below.
> 
> > +if (ch > 0x7f)
> > +   n = 1;
> > +
> > +   return n;
> > +}
> > +
> > +/* Search non valid UTF-8 characters in the message */
> > +static int
> > +not_utf8(struct __sFILE *message, int len)
> > +{
> > +   int i, nou8;
> > +   char c;
> > +   unsigned char s[len + 1];
> 
> Please don't.  Variable length arrays (VLA) with a possibly large len
> are a bad idea.  They have more or less the same issues as alloca(3),
> see the CAVEATS section of it to get an idea.
> 
> > +
> > +   i = 0;
> > +while ((c = getc(message)) != EOF)
> > +   s[i++] = c;
> 
> and even then, fread() is simpler :-)
> 
> > +   s[i] = '\0';
> > +
> > +   i = nou8 = 0;
> > +   while (i != len)
> 
> ...and even then, mbtowc is easier to use.  See Ingo'
> /src/usr/bin/ls/utf8.c for an example usage.
> 
> > +   if (s[i] <= 0x7f)
> > +   ++i;
> > +   /* 

Re: Send international text with mail(1) - proposal and patches

2023-09-21 Thread Crystal Kolipe
On Thu, Sep 21, 2023 at 11:26:11AM +0200, Omar Polo wrote:
> Do we really need the two checks?

WFIW, my original suggestion made off-list was about checking for 0xfe and
0xff only:

Crystal wrote:
> 0xfe and 0xff are invalid in utf-8.
> 
> It _might_ be worth detecting them and in this case not outputting any mime
> headers at all, since the data would be neither us-ascii nor valid utf-8, and
> therefore possibly some other encoding, (that the user is aware of and
> handling correctly themselves).
> 
> OTOH, if we're not doing a complete check for valid utf-8, maybe such a
> partial check is worse than no check at all.

I _didn't_ advocate putting a whole utf-8 parser in.

The rationale is that seeing 0xfe or 0xff immediately makes it an invalid
utf-8 stream, and in that case the chances of it being a different 8-bit
encoding become much more likely, but we don't know for sure so best do
no further processing of headers.

Also, 0xff can easily turn up in input piped from other broken or exploited
code, so maybe in that case we also don't want to do futher processing.

The single loop checking for ascii characters could easily check 0xfe and
0xff with a trivial change.



Re: Send international text with mail(1) - proposal and patches

2023-09-21 Thread Omar Polo
On 2023/09/21 10:55:47 +0200, Walter Alejandro Iglesias  
wrote:
> On Wed, Sep 20, 2023 at 08:36:23PM +0200, Walter Alejandro Iglesias wrote:
> > On Wed, Sep 20, 2023 at 07:44:12PM +0200, Walter Alejandro Iglesias wrote:
> > > And this new idea simplifies all to this:
> > 
> > In case anyone else is worried.  Crystal Kolipe already pointed me out
> > that a better UTF-8 checking is needed, I know, I'll get to that
> > tomorrow.
> 
> The following version checks for not valid UTF-8 characters.  I could
> make it fail in this case and send a dead.letter but I imagine that
> those who really use mail(1) surely do it mostly in a tty console where,
> at least with a non US keyboard, is too easy to type some non valid utf-8
> character, hence this feature would be more a hassle than a help, so I
> chose to make it simply skip adding any MIME header in this case (how it
> has been used until now and no one complained :-)).  If you prefer the
> other behavior let me know.
> 
> 
> Index: send.c
> ===
> RCS file: /cvs/src/usr.bin/mail/send.c,v
> retrieving revision 1.26
> diff -u -p -r1.26 send.c
> --- send.c8 Mar 2023 04:43:11 -   1.26
> +++ send.c21 Sep 2023 08:40:11 -
> @@ -33,6 +33,15 @@
>  #include "rcv.h"
>  #include "extern.h"
>  
> +/*
> + * Variables and functions declared here will be useful to check the
> + * character set of the message to add the appropiate MIME headers.
> + */
> +static char nascii = 0;
> +static char nutf8 = 0;

There's no need to explicitly zero static (or global) variables.

> +static int not_ascii(struct __sFILE *s);
> +static int not_utf8(struct __sFILE *s, int len);

I'd use FILE * instead of struct __sFILE

>  static volatile sig_atomic_t sendsignal; /* Interrupted by a signal? */
>  
>  /*
> @@ -341,6 +350,15 @@ mail1(struct header *hp, int printheader
>   else
>   puts("Null message body; hope that's ok");
>   }
> +
> + /* Check for non ASCII characters in the message */
> + nascii = not_ascii(mtf);
> + rewind(mtf);
> +
> + /* Check for non valid UTF-8 characters in the message */
> + nutf8 = not_utf8(mtf, fsize(mtf));
> + rewind(mtf);

assuming that we care for this two checks, why not doing everything in
a single pass?

Do we really need the two checks?

>   /*
>* Now, take the user names from the combined
>* to and cc lists and do all the alias
> @@ -525,6 +543,14 @@ puthead(struct header *hp, FILE *fo, int
>   fmt("To:", hp->h_to, fo, w), gotcha++;
>   if (hp->h_subject != NULL && w & GSUBJECT)
>   fprintf(fo, "Subject: %s\n", hp->h_subject), gotcha++;
> + if (!nascii)
> + fprintf(fo, "MIME-Version: 1.0\n"
> + "Content-Type: text/plain; charset=us-ascii\n"
> + "Content-Transfer-Encoding: 7bit\n"), gotcha++;

+1 for splitting the string in multiple lines, this is an improvements
over previous versions, but please

 - use four spaces of indentation for continuation lines

 - although existing code uses ", gotcha++" I'd split that in a
   separate line for clarity.

> + else if (nutf8 == 0)
> + fprintf(fo, "MIME-Version: 1.0\n"
> + "Content-Type: text/plain; charset=utf-8\n"
> + "Content-Transfer-Encoding: 8bit\n"), gotcha++;
>   if (hp->h_cc != NULL && w & GCC)
>   fmt("Cc:", hp->h_cc, fo, w), gotcha++;
>   if (hp->h_bcc != NULL && w & GBCC)
> @@ -609,4 +635,67 @@ sendint(int s)
>  {
>  
>   sendsignal = s;
> +}
> +
> +/* Search non ASCII characters in the message */
> +static int
> +not_ascii(struct __sFILE *s)
> +{
> + int ch, n;
> + n = 0;
> +while ((ch = getc(s)) != EOF)

There are some spacing issues, both here and below.

> +if (ch > 0x7f)
> + n = 1;
> +
> + return n;
> +}
> +
> +/* Search non valid UTF-8 characters in the message */
> +static int
> +not_utf8(struct __sFILE *message, int len)
> +{
> + int i, nou8;
> + char c;
> + unsigned char s[len + 1];

Please don't.  Variable length arrays (VLA) with a possibly large len
are a bad idea.  They have more or less the same issues as alloca(3),
see the CAVEATS section of it to get an idea.

> +
> + i = 0;
> +while ((c = getc(message)) != EOF)
> + s[i++] = c;

and even then, fread() is simpler :-)

> + s[i] = '\0';
> +
> + i = nou8 = 0;
> + while (i != len)

...and even then, mbtowc is easier to use.  See Ingo'
/src/usr/bin/ls/utf8.c for an example usage.

> + if (s[i] <= 0x7f)
> + ++i;
> + /* Two bytes case */
> + else if (s[i] >= 0xc2 && s[i] < 0xe0 &&
> + s[i + 1] >= 0x80 && s[i + 1] <= 0xbf)
> + i += 2;
> + /* Special three bytes case */
> + else if ((s[i] == 0xe0 &&
> + 

Re: Send international text with mail(1) - proposal and patches

2023-09-21 Thread Walter Alejandro Iglesias
On Wed, Sep 20, 2023 at 08:36:23PM +0200, Walter Alejandro Iglesias wrote:
> On Wed, Sep 20, 2023 at 07:44:12PM +0200, Walter Alejandro Iglesias wrote:
> > And this new idea simplifies all to this:
> 
> In case anyone else is worried.  Crystal Kolipe already pointed me out
> that a better UTF-8 checking is needed, I know, I'll get to that
> tomorrow.

The following version checks for not valid UTF-8 characters.  I could
make it fail in this case and send a dead.letter but I imagine that
those who really use mail(1) surely do it mostly in a tty console where,
at least with a non US keyboard, is too easy to type some non valid utf-8
character, hence this feature would be more a hassle than a help, so I
chose to make it simply skip adding any MIME header in this case (how it
has been used until now and no one complained :-)).  If you prefer the
other behavior let me know.


Index: send.c
===
RCS file: /cvs/src/usr.bin/mail/send.c,v
retrieving revision 1.26
diff -u -p -r1.26 send.c
--- send.c  8 Mar 2023 04:43:11 -   1.26
+++ send.c  21 Sep 2023 08:40:11 -
@@ -33,6 +33,15 @@
 #include "rcv.h"
 #include "extern.h"
 
+/*
+ * Variables and functions declared here will be useful to check the
+ * character set of the message to add the appropiate MIME headers.
+ */
+static char nascii = 0;
+static char nutf8 = 0;
+static int not_ascii(struct __sFILE *s);
+static int not_utf8(struct __sFILE *s, int len);
+
 static volatile sig_atomic_t sendsignal;   /* Interrupted by a signal? */
 
 /*
@@ -341,6 +350,15 @@ mail1(struct header *hp, int printheader
else
puts("Null message body; hope that's ok");
}
+
+   /* Check for non ASCII characters in the message */
+   nascii = not_ascii(mtf);
+   rewind(mtf);
+
+   /* Check for non valid UTF-8 characters in the message */
+   nutf8 = not_utf8(mtf, fsize(mtf));
+   rewind(mtf);
+
/*
 * Now, take the user names from the combined
 * to and cc lists and do all the alias
@@ -525,6 +543,14 @@ puthead(struct header *hp, FILE *fo, int
fmt("To:", hp->h_to, fo, w), gotcha++;
if (hp->h_subject != NULL && w & GSUBJECT)
fprintf(fo, "Subject: %s\n", hp->h_subject), gotcha++;
+   if (!nascii)
+   fprintf(fo, "MIME-Version: 1.0\n"
+   "Content-Type: text/plain; charset=us-ascii\n"
+   "Content-Transfer-Encoding: 7bit\n"), gotcha++;
+   else if (nutf8 == 0)
+   fprintf(fo, "MIME-Version: 1.0\n"
+   "Content-Type: text/plain; charset=utf-8\n"
+   "Content-Transfer-Encoding: 8bit\n"), gotcha++;
if (hp->h_cc != NULL && w & GCC)
fmt("Cc:", hp->h_cc, fo, w), gotcha++;
if (hp->h_bcc != NULL && w & GBCC)
@@ -609,4 +635,67 @@ sendint(int s)
 {
 
sendsignal = s;
+}
+
+/* Search non ASCII characters in the message */
+static int
+not_ascii(struct __sFILE *s)
+{
+   int ch, n;
+   n = 0;
+while ((ch = getc(s)) != EOF)
+if (ch > 0x7f)
+   n = 1;
+
+   return n;
+}
+
+/* Search non valid UTF-8 characters in the message */
+static int
+not_utf8(struct __sFILE *message, int len)
+{
+   int i, nou8;
+   char c;
+   unsigned char s[len + 1];
+
+   i = 0;
+while ((c = getc(message)) != EOF)
+   s[i++] = c;
+
+   s[i] = '\0';
+
+   i = nou8 = 0;
+   while (i != len)
+   if (s[i] <= 0x7f)
+   ++i;
+   /* Two bytes case */
+   else if (s[i] >= 0xc2 && s[i] < 0xe0 &&
+   s[i + 1] >= 0x80 && s[i + 1] <= 0xbf)
+   i += 2;
+   /* Special three bytes case */
+   else if ((s[i] == 0xe0 &&
+   s[i + 1] >= 0xa0 && s[i + 1] <= 0xbf &&
+   s[i + 2] >= 0x80 && s[i + 2] <= 0xbf) ||
+   /* Three bytes case */
+   (s[i] > 0xe0 && s[i] < 0xf0 &&
+   s[i + 1] >= 0x80 && s[i + 1] <= 0xbf &&
+   s[i + 2] >= 0x80 && s[i + 2] <= 0xbf))
+   i += 3;
+   /* Special four bytes case */
+   else if ((s[i] == 0xf0 &&
+   s[i + 1] >= 0x90 && s[i + 1] <= 0xbf &&
+   s[i + 2] >= 0x80 && s[i + 2] <= 0xbf &&
+   s[i + 3] >= 0x80 && s[i + 3] <= 0xbf) ||
+   /* Four bytes case */
+   (s[i] > 0xf0 &&
+   s[i + 1] >= 0x80 && s[i + 1] <= 0xbf &&
+   s[i + 2] >= 0x80 && s[i + 2] <= 0xbf &&
+   s[i + 3] >= 0x80 && s[i + 3] <= 0xbf))
+   i += 4;
+   else {
+   nou8 = i + 1;
+   break;
+   

Re: Send international text with mail(1) - proposal and patches

2023-09-20 Thread Steffen Nurpmeso
Steffen Nurpmeso wrote in
 <20230920214009.w5mrf%stef...@sdaoden.eu>:
 |Ingo Schwarze wrote in
 | :
 | ...
 ||I just checked - even though i'm using the higer-level mutt(1) MUA
 ||most of the time and even though the shell i'm starting mutt(1) from
 ||has LC_CTYPE=C.UTF-8 set on that particular machine, the last sixteen
 ||mails i sent all contained the explicit header
 ||
 ||  Content-Type: text/plain; charset=us-ascii
 ||
 ||and intentionally so.  Yes, i do occasionally send UTF-8 mail on
 |
 |To be a hundred percent correct: MIME is not needed at all in that

That is to say, to be correct myself: like RFC 2045 says, "MIME
defines a number of new RFC 822 header fields that are used to
describe the content of a MIME entity".  Yet if there is no MIME
entity but only a plain RFC 822/2822/5322 internet message format,
there is nothing to describe.

   [.]there are still circumstances in which it might be desirable
   for a mail-processing agent to know whether a message was
   composed with the new standard in mind.
   Therefore, this document defines a new header field, "MIME-Version",
   which is to be used to declare the version of the Internet message
   body format standard in use.

   Messages composed in accordance with this document MUST include such
   a header field, with the following verbatim text:

But normally OpenBSD Mail does not, so no "MIME-Version: 1.0",
because no

   The presence of this header field is an assertion that the
   message has been composed in compliance with this document.

 |case, unless a transfer-encoding had to be used (you do not show
 |that header), maybe because of overlong lines to-be-folded, or for
 |whatever reason.  (But it is swallowed by consumers of course.)

That would at least be my point of view.

--steffen
|
|Der Kragenbaer,The moon bear,
|der holt sich munter   he cheerfully and one by one
|einen nach dem anderen runter  wa.ks himself off
|(By Robert Gernhardt)



Re: Send international text with mail(1) - proposal and patches

2023-09-20 Thread Steffen Nurpmeso
Ingo Schwarze wrote in
 :
 ...
 |I just checked - even though i'm using the higer-level mutt(1) MUA
 |most of the time and even though the shell i'm starting mutt(1) from
 |has LC_CTYPE=C.UTF-8 set on that particular machine, the last sixteen
 |mails i sent all contained the explicit header
 |
 |  Content-Type: text/plain; charset=us-ascii
 |
 |and intentionally so.  Yes, i do occasionally send UTF-8 mail on

To be a hundred percent correct: MIME is not needed at all in that
case, unless a transfer-encoding had to be used (you do not show
that header), maybe because of overlong lines to-be-folded, or for
whatever reason.  (But it is swallowed by consumers of course.)

--steffen
|
|Der Kragenbaer,The moon bear,
|der holt sich munter   he cheerfully and one by one
|einen nach dem anderen runter  wa.ks himself off
|(By Robert Gernhardt)



Re: Send international text with mail(1) - proposal and patches

2023-09-20 Thread Kirill Miazine

Hi, Ingo

• Ingo Schwarze [2023-09-20 13:55]:

Hi Kirill,

Kirill Miazine wrote on Wed, Sep 20, 2023 at 12:52:52PM +0200:


you may not even need -m, and instead inspect LC_CTYPE environment
variable and add appropriate headers for UTF-8. according to locale(1),
LC_CTYPE may be set to indicate UTF-8:

If the value of LC_CTYPE ends in ‘.UTF-8’, programs in the OpenBSD base
system ignore the beginning of it, treating for example zh_CN.UTF-8
exactly like en_US.UTF-8.


This is definitely very bad advice


I am sorry! I was thinking that a user who sets appropriate LC_CTYPE is 
thus instructing programs that input and output is UTF-8 and such 
instruction could be used instead of a flag, as per locale(1) presence 
of .UTF-8 in LC_CTYPE is an instruction to treat input and output as 
UTF-8 encoded text:


"The character encoding locale LC_CTYPE instructs programs which 
character encoding to assume for text input and to use for text output."


After all, LC_CTYPE=en_US.UTF-8 has to be set by a user and thus signals 
a preference to programs, and thus it wouldn't be unexpected or 
surprising to treat text as UTF-8 and also set appropriate MIME-headers. 
After all, by setting LC_CTYPE=en_US.UTF-8 -- according to locale(1) -- 
user says that input text is UTF-8, and then mail(1) would have to 
figure out how to make sure that text is transmitted properly.



Whether the user uses an UTF-8 locale for their shell and terminal
has nothing to do with whether they want to be send UTF-8 encoded
mail with MIME headers. For example, i'm using LC_CTYPE=en_US.UTF-8
for my shells and terminals most of the time, but i do not want the
low-level mail(1) MUA to suddenly start sending UTF-8 mail without
being specifically asked to.


My understanding of purpose of LC_TYPE was that by setting it, user 
specifically asks to treat input as UTF-8, and then the programs have to 
handle encoding appropriately. So I wouldn't be surprised if mail(1) 
started sending UTF-8 mail with LC_CTYPE=en_US.UTF-8. In fact, I'd be 
happy if it did so.



I just checked - even though i'm using the higer-level mutt(1) MUA
most of the time and even though the shell i'm starting mutt(1) from
has LC_CTYPE=C.UTF-8 set on that particular machine, the last sixteen
mails i sent all contained the explicit header

   Content-Type: text/plain; charset=us-ascii

and intentionally so.  Yes, i do occasionally send UTF-8 mail on
purpose, mostly in highly technical messages that need to display
particular Unicode characters in addition to mentioning their
codepoints in the U+[XX] form, and rarely, sending UTF-8 happens
inadvertently because mutt(1) contains some weird autodetection logic -
but what you set your terminal to and what you use for sending mail
are clearly completely unrelated topics.


Mutt has indeed a logic to see which character set a text can be 
converted into: it tries US-ASCII, then ISO-8859-1 and then UTF-8.



Yours,
   Ingo





Re: Send international text with mail(1) - proposal and patches

2023-09-20 Thread Walter Alejandro Iglesias
On Wed, Sep 20, 2023 at 07:44:12PM +0200, Walter Alejandro Iglesias wrote:
> And this new idea simplifies all to this:

In case anyone else is worried.  Crystal Kolipe already pointed me out
that a better UTF-8 checking is needed, I know, I'll get to that
tomorrow.



Re: Send international text with mail(1) - proposal and patches

2023-09-20 Thread Walter Alejandro Iglesias
On Wed, Sep 20, 2023 at 06:13:10PM +0200, Walter Alejandro Iglesias wrote:
> Now I was investigating exactly that :-) (like Mutt also does): to make
> mail(1) automatically set the appropiate MIME headers when it detects
> any utf8 characters in the body text.  So, you don't like this idea?
> 

And this new idea simplifies all to this:


Index: send.c
===
RCS file: /cvs/src/usr.bin/mail/send.c,v
retrieving revision 1.26
diff -u -p -r1.26 send.c
--- send.c  8 Mar 2023 04:43:11 -   1.26
+++ send.c  20 Sep 2023 17:40:22 -
@@ -33,6 +33,8 @@
 #include "rcv.h"
 #include "extern.h"
 
+char utf8 = 0;
+
 static volatile sig_atomic_t sendsignal;   /* Interrupted by a signal? */
 
 /*
@@ -341,6 +343,13 @@ mail1(struct header *hp, int printheader
else
puts("Null message body; hope that's ok");
}
+   /* Check for non ascii characters */
+   int ch;
+while ((ch = getc(mtf)) != EOF)
+if (ch > 0x7f)
+   utf8 = 1;
+   rewind(mtf);
+
/*
 * Now, take the user names from the combined
 * to and cc lists and do all the alias
@@ -525,6 +534,10 @@ puthead(struct header *hp, FILE *fo, int
fmt("To:", hp->h_to, fo, w), gotcha++;
if (hp->h_subject != NULL && w & GSUBJECT)
fprintf(fo, "Subject: %s\n", hp->h_subject), gotcha++;
+   if (utf8)
+   fprintf(fo, "MIME-Version: 1.0\nContent-Type: text/plain; 
charset=utf-8\nContent-Transfer-Encoding: 8bit\n"), gotcha++;
+   else
+   fprintf(fo, "MIME-Version: 1.0\nContent-Type: text/plain; 
charset=us-ascii\nContent-Transfer-Encoding: 7bit\n"), gotcha++;
if (hp->h_cc != NULL && w & GCC)
fmt("Cc:", hp->h_cc, fo, w), gotcha++;
if (hp->h_bcc != NULL && w & GBCC)


-- 
Walter



Re: Send international text with mail(1) - proposal and patches

2023-09-20 Thread Walter Alejandro Iglesias
On Wed, Sep 20, 2023 at 05:30:08PM +0200, Ingo Schwarze wrote:
> Hi,
> 
> i checked the following points:
> 
>  * Even though RFC 2049 section 2 bullet point 1 only *requires*
>MIME-conformant MUAs to always write the header "MIME-Version:
>1.0" - and mail(1) is most certainly not MIME-conformant - RFC 2049
>section 2 bullet point 8 explicitly *recommends* that even non-MIME
>MUAs always set appropriate MIME headers.  RFC 2046 section 4.1.2
>paragraph 8 also "strongly" recommends the explicit inclusion of a
>"charset" parameter even for us-ascii.
> 
>Consequently, i believe that when sending a message in US-ASCII,
>mail(1) should include these headers:
> 
>MIME-Version: 1.0
>Content-Transfer-Encoding: 7bit
>Content-Type: text/plain; charset=us-ascii

I already thought about adding this, it's what Mutt does by default, But
I thought, Ingo is going to scold me for complicating things. :-)

> 
>  * Adding a "Content-Transfer-Encoding: ..." header is indeed required
>for sending UTF-8 messages, see  RFC 2049 section 2 bullet point 2.
>"8bit" is one of the valid values that MUAs must support for
>receiving messages by default.
>Using it seems sane because it is most likely to work with receiving
>MUAs that are not MIME-conformant, like our mail(1) itself.
>I think nowadays, that's a bigger concern than MTAs that are not
>8-bit clean, in particular when maintaining a low-level program
>like our mail(1).
>Consequently, i think using 8bit is indeed better for our mail(1)
>than quoted-printable or base64.

Well, this also saves you the conversion, especially with the subject,
which is tricky.

> 
>  * Adding "Content-Type: text/plain; charset=utf-8" is required by
>RFC 2049 section 2 bullet point 4 (for the simplest kind of UTF-8
>encoded messages).
> 
>  * The Content-Disposition: header is defined in RFC 2183, clearly
>optional, and not useful in single-part messages.  Consequently,
>mail(1) should not write it.

Yeah, I read that, that's why I didn't add that header.


> 
> So apart from writing the headers for us-ascii, i think you are
> almost there.
> 
> Given that the charset cannot be inferred from the environment
> and that setting it per-system or per-user in a configuration file
> is also inadequate - it shouldn't be uncommon for users to sometimes
> send US-ASCII and sometimes UTF-8 mail - i think that a new option
> is indeed needed.
> 
> Regarding the naming of the option, compatibility with POSIX
>   https://pubs.opengroup.org/onlinepubs/9699919799/utilities/mailx.html
> is paramount, which kills the tentative idea to use -u for "UTF-8"
> because -u already means "user".
> 
> Compatibility with other mailx(1) implementations is also a
> consideration.  See, for example,
>   https://linux.die.net/man/1/mail
> and -m is indeed among the very few options still available over there.
> I would document it focussing on a "multibyte character encoding"
> mnemonic.  The "mime" mnemonic feels far too broad because MIME can
> be used for lots of other purposes besides specifying a character
> encoding.
> 
> The -m option is also free here:
>   https://man.freebsd.org/cgi/man.cgi?query=mail(1)
>   https://man.netbsd.org/mail.1
>   https://docs.oracle.com/cd/E88353_01/html/E37839/mailx-1.html
>   https://www.ibm.com/docs/en/aix/7.3?topic=m-mail-command-1
> None of those appears to support command line selection of the
> character set for sending mail, so i don't see any immediate
> logioc clashes either.
> 
> The -m option does clash with this one:
>   https://www.sdaoden.eu/code-nail.html
> But i think dismissing Steffen Daode Nurpmeso as a lunatic is obviously
> the way to go.  Try to listen to that person and you will never get
> anything done.
> 
> The mailx(1) documented on die.net appears to be the Heirloom one.
> It does not have an option to select sending US-ASCII or UTF-8.
> Instead, it has a "sendcharsets" configuration variable.  That's
> clearly overengineering, but even when hardcoding the equivalent of
> 
>   sendcharsets=utf-8
> 
> which is also the default, that's nasty because it silently switches to
> UTF-8 as soon as a non-ASCII character appears in the input.  I think
> at least in interactive mode, explicit confirmation from the user would
> be required to send UTF-8, instead writing dead.letter if the user
> rejects the request, such that they can clean up the file and try again.
> 
> That would certainly be more complicated than requiring an option
> up front, not only from the implementation perspective, but arguably
> also from the user perspective.  So unless other developers think this
> should be fully automatic with confirmation rather than controlled
> by an option, i suggest staying with Walter's idea of using an option.

Now I was investigating exactly that :-) (like Mutt also does): to make
mail(1) automatically set the appropiate MIME headers when it detects
any utf8 characters 

Re: Send international text with mail(1) - proposal and patches

2023-09-20 Thread Ingo Schwarze
Hi,

i checked the following points:

 * Even though RFC 2049 section 2 bullet point 1 only *requires*
   MIME-conformant MUAs to always write the header "MIME-Version:
   1.0" - and mail(1) is most certainly not MIME-conformant - RFC 2049
   section 2 bullet point 8 explicitly *recommends* that even non-MIME
   MUAs always set appropriate MIME headers.  RFC 2046 section 4.1.2
   paragraph 8 also "strongly" recommends the explicit inclusion of a
   "charset" parameter even for us-ascii.

   Consequently, i believe that when sending a message in US-ASCII,
   mail(1) should include these headers:

   MIME-Version: 1.0
   Content-Transfer-Encoding: 7bit
   Content-Type: text/plain; charset=us-ascii

 * Adding a "Content-Transfer-Encoding: ..." header is indeed required
   for sending UTF-8 messages, see  RFC 2049 section 2 bullet point 2.
   "8bit" is one of the valid values that MUAs must support for
   receiving messages by default.
   Using it seems sane because it is most likely to work with receiving
   MUAs that are not MIME-conformant, like our mail(1) itself.
   I think nowadays, that's a bigger concern than MTAs that are not
   8-bit clean, in particular when maintaining a low-level program
   like our mail(1).
   Consequently, i think using 8bit is indeed better for our mail(1)
   than quoted-printable or base64.

 * Adding "Content-Type: text/plain; charset=utf-8" is required by
   RFC 2049 section 2 bullet point 4 (for the simplest kind of UTF-8
   encoded messages).

 * The Content-Disposition: header is defined in RFC 2183, clearly
   optional, and not useful in single-part messages.  Consequently,
   mail(1) should not write it.

So apart from writing the headers for us-ascii, i think you are
almost there.

Given that the charset cannot be inferred from the environment
and that setting it per-system or per-user in a configuration file
is also inadequate - it shouldn't be uncommon for users to sometimes
send US-ASCII and sometimes UTF-8 mail - i think that a new option
is indeed needed.

Regarding the naming of the option, compatibility with POSIX
  https://pubs.opengroup.org/onlinepubs/9699919799/utilities/mailx.html
is paramount, which kills the tentative idea to use -u for "UTF-8"
because -u already means "user".

Compatibility with other mailx(1) implementations is also a
consideration.  See, for example,
  https://linux.die.net/man/1/mail
and -m is indeed among the very few options still available over there.
I would document it focussing on a "multibyte character encoding"
mnemonic.  The "mime" mnemonic feels far too broad because MIME can
be used for lots of other purposes besides specifying a character
encoding.

The -m option is also free here:
  https://man.freebsd.org/cgi/man.cgi?query=mail(1)
  https://man.netbsd.org/mail.1
  https://docs.oracle.com/cd/E88353_01/html/E37839/mailx-1.html
  https://www.ibm.com/docs/en/aix/7.3?topic=m-mail-command-1
None of those appears to support command line selection of the
character set for sending mail, so i don't see any immediate
logioc clashes either.

The -m option does clash with this one:
  https://www.sdaoden.eu/code-nail.html
But i think dismissing Steffen Daode Nurpmeso as a lunatic is obviously
the way to go.  Try to listen to that person and you will never get
anything done.

The mailx(1) documented on die.net appears to be the Heirloom one.
It does not have an option to select sending US-ASCII or UTF-8.
Instead, it has a "sendcharsets" configuration variable.  That's
clearly overengineering, but even when hardcoding the equivalent of

  sendcharsets=utf-8

which is also the default, that's nasty because it silently switches to
UTF-8 as soon as a non-ASCII character appears in the input.  I think
at least in interactive mode, explicit confirmation from the user would
be required to send UTF-8, instead writing dead.letter if the user
rejects the request, such that they can clean up the file and try again.

That would certainly be more complicated than requiring an option
up front, not only from the implementation perspective, but arguably
also from the user perspective.  So unless other developers think this
should be fully automatic with confirmation rather than controlled
by an option, i suggest staying with Walter's idea of using an option.


> Index: extern.h
> ===
> RCS file: /cvs/src/usr.bin/mail/extern.h,v
> retrieving revision 1.29
> diff -u -p -r1.29 extern.h
> --- extern.h  16 Sep 2018 02:38:57 -  1.29
> +++ extern.h  20 Sep 2023 10:44:41 -
> @@ -261,3 +261,4 @@ intwriteback(FILE *);
>  extern char *__progname;
>  extern char *tmpdir;
>  extern const struct cmd *com; /* command we are running */
> +extern char mime; /* Add MIME headers */

Likely not best mnemonic naming.

> Index: mail.1
> ===
> RCS file: /cvs/src/usr.bin/mail/mail.1,v
> retrieving revision 1.83
> diff -u -p 

Re: Send international text with mail(1) - proposal and patches

2023-09-20 Thread Ingo Schwarze
Hi Kirill,

Kirill Miazine wrote on Wed, Sep 20, 2023 at 12:52:52PM +0200:

> you may not even need -m, and instead inspect LC_CTYPE environment 
> variable and add appropriate headers for UTF-8. according to locale(1), 
> LC_CTYPE may be set to indicate UTF-8:
> 
> If the value of LC_CTYPE ends in ‘.UTF-8’, programs in the OpenBSD base 
> system ignore the beginning of it, treating for example zh_CN.UTF-8 
> exactly like en_US.UTF-8.

This is definitely very bad advice.  Whether the user uses an UTF-8
locale for their shell and terminal has nothing to do with whether
they want to be send UTF-8 encoded mail with MIME headers.
For example, i'm using LC_CTYPE=en_US.UTF-8 for my shells and
terminals most of the time, but i do not want the low-level mail(1)
MUA to suddenly start sending UTF-8 mail without being specifically
asked to.

I just checked - even though i'm using the higer-level mutt(1) MUA
most of the time and even though the shell i'm starting mutt(1) from
has LC_CTYPE=C.UTF-8 set on that particular machine, the last sixteen
mails i sent all contained the explicit header

  Content-Type: text/plain; charset=us-ascii

and intentionally so.  Yes, i do occasionally send UTF-8 mail on
purpose, mostly in highly technical messages that need to display
particular Unicode characters in addition to mentioning their
codepoints in the U+[XX] form, and rarely, sending UTF-8 happens
inadvertently because mutt(1) contains some weird autodetection logic -
but what you set your terminal to and what you use for sending mail
are clearly completely unrelated topics.

Yours,
  Ingo



Re: Send international text with mail(1) - proposal and patches

2023-09-20 Thread Walter Alejandro Iglesias
On Wed, Sep 20, 2023 at 10:30:31AM +, Klemens Nanni wrote:
> Except for mandoc(1) and other manuals where "utf8" is a literal keyword,
> our manuals consistently use upper-case UTF-8 for what is an abbreviation,
> so this should do as wlel.
> 
> >  .It Fl n
> >  Inhibits reading
> >  .Pa /etc/mail.rc
> 
> You forgot SYNOPSIS:
>   $ man -h mail
>   mail [-dEIinv] [-b list] [-c list] [-r from-addr] [-s subject] to-addr 
> ...
>   mail [-dEIiNnv] -f [file]
>   mail [-dEIiNnv] [-u user]
> 
> Otherwise looks sane.
> 

Thank you!


Index: extern.h
===
RCS file: /cvs/src/usr.bin/mail/extern.h,v
retrieving revision 1.29
diff -u -p -r1.29 extern.h
--- extern.h16 Sep 2018 02:38:57 -  1.29
+++ extern.h20 Sep 2023 10:44:41 -
@@ -261,3 +261,4 @@ int  writeback(FILE *);
 extern char *__progname;
 extern char *tmpdir;
 extern const struct cmd *com; /* command we are running */
+extern char mime; /* Add MIME headers */
Index: mail.1
===
RCS file: /cvs/src/usr.bin/mail/mail.1,v
retrieving revision 1.83
diff -u -p -r1.83 mail.1
--- mail.1  31 Mar 2022 17:27:25 -  1.83
+++ mail.1  20 Sep 2023 10:44:41 -
@@ -40,7 +40,7 @@
 .Sh SYNOPSIS
 .Nm mail
 .Bk -words
-.Op Fl dEIinv
+.Op Fl dEIimnv
 .Op Fl b Ar list
 .Op Fl c Ar list
 .Op Fl r Ar from-addr
@@ -106,6 +106,8 @@ on noisy phone lines.
 .It Fl N
 Inhibits initial display of message headers
 when reading mail or editing a mail folder.
+.It Fl m
+Add MIME headers to send UTF-8 encoded messages.
 .It Fl n
 Inhibits reading
 .Pa /etc/mail.rc
Index: main.c
===
RCS file: /cvs/src/usr.bin/mail/main.c,v
retrieving revision 1.35
diff -u -p -r1.35 main.c
--- main.c  26 Jan 2021 18:21:47 -  1.35
+++ main.c  20 Sep 2023 10:44:41 -
@@ -79,6 +79,8 @@ int   realscreenheight;   /* the real scree
 intuflag;  /* Are we in -u mode? */
 sigset_t intset;   /* Signal set that is just SIGINT */
 
+char mime = 0; /* Add MIME headers */
+
 /*
  * The pointers for the string allocation routines,
  * there are NSPACE independent areas.
@@ -136,7 +138,7 @@ main(int argc, char **argv)
smopts = NULL;
fromaddr = NULL;
subject = NULL;
-   while ((i = getopt(argc, argv, "EINb:c:dfinr:s:u:v")) != -1) {
+   while ((i = getopt(argc, argv, "EINb:c:dfimnr:s:u:v")) != -1) {
switch (i) {
case 'u':
/*
@@ -171,6 +173,10 @@ main(int argc, char **argv)
 */
subject = optarg;
break;
+   case 'm':
+   /* Add MIME headers */
+   mime = 1;
+   break;
case 'f':
/*
 * User is specifying file to "edit" with Mail,
@@ -337,7 +343,7 @@ __dead void
 usage(void)
 {
 
-   fprintf(stderr, "usage: %s [-dEIinv] [-b list] [-c list] "
+   fprintf(stderr, "usage: %s [-dEIimnv] [-b list] [-c list] "
"[-r from-addr] [-s subject] to-addr ...\n", __progname);
fprintf(stderr, "   %s [-dEIiNnv] -f [file]\n", __progname);
fprintf(stderr, "   %s [-dEIiNnv] [-u user]\n", __progname);
Index: send.c
===
RCS file: /cvs/src/usr.bin/mail/send.c,v
retrieving revision 1.26
diff -u -p -r1.26 send.c
--- send.c  8 Mar 2023 04:43:11 -   1.26
+++ send.c  20 Sep 2023 10:44:41 -
@@ -525,6 +525,8 @@ puthead(struct header *hp, FILE *fo, int
fmt("To:", hp->h_to, fo, w), gotcha++;
if (hp->h_subject != NULL && w & GSUBJECT)
fprintf(fo, "Subject: %s\n", hp->h_subject), gotcha++;
+   if (mime)
+   fprintf(fo, "MIME-Version: 1.0\nContent-Type: text/plain; 
charset=utf-8\nContent-Transfer-Encoding: 8bit\n"), gotcha++;
if (hp->h_cc != NULL && w & GCC)
fmt("Cc:", hp->h_cc, fo, w), gotcha++;
if (hp->h_bcc != NULL && w & GBCC)


-- 
Walter



Re: Send international text with mail(1) - proposal and patches

2023-09-20 Thread Kirill Miazine

• Walter Alejandro Iglesias [2023-09-20 12:21]:

Hi Ingo,

I did what you suggested me, I investigated a bit and you were right in
that the MIME-Version header was necessary.  This new set of patches
add the following headers (hardcoded as you suggested me):

   MIME-Version: 1.0
   Content-Type: text/plain; charset=utf-8
   Content-Transfer-Encoding: 8bit

I modified the code the less as possible, just a '-m' option:

   $ mail -m -s Hello d...@ext.net < body_message_in_utf8


you may not even need -m, and instead inspect LC_CTYPE environment 
variable and add appropriate headers for UTF-8. according to locale(1), 
LC_CTYPE may be set to indicate UTF-8:


If the value of LC_CTYPE ends in ‘.UTF-8’, programs in the OpenBSD base 
system ignore the beginning of it, treating for example zh_CN.UTF-8 
exactly like en_US.UTF-8.



Although, to tell the truth, I'm not really convinced if this change is
worth it.  Feel free to ignore it.



Index: extern.h
===
RCS file: /cvs/src/usr.bin/mail/extern.h,v
retrieving revision 1.29
diff -u -p -r1.29 extern.h
--- extern.h16 Sep 2018 02:38:57 -  1.29
+++ extern.h20 Sep 2023 09:55:06 -
@@ -261,3 +261,4 @@ int  writeback(FILE *);
  extern char *__progname;
  extern char *tmpdir;
  extern const struct cmd *com; /* command we are running */
+extern char mime; /* Add MIME headers */
Index: mail.1
===
RCS file: /cvs/src/usr.bin/mail/mail.1,v
retrieving revision 1.83
diff -u -p -r1.83 mail.1
--- mail.1  31 Mar 2022 17:27:25 -  1.83
+++ mail.1  20 Sep 2023 09:55:06 -
@@ -106,6 +106,8 @@ on noisy phone lines.
  .It Fl N
  Inhibits initial display of message headers
  when reading mail or editing a mail folder.
+.It Fl m
+Add MIME headers to send utf-8 encoded messages.
  .It Fl n
  Inhibits reading
  .Pa /etc/mail.rc
Index: main.c
===
RCS file: /cvs/src/usr.bin/mail/main.c,v
retrieving revision 1.35
diff -u -p -r1.35 main.c
--- main.c  26 Jan 2021 18:21:47 -  1.35
+++ main.c  20 Sep 2023 09:55:06 -
@@ -79,6 +79,8 @@ int   realscreenheight;   /* the real scree
  int   uflag;  /* Are we in -u mode? */
  sigset_t intset;  /* Signal set that is just SIGINT */
  
+char mime = 0;/* Add MIME headers */

+
  /*
   * The pointers for the string allocation routines,
   * there are NSPACE independent areas.
@@ -136,7 +138,7 @@ main(int argc, char **argv)
smopts = NULL;
fromaddr = NULL;
subject = NULL;
-   while ((i = getopt(argc, argv, "EINb:c:dfinr:s:u:v")) != -1) {
+   while ((i = getopt(argc, argv, "EINb:c:dfimnr:s:u:v")) != -1) {
switch (i) {
case 'u':
/*
@@ -171,6 +173,10 @@ main(int argc, char **argv)
 */
subject = optarg;
break;
+   case 'm':
+   /* Add MIME headers */
+   mime = 1;
+   break;
case 'f':
/*
 * User is specifying file to "edit" with Mail,
@@ -337,7 +343,7 @@ __dead void
  usage(void)
  {
  
-	fprintf(stderr, "usage: %s [-dEIinv] [-b list] [-c list] "

+   fprintf(stderr, "usage: %s [-dEIimnv] [-b list] [-c list] "
"[-r from-addr] [-s subject] to-addr ...\n", __progname);
fprintf(stderr, "   %s [-dEIiNnv] -f [file]\n", __progname);
fprintf(stderr, "   %s [-dEIiNnv] [-u user]\n", __progname);
Index: send.c
===
RCS file: /cvs/src/usr.bin/mail/send.c,v
retrieving revision 1.26
diff -u -p -r1.26 send.c
--- send.c  8 Mar 2023 04:43:11 -   1.26
+++ send.c  20 Sep 2023 09:55:06 -
@@ -525,6 +525,8 @@ puthead(struct header *hp, FILE *fo, int
fmt("To:", hp->h_to, fo, w), gotcha++;
if (hp->h_subject != NULL && w & GSUBJECT)
fprintf(fo, "Subject: %s\n", hp->h_subject), gotcha++;
+   if (mime)
+   fprintf(fo, "MIME-Version: 1.0\nContent-Type: text/plain; 
charset=utf-8\nContent-Transfer-Encoding: 8bit\n"), gotcha++;
if (hp->h_cc != NULL && w & GCC)
fmt("Cc:", hp->h_cc, fo, w), gotcha++;
if (hp->h_bcc != NULL && w & GBCC)







Re: Send international text with mail(1) - proposal and patches

2023-09-20 Thread Walter Alejandro Iglesias
Hi Ingo,

I did what you suggested me, I investigated a bit and you were right in
that the MIME-Version header was necessary.  This new set of patches
add the following headers (hardcoded as you suggested me):

  MIME-Version: 1.0
  Content-Type: text/plain; charset=utf-8
  Content-Transfer-Encoding: 8bit

I modified the code the less as possible, just a '-m' option:

  $ mail -m -s Hello d...@ext.net < body_message_in_utf8


Although, to tell the truth, I'm not really convinced if this change is
worth it.  Feel free to ignore it.



Index: extern.h
===
RCS file: /cvs/src/usr.bin/mail/extern.h,v
retrieving revision 1.29
diff -u -p -r1.29 extern.h
--- extern.h16 Sep 2018 02:38:57 -  1.29
+++ extern.h20 Sep 2023 09:55:06 -
@@ -261,3 +261,4 @@ int  writeback(FILE *);
 extern char *__progname;
 extern char *tmpdir;
 extern const struct cmd *com; /* command we are running */
+extern char mime; /* Add MIME headers */
Index: mail.1
===
RCS file: /cvs/src/usr.bin/mail/mail.1,v
retrieving revision 1.83
diff -u -p -r1.83 mail.1
--- mail.1  31 Mar 2022 17:27:25 -  1.83
+++ mail.1  20 Sep 2023 09:55:06 -
@@ -106,6 +106,8 @@ on noisy phone lines.
 .It Fl N
 Inhibits initial display of message headers
 when reading mail or editing a mail folder.
+.It Fl m
+Add MIME headers to send utf-8 encoded messages.
 .It Fl n
 Inhibits reading
 .Pa /etc/mail.rc
Index: main.c
===
RCS file: /cvs/src/usr.bin/mail/main.c,v
retrieving revision 1.35
diff -u -p -r1.35 main.c
--- main.c  26 Jan 2021 18:21:47 -  1.35
+++ main.c  20 Sep 2023 09:55:06 -
@@ -79,6 +79,8 @@ int   realscreenheight;   /* the real scree
 intuflag;  /* Are we in -u mode? */
 sigset_t intset;   /* Signal set that is just SIGINT */
 
+char mime = 0; /* Add MIME headers */
+
 /*
  * The pointers for the string allocation routines,
  * there are NSPACE independent areas.
@@ -136,7 +138,7 @@ main(int argc, char **argv)
smopts = NULL;
fromaddr = NULL;
subject = NULL;
-   while ((i = getopt(argc, argv, "EINb:c:dfinr:s:u:v")) != -1) {
+   while ((i = getopt(argc, argv, "EINb:c:dfimnr:s:u:v")) != -1) {
switch (i) {
case 'u':
/*
@@ -171,6 +173,10 @@ main(int argc, char **argv)
 */
subject = optarg;
break;
+   case 'm':
+   /* Add MIME headers */
+   mime = 1;
+   break;
case 'f':
/*
 * User is specifying file to "edit" with Mail,
@@ -337,7 +343,7 @@ __dead void
 usage(void)
 {
 
-   fprintf(stderr, "usage: %s [-dEIinv] [-b list] [-c list] "
+   fprintf(stderr, "usage: %s [-dEIimnv] [-b list] [-c list] "
"[-r from-addr] [-s subject] to-addr ...\n", __progname);
fprintf(stderr, "   %s [-dEIiNnv] -f [file]\n", __progname);
fprintf(stderr, "   %s [-dEIiNnv] [-u user]\n", __progname);
Index: send.c
===
RCS file: /cvs/src/usr.bin/mail/send.c,v
retrieving revision 1.26
diff -u -p -r1.26 send.c
--- send.c  8 Mar 2023 04:43:11 -   1.26
+++ send.c  20 Sep 2023 09:55:06 -
@@ -525,6 +525,8 @@ puthead(struct header *hp, FILE *fo, int
fmt("To:", hp->h_to, fo, w), gotcha++;
if (hp->h_subject != NULL && w & GSUBJECT)
fprintf(fo, "Subject: %s\n", hp->h_subject), gotcha++;
+   if (mime)
+   fprintf(fo, "MIME-Version: 1.0\nContent-Type: text/plain; 
charset=utf-8\nContent-Transfer-Encoding: 8bit\n"), gotcha++;
if (hp->h_cc != NULL && w & GCC)
fmt("Cc:", hp->h_cc, fo, w), gotcha++;
if (hp->h_bcc != NULL && w & GBCC)



-- 
Walter



Re: Send international text with mail(1) - proposal and patches

2023-09-19 Thread Walter Alejandro Iglesias
On Tue, Sep 19, 2023 at 05:48:01PM +0200, Ingo Schwarze wrote:
> Hi Walter,
> 
> i did not look closely at the patch yet, and i did not dig for standards
> documents, which one should almost certainly do before committing such
> a patch unless one knows all the relevant standards by heart (which i
> do not), so i'm not saying this must be done differently, but instead
> i am merely asking questions.

Today I came from having a biopsy of a tumor that appeared in my leg in
"February" of this year and thanks to the bureaucracy and the fact that
nowadays nobody takes anything seriously, I still don't know if the
tumor is malignant.  The apathy and irresponsibility of the people
(especially here in Spain) is such that I am thinking of buying a
scalpel and operating my tumor myself.  I explain this because, you
can't imagine, dear Ingo, how happy it would make me if at least 10% of
the people in this world were half as responsible as you are. :-)

> 
> 1. Are you really sure that a header like
>  MIME-Version: 1.0
>is not needed when you add Content-*: headers?
> 
> 2. Are you really sure that a header like
>  Content-Disposition: inline
>is not needed?

Thanks for the info. :-)

> 
> 3. What is the reason for not simply hardcoding
>  Content-Transfer-Encoding: 8bit
>when sending UTF-8 mail?

Yeah, I thought about it.

>Are there really still MTAs that choke on that in 2023?
>quoted-printable is definitely a PITA no matter the context,
>so in my book, if it can be avoided, avoiding it would be a plus.

I always try to choose what, from my ignorance, I suspect will cause the
least problems.  In this case I take into account that when sending a
file to the Internet its health no longer depends only on what *my
system* supports or not, out there it'll have to survive different
environments.  Many people still send messages in iso-latin and use
MSWin which still doesn't use utf-8: I send a message utf-8 encoded and
I get the response in iso-latin.  So, from my ignorance, I feel that
ASCII has more chances of surviving.

> 
> 4. What's the motivation for the -y flag taking an argument
>and not simply hardcoding "text/plain;charset=utf-8"?

I also thought about that.

>OpenBSD does not support any other charset and does not plan to
>change that in the future.
>I hope your next patch isn't going to be support for text/html.  =:-S

Believe me, I try to do everything in my life in the simplest way, while
others allow me.  But as with everything you have to be careful not to
overdo it, for example, in the case that concerns us here, if you notice
that every time you need some job done you have to install and use the
bloated version of the tools, you should ask yourself if you haven't
gone too far with your simplifications.  I'm more in favor of the
traditional "Keep it simple..." and "If ain't broken..." rather than
"simplifying".  Simplifying is dangerous, amputating a leg simplifies
your body as a system but not your life.


> 
> 5. What's the motivation for supporting -y without -e
>and for supporting -e without -y ?

Right, that's an inconsistency.

> 
> In general, we want as few options as possible and as little
> configurabity as possible.  If there is a sane use case for something -
> in this case, sending UTF-8 mail - *one* option is possibly warranted.
> But adding more than one option would need a very robust justification,
> and so would adding an option that takes an argument.
> 
> Note that mail(1) is not mail/swaks.  Its purpose is reading and
> sending mail in a *simple* way, not low-level testing or protocol
> debugging.
> 
> I'll postpone code review and testing, maybe you can simplify this
> first?

Well, as you have done with me on many occasions, your intention is to
kindly educate me, on this occasion you're making me notice that
publishing "sketches" instead of a finished work I'm wasting the
developers' time.  Thanks Ingo!  What saddens me is that I'm too old to
hope that one day I will win your approval in something. :-)

> 
> Yours,
>   Ingo


-- 
Walter



Re: Send international text with mail(1) - proposal and patches

2023-09-19 Thread Ingo Schwarze
Hi Walter,

i did not look closely at the patch yet, and i did not dig for standards
documents, which one should almost certainly do before committing such
a patch unless one knows all the relevant standards by heart (which i
do not), so i'm not saying this must be done differently, but instead
i am merely asking questions.

1. Are you really sure that a header like
 MIME-Version: 1.0
   is not needed when you add Content-*: headers?

2. Are you really sure that a header like
 Content-Disposition: inline
   is not needed?

3. What is the reason for not simply hardcoding
 Content-Transfer-Encoding: 8bit
   when sending UTF-8 mail?
   Are there really still MTAs that choke on that in 2023?
   quoted-printable is definitely a PITA no matter the context,
   so in my book, if it can be avoided, avoiding it would be a plus.

4. What's the motivation for the -y flag taking an argument
   and not simply hardcoding "text/plain;charset=utf-8"?
   OpenBSD does not support any other charset and does not plan to
   change that in the future.
   I hope your next patch isn't going to be support for text/html.  =:-S

5. What's the motivation for supporting -y without -e
   and for supporting -e without -y ?

In general, we want as few options as possible and as little
configurabity as possible.  If there is a sane use case for something -
in this case, sending UTF-8 mail - *one* option is possibly warranted.
But adding more than one option would need a very robust justification,
and so would adding an option that takes an argument.

Note that mail(1) is not mail/swaks.  Its purpose is reading and
sending mail in a *simple* way, not low-level testing or protocol
debugging.

I'll postpone code review and testing, maybe you can simplify this
first?

Yours,
  Ingo



Re: Send international text with mail(1) - proposal and patches

2023-09-19 Thread Walter Alejandro Iglesias
I'd forgotten that adding a "charset" specification to the Content-Type
header is also needed.  In the *new* set of patches below, besides I
corrected some other errors, I added a '-y' option to specify utf-8
character set:

  $ mail -s Hello -e quoted-printable -y "text/plain;charset=utf-8" \
recipi...@example.com < message.txt


Index: collect.c
===
RCS file: /cvs/src/usr.bin/mail/collect.c,v
retrieving revision 1.34
diff -u -p -r1.34 collect.c
--- collect.c   17 Jan 2014 18:42:30 -  1.34
+++ collect.c   19 Sep 2023 13:30:14 -
@@ -87,7 +87,7 @@ collect(struct header *hp, int printhead
 * refrain from printing a newline after
 * the headers (since some people mind).
 */
-   t = GTO|GSUBJECT|GCC|GNL;
+   t = GTO|GSUBJECT|GENCODING|GTYPE|GCC|GNL;
getsub = 0;
if (hp->h_subject == NULL && value("interactive") != NULL &&
(value("ask") != NULL || value("asksub") != NULL))
@@ -208,7 +208,7 @@ cont:
/*
 * Grab a bunch of headers.
 */
-   grabh(hp, GTO|GSUBJECT|GCC|GBCC);
+   grabh(hp, GTO|GSUBJECT|GENCODING|GTYPE|GCC|GBCC);
goto cont;
case 't':
/*
@@ -328,7 +328,7 @@ cont:
 */
rewind(collf);
puts("---\nMessage contains:");
-   puthead(hp, stdout, GTO|GSUBJECT|GCC|GBCC|GNL);
+   puthead(hp, stdout, 
GTO|GSUBJECT|GENCODING|GTYPE|GCC|GBCC|GNL);
while ((t = getc(collf)) != EOF)
(void)putchar(t);
goto cont;
Index: def.h
===
RCS file: /cvs/src/usr.bin/mail/def.h,v
retrieving revision 1.17
diff -u -p -r1.17 def.h
--- def.h   28 Jan 2022 06:18:41 -  1.17
+++ def.h   19 Sep 2023 13:30:14 -
@@ -158,12 +158,14 @@ struct headline {
 #defineGSUBJECT 2  /* Likewise, Subject: line */
 #defineGCC 4   /* And the Cc: line */
 #defineGBCC8   /* And also the Bcc: line */
-#defineGMASK   (GTO|GSUBJECT|GCC|GBCC)
+#defineGMASK   (GTO|GSUBJECT|GENCODING|GTYPE|GCC|GBCC)
/* Mask of places from whence */
 
 #defineGNL 16  /* Print blank line after */
 #defineGDEL32  /* Entity removed from list */
 #defineGCOMMA  64  /* detract puts in commas */
+#defineGENCODING 128   /* Content-Transfer-Encoding: line */
+#defineGTYPE   256 /* Content-Type: line */
 
 /*
  * Structure used to pass about the current
@@ -173,6 +175,8 @@ struct header {
struct name *h_to;  /* Dynamic "To:" string */
char *h_from;   /* User-specified "From:" string */
char *h_subject;/* Subject string */
+   char *h_encoding;   /* Content-Transfer-Encoding string */
+   char *h_type;   /* Content-Type string */
struct name *h_cc;  /* Carbon copies string */
struct name *h_bcc; /* Blind carbon copies */
struct name *h_smopts;  /* Sendmail options */
Index: extern.h
===
RCS file: /cvs/src/usr.bin/mail/extern.h,v
retrieving revision 1.29
diff -u -p -r1.29 extern.h
--- extern.h16 Sep 2018 02:38:57 -  1.29
+++ extern.h19 Sep 2023 13:30:14 -
@@ -163,7 +163,7 @@ void load(char *);
 struct var *
 lookup(char *);
 int mail(struct name *, struct name *, struct name *, struct name *,
-  char *, char *);
+  char *, char *, char *, char *);
 voidmail1(struct header *, int);
 voidmakemessage(FILE *, int);
 voidmark(int);
Index: mail.1
===
RCS file: /cvs/src/usr.bin/mail/mail.1,v
retrieving revision 1.83
diff -u -p -r1.83 mail.1
--- mail.1  31 Mar 2022 17:27:25 -  1.83
+++ mail.1  19 Sep 2023 13:30:15 -
@@ -45,6 +45,8 @@
 .Op Fl c Ar list
 .Op Fl r Ar from-addr
 .Op Fl s Ar subject
+.Op Fl e Ar transfer-encoding
+.Op Fl y Ar content-type
 .Ar to-addr ...
 .Ek
 .Nm mail
@@ -77,6 +79,8 @@ Causes
 .Nm mail
 to output all sorts of information useful for debugging
 .Nm mail .
+.It Fl e Ar encoding
+Add a Content-Transfer-Enconding header on command line.
 .It Fl E
 Don't send messages with an empty body.
 .It Fl f
@@ -133,6 +137,8 @@ except that locking is done.
 Verbose mode.
 The details of
 delivery are displayed on the user's terminal.
+.It Fl y Ar content-type
+Add a Content-Type header on command line.
 .El