Re: Legal values for a message-id, and references header

2016-11-25 Thread Kevin J. McCarthy
On Tue, Nov 22, 2016 at 03:38:57PM +0100, Vincent Lefevre wrote:
> On 2016-11-22 20:43:12 +1100, Cameron Simpson wrote:
> > On 22Nov2016 03:27, vincent lefevre  wrote:
> > > That's clearly illegal. MIME has been designed so that the main
> > > features should still work with software that doesn't support MIME
> > > (said otherwise, software that only needs to work with addresses
> > > and Message-ID's does not need to support MIME).
> > 
> > Yeah. And reading the text of RFC2047 suggests that it is intended for
> > Subject: lines etc and explicitly _not_ for structured fields used by mail
> > transport agents, to avoid any need to rewrite the entire mail
> > infrastructure.
> 
> It can be used in structured fields like "From:" or "To:", but only
> in place of a comment, so that it doesn't interfere with the addresses.

So while it could even be used in the "phrase" part of obs-references,
but 5322 explicitly says those parts should be ignored for the purposes
of interpretation.

After taking the time myself to also review 5322 and 2047, I think mutt
is already doing the right thing in this case.  It should not attempt to
somehow encode the utf-8 character in the left-side of the references
msg-id when sending.

On the other side, a received illegally rfc2047-encoded references
msg-id should not be decoded before attempting to interpret it.

Given that, I'm going to close #3898 wont-fix.

Thank you everyone for your helpful feedback!

-- 
Kevin J. McCarthy
GPG Fingerprint: 8975 A9B3 3AA3 7910 385C  5308 ADEF 7684 8031 6BDA


signature.asc
Description: PGP signature


Re: Legal values for a message-id, and references header

2016-11-22 Thread Vincent Lefevre
On 2016-11-22 20:43:12 +1100, Cameron Simpson wrote:
> On 22Nov2016 03:27, vincent lefevre  wrote:
> > That's clearly illegal. MIME has been designed so that the main
> > features should still work with software that doesn't support MIME
> > (said otherwise, software that only needs to work with addresses
> > and Message-ID's does not need to support MIME).
> 
> Yeah. And reading the text of RFC2047 suggests that it is intended for
> Subject: lines etc and explicitly _not_ for structured fields used by mail
> transport agents, to avoid any need to rewrite the entire mail
> infrastructure.

It can be used in structured fields like "From:" or "To:", but only
in place of a comment, so that it doesn't interfere with the addresses.

> Regarding non-ASCII, we decode to bytes. Valid message-ids only contain
> ASCII anyway. If mutt ingests and decodes this rubbish, it can write out
> clean _not_ encoded message-ids for those which are syntacticly clean,
> making for a healthier ecosystem.

This is ambiguous. Do you keep the encoding used by the sender
or convert into UTF-8? And if you convert into UTF-8, do you
use NFC, NFD, etc.? And for parts already in UTF-8, do you apply
a normalization?

-- 
Vincent Lefèvre  - Web: 
100% accessible validated (X)HTML - Blog: 
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)


Re: Legal values for a message-id, and references header

2016-11-22 Thread Arnt Gulbrandsen

Cameron Simpson writes:
Regarding non-ASCII, we decode to bytes. Valid message-ids only 
contain ASCII anyway. 


Actually message-ids can contain more than just ASCII, see RFC 6530 and 
friends. But never 2047. (And even 6530-cognisant senders do well to keep 
message-id to just ASCII.)


My suggestion is to drop any message-id that contains 2047 on the floor.

Arnt



Re: Legal values for a message-id, and references header

2016-11-22 Thread Cameron Simpson

On 22Nov2016 03:27, vincent lefevre  wrote:

On 2016-11-20 18:51:25 +1100, Cameron Simpson wrote:

On 19Nov2016 19:58, Kevin J. McCarthy  wrote:
>  References: =?utf-8?Q?=3C201611170549=2EQ3WT?=
>=?utf-8?Q?foMB=C3=83=C2=BEngguang=2Ewu=40i?=
>=?utf-8?Q?ntel=2Ecom=3E=20=3C1479410?= =?utf-8?Q?777-6702-1-git-sen?=
>=?utf-8?Q?d-email-manuel=2Esch?= =?utf-8?Q?oelling=40gmx=2Ede=3E?=
>
>  
>
> If this is legal, then mutt needs to be decoding the References before
> trying to parse out the ids, because I believe it will just choke on
> this.

Wow. I would have thought that was illegal.


That's clearly illegal. MIME has been designed so that the main
features should still work with software that doesn't support MIME
(said otherwise, software that only needs to work with addresses
and Message-ID's does not need to support MIME).


Yeah. And reading the text of RFC2047 suggests that it is intended for Subject: 
lines etc and explicitly _not_ for structured fields used by mail transport 
agents, to avoid any need to rewrite the entire mail infrastructure.



Regarding the discussion below, the TL;DR is that I think that if it is
feasible mutt should decode these, but write _unencoded_ versions of these
headers and any headers derived from them. In particular, is it easy to make
mutt's header ingestion code go "stict parse, but if that fails decode with
RFC2047 and try a second time"? Probably on a specific header basis.


I don't think that Mutt should even try to support these. This could
lead to various problems. First, what to do with characters that are
illegal (e.g. non-ASCII) in a Message-ID...


I think mutt should cope with them to the extend of decoding them and including 
them in the message thread stuff. I think any generated headers such as 
in-reply-toand references in new messages should not contain these. I'd argue 
(as one not actually writing the code:-) that the ingest should learn to decode 
these for threading (thread on the decoded strings, should that be needed) and 
when writing message-ids to new headers mutt should syntax check those strings 
and discard invalid ones (for added points, shunt them to an 
x-discarded-invalid-message-ids header:-)


Regarding non-ASCII, we decode to bytes. Valid message-ids only contain ASCII 
anyway. If mutt ingests and decodes this rubbish, it can write out clean _not_ 
encoded message-ids for those which are syntacticly clean, making for a 
healthier ecosystem.


Cheers,
Cameron Simpson 

Television is an invention that permits you to be entertained in your living
room by people you wouldn't have in your home.  - David Frost


Re: Legal values for a message-id, and references header

2016-11-21 Thread Vincent Lefevre
On 2016-11-20 18:51:25 +1100, Cameron Simpson wrote:
> On 19Nov2016 19:58, Kevin J. McCarthy  wrote:
> > On Sun, Nov 20, 2016 at 10:08:13AM +1100, Cameron Simpson wrote:
> > > On 19Nov2016 13:13, Kevin J. McCarthy  wrote:
> > > > Should mutt be rfc-2047 encoding/decoding the references
> > > > header?
> > > 
> > > No. RFC2047 tokens need to be whitespace delimited from the surrounding
> > > text.  No whitespace is permitted inside the "<" and ">" markers which
> > > enclose a message-id:
> > 
> > Thank you for your detailed analysis, Cameron.  I will take a deeper
> > look at this soon.  Another piece of information is that they sent a
> > reply through the Fastmail web interface, which sent this:
> > 
> >  References: =?utf-8?Q?=3C201611170549=2EQ3WT?=
> >=?utf-8?Q?foMB=C3=83=C2=BEngguang=2Ewu=40i?=
> >=?utf-8?Q?ntel=2Ecom=3E=20=3C1479410?= =?utf-8?Q?777-6702-1-git-sen?=
> >=?utf-8?Q?d-email-manuel=2Esch?= =?utf-8?Q?oelling=40gmx=2Ede=3E?=
> > 
> >  
> > 
> > If this is legal, then mutt needs to be decoding the References before
> > trying to parse out the ids, because I believe it will just choke on
> > this.
> 
> Wow. I would have thought that was illegal.

That's clearly illegal. MIME has been designed so that the main
features should still work with software that doesn't support MIME
(said otherwise, software that only needs to work with addresses
and Message-ID's does not need to support MIME).

> Regarding the discussion below, the TL;DR is that I think that if it is
> feasible mutt should decode these, but write _unencoded_ versions of these
> headers and any headers derived from them. In particular, is it easy to make
> mutt's header ingestion code go "stict parse, but if that fails decode with
> RFC2047 and try a second time"? Probably on a specific header basis.

I don't think that Mutt should even try to support these. This could
lead to various problems. First, what to do with characters that are
illegal (e.g. non-ASCII) in a Message-ID...

-- 
Vincent Lefèvre  - Web: 
100% accessible validated (X)HTML - Blog: 
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)


Re: Legal values for a message-id, and references header

2016-11-19 Thread Cameron Simpson

On 19Nov2016 19:58, Kevin J. McCarthy  wrote:

On Sun, Nov 20, 2016 at 10:08:13AM +1100, Cameron Simpson wrote:

On 19Nov2016 13:13, Kevin J. McCarthy  wrote:
> Should mutt be rfc-2047 encoding/decoding the references
> header?

No. RFC2047 tokens need to be whitespace delimited from the surrounding
text.  No whitespace is permitted inside the "<" and ">" markers which
enclose a message-id:


Thank you for your detailed analysis, Cameron.  I will take a deeper
look at this soon.  Another piece of information is that they sent a
reply through the Fastmail web interface, which sent this:

 References: =?utf-8?Q?=3C201611170549=2EQ3WT?=
   =?utf-8?Q?foMB=C3=83=C2=BEngguang=2Ewu=40i?=
   =?utf-8?Q?ntel=2Ecom=3E=20=3C1479410?= =?utf-8?Q?777-6702-1-git-sen?=
   =?utf-8?Q?d-email-manuel=2Esch?= =?utf-8?Q?oelling=40gmx=2Ede=3E?=

 

If this is legal, then mutt needs to be decoding the References before
trying to parse out the ids, because I believe it will just choke on
this.


Wow. I would have thought that was illegal.

Regarding the discussion below, the TL;DR is that I think that if it is 
feasible mutt should decode these, but write _unencoded_ versions of these 
headers and any headers derived from them. In particular, is it easy to make 
mutt's header ingestion code go "stict parse, but if that fails decode with 
RFC2047 and try a second time"? Probably on a specific header basis.


Regarding the standards:

RFC2047 doesn't actually enumerate specific headers, but second 5 has a list of 
permitted and forbidden places for "encoded-words" (which the above are). I'm 
going to quote the bits I think are pertinent but please read it to see if I'm 
missing anything:


An 'encoded-word' may appear in a message header or body part header according 
to the following rules:


(1) An 'encoded-word' may replace a 'text' token (as defined by RFC 822) in 
any Subject or Comments header field, any extension message header field, or 
any MIME body part field for which the field body is defined as '*text'.  An 
'encoded-word' may also appear in any user-defined ("X-") message or body part 
header field.


Message-IDs are not "text" in RFC822 and its modern form RFC5322. So I'd say 
(1) does not permit this. A 'text' token is defined as:


  text=   %d1-9 /; Characters excluding CR
  %d11 / ;  and LF
  %d12 /
  %d14-127

(1) _does_ say "any MIME body part field for which the field body is defined as 
'*text'". But '*text' means zero or more 'text' tokens, and Message-ID: et al 
are not MIME fields.


(2) An 'encoded-word' may appear within a 'comment' delimited by "(" and ")", 
i.e., wherever a 'ctext' is allowed.  More precisely, the RFC 822 ABNF 
definition for 'comment' is amended as follows:


 comment = "(" *(ctext / quoted-pair / comment / encoded-word) ")"

This doesn't cover Message-IDs.

(3) As a replacement for a 'word' entity within a 'phrase', for example, one 
that precedes an address in a From, To, or Cc header.  The ABNF definition for 
'phrase' from RFC 822 thus becomes:


 phrase = 1*( encoded-word / word )

But a 'phrase' is just one of more 'word's and Message-IDs are not 'word's. The 
RFC2047 goes on to say that _any_ other use is forbidden, and tries to be 
really clear about that:


These are the ONLY locations where an 'encoded-word' may appear.  In 
particular:


+ An 'encoded-word' MUST NOT appear in any portion of an 'addr-spec'.

+ An 'encoded-word' MUST NOT appear within a 'quoted-string'.

+ An 'encoded-word' MUST NOT be used in a Received header field.

+ An 'encoded-word' MUST NOT be used in parameter of a MIME Content-Type or 
Content-Disposition field, or in any structured field body except within a 
'comment' or 'phrase'.


So I think fastmail are playing fast and loose, and while mutt should try to 
cope, it sure as hell should never _emit_ this nonsense!


Cheers,
Cameron Simpson 


Re: Legal values for a message-id, and references header

2016-11-19 Thread Kevin J. McCarthy
On Sun, Nov 20, 2016 at 10:08:13AM +1100, Cameron Simpson wrote:
> On 19Nov2016 13:13, Kevin J. McCarthy  wrote:
> > Should mutt be rfc-2047 encoding/decoding the references
> > header?
> 
> No. RFC2047 tokens need to be whitespace delimited from the surrounding
> text.  No whitespace is permitted inside the "<" and ">" markers which
> enclose a message-id:

Thank you for your detailed analysis, Cameron.  I will take a deeper
look at this soon.  Another piece of information is that they sent a
reply through the Fastmail web interface, which sent this:

  References: =?utf-8?Q?=3C201611170549=2EQ3WT?=
=?utf-8?Q?foMB=C3=83=C2=BEngguang=2Ewu=40i?=
=?utf-8?Q?ntel=2Ecom=3E=20=3C1479410?= =?utf-8?Q?777-6702-1-git-sen?=
=?utf-8?Q?d-email-manuel=2Esch?= =?utf-8?Q?oelling=40gmx=2Ede=3E?=

  

If this is legal, then mutt needs to be decoding the References before
trying to parse out the ids, because I believe it will just choke on
this.

-- 
Kevin J. McCarthy
GPG Fingerprint: 8975 A9B3 3AA3 7910 385C  5308 ADEF 7684 8031 6BDA


signature.asc
Description: PGP signature


Re: Legal values for a message-id, and references header

2016-11-19 Thread Cameron Simpson

On 19Nov2016 13:13, Kevin J. McCarthy  wrote:

On #mutt, andrey_utkin_ reported getting a bounce trying to reply to a
linux-kernel mailing list email.  When he replied, vger.kernel.org
bounced it because of raw utf-8 in a header.

He posted a gist at


I don't know how long those hang around, but the problem is in the
References header: <201611170549.Q3WTfoMBþngguang...@intel.com>
contains the utf-8 character "þ".

Are any of you familiar with the rules for Mesage-ID and References
headers?


Somewhat. Reviewing RFC 5322 right now to see how dated my knowledge is ...

 https://tools.ietf.org/rfcmarkup/5322


Should mutt be rfc-2047 encoding/decoding the references
header?


No. RFC2047 tokens need to be whitespace delimited from the surrounding text.  
No whitespace is permitted inside the "<" and ">" markers which enclose a 
message-id:


 https://tools.ietf.org/rfcmarkup/5322#section-3.6.4

The whitespace padding requirement is discussed in RFC2047 section 5:

 https://tools.ietf.org/rfcmarkup?doc=2047#section-5

The RFC5322 message-id syntax prevents using RFC2047.

I think the cited message-id is simply illegal and unfixable. Mutt should 
perhaps support it for stitching threads together, but arguably _not_ release 
such a thing into the wild in new References: or In-Reply-To: headers.



What about the domain part - should we be idn encoding that
part if $idn_encode is set?


Perhaps, if required. It looks like RFC3490's encoding is legal dot-text for 
RFC5322 (based on my reading of the Wikipedia article). The RFC is here:


 https://tools.ietf.org/rfcmarkup?doc=3490

and the article I've consulted has the relevant section here:

 
https://en.wikipedia.org/wiki/Internationalized_domain_name#ToASCII_and_ToUnicode

Cheers,
Cameron Simpson 


Legal values for a message-id, and references header

2016-11-19 Thread Kevin J. McCarthy
On #mutt, andrey_utkin_ reported getting a bounce trying to reply to a
linux-kernel mailing list email.  When he replied, vger.kernel.org
bounced it because of raw utf-8 in a header.

He posted a gist at


I don't know how long those hang around, but the problem is in the
References header: <201611170549.Q3WTfoMBþngguang...@intel.com>
contains the utf-8 character "þ".

Are any of you familiar with the rules for Mesage-ID and References
headers?  Should mutt be rfc-2047 encoding/decoding the references
header?  What about the domain part - should we be idn encoding that
part if $idn_encode is set?

-- 
Kevin J. McCarthy
GPG Fingerprint: 8975 A9B3 3AA3 7910 385C  5308 ADEF 7684 8031 6BDA


signature.asc
Description: PGP signature