[issue45066] email parser fails to decode quoted-printable rfc822 message attachemnt

2021-08-31 Thread Diego Ramirez


Change by Diego Ramirez :


--
nosy: +DiddiLeija

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue45066] email parser fails to decode quoted-printable rfc822 message attachemnt

2021-08-31 Thread anarcat


anarcat  added the comment:

looking at email.feedparser.FeedParser._parse_gen(), it looks like this is 
going to be really hard to fix, because the parser just happily recurses into 
the sub-part without ever checking the CTE (content-transfer-encoding). that's 
typically only done on "get_payload()", which is obviously not called there 
because we're streaming the email in.

in general, it looks like support for quoted-printable, as a CTE (which is 
https://datatracker.ietf.org/doc/html/rfc2045#section-6.7), seems to be spotty 
at best. multipart/ parts will raise the (undocumented) exception 
InvalidMultipartContentTransferEncodingDefect if they encounter it, for example:

https://github.com/python/cpython/blob/3.9/Lib/email/feedparser.py#L322

so I'm not sure how to handle this. it's not clear to me either how to 
workaround this problem at all... is there a way to keep the parser from 
recursing like this?

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue45066] email parser fails to decode quoted-printable rfc822 message attachemnt

2021-08-31 Thread anarcat


New submission from anarcat :

If an email message has a message/rfc822 part *and* that part is
quoted-printable encoded, Python freaks out.

Consider this code:

import email.parser
import email.policy

# python 3.9.2 cannot decode this message, it fails with
# "email.errors.StartBoundaryNotFoundDefect"

mail = """Mime-Version: 1.0
Content-Type: multipart/report;
 boundary=aa
Content-Transfer-Encoding: 7bit


--aa
Content-Type: message/rfc822
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline

MIME-Version: 1.0
Content-Type: multipart/alternative;
 boundary=3D"=3Dbb"


--=3Dbb
Content-Transfer-Encoding: 8bit
Content-Type: text/plain; charset=3Dutf-8

x=
x

--=3Dbb--

--aa--
"""

msg_abuse = email.parser.Parser(policy=email.policy.default + 
email.policy.strict).parsestr(mail)

That crashes with: email.errors.StartBoundaryNotFoundDefect

This should normally work: the sub-message is valid, assuming you
decode the content. But if you do not, you end up in this bizarre
situation, because the multipart boundary is probably considered to be
something like `3D"=3Dbb"`, and of course the above code crashes
with the above exception.

If you remove the quoted-printable part from the equation, the parser actually 
behaves:

import email.parser
import email.policy

# python 3.9.2 cannot decode this message, it fails with
# "email.errors.StartBoundaryNotFoundDefect"

mail = """Mime-Version: 1.0
Content-Type: multipart/report;
 boundary=aa
Content-Transfer-Encoding: 7bit


--aa
Content-Type: message/rfc822
Content-Disposition: inline

MIME-Version: 1.0
Content-Type: multipart/alternative;
 boundary="=bb"


--=bb
Content-Transfer-Encoding: 8bit
Content-Type: text/plain; charset=utf-8

xx

--=bb--

--aa--
"""

msg_abuse = email.parser.Parser(policy=email.policy.default + 
email.policy.strict).parsestr(mail)

The above correctly parses the message.

This problem causes all sorts of weird issues. In one real-world
example, it would just stop parsing headers inside the email because
long lines in headers (typical in Received-by headers) would get
broken up... So it would not actually fail completely. Or, to be more
accurate, by *default* (ie. if you do not use strict), it does not
crash and instead produces invalid data (e.g. a message without a
Message-ID or From).

On most messages that are encoded this way, the strict mode will
actually fail with: email.errors.MissingHeaderBodySeparatorDefect
because it will stumble upon a header line that should be a
continuation but instead is treated like a full header line, so it's
missing a colon (":").

--
components: email
messages: 400764
nosy: anarcat, barry, r.david.murray
priority: normal
severity: normal
status: open
title: email parser fails to decode quoted-printable rfc822 message attachemnt
type: crash
versions: Python 3.9

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com