New submission from Marc Villain <marc.vill...@epita.fr>: I am parsing an email with a subject header where the encoding of a unicode character happens to be cut in half. When a second encoded unicode character is encountered, we get the following error:
> 'utf-8' codec can't encode characters in position 1-2: surrogates not allowed This error can be reproduced using the following: >>> from email.message import EmailMessage >>> msg = EmailMessage() >>> msg.add_header('subject', '=?UTF-8?Q?a=C3?= =?UTF-8?Q?=B1o_a=C3=B1o?=') >>> print(str(msg)) # This will succeed >>> print(msg.as_bytes()) # This will fail >>> print(msg.as_string()) # This will fail After a bit of investigations, it appears the library is at some poing trying to concatenate 'a\udcc3\udcb1o ' and 'cómo'. It then proceeds to try to call _ew.encode in email._header_value_parser._fold_as_ew on that. This obviously fails as '\udcc3\udcb1o' is not utf-8, whereas 'cómo' is. More tests: [OK] '=?UTF-8?Q?a=C3?= =?UTF-8?Q?=B1o_a=C3=B1o?=' > b' subject: =?utf-8?q?a=C3=B1o_c=C3=B3mo?=\n\n' [OK] '=?UTF-8?Q?a=C3?= =?UTF-8?Q?=B1o_cmo?=' > b' subject: =?unknown-8bit?q?a=C3=B1o?= cmo\n\n' [OK] '=?UTF-8?Q?a=C3?= =?UTF-8?Q?=B1o?= =?UTF-8?Q?a=C3?= =?UTF-8?Q?=B1o?=' > b' subject: =?unknown-8bit?q?a=C3=B1oa=C3=B1o?=\n\n' [KO] '=?UTF-8?Q?a=C3?= =?UTF-8?Q?=B1o_a=C3=B1o?=' > 'utf-8' codec can't encode characters in position 1-2: surrogates not allowed Not sure what is the best way to fix that. ---------- components: Library (Lib) messages: 407379 nosy: marc.villain priority: normal severity: normal status: open title: EmailMessage as_bytes type: crash versions: Python 3.10 _______________________________________ Python tracker <rep...@bugs.python.org> <https://bugs.python.org/issue45938> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com