New submission from era <era+pyt...@iki.fi>:
The email.charset module should contain common informal character-set identifiers even if they are not formally specified in a IANA RFC. >From a quick grep of a pile of recent email, I find the following: 46 "cp-850" 6 "windows-874" For scale, the same collection contained around 10,000 messages with "utf-8" and 2,000 with "iso-8859-1". Still, the fact that there are multiple occurrences in a spool of recent messages indicates that they are fairly common. Currently, the email module throws a traceback if you attempt to parse a message whose character set is not known to Python. This is not possible to prevent in the general case, but making it more robust with encodings which are reasonably prevalent in the wild would definitely be desirable. For what it's worth, "cp-850" is apparently an alias for IBM code page 850 which is defined with the name "cp850" in RFC1345. "windows-874" is an official designation which is detailed in https://www.iana.org/assignments/charset-reg/windows-874 which is apparently equivalent to the Python codec "cp784". ---------- components: email messages: 323870 nosy: barry, era, r.david.murray priority: normal severity: normal status: open title: email.charset: common IANA labels missing versions: Python 3.6 _______________________________________ Python tracker <rep...@bugs.python.org> <https://bugs.python.org/issue34460> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com