New submission from era <era+pyt...@iki.fi>:

The email.charset module should contain common informal character-set 
identifiers even if they are not formally specified in a IANA RFC.

>From a quick grep of a pile of recent email, I find the following:

   46 "cp-850"
    6 "windows-874"

For scale, the same collection contained around 10,000 messages with "utf-8" 
and 2,000 with "iso-8859-1".  Still, the fact that there are multiple 
occurrences in a spool of recent messages indicates that they are fairly common.

Currently, the email module throws a traceback if you attempt to parse a 
message whose character set is not known to Python. This is not possible to 
prevent in the general case, but making it more robust with encodings which are 
reasonably prevalent in the wild would definitely be desirable.  

For what it's worth, "cp-850" is apparently an alias for IBM code page 850 
which is defined with the name "cp850" in RFC1345.  "windows-874" is an 
official designation which is detailed in 
https://www.iana.org/assignments/charset-reg/windows-874 which is apparently 
equivalent to the Python codec "cp784".

----------
components: email
messages: 323870
nosy: barry, era, r.david.murray
priority: normal
severity: normal
status: open
title: email.charset: common IANA labels missing
versions: Python 3.6

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue34460>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to