[issue5902] Stricter codec names
Marc-Andre Lemburg m...@egenix.com added the comment: Alexander Belopolsky wrote: Alexander Belopolsky belopol...@users.sourceforge.net added the comment: What is the status of this. Status=open and Resolution=rejected contradict each other. Sorry, forgot to close the ticket. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue5902 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue5902] Stricter codec names
Changes by Marc-Andre Lemburg m...@egenix.com: -- status: open - closed ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue5902 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue5902] Stricter codec names
Marc-Andre Lemburg m...@egenix.com added the comment: Alexander Belopolsky wrote: Alexander Belopolsky belopol...@users.sourceforge.net added the comment: Accepting all common forms for encoding names means that you can usually give Python an encoding name from, e.g. a HTML page, or any other file or system that specifies an encoding. I don't buy this argument. Running attached script on http://www.iana.org/assignments/character-sets shows that there are hundreds of registered charsets that are not accepted by python: $ ./python.exe iana.py| wc -l 413 Any serious HTML or XML processing software should be based on the IANA character-sets file rather than on the ad-hoc list of aliases that made it into encodings/aliases.py. Let's do a reality check: How often do you see requests for additions to the aliases we have in Python ? Perhaps one every year, if at all. We take great care not to add aliases that are not in common use or that do not have a proven track record of really being compatible to the codec in question. If you think we are missing some aliases, please open tickets for them, indicating why these should be added. If you really want complete IANA coverage, I suggest you create a normalization module which maps the IANA names to our names and upload it to PyPI. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue5902 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue5902] Stricter codec names
Marc-Andre Lemburg m...@egenix.com added the comment: Alexander Belopolsky wrote: Alexander Belopolsky belopol...@users.sourceforge.net added the comment: Ezio and I discussed on IRC the implementation of alias lookup and neither of us was able to point out to the function that strips non-alphanumeric characters from encoding names. I think you are misunderstanding the way the codec registry works. You register codec search functions with it which then have to try to map a given encoding name to a codec module. The stdlib ships with one such function (defined in encodings/__init__.py). This is registered with the codec registry per default. The codec search function takes care of any normalization and conversion to the module name used by the codecs from that codec package. It turns out that there are three normalize functions that are successively applied to the encoding name during evaluation of str.encode/str.decode. 1. normalize_encoding() in unicodeobject.c This was added to have the few shortcuts we have in the C code for commonly used codecs match more encoding aliases. The shortcuts completely bypass the codec registry and also bypass the function call overhead incurred by codecs run via the codec registry. 2. normalizestring() in codecs.c This is the normalization applied by the codec registry. See PEP 100 for details: Search functions are expected to take one argument, the encoding name in all lower case letters and with hyphens and spaces converted to underscores, ... 3. normalize_encoding() in encodings/__init__.py This is part of the stdlib encodings package's codec search function. Each performs a slightly different transformation and only the last one strips non-alphanumeric characters. The complexity of codec lookup is comparable with that of the import mechanism! It's flexible, but not really complex. I hope the above clarifies the reasons for the three normalization functions. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue5902 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue5902] Stricter codec names
Alexander Belopolsky belopol...@users.sourceforge.net added the comment: What is the status of this. Status=open and Resolution=rejected contradict each other. This discussion is relevant for issue11303. Currently alias lookup incurs huge performance penalty in some cases. -- nosy: +belopolsky ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue5902 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue5902] Stricter codec names
Alexander Belopolsky belopol...@users.sourceforge.net added the comment: Accepting all common forms for encoding names means that you can usually give Python an encoding name from, e.g. a HTML page, or any other file or system that specifies an encoding. I don't buy this argument. Running attached script on http://www.iana.org/assignments/character-sets shows that there are hundreds of registered charsets that are not accepted by python: $ ./python.exe iana.py| wc -l 413 Any serious HTML or XML processing software should be based on the IANA character-sets file rather than on the ad-hoc list of aliases that made it into encodings/aliases.py. -- Added file: http://bugs.python.org/file20873/iana.py ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue5902 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue5902] Stricter codec names
Alexander Belopolsky belopol...@users.sourceforge.net added the comment: Ezio and I discussed on IRC the implementation of alias lookup and neither of us was able to point out to the function that strips non-alphanumeric characters from encoding names. It turns out that there are three normalize functions that are successively applied to the encoding name during evaluation of str.encode/str.decode. 1. normalize_encoding() in unicodeobject.c 2. normalizestring() in codecs.c 3. normalize_encoding() in encodings/__init__.py Each performs a slightly different transformation and only the last one strips non-alphanumeric characters. The complexity of codec lookup is comparable with that of the import mechanism! -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue5902 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue5902] Stricter codec names
Marc-Andre Lemburg m...@egenix.com added the comment: On 2009-05-04 19:04, Georg Brandl wrote: Georg Brandl ge...@python.org added the comment: So, do you also think utf and latin should stay? For Python 3.x, I think those can be removed. For 2.x it's better to keep them. Note that UTF-8 was the first official Unicode transfer encoding, that's why it's sometimes referred to as UTF. The situation is similar for Latin-1. It was the first of a series of encodings defined by ECMA which was later published by ISO under the name ISO-8859 - long after the name Latin-1 became popular which is why it's the default name in Python. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue5902 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue5902] Stricter codec names
Marc-Andre Lemburg m...@egenix.com added the comment: On 2009-05-02 11:20, Georg Brandl wrote: Georg Brandl ge...@python.org added the comment: I don't think this is a good idea. Accepting all common forms for encoding names means that you can usually give Python an encoding name from, e.g. a HTML page, or any other file or system that specifies an encoding. If we only supported, e.g., UTF-8 and no other spelling, that would make life much more difficult. If you look into encodings/__init__.py, you can see that throwing out all non-alphanumerics is a conscious design choice in encoding name normalization. The only thing I don't know is why utf is an alias for utf-8. Assigning to Marc-Andre, who implemented most of codecs. -1 on making codec names strict. The reason why we have to many aliases is to enhance compatibility with other software and data, not to encourage use of these aliases in Python itself. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue5902 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue5902] Stricter codec names
Georg Brandl ge...@python.org added the comment: So, do you also think utf and latin should stay? -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue5902 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue5902] Stricter codec names
Matthew Barnett pyt...@mrabarnett.plus.com added the comment: Well, there are multiple UTF encodings, so no to utf. Are there multiple Latin encodings? Not in Python 2.6.2 under those names. I'd probably insist on names that are strictish(?), ie correct, give or take a '-' or '_'. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue5902 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue5902] Stricter codec names
Ezio Melotti ezio.melo...@gmail.com added the comment: Actually I'd like to have some kind of convention mainly when the user writes the encoding as a string, e.g. s.encode('utf-8'). Indeed, if the encoding comes from a webpage or somewhere else it makes sense to have some flexibility. I think that 'utf-8' is the most widely used name for the UTF-8 codec and it's not even mentioned in the table of the standard encodings. So someone will use 'utf-8', someone else 'utf_8' and some users could even pick one of the aliases, like 'U8'. Probably is enough to add 'utf-8', 'iso-8859-1' and similar as preferred form and explain why and how the codec names are normalized and what are the valid aliases. Regarding the ambiguity of 'UTF', it is not the only one, there's also 'LATIN' among the aliases of ISO-8859-1. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue5902 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue5902] Stricter codec names
New submission from Ezio Melotti ezio.melo...@gmail.com: I noticed that codec names[1]: 1) can contain random/unnecessary spaces and punctuation; 2) have several aliases that could probably be removed; A few examples of valid codec names (done with Python 3): s = 'xxx' s.encode('utf') b'xxx' s.encode('utf-') b'xxx' s.encode('}Utf~-8-~siG{ ;)') b'\xef\xbb\xbfxxx' 'utf' is an alias for UTF-8 and that doesn't quite make sense to me that 'utf' alone refers to UTF-8. 'utf-' could be a mistyped 'utf-8', 'utf-7' or even 'utf-16'; I'd like it to raise an error instead. The third example is not probably something that can be found in the real world (I hope) but it shows how permissive the parsing of the names is. Apparently the whitespaces are removed and the punctuation is used to split the name in several parts and then the check is performed. About the aliases: in the documentation the official name for the UTF-8 codec is 'utf_8' and there are 3 more aliases: U8, UTF, utf8. For ISO-8859-1, the official name is 'latin_1' and there are 7 more aliases: iso-8859-1, iso8859-1, 8859, cp819, latin, latin1, L1. The Zen says There should be one—and preferably only one—obvious way to do it., so I suggest to 1) disallow random punctuation and spaces within the name (only allow leading and trailing spaces); 2) change the default names to, for example: 'utf-8', 'iso-8859-1' instead of 'utf_8' and 'iso8859_1'. The name are case-insentive. 3) remove the unnecessary aliases, for example: 'UTF', 'U8' for UTF-8 and 'iso8859-1', '8859', 'latin', 'L1' for ISO-8859-1; This last point could break some code and may need some DeprecationWarning. If there are good reason to keep around these codecs only the other two issues can be addressed. If the name of the codec has to be a valid variable name (that is, without '-'), only the documentation could be changed to have 'utf-8', 'iso-8859-1', etc. as preferred name. [1]: http://docs.python.org/library/codecs.html#standard-encodings http://docs.python.org/3.0/library/codecs.html#standard-encodings -- assignee: georg.brandl components: Documentation, Library (Lib) messages: 86933 nosy: ezio.melotti, georg.brandl severity: normal status: open title: Stricter codec names type: behavior versions: Python 2.6, Python 2.7, Python 3.0, Python 3.1 ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue5902 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue5902] Stricter codec names
Georg Brandl ge...@python.org added the comment: I don't think this is a good idea. Accepting all common forms for encoding names means that you can usually give Python an encoding name from, e.g. a HTML page, or any other file or system that specifies an encoding. If we only supported, e.g., UTF-8 and no other spelling, that would make life much more difficult. If you look into encodings/__init__.py, you can see that throwing out all non-alphanumerics is a conscious design choice in encoding name normalization. The only thing I don't know is why utf is an alias for utf-8. Assigning to Marc-Andre, who implemented most of codecs. -- assignee: georg.brandl - lemburg nosy: +lemburg resolution: - rejected status: open - pending ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue5902 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue5902] Stricter codec names
Antoine Pitrou pit...@free.fr added the comment: Is there any reason for allowing utf as an alias to utf-8? It sounds much too ambiguous. The other silly variants (those with lots of spurious puncutuations characters) could be forbidden too. -- nosy: +pitrou status: pending - open ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue5902 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue5902] Stricter codec names
Matthew Barnett pyt...@mrabarnett.plus.com added the comment: How about a 'full' form and a 'key' form generated by the function: def codec_key(name): return name.lower().replace(-, ).replace(_, ) The key form would be the key to an available codec, and the key generated by a user-supplied codec name would have to match one of those keys. For example: Full: UTF-8, key: utf8. Full: ISO-8859-1, key: iso88591. -- nosy: +mrabarnett ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue5902 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com