[issue5902] Stricter codec names

2011-02-24 Thread Marc-Andre Lemburg

Marc-Andre Lemburg m...@egenix.com added the comment:

Alexander Belopolsky wrote:
 
 Alexander Belopolsky belopol...@users.sourceforge.net added the comment:
 
 What is the status of this.  Status=open and Resolution=rejected contradict 
 each other.

Sorry, forgot to close the ticket.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue5902
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue5902] Stricter codec names

2011-02-24 Thread Marc-Andre Lemburg

Changes by Marc-Andre Lemburg m...@egenix.com:


--
status: open - closed

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue5902
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue5902] Stricter codec names

2011-02-24 Thread Marc-Andre Lemburg

Marc-Andre Lemburg m...@egenix.com added the comment:

Alexander Belopolsky wrote:
 
 Alexander Belopolsky belopol...@users.sourceforge.net added the comment:
 
 Accepting all common forms for
 encoding names means that you can usually give Python an encoding name
 from, e.g. a HTML page, or any other file or system that specifies an
 encoding.
 
 I don't buy this argument.  Running attached script on 
 http://www.iana.org/assignments/character-sets shows that there are hundreds 
 of registered charsets that are not accepted by python:
 
 $ ./python.exe iana.py| wc -l
  413
 
 Any serious HTML or XML processing software should be based on the IANA 
 character-sets file rather than on the ad-hoc list of aliases that made it 
 into encodings/aliases.py.

Let's do a reality check:

How often do you see requests for additions to the aliases we
have in Python ? Perhaps one every year, if at all.

We take great care not to add aliases that are not in common
use or that do not have a proven track record of really being
compatible to the codec in question.

If you think we are missing some aliases, please open tickets
for them, indicating why these should be added.

If you really want complete IANA coverage, I suggest you create
a normalization module which maps the IANA names to our names
and upload it to PyPI.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue5902
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue5902] Stricter codec names

2011-02-24 Thread Marc-Andre Lemburg

Marc-Andre Lemburg m...@egenix.com added the comment:

Alexander Belopolsky wrote:
 
 Alexander Belopolsky belopol...@users.sourceforge.net added the comment:
 
 Ezio and I discussed on IRC the implementation of alias lookup and neither of 
 us was able to point out to the function that strips non-alphanumeric 
 characters from encoding names.

I think you are misunderstanding the way the codec registry works.

You register codec search functions with it which then have to try
to map a given encoding name to a codec module.

The stdlib ships with one such function (defined in encodings/__init__.py).
This is registered with the codec registry per default.

The codec search function takes care of any normalization and conversion
to the module name used by the codecs from that codec package.

 It turns out that there are three normalize functions that are successively 
 applied to the encoding name during evaluation of str.encode/str.decode.
 
 1. normalize_encoding() in unicodeobject.c

This was added to have the few shortcuts we have in the C code
for commonly used codecs match more encoding aliases.

The shortcuts completely bypass the codec registry and also
bypass the function call overhead incurred by codecs
run via the codec registry.

 2. normalizestring() in codecs.c

This is the normalization applied by the codec registry. See PEP 100
for details:


Search functions are expected to take one argument, the encoding
name in all lower case letters and with hyphens and spaces
converted to underscores, ...


 3. normalize_encoding() in encodings/__init__.py

This is part of the stdlib encodings package's codec search
function.

 Each performs a slightly different transformation and only the last one 
 strips non-alphanumeric characters.
 
 The complexity of codec lookup is comparable with that of the import 
 mechanism!

It's flexible, but not really complex.

I hope the above clarifies the reasons for the three normalization
functions.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue5902
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue5902] Stricter codec names

2011-02-23 Thread Alexander Belopolsky

Alexander Belopolsky belopol...@users.sourceforge.net added the comment:

What is the status of this.  Status=open and Resolution=rejected contradict 
each other.

This discussion is relevant for issue11303.  Currently alias lookup incurs huge 
performance penalty in some cases.

--
nosy: +belopolsky

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue5902
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue5902] Stricter codec names

2011-02-23 Thread Alexander Belopolsky

Alexander Belopolsky belopol...@users.sourceforge.net added the comment:

 Accepting all common forms for
 encoding names means that you can usually give Python an encoding name
 from, e.g. a HTML page, or any other file or system that specifies an
 encoding.

I don't buy this argument.  Running attached script on 
http://www.iana.org/assignments/character-sets shows that there are hundreds of 
registered charsets that are not accepted by python:

$ ./python.exe iana.py| wc -l
 413

Any serious HTML or XML processing software should be based on the IANA 
character-sets file rather than on the ad-hoc list of aliases that made it into 
encodings/aliases.py.

--
Added file: http://bugs.python.org/file20873/iana.py

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue5902
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue5902] Stricter codec names

2011-02-23 Thread Alexander Belopolsky

Alexander Belopolsky belopol...@users.sourceforge.net added the comment:

Ezio and I discussed on IRC the implementation of alias lookup and neither of 
us was able to point out to the function that strips non-alphanumeric 
characters from encoding names.

It turns out that there are three normalize functions that are successively 
applied to the encoding name during evaluation of str.encode/str.decode.

1. normalize_encoding() in unicodeobject.c
2. normalizestring() in codecs.c
3. normalize_encoding() in encodings/__init__.py

Each performs a slightly different transformation and only the last one strips 
non-alphanumeric characters.

The complexity of codec lookup is comparable with that of the import mechanism!

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue5902
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue5902] Stricter codec names

2009-05-05 Thread Marc-Andre Lemburg

Marc-Andre Lemburg m...@egenix.com added the comment:

On 2009-05-04 19:04, Georg Brandl wrote:
 Georg Brandl ge...@python.org added the comment:
 
 So, do you also think utf and latin should stay?

For Python 3.x, I think those can be removed. For 2.x it's better to
keep them.

Note that UTF-8 was the first official Unicode transfer encoding,
that's why it's sometimes referred to as UTF.

The situation is similar for Latin-1. It was the first of a series of
encodings defined by ECMA which was later published by ISO under the name
ISO-8859 - long after the name Latin-1 became popular which is why
it's the default name in Python.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue5902
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue5902] Stricter codec names

2009-05-04 Thread Marc-Andre Lemburg

Marc-Andre Lemburg m...@egenix.com added the comment:

On 2009-05-02 11:20, Georg Brandl wrote:
 Georg Brandl ge...@python.org added the comment:
 
 I don't think this is a good idea.  Accepting all common forms for
 encoding names means that you can usually give Python an encoding name
 from, e.g. a HTML page, or any other file or system that specifies an
 encoding.  If we only supported, e.g., UTF-8 and no other spelling,
 that would make life much more difficult.  If you look into
 encodings/__init__.py, you can see that throwing out all
 non-alphanumerics is a conscious design choice in encoding name
 normalization.
 
 The only thing I don't know is why utf is an alias for utf-8.
 
 Assigning to Marc-Andre, who implemented most of codecs.

-1 on making codec names strict.

The reason why we have to many aliases is to enhance compatibility
with other software and data, not to encourage use of these aliases
in Python itself.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue5902
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue5902] Stricter codec names

2009-05-04 Thread Georg Brandl

Georg Brandl ge...@python.org added the comment:

So, do you also think utf and latin should stay?

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue5902
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue5902] Stricter codec names

2009-05-04 Thread Matthew Barnett

Matthew Barnett pyt...@mrabarnett.plus.com added the comment:

Well, there are multiple UTF encodings, so no to utf.

Are there multiple Latin encodings? Not in Python 2.6.2 under those names.

I'd probably insist on names that are strictish(?), ie correct, give or
take a '-' or '_'.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue5902
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue5902] Stricter codec names

2009-05-03 Thread Ezio Melotti

Ezio Melotti ezio.melo...@gmail.com added the comment:

Actually I'd like to have some kind of convention mainly when the user
writes the encoding as a string, e.g. s.encode('utf-8'). Indeed, if the
encoding comes from a webpage or somewhere else it makes sense to have
some flexibility.

I think that 'utf-8' is the most widely used name for the UTF-8 codec
and it's not even mentioned in the table of the standard encodings. So
someone will use 'utf-8', someone else 'utf_8' and some users could even
pick one of the aliases, like 'U8'.

Probably is enough to add 'utf-8', 'iso-8859-1' and similar as
preferred form and explain why and how the codec names are normalized
and what are the valid aliases.

Regarding the ambiguity of 'UTF', it is not the only one, there's also
'LATIN' among the aliases of ISO-8859-1.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue5902
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue5902] Stricter codec names

2009-05-02 Thread Ezio Melotti

New submission from Ezio Melotti ezio.melo...@gmail.com:

I noticed that codec names[1]:
1) can contain random/unnecessary spaces and punctuation;
2) have several aliases that could probably be removed;

A few examples of valid codec names (done with Python 3):
 s = 'xxx'
 s.encode('utf')
b'xxx'
 s.encode('utf-')
b'xxx'
 s.encode('}Utf~-8-~siG{ ;)')
b'\xef\xbb\xbfxxx'

'utf' is an alias for UTF-8 and that doesn't quite make sense to me that
'utf' alone refers to UTF-8.
'utf-' could be a mistyped 'utf-8', 'utf-7' or even 'utf-16'; I'd like
it to raise an error instead.
The third example is not probably something that can be found in the
real world (I hope) but it shows how permissive the parsing of the names is.

Apparently the whitespaces are removed and the punctuation is used to
split the name in several parts and then the check is performed.


About the aliases: in the documentation the official name for the
UTF-8 codec is 'utf_8' and there are 3 more aliases: U8, UTF, utf8. For
ISO-8859-1, the official name is 'latin_1' and there are 7 more
aliases: iso-8859-1, iso8859-1, 8859, cp819, latin, latin1, L1.
The Zen says There should be one—and preferably only one—obvious way to
do it., so I suggest to
1) disallow random punctuation and spaces within the name (only allow
leading and trailing spaces);
2) change the default names to, for example: 'utf-8', 'iso-8859-1'
instead of 'utf_8' and 'iso8859_1'. The name are case-insentive.
3) remove the unnecessary aliases, for example: 'UTF', 'U8' for UTF-8
and 'iso8859-1', '8859', 'latin', 'L1' for ISO-8859-1;

This last point could break some code and may need some
DeprecationWarning. If there are good reason to keep around these codecs
only the other two issues can be addressed. 
If the name of the codec has to be a valid variable name (that is,
without '-'), only the documentation could be changed to have 'utf-8',
'iso-8859-1', etc. as preferred name.

[1]: http://docs.python.org/library/codecs.html#standard-encodings
 http://docs.python.org/3.0/library/codecs.html#standard-encodings

--
assignee: georg.brandl
components: Documentation, Library (Lib)
messages: 86933
nosy: ezio.melotti, georg.brandl
severity: normal
status: open
title: Stricter codec names
type: behavior
versions: Python 2.6, Python 2.7, Python 3.0, Python 3.1

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue5902
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue5902] Stricter codec names

2009-05-02 Thread Georg Brandl

Georg Brandl ge...@python.org added the comment:

I don't think this is a good idea.  Accepting all common forms for
encoding names means that you can usually give Python an encoding name
from, e.g. a HTML page, or any other file or system that specifies an
encoding.  If we only supported, e.g., UTF-8 and no other spelling,
that would make life much more difficult.  If you look into
encodings/__init__.py, you can see that throwing out all
non-alphanumerics is a conscious design choice in encoding name
normalization.

The only thing I don't know is why utf is an alias for utf-8.

Assigning to Marc-Andre, who implemented most of codecs.

--
assignee: georg.brandl - lemburg
nosy: +lemburg
resolution:  - rejected
status: open - pending

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue5902
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue5902] Stricter codec names

2009-05-02 Thread Antoine Pitrou

Antoine Pitrou pit...@free.fr added the comment:

Is there any reason for allowing utf as an alias to utf-8? It sounds
much too ambiguous. The other silly variants (those with lots of
spurious puncutuations characters) could be forbidden too.

--
nosy: +pitrou
status: pending - open

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue5902
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue5902] Stricter codec names

2009-05-02 Thread Matthew Barnett

Matthew Barnett pyt...@mrabarnett.plus.com added the comment:

How about a 'full' form and a 'key' form generated by the function:

def codec_key(name):
return name.lower().replace(-, ).replace(_, )

The key form would be the key to an available codec, and the key
generated by a user-supplied codec name would have to match one of those
keys.

For example:

Full: UTF-8, key: utf8.

Full: ISO-8859-1, key: iso88591.

--
nosy: +mrabarnett

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue5902
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com