Marc-Andre Lemburg m...@egenix.com added the comment:
M.-A. Lemburg wrote:
Raymond Hettinger wrote:
Raymond Hettinger rhettin...@users.sourceforge.net added the comment:
If you agree, Raymond, I'll backport the patch.
Yes. That will address Antoine's legitimate concern about making other
Steffen Daode Nurpmeso sdao...@googlemail.com added the comment:
On Fri, Feb 25, 2011 at 03:43:06PM +, Marc-Andre Lemburg wrote:
Marc-Andre Lemburg m...@egenix.com added the comment:
r88586: Normalized the encoding names for Latin-1 and UTF-8 to
'latin-1' and 'utf-8' in the stdlib.
Marc-Andre Lemburg m...@egenix.com added the comment:
Raymond Hettinger wrote:
Raymond Hettinger rhettin...@users.sourceforge.net added the comment:
If you agree, Raymond, I'll backport the patch.
Yes. That will address Antoine's legitimate concern about making other
backports
Steffen Daode Nurpmeso sdao...@googlemail.com added the comment:
(Not issue related)
Ezio and Alexander: after reading your posts and looking back on my code:
you're absolutely right. Doing resize(31) is pointless: it doesn't save space
(mempool serves [8],16,24,32 there; and: dynamic,
Marc-Andre Lemburg m...@egenix.com added the comment:
r88586: Normalized the encoding names for Latin-1 and UTF-8 to
'latin-1' and 'utf-8' in the stdlib.
--
___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue11303
Marc-Andre Lemburg m...@egenix.com added the comment:
I think we should reset this whole discussion and just go with Alexander's
original patch issue11303.diff.
I don't know who changed the encoding's package normalize_encoding() function
(wasn't me), but it's a really slow implementation.
Marc-Andre Lemburg m...@egenix.com added the comment:
Marc-Andre Lemburg wrote:
I don't know who changed the encoding's package normalize_encoding() function
(wasn't me), but it's a really slow implementation.
The original version used the .translate() method which is a lot faster.
I
Alexander Belopolsky belopol...@users.sourceforge.net added the comment:
Committed issue11303.diff and doc change in revision 88602.
I think the remaining ideas are best addressed in issue11322.
Given that we are starting to have a whole set of such aliases
in the C code, I wonder whether it
STINNER Victor victor.stin...@haypocalc.com added the comment:
r88586: Normalized the encoding names for Latin-1 and UTF-8 to
'latin-1' and 'utf-8' in the stdlib.
Why did you do that? We are trying to find a solution together, and you change
directly the code without any review. Your commit
Raymond Hettinger rhettin...@users.sourceforge.net added the comment:
What's wrong with Marc's commit? He's using the standard names.
--
nosy: +rhettinger
___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue11303
Marc-Andre Lemburg m...@egenix.com added the comment:
STINNER Victor wrote:
STINNER Victor victor.stin...@haypocalc.com added the comment:
r88586: Normalized the encoding names for Latin-1 and UTF-8 to
'latin-1' and 'utf-8' in the stdlib.
Why did you do that? We are trying to find a
Antoine Pitrou pit...@free.fr added the comment:
What's wrong with Marc's commit? He's using the standard names.
That's a pretty useless commit and it will make applying patches and backports
more tedious, for no obvious benefit.
Of course that concern will be removed if Marc-André also
Marc-Andre Lemburg m...@egenix.com added the comment:
I guess you could regard the wrong encoding name use as bug - it
slows down several stdlib modules for no apparent reason.
If you agree, Raymond, I'll backport the patch.
--
title: b'x'.decode('latin1') is much slower than
Ezio Melotti ezio.melo...@gmail.com added the comment:
+1 on the backport.
--
___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue11303
___
___
Marc-Andre Lemburg m...@egenix.com added the comment:
Marc-Andre Lemburg wrote:
Marc-Andre Lemburg m...@egenix.com added the comment:
I guess you could regard the wrong encoding name use as bug - it
slows down several stdlib modules for no apparent reason.
If you agree, Raymond, I'll
Raymond Hettinger rhettin...@users.sourceforge.net added the comment:
If you agree, Raymond, I'll backport the patch.
Yes. That will address Antoine's legitimate concern about making other
backports harder, and it will get all the Python's to use the canonical
spelling.
For other spellings
Éric Araujo mer...@netwok.org added the comment:
Such warnings about performance seem to me to be the domain of code analysis or
lint tools, not the interpreter.
--
___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue11303
Antoine Pitrou pit...@free.fr added the comment:
For other spellings like utf8 or latin1, I wonder if it would be
useful to emit a warning/suggestion to use the standard spelling.
No, it would be an useless annoyance.
--
___
Python tracker
STINNER Victor victor.stin...@haypocalc.com added the comment:
For other spellings like utf8 or latin1, I wonder
if it would be useful to emit a warning/suggestion to use
the standard spelling.
Why do you want to emit a warning? utf8 is now as fast as utf-8.
--
Ezio Melotti ezio.melo...@gmail.com added the comment:
For other spellings like utf8 or latin1, I wonder if it would be
useful to emit a warning/suggestion to use the standard spelling.
It would prefer to see the note added by Alexander in the doc mention *only*
the preferred spellings (i.e.
Alexander Belopolsky belopol...@users.sourceforge.net added the comment:
On Fri, Feb 25, 2011 at 8:29 PM, Antoine Pitrou rep...@bugs.python.org wrote:
..
For other spellings like utf8 or latin1, I wonder if it would be
useful to emit a warning/suggestion to use the standard spelling.
No, it
Antoine Pitrou pit...@free.fr added the comment:
If we ever decide to get rid of codec aliases in the core
If.
--
___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue11303
___
Alexander Belopolsky belopol...@users.sourceforge.net added the comment:
On Fri, Feb 25, 2011 at 8:39 PM, Ezio Melotti rep...@bugs.python.org wrote:
..
It would prefer to see the note added by Alexander in the doc mention *only*
the preferred spellings
(i.e. 'utf-8' and 'iso-8859-1') rather
Marc-Andre Lemburg m...@egenix.com added the comment:
Alexander Belopolsky wrote:
Alexander Belopolsky belopol...@users.sourceforge.net added the comment:
In issue11303.diff, I add similar optimization for encode('latin1') and for
'utf8' variant of utf-8. I don't think dash-less
Steffen Daode Nurpmeso sdao...@googlemail.com added the comment:
I wonder what this normalize_encoding() does! Here is a pretty standard
version of mine which is a bit more expensive but catches match more cases!
This is stripped, of course, and can be rewritten very easily to Python's needs
Steffen Daode Nurpmeso sdao...@googlemail.com added the comment:
(That is to say, i would do it. But not if _cpython is thrown to trash ,-);
i.e. not if there is not a slight chance that it gets actually patched in
because this performance issue probably doesn't mean a thing in real life.
Ezio Melotti ezio.melo...@gmail.com added the comment:
See also discussion on #5902.
Steffen, your normalization function looks similar to
encodings.normalize_encoding, with just a few differences (it uses spaces
instead of dashes, it divides alpha chars from digits).
If it doesn't slow down
Changes by STINNER Victor victor.stin...@haypocalc.com:
--
nosy: +haypo
___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue11303
___
___
Alexander Belopolsky belopol...@users.sourceforge.net added the comment:
On Thu, Feb 24, 2011 at 10:30 AM, Ezio Melotti rep...@bugs.python.org wrote:
..
See also discussion on #5902.
Mark has closed #5902 and indeed the discussion of how to efficiently
normalize encoding names (without
Steffen Daode Nurpmeso sdao...@googlemail.com added the comment:
.. i don't have actually invented this algorithm (but don't ask me where i got
the idea from years ago), i've just implemented the function you see. The
algorithm itself avoids some pitfalls in respect to combining numerics and
Steffen Daode Nurpmeso sdao...@googlemail.com added the comment:
(Everything else is beyond my scope. But normalizing _ to - is possibly a bad
idea as far as i can remember the situation three years ago.)
--
___
Python tracker
Steffen Daode Nurpmeso sdao...@googlemail.com added the comment:
P.P.S.: separating alphanumerics is a win for things like, e.g. UTF-16BE: it
gets 'utf 16 be' - think about the possible mispellings here and you see this
algorithm is a good thing
--
Marc-Andre Lemburg m...@egenix.com added the comment:
Alexander Belopolsky wrote:
Alexander Belopolsky belopol...@users.sourceforge.net added the comment:
On Thu, Feb 24, 2011 at 10:30 AM, Ezio Melotti rep...@bugs.python.org wrote:
..
See also discussion on #5902.
Mark has closed
Steffen Daode Nurpmeso sdao...@googlemail.com added the comment:
So, well, a-ha, i will boot my laptop this evening and (try to) write a patch
for normalize_encoding(), which will match the standart conforming LATIN1 and
also will continue to support the illegal latin-1 without actually
Ezio Melotti ezio.melo...@gmail.com added the comment:
If the first normalization function is flexible enough to match most of the
spellings of the optimized encodings, they will all benefit of the optimization
without having to go through the long path.
(If the normalized encoding name is
STINNER Victor victor.stin...@haypocalc.com added the comment:
I think that the normalization function in unicodeobject.c (only used for
internal functions) can skip any character different than a-z, A-Z and 0-9.
Something like:
import re
def normalize(name): return re.sub([^a-z0-9], ,
STINNER Victor victor.stin...@haypocalc.com added the comment:
Patch implementing my suggestion.
--
Added file: http://bugs.python.org/file20875/aggressive_normalization.patch
___
Python tracker rep...@bugs.python.org
Ezio Melotti ezio.melo...@gmail.com added the comment:
That will also accept invalid names like 'iso88591' that are not valid now,
'iso 8859 1' is already accepted.
--
___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue11303
Changes by STINNER Victor victor.stin...@haypocalc.com:
Removed file: http://bugs.python.org/file20875/aggressive_normalization.patch
___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue11303
___
Alexander Belopolsky belopol...@users.sourceforge.net added the comment:
On Thu, Feb 24, 2011 at 11:01 AM, Marc-Andre Lemburg
rep...@bugs.python.org wrote:
..
On this ticker, we're discussing just one application area: that
of the builtin short cuts.
Fair enough. I was hoping to close this
Marc-Andre Lemburg m...@egenix.com added the comment:
As promised, here's the list of places where the wrong Latin-1 encoding
spelling is used:
Lib//test/test_cmd_line.py:
-- for encoding in ('ascii', 'latin1', 'utf8'):
Lib//test/test_codecs.py:
-- ef = codecs.EncodedFile(f,
Marc-Andre Lemburg m...@egenix.com added the comment:
STINNER Victor wrote:
STINNER Victor victor.stin...@haypocalc.com added the comment:
I think that the normalization function in unicodeobject.c (only used for
internal functions) can skip any character different than a-z, A-Z and 0-9.
Marc-Andre Lemburg m...@egenix.com added the comment:
Alexander Belopolsky wrote:
Alexander Belopolsky belopol...@users.sourceforge.net added the comment:
On Thu, Feb 24, 2011 at 11:01 AM, Marc-Andre Lemburg
rep...@bugs.python.org wrote:
..
On this ticker, we're discussing just one
STINNER Victor victor.stin...@haypocalc.com added the comment:
Ooops, I attached the wrong patch. Here is the new fixed patch.
Without the patch:
import timeit
timeit.Timer('a'.encode('latin1')).timeit()
3.8540711402893066
timeit.Timer('a'.encode('latin-1')).timeit()
1.4946870803833008
Alexander Belopolsky belopol...@users.sourceforge.net added the comment:
On Thu, Feb 24, 2011 at 11:31 AM, Marc-Andre Lemburg
rep...@bugs.python.org wrote:
..
I think rather than removing any hyphens, spaces, etc. the
function should additionally:
* add hyphens whenever (they are missing
Steffen Daode Nurpmeso sdao...@googlemail.com added the comment:
So happy hacker haypo did it, different however. It's illegal, but since this
is a static function which only serves some specific internal strcmp(3)s it may
do for the mentioned charsets. I won't boot my laptop this evening.
Marc-Andre Lemburg m...@egenix.com added the comment:
STINNER Victor wrote:
STINNER Victor victor.stin...@haypocalc.com added the comment:
Ooops, I attached the wrong patch. Here is the new fixed patch.
That won't work, Victor, since it makes invalid encoding
names valid, e.g. 'utf(=)-8'.
Marc-Andre Lemburg m...@egenix.com added the comment:
Alexander Belopolsky wrote:
Alexander Belopolsky belopol...@users.sourceforge.net added the comment:
On Thu, Feb 24, 2011 at 11:31 AM, Marc-Andre Lemburg
rep...@bugs.python.org wrote:
..
I think rather than removing any hyphens,
Alexander Belopolsky belopol...@users.sourceforge.net added the comment:
On Thu, Feb 24, 2011 at 11:39 AM, Marc-Andre Lemburg
rep...@bugs.python.org wrote:
Marc-Andre Lemburg m...@egenix.com added the comment:
..
That won't work, Victor, since it makes invalid encoding
names valid, e.g.
Alexander Belopolsky belopol...@users.sourceforge.net added the comment:
'abc'.encode('utf(=)-8')
b'abc'
--
___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue11303
___
Ezio Melotti ezio.melo...@gmail.com added the comment:
That won't work, Victor, since it makes invalid encoding
names valid, e.g. 'utf(=)-8'.
That already works in Python (thanks to encodings.normalize_encoding).
The problem with the patch is that it makes names like 'iso88591' valid.
Éric Araujo mer...@netwok.org added the comment:
Agreed with Marc-André. It seems too magic and error-prone to do anything else
than stripping hyphens and spaces.
Steffen: This is a rather minor change in an area that is well known by several
developers, so don’t take it personally that
Steffen Daode Nurpmeso sdao...@googlemail.com added the comment:
That's ok by me.
And 'happy hacker haypo' was not ment unfriendly, i've only repeated the first
response i've ever posted back to this tracker (guess who was very fast at that
time :)).
--
Ezio Melotti ezio.melo...@gmail.com added the comment:
The attached patch is a proof of concept to see if Steffen proposal might be
viable.
I wrote another normalize_encoding function that implements the algorithm
described in msg129259, adjusted the shortcuts and did some timings. (Note: the
Alexander Belopolsky belopol...@users.sourceforge.net added the comment:
+char lower[strlen(encoding)*2];
Is this valid in C-89?
--
___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue11303
Ezio Melotti ezio.melo...@gmail.com added the comment:
Probably not, but that part should be changed if possible, because is less
efficient than the previous version that was allocating only 11 bytes.
The problem here is that the previous versions was only changing/removing
chars, whereas
STINNER Victor victor.stin...@haypocalc.com added the comment:
That won't work, Victor, since it makes invalid encoding
names valid, e.g. 'utf(=)-8'.
.. but this *is* valid: ...
Ah yes, it's because of encodings.normalize_encoding(). It's funny: we have 3
functions to normalize an encoding
STINNER Victor victor.stin...@haypocalc.com added the comment:
more_aggressive_normalization.patch
Woops, normalizestring() comment points to itself.
normalize_encoding() might also points to the C implementations, at least in a
# comment.
--
___
New submission from Alexander Belopolsky belopol...@users.sourceforge.net:
$ ./python.exe -m timeit b'x'.decode('latin1')
10 loops, best of 3: 2.57 usec per loop
$ ./python.exe -m timeit b'x'.decode('latin-1')
100 loops, best of 3: 0.336 usec per loop
The reason for this behavior is
Changes by Éric Araujo mer...@netwok.org:
--
nosy: +eric.araujo
versions: +Python 3.3
___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue11303
___
___
Alexander Belopolsky belopol...@users.sourceforge.net added the comment:
In issue11303.diff, I add similar optimization for encode('latin1') and for
'utf8' variant of utf-8. I don't think dash-less variants of utf-16 and utf-32
are common enough to justify special-casing.
--
Added
Changes by Ezio Melotti ezio.melo...@gmail.com:
--
nosy: +ezio.melotti
___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue11303
___
___
Éric Araujo mer...@netwok.org added the comment:
+1 for the patch.
--
___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue11303
___
___
Python-bugs-list
Changes by Jesús Cea Avión j...@jcea.es:
--
nosy: +jcea
___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue11303
___
___
Python-bugs-list mailing list
64 matches
Mail list logo