[issue11303] b'x'.decode('latin1') is much slower than b'x'.decode('latin-1')

2011-03-04 Thread Marc-Andre Lemburg

Marc-Andre Lemburg m...@egenix.com added the comment:

M.-A. Lemburg wrote:
 Raymond Hettinger wrote:

 Raymond Hettinger rhettin...@users.sourceforge.net added the comment:

 If you agree, Raymond, I'll backport the patch.

 Yes.  That will address Antoine's legitimate concern about making other 
 backports harder, and it will get all the Python's to use the canonical 
 spelling.
 
 Ok, I'll backport both the normalization and Alexander's patch.

Hmm, I wanted to start working on this just now and then saw
Georg's mail about the hg transition today, so I guess the
backport will have to wait until Monday... will be interesting
to see whether hg is really so much better than svn ;-)

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue11303
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11303] b'x'.decode('latin1') is much slower than b'x'.decode('latin-1')

2011-02-26 Thread Steffen Daode Nurpmeso

Steffen Daode Nurpmeso sdao...@googlemail.com added the comment:

On Fri, Feb 25, 2011 at 03:43:06PM +, Marc-Andre Lemburg wrote:
 
 Marc-Andre Lemburg m...@egenix.com added the comment:
 
 r88586: Normalized the encoding names for Latin-1 and UTF-8 to
 'latin-1' and 'utf-8' in the stdlib.

Even though - or maybe exactly because - i'm a newbie, i really 
want to add another message after all this biting is over. 
I've just read PEP 100 and msg129257 (on Issue 5902), and i feel 
a bit confused.

 Marc-Andre Lemburg m...@egenix.com added the comment:
 It turns out that there are three normalize functions that are 
 successively applied to the encoding name during evaluation of 
 str.encode/str.decode.
 
 1. normalize_encoding() in unicodeobject.c

 This was added to have the few shortcuts we have in the C code
 for commonly used codecs match more encoding aliases.

 The shortcuts completely bypass the codec registry and also
 bypass the function call overhead incurred by codecs
 run via the codec registry.

The thing that i don't understand the most is that illegal 
(according to IANA standarts) names are good on the one hand 
(latin-1, utf-16-be), but bad on the other, i.e. in my 
group-preserving code or haypos very fast but name-joining patch 
(the first): a *local* change in unicodeobject.c, which' result is 
*only* used for the two users PyUnicode_Decode() and 
PyUnicode_AsEncodedString().  However:

 Marc-Andre Lemburg m...@egenix.com added the comment:
 Programmers who don't use the encoding names triggering those
 optimizations will still have a running program, it'll only be
 a bit slower and that's perfectly fine.

 Marc-Andre Lemburg m...@egenix.com added the comment:
 think rather than removing any hyphens, spaces, etc. the
 function should additionally:

  * add hyphens whenever (they are missing and) there's switch
 from [a-z] to [0-9]

 That way you end up with the correct names for the given set 
 of optimized encoding names.

haypos patch can easily be adjusted to reflect this, resulting in 
a much cleaner code in the two mentioned users, because 
normalize_encoding() did the job it was ment for. 
(Hmmm, and my own code could also be adjusted to match Python 
semantics (using hyphen instead of space as a group-separator), 
so that an end-user has the choice in between *all* IANA standart 
names (e.g. ISO-8859-1, ISO8859-1, ISO_8859-1, LATIN1), 
and would gain the full optimization benefit of using latin-1, 
which seems to be pretty useful for limburger.)

 Ezio Melotti wrote:
 Marc-Andre Lemburg wrote:
 That won't work, Victor, since it makes invalid encoding
 names valid, e.g. 'utf(=)-8'.

 That already works in Python (thanks to encodings.normalize_encoding)

*However*: in PEP 100 Python has decided to go its own way 
a decade ago.

 Marc-Andre Lemburg m...@egenix.com added the comment:
 2. normalizestring() in codecs.c

 This is the normalization applied by the codec registry. See PEP 100
 for details:

 
Search functions are expected to take one argument, 
the encoding name in all lower case letters and with hyphens 
and spaces converted to underscores, ...
 

 3. normalize_encoding() in encodings/__init__.py

 This is part of the stdlib encodings package's codec search function.

First: *i* go for haypo:

 It's funny: we have 3 functions to normalize an encoding name, and
 each function does something else :-)

(that's Issue 11322:)
 We should first implement the same algorithm of the 3 normalization
 functions and add tests for them

And *i* don't understand anything else (*i* do have *my* - now 
furtherly optimized, thanks - s_textcodec_normalize_name()). 
However, two different ones (very fast thing which is enough to 
meet unicodeobject.c and a global one for anything else) may also do.
Isn't anything else a maintenance mess?  Where is that database, 
are there any known dependencies which are exposed to end-users?
Or the like.

I'm much too loud, and have a nice weekend.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue11303
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11303] b'x'.decode('latin1') is much slower than b'x'.decode('latin-1')

2011-02-26 Thread Marc-Andre Lemburg

Marc-Andre Lemburg m...@egenix.com added the comment:

Raymond Hettinger wrote:
 
 Raymond Hettinger rhettin...@users.sourceforge.net added the comment:
 
 If you agree, Raymond, I'll backport the patch.
 
 Yes.  That will address Antoine's legitimate concern about making other 
 backports harder, and it will get all the Python's to use the canonical 
 spelling.

Ok, I'll backport both the normalization and Alexander's patch.

 For other spellings like utf8 or latin1, I wonder if it would be useful 
 to emit a warning/suggestion to use the standard spelling.

While it would make sense for Python programs, it would not for
cases where the encoding is read from some other source, e.g.
an XML encoding declaration.

However, perhaps we could have a warning which is disabled
per default and can be enabled using the -W option.

--
title: b'x'.decode('latin1') is much slower than b'x'.decode('latin-1') - 
b'x'.decode('latin1')is  muchslower  than
b'x'.decode('latin-1')

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue11303
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11303] b'x'.decode('latin1') is much slower than b'x'.decode('latin-1')

2011-02-25 Thread Steffen Daode Nurpmeso

Steffen Daode Nurpmeso sdao...@googlemail.com added the comment:

(Not issue related)
Ezio and Alexander: after reading your posts and looking back on my code: 
you're absolutely right.  Doing resize(31) is pointless: it doesn't save space 
(mempool serves [8],16,24,32 there; and: dynamic, normalized coded names don't 
exist that long in real life, too).  And append_char() is inlined but much more 
expensive than doing (register-loaded) *(target++)=char.  Thus i now do believe 
my code is a bug and i will rewrite doing *target=cstr(resize(len(input)*2)) 
... truncate() instead!
Thanks.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue11303
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11303] b'x'.decode('latin1') is much slower than b'x'.decode('latin-1')

2011-02-25 Thread Marc-Andre Lemburg

Marc-Andre Lemburg m...@egenix.com added the comment:

r88586: Normalized the encoding names for Latin-1 and UTF-8 to
'latin-1' and 'utf-8' in the stdlib.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue11303
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11303] b'x'.decode('latin1') is much slower than b'x'.decode('latin-1')

2011-02-25 Thread Marc-Andre Lemburg

Marc-Andre Lemburg m...@egenix.com added the comment:

I think we should reset this whole discussion and just go with Alexander's 
original patch issue11303.diff.

I don't know who changed the encoding's package normalize_encoding() function 
(wasn't me), but it's a really slow implementation.

The original version used the .translate() method which is a lot faster.

I'll open a new issue for that part.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue11303
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11303] b'x'.decode('latin1') is much slower than b'x'.decode('latin-1')

2011-02-25 Thread Marc-Andre Lemburg

Marc-Andre Lemburg m...@egenix.com added the comment:

Marc-Andre Lemburg wrote:
 
 I don't know who changed the encoding's package normalize_encoding() function 
 (wasn't me), but it's a really slow implementation.
 
 The original version used the .translate() method which is a lot faster.

I guess that's one of the reasons why Alexander found such a dramatic
difference between the shortcut variant of the names and the ones
going through the registry.

 I'll open a new issue for that part.

issue11322

--
title: b'x'.decode('latin1') is much slower than b'x'.decode('latin-1') - 
b'x'.decode('latin1') is much slower thanb'x'.decode('latin-1')

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue11303
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11303] b'x'.decode('latin1') is much slower than b'x'.decode('latin-1')

2011-02-25 Thread Alexander Belopolsky

Alexander Belopolsky belopol...@users.sourceforge.net added the comment:

Committed issue11303.diff and doc change in revision 88602.

I think the remaining ideas are best addressed in issue11322.

 Given that we are starting to have a whole set of such aliases
 in the C code, I wonder whether it would be better to make the
 string comparisons more efficient, e.g.

I don't think we can do much better than a string of strcmp()s.  Even if a more 
efficient algorithm can be found, it will certainly be less readable.  Moving 
strcmp()s before normalize_encoding() (and either forgoing optimization for 
alternative capitalizations or using case insensitive comparison) may be a more 
promising optimization strategy.  In any case all these micro-optimizations are 
dwarfed by that of bypassing Python calls and are probably not worth pursuing.

--
assignee:  - belopolsky
resolution:  - fixed
stage:  - committed/rejected
status: open - pending
superseder:  - encoding package's normalize_encoding() function is too slow

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue11303
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11303] b'x'.decode('latin1') is much slower than b'x'.decode('latin-1')

2011-02-25 Thread STINNER Victor

STINNER Victor victor.stin...@haypocalc.com added the comment:

 r88586: Normalized the encoding names for Latin-1 and UTF-8 to
 'latin-1' and 'utf-8' in the stdlib.

Why did you do that? We are trying to find a solution together, and you change 
directly the code without any review. Your commit doesn't solve this issue.

Your commit is now useless, can you please revert it?

--
status: pending - open

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue11303
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11303] b'x'.decode('latin1') is much slower than b'x'.decode('latin-1')

2011-02-25 Thread Raymond Hettinger

Raymond Hettinger rhettin...@users.sourceforge.net added the comment:

What's wrong with Marc's commit?  He's using the standard names.

--
nosy: +rhettinger

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue11303
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11303] b'x'.decode('latin1') is much slower than b'x'.decode('latin-1')

2011-02-25 Thread Marc-Andre Lemburg

Marc-Andre Lemburg m...@egenix.com added the comment:

STINNER Victor wrote:
 
 STINNER Victor victor.stin...@haypocalc.com added the comment:
 
 r88586: Normalized the encoding names for Latin-1 and UTF-8 to
 'latin-1' and 'utf-8' in the stdlib.
 
 Why did you do that? We are trying to find a solution together, and you 
 change directly the code without any review. Your commit doesn't solve this 
 issue.

As discussed on python-dev, the stdlib should use Python's
default names for encodings and that's what I changed.

 Your commit is now useless, can you please revert it?

This ticket was mainly discussing use cases in
3rd party applications, not code that we have control over
in the stdlib - we can easily fix that and that's what I did
with the above checkin.

--
title: b'x'.decode('latin1') is much slower thanb'x'.decode('latin-1') 
- b'x'.decode('latin1') is much slower  thanb'x'.decode('latin-1')

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue11303
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11303] b'x'.decode('latin1') is much slower than b'x'.decode('latin-1')

2011-02-25 Thread Antoine Pitrou

Antoine Pitrou pit...@free.fr added the comment:

 What's wrong with Marc's commit?  He's using the standard names.

That's a pretty useless commit and it will make applying patches and backports 
more tedious, for no obvious benefit.
Of course that concern will be removed if Marc-André also backports it to 3.2 
and 2.7.

--
nosy: +pitrou

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue11303
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11303] b'x'.decode('latin1') is much slower than b'x'.decode('latin-1')

2011-02-25 Thread Marc-Andre Lemburg

Marc-Andre Lemburg m...@egenix.com added the comment:

I guess you could regard the wrong encoding name use as bug - it
slows down several stdlib modules for no apparent reason.

If you agree, Raymond, I'll backport the patch.

--
title: b'x'.decode('latin1') is much slower thanb'x'.decode('latin-1') 
- b'x'.decode('latin1') is much slower  thanb'x'.decode('latin-1')

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue11303
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11303] b'x'.decode('latin1') is much slower than b'x'.decode('latin-1')

2011-02-25 Thread Ezio Melotti

Ezio Melotti ezio.melo...@gmail.com added the comment:

+1 on the backport.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue11303
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11303] b'x'.decode('latin1') is much slower than b'x'.decode('latin-1')

2011-02-25 Thread Marc-Andre Lemburg

Marc-Andre Lemburg m...@egenix.com added the comment:

Marc-Andre Lemburg wrote:
 
 Marc-Andre Lemburg m...@egenix.com added the comment:
 
 I guess you could regard the wrong encoding name use as bug - it
 slows down several stdlib modules for no apparent reason.
 
 If you agree, Raymond, I'll backport the patch.

We might actually backport Alexander's patch as well - for much
the same reason.

--
title: b'x'.decode('latin1') is muchslower  thanb'x'.decode('latin-1') 
- b'x'.decode('latin1') is  muchslower  thanb'x'.decode('latin-1')

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue11303
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11303] b'x'.decode('latin1') is much slower than b'x'.decode('latin-1')

2011-02-25 Thread Raymond Hettinger

Raymond Hettinger rhettin...@users.sourceforge.net added the comment:

 If you agree, Raymond, I'll backport the patch.

Yes.  That will address Antoine's legitimate concern about making other 
backports harder, and it will get all the Python's to use the canonical 
spelling.

For other spellings like utf8 or latin1, I wonder if it would be useful to 
emit a warning/suggestion to use the standard spelling.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue11303
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11303] b'x'.decode('latin1') is much slower than b'x'.decode('latin-1')

2011-02-25 Thread Éric Araujo

Éric Araujo mer...@netwok.org added the comment:

Such warnings about performance seem to me to be the domain of code analysis or 
lint tools, not the interpreter.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue11303
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11303] b'x'.decode('latin1') is much slower than b'x'.decode('latin-1')

2011-02-25 Thread Antoine Pitrou

Antoine Pitrou pit...@free.fr added the comment:

 For other spellings like utf8 or latin1, I wonder if it would be
 useful to emit a warning/suggestion to use the standard spelling.

No, it would be an useless annoyance.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue11303
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11303] b'x'.decode('latin1') is much slower than b'x'.decode('latin-1')

2011-02-25 Thread STINNER Victor

STINNER Victor victor.stin...@haypocalc.com added the comment:

 For other spellings like utf8 or latin1, I wonder 
 if it would be useful to emit a warning/suggestion to use 
 the standard spelling.

Why do you want to emit a warning? utf8 is now as fast as utf-8.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue11303
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11303] b'x'.decode('latin1') is much slower than b'x'.decode('latin-1')

2011-02-25 Thread Ezio Melotti

Ezio Melotti ezio.melo...@gmail.com added the comment:

 For other spellings like utf8 or latin1, I wonder if it would be
 useful to emit a warning/suggestion to use the standard spelling.

It would prefer to see the note added by Alexander in the doc mention *only* 
the preferred spellings (i.e. 'utf-8' and 'iso-8859-1') rather than all the 
variants that are actually optimized. One of the reasons that lead me to open 
#5902 is that I didn't like the inconsistencies in the encoding names (utf-8 vs 
utf8 vs UTF8 etc). Suggesting only one spelling per encoding will fix the 
problem.

FWIW, the correct spelling is 'latin1', not 'latin-1', but I still prefer 
'iso-8859-1' over the two.

(The note could also use some more ``'markup'`` for the encoding names.)

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue11303
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11303] b'x'.decode('latin1') is much slower than b'x'.decode('latin-1')

2011-02-25 Thread Alexander Belopolsky

Alexander Belopolsky belopol...@users.sourceforge.net added the comment:

On Fri, Feb 25, 2011 at 8:29 PM, Antoine Pitrou rep...@bugs.python.org wrote:
..
 For other spellings like utf8 or latin1, I wonder if it would be
 useful to emit a warning/suggestion to use the standard spelling.

 No, it would be an useless annoyance.

If we ever decide to get rid of codec aliases in the core and require
users to translate names found in various internet standards to
canonical Python spellings, we will have to issue deprecation warnings
before that.

As long as we recommend using say XML encoding metadata as is, we
cannot standardize on Python spellings because they differ from XML
standard.  (For example, Python uses latin-1 and proper XML only
accepts latin1. Of course, we can ask everyone to use iso-8859-1
instead, but how many users can remember that name?)

--
title: b'x'.decode('latin1') is muchslower  thanb'x'.decode('latin-1') 
- b'x'.decode('latin1') is much slower than b'x'.decode('latin-1')

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue11303
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11303] b'x'.decode('latin1') is much slower than b'x'.decode('latin-1')

2011-02-25 Thread Antoine Pitrou

Antoine Pitrou pit...@free.fr added the comment:

 If we ever decide to get rid of codec aliases in the core

If.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue11303
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11303] b'x'.decode('latin1') is much slower than b'x'.decode('latin-1')

2011-02-25 Thread Alexander Belopolsky

Alexander Belopolsky belopol...@users.sourceforge.net added the comment:

On Fri, Feb 25, 2011 at 8:39 PM, Ezio Melotti rep...@bugs.python.org wrote:
..
 It would prefer to see the note added by Alexander in the doc mention *only* 
 the preferred spellings
 (i.e. 'utf-8' and 'iso-8859-1') rather than all the variants that are 
 actually optimized. One of the reasons
 that lead me to open #5902 is that I didn't like the inconsistencies in the 
 encoding names (utf-8 vs utf8 vs
 UTF8 etc). Suggesting only one spelling per encoding will fix the problem.

I am fine with trimming the list.  In fact I deliberately did not
mention say UTF-8 variant even though it is also optimized.
Unfortunately, I don't think we have a choice between 'latin-1',
'latin1', and 'iso-8859-1'.  I don't think we should recommend
'latin-1' because this may cause people adding '-' to a very popular
and IANA registered 'latin1' variant and while 'iso-8859-1' is the
most pedantically correct spelling, it is very user unfriendly.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue11303
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11303] b'x'.decode('latin1') is much slower than b'x'.decode('latin-1')

2011-02-24 Thread Marc-Andre Lemburg

Marc-Andre Lemburg m...@egenix.com added the comment:

Alexander Belopolsky wrote:
 
 Alexander Belopolsky belopol...@users.sourceforge.net added the comment:
 
 In issue11303.diff, I add similar optimization for encode('latin1') and for 
 'utf8' variant of utf-8.  I don't think dash-less variants of utf-16 and 
 utf-32 are common enough to justify special-casing.

Looks good.

Given that we are starting to have a whole set of such aliases
in the C code, I wonder whether it would be better to make the
string comparisons more efficient, e.g.
if utf matches, the checks could then continue with 8 or -8
instead of trying to match utf again and again.

--
title: b'x'.decode('latin1') is much slower than b'x'.decode('latin-1') - 
b'x'.decode('latin1') is much slower thanb'x'.decode('latin-1')

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue11303
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11303] b'x'.decode('latin1') is much slower than b'x'.decode('latin-1')

2011-02-24 Thread Steffen Daode Nurpmeso

Steffen Daode Nurpmeso sdao...@googlemail.com added the comment:

I wonder what this normalize_encoding() does!  Here is a pretty standard 
version of mine which is a bit more expensive but catches match more cases!  
This is stripped, of course, and can be rewritten very easily to Python's needs 
(i.e. using char[32] instead of char[11].

 * @@li If a character is either ::s_char_is_space() or ::s_char_is_punct():
 *  @@liReplace with ASCII space (0x20).
 *  @@liSqueeze adjacent spaces to a single one.
 * @@li Else if a character is ::s_char_is_alnum():
 *  @@li::s_char_to_lower() characters.
 *  @@liSeparate groups of alphas and digits with ASCII space (0x20).
 * @@li Else discard character.
 * E.g. ISO_8859---1 becomes iso 8859 1
 * and ISO8859-1 also becomes iso 8859 1.

s_textcodec_normalize_name(s_CString *_name) {
enum { C_NONE, C_WS, C_ALPHA, C_DIGIT } c_type = C_NONE;
char *name, c;
auto s_CString input;

s_cstring_swap(s_cstring_init(input), _name);
_name = s_cstring_reserve(_name, 31, s_FAL0);
name = s_cstring_cstr(input);

while ((c = *(name++)) != s_NUL) {
s_si8 sep = s_FAL0;

if (s_char_is_space(c) || s_char_is_punct(c)) {
if (c_type == C_WS)
continue;
c_type = C_WS;
c = ' ';
} else if (s_char_is_alpha(c)) {
sep = (c_type == C_DIGIT);
c_type = C_ALPHA;
c = s_char_to_lower(c);
} else if (s_char_is_digit(c)) {
sep = (c_type == C_ALPHA);
c_type = C_DIGIT;
} else
continue;

do
_name = s_cstring_append_char(_name, (sep ? ' ' : c));
while (--sep = s_FAL0);
}

s_cstring_destroy(input);
return _name;
}

--
nosy: +sdaoden

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue11303
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11303] b'x'.decode('latin1') is much slower than b'x'.decode('latin-1')

2011-02-24 Thread Steffen Daode Nurpmeso

Steffen Daode Nurpmeso sdao...@googlemail.com added the comment:

(That is to say, i would do it.  But not if _cpython is thrown to trash ,-); 
i.e. not if there is not a slight chance that it gets actually patched in 
because this performance issue probably doesn't mean a thing in real life.  You 
know, i'm a slow programmer, i would need *at least* two hours to rewrite that 
in plain C in a way that can make it as a replacement of normalize_encoding().)

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue11303
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11303] b'x'.decode('latin1') is much slower than b'x'.decode('latin-1')

2011-02-24 Thread Ezio Melotti

Ezio Melotti ezio.melo...@gmail.com added the comment:

See also discussion on #5902.

Steffen, your normalization function looks similar to 
encodings.normalize_encoding, with just a few differences (it uses spaces 
instead of dashes, it divides alpha chars from digits).

If it doesn't slow down the normal cases (i.e. 'utf-8', 'utf8', 'latin-1', 
etc.), a more flexible normalization done earlier might be a valid alternative.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue11303
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11303] b'x'.decode('latin1') is much slower than b'x'.decode('latin-1')

2011-02-24 Thread STINNER Victor

Changes by STINNER Victor victor.stin...@haypocalc.com:


--
nosy: +haypo

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue11303
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11303] b'x'.decode('latin1') is much slower than b'x'.decode('latin-1')

2011-02-24 Thread Alexander Belopolsky

Alexander Belopolsky belopol...@users.sourceforge.net added the comment:

On Thu, Feb 24, 2011 at 10:30 AM, Ezio Melotti rep...@bugs.python.org wrote:
..
 See also discussion on #5902.

Mark has closed #5902 and indeed the discussion of how to efficiently
normalize encoding names (without changing what is accepted) is beyond
the scope of that or the current issue.  Can someone open a separate
issue to see if we can improve the current situation?  I don't think
having three slightly different normalize functions is optimal.  See
msg129248.

--
title: b'x'.decode('latin1') is much slower thanb'x'.decode('latin-1') 
- b'x'.decode('latin1') is much slower than b'x'.decode('latin-1')

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue11303
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11303] b'x'.decode('latin1') is much slower than b'x'.decode('latin-1')

2011-02-24 Thread Steffen Daode Nurpmeso

Steffen Daode Nurpmeso sdao...@googlemail.com added the comment:

.. i don't have actually invented this algorithm (but don't ask me where i got 
the idea from years ago), i've just implemented the function you see.  The 
algorithm itself avoids some pitfalls in respect to combining numerics and 
significantly reduces the number of possible normalization cases:

ISO-8859-1, ISO8859-1, ISO_8859-1, LATIN1
(+ think of additional mispellings)
all become
iso 8859 1, latin 1
in the end

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue11303
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11303] b'x'.decode('latin1') is much slower than b'x'.decode('latin-1')

2011-02-24 Thread Steffen Daode Nurpmeso

Steffen Daode Nurpmeso sdao...@googlemail.com added the comment:

(Everything else is beyond my scope.  But normalizing _ to - is possibly a bad 
idea as far as i can remember the situation three years ago.)

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue11303
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11303] b'x'.decode('latin1') is much slower than b'x'.decode('latin-1')

2011-02-24 Thread Steffen Daode Nurpmeso

Steffen Daode Nurpmeso sdao...@googlemail.com added the comment:

P.P.S.: separating alphanumerics is a win for things like, e.g. UTF-16BE: it 
gets 'utf 16 be' - think about the possible mispellings here and you see this 
algorithm is a good thing

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue11303
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11303] b'x'.decode('latin1') is much slower than b'x'.decode('latin-1')

2011-02-24 Thread Marc-Andre Lemburg

Marc-Andre Lemburg m...@egenix.com added the comment:

Alexander Belopolsky wrote:
 
 Alexander Belopolsky belopol...@users.sourceforge.net added the comment:
 
 On Thu, Feb 24, 2011 at 10:30 AM, Ezio Melotti rep...@bugs.python.org wrote:
 ..
 See also discussion on #5902.
 
 Mark has closed #5902 and indeed the discussion of how to efficiently
 normalize encoding names (without changing what is accepted) is beyond
 the scope of that or the current issue.  Can someone open a separate
 issue to see if we can improve the current situation?  I don't think
 having three slightly different normalize functions is optimal.  See
 msg129248.

Please see my reply on this ticket: those three functions have
different application areas.

On this ticker, we're discussing just one application area: that
of the builtin short cuts.

To have more encoding name variants benefit from the optimization,
we might want to enhance that particular normalization function
to avoid having to compare against utf8 and utf-8 in the
encode/decode functions.

--
title: b'x'.decode('latin1') is much slower than b'x'.decode('latin-1') - 
b'x'.decode('latin1') is much slower thanb'x'.decode('latin-1')

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue11303
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11303] b'x'.decode('latin1') is much slower than b'x'.decode('latin-1')

2011-02-24 Thread Steffen Daode Nurpmeso

Steffen Daode Nurpmeso sdao...@googlemail.com added the comment:

So, well, a-ha, i will boot my laptop this evening and (try to) write a patch 
for normalize_encoding(), which will match the standart conforming LATIN1 and 
also will continue to support the illegal latin-1 without actually changing the 
two users PyUnicode_Decode() and PyUnicode_AsEncodedString(), from which i 
better keep the hands off.  But i'm slow, it may take until tomorrow...

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue11303
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11303] b'x'.decode('latin1') is much slower than b'x'.decode('latin-1')

2011-02-24 Thread Ezio Melotti

Ezio Melotti ezio.melo...@gmail.com added the comment:

If the first normalization function is flexible enough to match most of the 
spellings of the optimized encodings, they will all benefit of the optimization 
without having to go through the long path.

(If the normalized encoding name is then passed through, the following 
normalization functions will also have to do less work, but this is out of the 
scope of this issue.)

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue11303
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11303] b'x'.decode('latin1') is much slower than b'x'.decode('latin-1')

2011-02-24 Thread STINNER Victor

STINNER Victor victor.stin...@haypocalc.com added the comment:

I think that the normalization function in unicodeobject.c (only used for 
internal functions) can skip any character different than a-z, A-Z and 0-9. 
Something like:

 import re
 def normalize(name): return re.sub([^a-z0-9], , name.lower())
... 
 normalize(UTF-8)
'utf8'
 normalize(ISO-8859-1)
'iso88591'
 normalize(latin1)
'latin1'

So ISO-8859-1, ISO885-1, LATIN-1, latin1, UTF-8, utf8, etc. will be normalized 
to iso88591, latin1 and utf8.

I don't know any encoding name where a character outside a-z, A-Z, 0-9 means 
anything special. But I don't know all encoding names! :-)

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue11303
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11303] b'x'.decode('latin1') is much slower than b'x'.decode('latin-1')

2011-02-24 Thread STINNER Victor

STINNER Victor victor.stin...@haypocalc.com added the comment:

Patch implementing my suggestion.

--
Added file: http://bugs.python.org/file20875/aggressive_normalization.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue11303
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11303] b'x'.decode('latin1') is much slower than b'x'.decode('latin-1')

2011-02-24 Thread Ezio Melotti

Ezio Melotti ezio.melo...@gmail.com added the comment:

That will also accept invalid names like 'iso88591' that are not valid now, 
'iso 8859 1' is already accepted.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue11303
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11303] b'x'.decode('latin1') is much slower than b'x'.decode('latin-1')

2011-02-24 Thread STINNER Victor

Changes by STINNER Victor victor.stin...@haypocalc.com:


Removed file: http://bugs.python.org/file20875/aggressive_normalization.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue11303
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11303] b'x'.decode('latin1') is much slower than b'x'.decode('latin-1')

2011-02-24 Thread Alexander Belopolsky

Alexander Belopolsky belopol...@users.sourceforge.net added the comment:

On Thu, Feb 24, 2011 at 11:01 AM, Marc-Andre Lemburg
rep...@bugs.python.org wrote:
..
 On this ticker, we're discussing just one application area: that
 of the builtin short cuts.

Fair enough.  I was hoping to close this ticket by simply committing
the posted patch, but it looks like people want to do more.  I don't
think we'll get measurable performance gains but may improve code
understandability.

 To have more encoding name variants benefit from the optimization,
 we might want to enhance that particular normalization function
 to avoid having to compare against utf8 and utf-8 in the
 encode/decode functions.

Which function are you talking about?

1. normalize_encoding() in unicodeobject.c
2. normalizestring() in codecs.c

The first is s.lower().replace('-', '_') and the second is
s.lower().replace(' ', '_'). (Note space vs. dash difference.)

Why do we need both?  And why should they be different?

--
title: b'x'.decode('latin1') is much slower thanb'x'.decode('latin-1') 
- b'x'.decode('latin1') is much slower than b'x'.decode('latin-1')

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue11303
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11303] b'x'.decode('latin1') is much slower than b'x'.decode('latin-1')

2011-02-24 Thread Marc-Andre Lemburg

Marc-Andre Lemburg m...@egenix.com added the comment:

As promised, here's the list of places where the wrong Latin-1 encoding 
spelling is used:

Lib//test/test_cmd_line.py:
-- for encoding in ('ascii', 'latin1', 'utf8'):
Lib//test/test_codecs.py:
-- ef = codecs.EncodedFile(f, 'utf-8', 'latin1')
Lib//test/test_shelve.py:
-- shelve.Shelf(d, keyencoding='latin1')[key] = [1]
-- self.assertIn(key.encode('latin1'), d)
Lib//test/test_uuid.py:
-- os.write(fds[1], value.hex.encode('latin1'))
-- child_value = os.read(fds[0], 100).decode('latin1')
Lib//test/test_xml_etree.py:
--  ET.tostring(ET.PI('test', 'testing\xe3'), 'latin1')
-- b?xml version='1.0' encoding='latin1'?\\n?test testing\\xe3?
Lib//urllib/request.py:
-- data = base64.decodebytes(data.encode('ascii')).decode('latin1')
Lib//asynchat.py:
-- encoding= 'latin1'
Lib//sre_parse.py:
-- encode = lambda x: x.encode('latin1')
Lib//distutils/command/bdist_wininst.py:
-- # convert back to bytes. latin1 simply avoids any possible
-- encoding=latin1) as script:
-- script_data = script.read().encode(latin1)
Lib//test/test_bigmem.py:
-- return s.encode(latin1)
-- return bytearray(s.encode(latin1))
Lib//test/test_bytes.py:
-- self.assertRaises(UnicodeEncodeError, self.type2test, sample, 
latin1)
-- b = self.type2test(sample, latin1, ignore)
-- b = self.type2test(sample, latin1)
Lib//test/test_codecs.py:
-- self.assertEqual(\udce4\udceb\udcef\udcf6\udcfc.encode(latin1, 
surrogateescape),
Lib//test/test_io.py:
-- with open(__file__, r, encoding=latin1) as f:
-- t.__init__(b, encoding=latin1, newline=\r\n)
-- self.assertEqual(t.encoding, latin1)
-- for enc in ascii, latin1, utf8 :# , utf-16-be, 
utf-16-le:
Lib//ftplib.py:
-- encoding = latin1

I'll fix those later today or tomorrow.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue11303
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11303] b'x'.decode('latin1') is much slower than b'x'.decode('latin-1')

2011-02-24 Thread Marc-Andre Lemburg

Marc-Andre Lemburg m...@egenix.com added the comment:

STINNER Victor wrote:
 
 STINNER Victor victor.stin...@haypocalc.com added the comment:
 
 I think that the normalization function in unicodeobject.c (only used for 
 internal functions) can skip any character different than a-z, A-Z and 0-9. 
 Something like:
 
 import re
 def normalize(name): return re.sub([^a-z0-9], , name.lower())
 ... 
 normalize(UTF-8)
 'utf8'
 normalize(ISO-8859-1)
 'iso88591'
 normalize(latin1)
 'latin1'
 
 So ISO-8859-1, ISO885-1, LATIN-1, latin1, UTF-8, utf8, etc. will be 
 normalized to iso88591, latin1 and utf8.
 
 I don't know any encoding name where a character outside a-z, A-Z, 0-9 means 
 anything special. But I don't know all encoding names! :-)

I think rather than removing any hyphens, spaces, etc. the
function should additionally:

 * add hyphens whenever (they are missing and) there's switch
   from [a-z] to [0-9]

That way you end up with the correct names for the given set of
optimized encoding names.

--
title: b'x'.decode('latin1') is much slower than b'x'.decode('latin-1') - 
b'x'.decode('latin1') is much slower thanb'x'.decode('latin-1')

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue11303
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11303] b'x'.decode('latin1') is much slower than b'x'.decode('latin-1')

2011-02-24 Thread Marc-Andre Lemburg

Marc-Andre Lemburg m...@egenix.com added the comment:

Alexander Belopolsky wrote:
 
 Alexander Belopolsky belopol...@users.sourceforge.net added the comment:
 
 On Thu, Feb 24, 2011 at 11:01 AM, Marc-Andre Lemburg
 rep...@bugs.python.org wrote:
 ..
 On this ticker, we're discussing just one application area: that
 of the builtin short cuts.

 Fair enough.  I was hoping to close this ticket by simply committing
 the posted patch, but it looks like people want to do more.  I don't
 think we'll get measurable performance gains but may improve code
 understandability.
 
 To have more encoding name variants benefit from the optimization,
 we might want to enhance that particular normalization function
 to avoid having to compare against utf8 and utf-8 in the
 encode/decode functions.
 
 Which function are you talking about?
 
 1. normalize_encoding() in unicodeobject.c
 2. normalizestring() in codecs.c

The first one, since that's being used by the shortcuts.

 The first is s.lower().replace('-', '_') and the second is

It does this: s.lower().replace('_', '-')

 s.lower().replace(' ', '_'). (Note space vs. dash difference.)
 
 Why do we need both?  And why should they be different?

Because the first is specifically used for the shortcuts
(which can do more without breaking anything, since it's
only used internally) and the second prepares the encoding
names for lookup in the codec registry (which has a PEP100
defined behavior we cannot easily change).

--
title: b'x'.decode('latin1') is much slower thanb'x'.decode('latin-1') 
- b'x'.decode('latin1') is much slower than b'x'.decode('latin-1')

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue11303
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11303] b'x'.decode('latin1') is much slower than b'x'.decode('latin-1')

2011-02-24 Thread STINNER Victor

STINNER Victor victor.stin...@haypocalc.com added the comment:

Ooops, I attached the wrong patch. Here is the new fixed patch.

Without the patch:

 import timeit
 timeit.Timer('a'.encode('latin1')).timeit()
3.8540711402893066
 timeit.Timer('a'.encode('latin-1')).timeit()
1.4946870803833008

With the patch:

 import timeit
 timeit.Timer('a'.encode('latin1')).timeit()
1.4461820125579834
 timeit.Timer('a'.encode('latin-1')).timeit()
1.463456153869629

 timeit.Timer('a'.encode('UTF-8')).timeit()
0.9479248523712158
 timeit.Timer('a'.encode('UTF8')).timeit()
0.9208409786224365

--
Added file: http://bugs.python.org/file20876/aggressive_normalization.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue11303
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11303] b'x'.decode('latin1') is much slower than b'x'.decode('latin-1')

2011-02-24 Thread Alexander Belopolsky

Alexander Belopolsky belopol...@users.sourceforge.net added the comment:

On Thu, Feb 24, 2011 at 11:31 AM, Marc-Andre Lemburg
rep...@bugs.python.org wrote:
..
 I think rather than removing any hyphens, spaces, etc. the
 function should additionally:

  * add hyphens whenever (they are missing and) there's switch
   from [a-z] to [0-9]


This will do the wrong thing to the cs family of aliases:


The aliases that start with cs have been added for use with the
IANA-CHARSET-MIB as originally defined in RFC3808, and as currently
maintained by IANA at http://www.iana.org/assignments/ianacharset-mib.
Note that the ianacharset-mib needs to be kept in sync with this
registry.  These aliases that start with cs contain the standard
numbers along with suggestive names in order to facilitate applications
that want to display the names in user interfaces.  The cs stands
for character set and is provided for applications that need a lower
case first letter but want to use mixed case thereafter that cannot
contain any special characters, such as underbar (_) and dash (-).


--
title: b'x'.decode('latin1') is much slower thanb'x'.decode('latin-1') 
- b'x'.decode('latin1') is much slower than b'x'.decode('latin-1')

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue11303
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11303] b'x'.decode('latin1') is much slower than b'x'.decode('latin-1')

2011-02-24 Thread Steffen Daode Nurpmeso

Steffen Daode Nurpmeso sdao...@googlemail.com added the comment:

So happy hacker haypo did it, different however.  It's illegal, but since this 
is a static function which only serves some specific internal strcmp(3)s it may 
do for the mentioned charsets.  I won't boot my laptop this evening.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue11303
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11303] b'x'.decode('latin1') is much slower than b'x'.decode('latin-1')

2011-02-24 Thread Marc-Andre Lemburg

Marc-Andre Lemburg m...@egenix.com added the comment:

STINNER Victor wrote:
 
 STINNER Victor victor.stin...@haypocalc.com added the comment:
 
 Ooops, I attached the wrong patch. Here is the new fixed patch.

That won't work, Victor, since it makes invalid encoding
names valid, e.g. 'utf(=)-8'.

We really only want to add the functionality of matching
encodings names with hyphen or not.

Perhaps it's not really worth the trouble as Alexander suggests
and we should simply add the few extra cases where needed.

--
title: b'x'.decode('latin1') is much slower than b'x'.decode('latin-1') - 
b'x'.decode('latin1') is much slower thanb'x'.decode('latin-1')

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue11303
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11303] b'x'.decode('latin1') is much slower than b'x'.decode('latin-1')

2011-02-24 Thread Marc-Andre Lemburg

Marc-Andre Lemburg m...@egenix.com added the comment:

Alexander Belopolsky wrote:
 
 Alexander Belopolsky belopol...@users.sourceforge.net added the comment:
 
 On Thu, Feb 24, 2011 at 11:31 AM, Marc-Andre Lemburg
 rep...@bugs.python.org wrote:
 ..
 I think rather than removing any hyphens, spaces, etc. the
 function should additionally:

  * add hyphens whenever (they are missing and) there's switch
   from [a-z] to [0-9]

 
 This will do the wrong thing to the cs family of aliases:

We don't support those for the shortcut optimizations.

--
title: b'x'.decode('latin1') is much slower thanb'x'.decode('latin-1') 
- b'x'.decode('latin1') is much slower than b'x'.decode('latin-1')

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue11303
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11303] b'x'.decode('latin1') is much slower than b'x'.decode('latin-1')

2011-02-24 Thread Alexander Belopolsky

Alexander Belopolsky belopol...@users.sourceforge.net added the comment:

On Thu, Feb 24, 2011 at 11:39 AM, Marc-Andre Lemburg
rep...@bugs.python.org wrote:

 Marc-Andre Lemburg m...@egenix.com added the comment:
..
 That won't work, Victor, since it makes invalid encoding
 names valid, e.g. 'utf(=)-8'.


.. but this *is* valid:

b'abc'

--
title: b'x'.decode('latin1') is much slower thanb'x'.decode('latin-1') 
- b'x'.decode('latin1') is much slower than b'x'.decode('latin-1')

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue11303
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11303] b'x'.decode('latin1') is much slower than b'x'.decode('latin-1')

2011-02-24 Thread Alexander Belopolsky

Alexander Belopolsky belopol...@users.sourceforge.net added the comment:

 'abc'.encode('utf(=)-8')
b'abc'

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue11303
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11303] b'x'.decode('latin1') is much slower than b'x'.decode('latin-1')

2011-02-24 Thread Ezio Melotti

Ezio Melotti ezio.melo...@gmail.com added the comment:

 That won't work, Victor, since it makes invalid encoding
 names valid, e.g. 'utf(=)-8'.

That already works in Python (thanks to encodings.normalize_encoding).
The problem with the patch is that it makes names like 'iso88591' valid.
Normalize to 'iso 8859 1' should solve this problem.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue11303
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11303] b'x'.decode('latin1') is much slower than b'x'.decode('latin-1')

2011-02-24 Thread Éric Araujo

Éric Araujo mer...@netwok.org added the comment:

Agreed with Marc-André.  It seems too magic and error-prone to do anything else 
than stripping hyphens and spaces.

Steffen: This is a rather minor change in an area that is well known by several 
developers, so don’t take it personally that Victor went ahead and made a quick 
patch.  Patches for other bugs are welcome!  Thanks for your wanting to help.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue11303
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11303] b'x'.decode('latin1') is much slower than b'x'.decode('latin-1')

2011-02-24 Thread Steffen Daode Nurpmeso

Steffen Daode Nurpmeso sdao...@googlemail.com added the comment:

That's ok by me.
And 'happy hacker haypo' was not ment unfriendly, i've only repeated the first 
response i've ever posted back to this tracker (guess who was very fast at that 
time :)).

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue11303
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11303] b'x'.decode('latin1') is much slower than b'x'.decode('latin-1')

2011-02-24 Thread Ezio Melotti

Ezio Melotti ezio.melo...@gmail.com added the comment:

The attached patch is a proof of concept to see if Steffen proposal might be 
viable.

I wrote another normalize_encoding function that implements the algorithm 
described in msg129259, adjusted the shortcuts and did some timings. (Note: the 
function is not tested extensively and might break. It might also be optimized 
further.)

These are the results:
# $ command
# result with my patch
# result without
wolf@hp:~/dev/py/py3k$ ./python -m timeit b'x'.decode('latin1')
100 loops, best of 3: 0.626 usec per loop
10 loops, best of 3: 2.03 usec per loop
wolf@hp:~/dev/py/py3k$ ./python -m timeit b'x'.decode('latin-1')
100 loops, best of 3: 0.614 usec per loop
100 loops, best of 3: 0.616 usec per loop
wolf@hp:~/dev/py/py3k$ ./python -m timeit b'x'.decode('iso-8859-1')
100 loops, best of 3: 0.993 usec per loop
100 loops, best of 3: 0.649 usec per loop
wolf@hp:~/dev/py/py3k$ ./python -m timeit b'x'.decode('iso8859_1')
100 loops, best of 3: 1.01 usec per loop
10 loops, best of 3: 2.08 usec per loop
wolf@hp:~/dev/py/py3k$ ./python -m timeit b'x'.decode('iso_8859_1')
100 loops, best of 3: 0.734 usec per loop
100 loops, best of 3: 0.694 usec per loop
wolf@hp:~/dev/py/py3k$ ./python -m timeit b'x'.decode('utf8')
100 loops, best of 3: 0.728 usec per loop
10 loops, best of 3: 6.37 usec per loop

--
Added file: http://bugs.python.org/file20878/issue11303.diff

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue11303
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11303] b'x'.decode('latin1') is much slower than b'x'.decode('latin-1')

2011-02-24 Thread Alexander Belopolsky

Alexander Belopolsky belopol...@users.sourceforge.net added the comment:

+char lower[strlen(encoding)*2];

Is this valid in C-89?

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue11303
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11303] b'x'.decode('latin1') is much slower than b'x'.decode('latin-1')

2011-02-24 Thread Ezio Melotti

Ezio Melotti ezio.melo...@gmail.com added the comment:

Probably not, but that part should be changed if possible, because is less 
efficient than the previous version that was allocating only 11 bytes.

The problem here is that the previous versions was only changing/removing 
chars, whereas this might add spaces too, so the string might get longer. E.g. 
'utf8' - 'utf 8'. The worst case is 'a1a1a1' - 'a 1 a 1 a 1', and including 
the trailing \0, the result might end up being twice as long than the original 
encoding string. It can be fixed returning 0 as soon as the normalized string 
reaches a fixed threshold (something like 15 chars, depending on the longest 
normalized encoding name).

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue11303
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11303] b'x'.decode('latin1') is much slower than b'x'.decode('latin-1')

2011-02-24 Thread STINNER Victor

STINNER Victor victor.stin...@haypocalc.com added the comment:

 That won't work, Victor, since it makes invalid encoding
 names valid, e.g. 'utf(=)-8'.

 .. but this *is* valid: ...

Ah yes, it's because of encodings.normalize_encoding(). It's funny: we have 3 
functions to normalize an encoding name, and each function does something else 
:-) E.g. encodings.normalize_encoding() doesn't replace non-ASCII letters, and 
don't convert to lowercase.

more_aggressive_normalization.patch changes all of the 3 normalization 
functions and add tests on encodings.normalize_encoding().

I think that speed and backward compatibility is more important than conforming 
to IANA or other standards.

Even if ~~ utf#8 ~~ is ugly, I don't think that it really matter that we 
accept it.

--

If you don't want to touch the normalization functions and just add more 
aliases in C fast-paths: we should also add utf8, utf16 and utf32.

Use of utf8 in Python: random.Random.seed(), 
smtpd.SMTPChannel.collect_incoming_data(), tarfile, multiprocessing.connection 
(xml serialization)

PS: On error, UTF-8 decoder raises a UnicodeDecodeError with utf8 as the 
encoding name :-)

--
Added file: http://bugs.python.org/file20880/more_aggressive_normalization.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue11303
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11303] b'x'.decode('latin1') is much slower than b'x'.decode('latin-1')

2011-02-24 Thread STINNER Victor

STINNER Victor victor.stin...@haypocalc.com added the comment:

 more_aggressive_normalization.patch

Woops, normalizestring() comment points to itself.

normalize_encoding() might also points to the C implementations, at least in a 
# comment.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue11303
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11303] b'x'.decode('latin1') is much slower than b'x'.decode('latin-1')

2011-02-23 Thread Alexander Belopolsky

New submission from Alexander Belopolsky belopol...@users.sourceforge.net:

$ ./python.exe -m timeit b'x'.decode('latin1')
10 loops, best of 3: 2.57 usec per loop
$ ./python.exe -m timeit b'x'.decode('latin-1')
100 loops, best of 3: 0.336 usec per loop

The reason for this behavior is that 'latin-1' is short-circuited in C code 
while 'latin1' has to be looked up in aliases.py.  Attached patch fixes this 
issue.

--
files: latin1.diff
keywords: patch
messages: 129227
nosy: belopolsky, lemburg
priority: normal
severity: normal
status: open
title: b'x'.decode('latin1') is much slower than b'x'.decode('latin-1')
type: performance
Added file: http://bugs.python.org/file20871/latin1.diff

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue11303
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11303] b'x'.decode('latin1') is much slower than b'x'.decode('latin-1')

2011-02-23 Thread Éric Araujo

Changes by Éric Araujo mer...@netwok.org:


--
nosy: +eric.araujo
versions: +Python 3.3

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue11303
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11303] b'x'.decode('latin1') is much slower than b'x'.decode('latin-1')

2011-02-23 Thread Alexander Belopolsky

Alexander Belopolsky belopol...@users.sourceforge.net added the comment:

In issue11303.diff, I add similar optimization for encode('latin1') and for 
'utf8' variant of utf-8.  I don't think dash-less variants of utf-16 and utf-32 
are common enough to justify special-casing.

--
Added file: http://bugs.python.org/file20872/issue11303.diff

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue11303
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11303] b'x'.decode('latin1') is much slower than b'x'.decode('latin-1')

2011-02-23 Thread Ezio Melotti

Changes by Ezio Melotti ezio.melo...@gmail.com:


--
nosy: +ezio.melotti

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue11303
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11303] b'x'.decode('latin1') is much slower than b'x'.decode('latin-1')

2011-02-23 Thread Éric Araujo

Éric Araujo mer...@netwok.org added the comment:

+1 for the patch.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue11303
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11303] b'x'.decode('latin1') is much slower than b'x'.decode('latin-1')

2011-02-23 Thread Jesús Cea Avión

Changes by Jesús Cea Avión j...@jcea.es:


--
nosy: +jcea

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue11303
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com