[issue11322] encoding package's normalize_encoding() function is too slow

2022-01-24 Thread Gregory P. Smith


Change by Gregory P. Smith :


--
nosy: +gregory.p.smith

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11322] encoding package's normalize_encoding() function is too slow

2016-12-15 Thread STINNER Victor

STINNER Victor added the comment:

Oh, while reading Mercurial history, I found a note that I wrote:

"It's not exactly the same than encodings.normalize_encoding(): the C function 
also converts to lowercase."

IHMO it's fine to modify encodings.normalize_encoding() to also convert to 
lower-case.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11322] encoding package's normalize_encoding() function is too slow

2016-12-15 Thread STINNER Victor

STINNER Victor added the comment:

It seems like encodings.normalize_encoding() currently has no unit test! Before 
modifying it, I would prefer to see a few unit tests:

* " utf 8 "
* "UtF 8"
* "utf8\xE9"
* etc.

Since we are talking about an optimmization, I would like to see a benchmark 
result before/after. I also would like to test Marc-Andre's idea of exposing 
the C function _Py_normalize_encoding().

_Py_normalize_encoding() works on a byte string encoded to Latin1. To implement 
encodings.normalize_encoding(), we might rewrite the function to work on 
Py_UCS4 character, or have a fast version on char*, and a more generic version 
for UCS2 and UCS4?

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11322] encoding package's normalize_encoding() function is too slow

2016-12-15 Thread Marc-Andre Lemburg

Marc-Andre Lemburg added the comment:

Thanks for the patch.

Victor has implemented the function in C, AFAIK, so an even better approach 
would be to expose that function at the Python level and use it in the 
encodings package.

--
versions: +Python 3.7 -Python 3.4, Python 3.5

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11322] encoding package's normalize_encoding() function is too slow

2016-12-15 Thread Mark Lawrence

Changes by Mark Lawrence :


--
nosy:  -BreamoreBoy

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11322] encoding package's normalize_encoding() function is too slow

2016-12-15 Thread INADA Naoki

Changes by INADA Naoki :


--
keywords: +patch
Added file: http://bugs.python.org/file45909/encoding_normalize_optimize.patch

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11322] encoding package's normalize_encoding() function is too slow

2014-06-15 Thread Mark Lawrence

Mark Lawrence added the comment:

What's the status of this issue, as we've lived with this really slow 
implementation for well over three years?

--
nosy: +BreamoreBoy
versions: +Python 3.4, Python 3.5 -Python 3.3

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue11322
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



Re: [issue11322] encoding package's normalize_encoding() function is too slow

2014-06-15 Thread M.-A. Lemburg
On 15.06.2014 15:02, Mark Lawrence wrote:
 
 What's the status of this issue, as we've lived with this really slow 
 implementation for well over three years?

I guess it just needs someone to write a patch.

Note that encoding lookups are cached, so the slowness only
becomes an issue if you lookup lots of different encodings.

-- 
Marc-Andre Lemburg
eGenix.com

___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11322] encoding package's normalize_encoding() function is too slow

2012-07-15 Thread Serhiy Storchaka

Serhiy Storchaka storch...@gmail.com added the comment:

 I don't know who changed the encoding's package normalize_encoding() function 
 (wasn't me), but it's a really slow implementation.

See changeset 54ef645d08e4.

--
nosy: +storchaka

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue11322
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11322] encoding package's normalize_encoding() function is too slow

2011-03-01 Thread Jesús Cea Avión

Changes by Jesús Cea Avión j...@jcea.es:


--
nosy: +jcea

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue11322
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11322] encoding package's normalize_encoding() function is too slow

2011-02-26 Thread Steffen Daode Nurpmeso

Changes by Steffen Daode Nurpmeso sdao...@googlemail.com:


--
nosy: +sdaoden

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue11322
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11322] encoding package's normalize_encoding() function is too slow

2011-02-25 Thread Marc-Andre Lemburg

New submission from Marc-Andre Lemburg m...@egenix.com:

I don't know who changed the encoding's package normalize_encoding() function 
(wasn't me), but it's a really slow implementation.

The original version used the .translate() method which is a lot faster and can 
be adapted to work with the Unicode variant of the .translate() method just as 
well.

_norm_encoding_map = ('  . '
  '0123456789   ABCDEFGHIJKLMNOPQRSTUVWXYZ '
  ' abcdefghijklmnopqrstuvwxyz '
  ''
  ''
  '')

def normalize_encoding(encoding):

 Normalize an encoding name.

Normalization works as follows: all non-alphanumeric
characters except the dot used for Python package names are
collapsed and replaced with a single underscore, e.g. '  -;#'
becomes '_'. Leading and trailing underscores are removed.

Note that encoding names should be ASCII only; if they do use
non-ASCII characters, these must be Latin-1 compatible.


# Make sure we have an 8-bit string, because .translate() works
# differently for Unicode strings.
if hasattr(__builtin__, unicode) and isinstance(encoding, unicode):
# Note that .encode('latin-1') does *not* use the codec
# registry, so this call doesn't recurse. (See unicodeobject.c
# PyUnicode_AsEncodedString() for details)
encoding = encoding.encode('latin-1')
return '_'.join(encoding.translate(_norm_encoding_map).split())

--
components: Unicode
messages: 129386
nosy: lemburg
priority: normal
severity: normal
status: open
title: encoding package's normalize_encoding() function is too slow
type: performance
versions: Python 3.3

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue11322
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11322] encoding package's normalize_encoding() function is too slow

2011-02-25 Thread Ezio Melotti

Changes by Ezio Melotti ezio.melo...@gmail.com:


--
nosy: +belopolsky, ezio.melotti

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue11322
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11322] encoding package's normalize_encoding() function is too slow

2011-02-25 Thread Alexander Belopolsky

Alexander Belopolsky belopol...@users.sourceforge.net added the comment:

I don't think the normalize_encoding() function was the culprit for issue11303 
because I measured timings with timeit which averages multiple runs while 
normalize_encoding() is called only the one time per encoding spelling due to 
caching.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue11322
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11322] encoding package's normalize_encoding() function is too slow

2011-02-25 Thread STINNER Victor

STINNER Victor victor.stin...@haypocalc.com added the comment:

We should first implement the same algorithm of the 3 normalization functions 
and add tests for them (at least for the function in normalization):

 - normalize_encoding() in encodings: it doesn't convert to lowercase and keep 
non-ASCII letters
 - normalize_encoding() in unicodeobject.c
 - normalizestring() in codecs.c

normalize_encoding() in encodings is more laxist than the two other functions: 
it normalizes   utf   8   to 'utf_8'. But it doesn't convert to lowercase and 
keeps non-ASCII letters: UTF-8é is normalized UTF_8é.

I don't know if the normalization functions have to be more or less strict, but 
I think that they should all give the same result.

--
nosy: +haypo

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue11322
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11322] encoding package's normalize_encoding() function is too slow

2011-02-25 Thread Marc-Andre Lemburg

Marc-Andre Lemburg m...@egenix.com added the comment:

STINNER Victor wrote:
 
 STINNER Victor victor.stin...@haypocalc.com added the comment:
 
 We should first implement the same algorithm of the 3 normalization functions 
 and add tests for them (at least for the function in normalization):
 
  - normalize_encoding() in encodings: it doesn't convert to lowercase and 
 keep non-ASCII letters
  - normalize_encoding() in unicodeobject.c
  - normalizestring() in codecs.c
 
 normalize_encoding() in encodings is more laxist than the two other 
 functions: it normalizes   utf   8   to 'utf_8'. But it doesn't convert to 
 lowercase and keeps non-ASCII letters: UTF-8é is normalized UTF_8é.
 
 I don't know if the normalization functions have to be more or less strict, 
 but I think that they should all give the same result.

Please see this message for an explanation of why we have those
three functions, why they are different and what their application
space is:

http://bugs.python.org/issue5902#msg129257

This ticket is just about the encoding package's codec search
function, not the other two, and I don't want to change
semantics, just its performance.

--
title: encoding package's normalize_encoding() function is too slow - encoding 
package's normalize_encoding() function is too  slow

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue11322
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com