[issue24870] Optimize ascii and latin1 decoder with surrogateescape and surrogatepass error handlers

2016-10-06 Thread INADA Naoki
Changes by INADA Naoki : -- resolution: -> fixed status: open -> closed ___ Python tracker ___

[issue24870] Optimize ascii and latin1 decoder with surrogateescape and surrogatepass error handlers

2016-01-07 Thread INADA Naoki
INADA Naoki added the comment: FYI, I found a workaround. https://github.com/PyMySQL/PyMySQL/pull/409 _table = [chr(i) for i in range(128)] + [chr(i) for i in range(0xdc80, 0xdd00)] def decode_surroundescape(s): return s.decode('latin1').translate(_table) In [15]: data = b'\xff' * 1024 *

[issue24870] Optimize ascii and latin1 decoder with surrogateescape and surrogatepass error handlers

2016-01-07 Thread STINNER Victor
STINNER Victor added the comment: > In [18]: %timeit decode_surroundescape(data) > 10 loops, best of 3: 40 ms per loop Cool! Good job. -- ___ Python tracker

[issue24870] Optimize ascii and latin1 decoder with surrogateescape and surrogatepass error handlers

2015-10-10 Thread INADA Naoki
INADA Naoki added the comment: UTF-8 and Latin1 are typical encoding for MySQL query. When inserting BLOB: # Decode binary data x = data.decode('ascii', 'surrogateescape') # %-format query psql = sql % (escape(x),) # sql is unicode # Encode sql to connection encoding (latin1 or utf8)

[issue24870] Optimize ascii and latin1 decoder with surrogateescape and surrogatepass error handlers

2015-10-09 Thread STINNER Victor
STINNER Victor added the comment: INADA Naoki: "I want to Python 3.4 and Python 3.5 solve this issue since it's critical problem for some people." On microbenchmarks, the optimization that I just implemented in Python 3.6 are impressive. The problem is that the implementation is quite

[issue24870] Optimize ascii and latin1 decoder with surrogateescape and surrogatepass error handlers

2015-10-09 Thread STINNER Victor
STINNER Victor added the comment: Short summary. Ok, I optimized ASCII, Latin1 and UTF-8 codecs (encoders and decoders) for the most common error handlers. * ASCII and Latin1 encoders: surrogateescape, replace, ignore, backslashreplace, xmlcharrefreplace * ASCII decoder: surrogateescape

[issue24870] Optimize ascii and latin1 decoder with surrogateescape and surrogatepass error handlers

2015-10-02 Thread STINNER Victor
STINNER Victor added the comment: I created issue #25301: "Optimize UTF-8 decoder with error handlers". -- ___ Python tracker ___

[issue24870] Optimize ascii and latin1 decoder with surrogateescape and surrogatepass error handlers

2015-10-01 Thread STINNER Victor
STINNER Victor added the comment: I just pushed my patch to optimize the UTF-8 encoder with error handlers: see the issue #25267. It's up to 70 times as fast. The patch was based on Serhiy's work: faster_surrogates_hadling.patch attached to this issue. --

[issue24870] Optimize ascii and latin1 decoder with surrogateescape and surrogatepass error handlers

2015-09-24 Thread STINNER Victor
Changes by STINNER Victor <victor.stin...@gmail.com>: -- title: Optimize coding with surrogateescape and surrogatepass error handlers -> Optimize ascii and latin1 decoder with surrogateescape and surrogatepass error handlers ___ Python tra

[issue24870] Optimize ascii and latin1 decoder with surrogateescape and surrogatepass error handlers

2015-09-24 Thread STINNER Victor
STINNER Victor added the comment: Serhiy wrote: "All other error handlers lose information and can't be used per se for transcoding bytes as string or string as bytes." Well, it was very simple to implement replace and ignore in decoders. I believe that the error handlers are commonly used.

Re: ascii to latin1

2006-05-10 Thread Serge Orlov
also don't get a list as [115448,DAÇÃO] but a [u'115448,DAÇÃO'] element, which doesn't suit my needs. Would you mind telling me what should I change? Calling this process latin1 to ascii was a misnomer, sorry that I used this phrase. It should be called latin1 to search key, there is no requirement

Re: ascii to latin1

2006-05-10 Thread Luis P. Mendes
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 With regards to º, Richie already gave you food for thoughts, if you want 1 DE MO to match 1º DE MO remove that symbol from the key (linha_key = linha_key.translate({uº: None}), if you don't want such a fuzzy matching, keep it. Thank you all

Re: ascii to latin1

2006-05-09 Thread Richie Hindle
[Serge] def search_key(s): de_str = unicodedata.normalize(NFD, s) return ''.join(cp for cp in de_str if not unicodedata.category(cp).startswith('M')) Lovely bit of code - thanks for posting it! You might want to use NFKD to normalize things like LATIN SMALL

Re: ascii to latin1

2006-05-09 Thread Luis P. Mendes
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Richie Hindle escreveu: [Serge] def search_key(s): de_str = unicodedata.normalize(NFD, s) return ''.join(cp for cp in de_str if not unicodedata.category(cp).startswith('M')) Lovely bit of code - thanks for posting

Re: ascii to latin1

2006-05-09 Thread Richie Hindle
[Luis] When I used the NFD option, I came across many errors on these and possibly other codes: \xba, \xc9, \xcd. What errors? This works fine for me, printing Ecoute: import unicodedata def search_key(s): de_str = unicodedata.normalize(NFD, s) return ''.join([cp for cp in de_str if

Re: ascii to latin1

2006-05-09 Thread Serge Orlov
Richie Hindle wrote: [Serge] def search_key(s): de_str = unicodedata.normalize(NFD, s) return ''.join(cp for cp in de_str if not unicodedata.category(cp).startswith('M')) Lovely bit of code - thanks for posting it! Well, it is not so good. Please read my next

Re: ascii to latin1

2006-05-09 Thread Serge Orlov
Luis P. Mendes wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Richie Hindle escreveu: [Serge] def search_key(s): de_str = unicodedata.normalize(NFD, s) return ''.join(cp for cp in de_str if not unicodedata.category(cp).startswith('M')) Lovely bit

Re: ascii to latin1

2006-05-09 Thread Richie Hindle
[Serge] I have to admit that using normalize is a far from perfect way to implement search. The most advanced algorithm is published by Unicode guys: http://www.unicode.org/reports/tr10/ If you read it you'll understand it's not so easy. I only have to look at the length of the document to

Re: ascii to latin1

2006-05-09 Thread Luis P. Mendes
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 When I used the NFD option, I came across many errors on these and possibly other codes: \xba, \xc9, \xcd. What errors? normalize method is not supposed to give any errors. You mean it doesn't work as expected? Well, I have to admit that using

Re: ascii to latin1

2006-05-09 Thread Peter Otten
Luis P. Mendes wrote: The script converted the ÇÃ from the first line, but not the º from the second one. Still in *, I also don't get a list as [115448,DAÇÃO] but a [u'115448,DAÇÃO'] element, which doesn't suit my needs. Would you mind telling me what should I change? Sometimes you are

Re: ascii to latin1

2006-05-09 Thread richie
[Luis] The script converted the ÇÃ from the first line, but not the º from the second one. That's because º, 0xba, MASCULINE ORDINAL INDICATOR is classed as a letter and not a diacritic: http://www.fileformat.info/info/unicode/char/00ba/index.htm You can't encode it in ascii because it's

ascii to latin1

2006-05-08 Thread Luis P. Mendes
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi, I'm developing a django based intranet web server that has a search page. Data contained in the database is mixed. Some of the words are accented, some are not but they should be. This is because the collection of data began a long time ago

Re: ascii to latin1

2006-05-08 Thread Robert Kern
Luis P. Mendes wrote: example: if the word searched is 'televisão', I want that a search by either 'televisao', 'televisão' or even 'télévisao' (this last one doesn't exist in Portuguese) is successful. The ICU library has the capability to transliterate strings via certain rulesets. One

Re: ascii to latin1

2006-05-08 Thread Rene Pijlman
Luis P. Mendes: I'm developing a django based intranet web server that has a search page. Data contained in the database is mixed. Some of the words are accented, some are not but they should be. This is because the collection of data began a long time ago when ascii was the only way to go.

Re: ascii to latin1

2006-05-08 Thread Serge Orlov
, instead of only one search, there will be several used. Is there anything already coded, or will I have to try to do it all by myself? You need to covert from latin1 to ascii not from ascii to latin1. The function below does that. Then you need to build database index not on latin1 text