[issue24870] surrogateescape is too slow
STINNER Victor added the comment: Serhiy: maybe we can start with ascii? -- title: Optimize coding with surrogateescape and surrogatepass error handlers - surrogateescape is too slow ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue24870 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue24870] surrogateescape is too slow
Serhiy Storchaka added the comment: Few months ago I wrote a patch that drastically speeds up encoding and decoding with surrogateescape and surrogatepass error handlers. However it causes 25% regression in decoding some UTF-8 data (U+0100-U+07FF if I remember correct) with strict error handler, so it needs some work. I hope that it is possible to rewrite UTF-8 decoder so that avoid a regression. The patch was postponed until 3.5 is released. In any case the patch is large and complex enough to be new feature that can appear only in 3.6. -- assignee: - serhiy.storchaka nosy: +serhiy.storchaka versions: -Python 3.4, Python 3.5 ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue24870 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue24870] surrogateescape is too slow
R. David Murray added the comment: Why are bytes being escaped in a binary blob? The reason to use surrogateescape is when you have data that is mostly text, should be processed as text, but can have occasional binary data. That wouldn't seem to apply to a database binary blob. But that aside, if you want to submit a patch to speed up surrogateescape without changing its functionality, I'm sure it would be considered. It would certainly be useful for the email library, which currently does do the stupid thing of encoding binary message attachments using surrogateescape (and I'm guessing the reason pymysql does it is something similar to why email does it: the code would need to be significantly reorganized to do things right). -- nosy: +r.david.murray versions: -Python 3.2, Python 3.3 ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue24870 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue24870] surrogateescape is too slow
New submission from INADA Naoki: surrogateescape is recommended way to mix binary data in string protocol. But surrogateescape is too slow and it cause usability problem. One actual problem is: https://github.com/PyMySQL/PyMySQL/issues/366 surrogateescape is slow because errorhandler is called with UnicodeError object. bs.decode('utf-8', 'surrogateescape') may produce len(bs)/2 error objects internally when bs is random bytes. surrogateescape is used with ASCII and UTF-8 encoding in ordinal. Specialized implementation can make it faster. I want to Python 3.4 and Python 3.5 solve this issue since it's critical problem for some people. -- components: Unicode messages: 248631 nosy: ezio.melotti, haypo, naoki priority: normal severity: normal status: open title: surrogateescape is too slow type: performance versions: Python 3.2, Python 3.3, Python 3.4, Python 3.5, Python 3.6 ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue24870 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue24870] surrogateescape is too slow
INADA Naoki added the comment: On MacBook Pro (Core i5 2.6GHz), surrogateescape 1MB data takes 250ms. In [1]: bs = bytes(range(256)) * (4 * 1024) In [2]: len(bs) Out[2]: 1048576 In [3]: %timeit x = bs.decode('ascii', 'surrogateescape') 1 loops, best of 3: 249 ms per loop -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue24870 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com