[issue16311] Use _PyUnicodeWriter API in text decoders
STINNER Victor added the comment: Oh, I forgot my benchmark results. decodebench.py result results on Linux 32 bits: (Linux-3.2.0-32-generic-pae-i686-with-debian-wheezy-sid) $ ./python bench-diff.py original writer ascii 'A'*1 4109 (-3%)3974 latin1'A'*1 3851 (-5%)3644 latin1'\x80'*114832 (-3%) 14430 utf-8 'A'*1 3747 (-4%)3608 utf-8 '\x80'*1976 (-2%) 961 utf-8 '\u0100'*1 974 (-2%) 959 utf-8 '\u8000'*1 804 (-14%)694 utf-8 '\U0001'*1 666 (-5%) 635 utf-16le 'A'*1 4154 (-1%)4117 utf-16le '\x80'*14055 (-2%)3988 utf-16le '\u0100'*1 4047 (-2%)3974 utf-16le '\u8000'*1 917 (-1%) 912 utf-16le '\U0001'*1 872 (-0%) 870 utf-16be 'A'*1 3218 (-1%)3185 utf-16be '\x80'*13163 (-2%)3114 utf-16be '\u0100'*1 2591 (-1%)2556 utf-16be '\u8000'*1 979 (-1%) 974 utf-16be '\U0001'*1 928 (-0%) 925 utf-32le 'A'*1 1681 (+12%) 1885 utf-32le '\x80'*11697 (+10%) 1865 utf-32le '\u0100'*1 2224 (+1%)2254 utf-32le '\u8000'*1 2224 (+2%)2269 utf-32le '\U0001'*1 2234 (+1%)2260 utf-32be 'A'*1 1685 (+11%) 1868 utf-32be '\x80'*11684 (+10%) 1860 utf-32be '\u0100'*1 2223 (+1%)2253 utf-32be '\u8000'*1 (+1%)2255 utf-32be '\U0001'*1 2243 (+1%)2257 decodebench.py result results on Linux 64 bits: (Linux-3.4.9-2.fc16.x86_64-x86_64-with-fedora-16-Verne) ascii 'A'*1 10043 (+1%) 10144 latin1'A'*1 8351 (-1%)8258 latin1'\x80'*119184 (+2%) 19560 utf-8 'A'*1 8083 (+5%)8461 utf-8 '\x80'*1982 (+1%) 993 utf-8 '\u0100'*1 984 (+1%) 992 utf-8 '\u8000'*1 806 (+31%)1053 utf-8 '\U0001'*1 639 (+12%)718 utf-16le 'A'*1 5547 (-2%)5422 utf-16le '\x80'*15205 (+1%)5271 utf-16le '\u0100'*1 4900 (-4%)4695 utf-16le '\u8000'*1 1062 (+9%)1154 utf-16le '\U0001'*1 1040 (+4%)1078 utf-16be 'A'*1 5416 (-5%)5157 utf-16be '\x80'*15077 (-1%)5011 utf-16be '\u0100'*1 4261 (-1%)4218 utf-16be '\u8000'*1 1146 (+0%)1147 utf-16be '\U0001'*1 1125 (-1%)1119 utf-32le 'A'*1 1743 (+8%)1880 utf-32le '\x80'*11751 (+5%)1842 utf-32le '\u0100'*1 2114 (+29%) 2721 utf-32le '\u8000'*1 2120 (+28%) 2718 utf-32le '\U0001'*1 2065 (+30%) 2690 utf-32be 'A'*1 1761 (+6%)1860 utf-32be '\x80'*11749 (+6%)1856 utf-32be '\u0100'*1 2101 (+29%) 2715 utf-32be '\u8000'*1 2083 (+30%) 2715 utf-32be '\U0001'*1 2058 (+31%) 2689 Most significant changes: * -14% to decode '\u8000'*1 from UTF-8 on Linux 32 bits * +31% to decode '\u8000'*1 from UTF-8 on Linux 32 bits * +28% to +31% to decode UCS-2 and UCS-4 characters from UTF-8 on Linux 32 bits @Serhiy Storchaka: If you feel able to tune _PyUnicodeWriter to improve its performance, please open a new issue. I consider the performance changes acceptable and I don't plan to work on this topic. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue16311 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue16311] Use _PyUnicodeWriter API in text decoders
Roundup Robot added the comment: New changeset 7ed9993d53b4 by Victor Stinner in branch 'default': Close #16311: Use the _PyUnicodeWriter API in text decoders http://hg.python.org/cpython/rev/7ed9993d53b4 -- nosy: +python-dev resolution: - fixed stage: - committed/rejected status: open - closed ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue16311 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue16311] Use _PyUnicodeWriter API in text decoders
Serhiy Storchaka added the comment: I updated the patch to resolve the conflict with issue14625. -- Added file: http://bugs.python.org/file27806/codecs_writer_2.patch ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue16311 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue16311] Use _PyUnicodeWriter API in text decoders
Changes by Serhiy Storchaka storch...@gmail.com: Added file: http://bugs.python.org/file27807/codecs_writer_2.patch ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue16311 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue16311] Use _PyUnicodeWriter API in text decoders
Changes by Serhiy Storchaka storch...@gmail.com: Removed file: http://bugs.python.org/file27806/codecs_writer_2.patch ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue16311 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue16311] Use _PyUnicodeWriter API in text decoders
Changes by Serhiy Storchaka storch...@gmail.com: Added file: http://bugs.python.org/file27808/decodebench.res ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue16311 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue16311] Use _PyUnicodeWriter API in text decoders
Serhiy Storchaka added the comment: With the patch UTF-8 decoder 20% slower for some data. UTF-16 decoder 20% faster for some data and 20% slower for other data. UTF-32 decoder slower for many data (even after some optimization, naive code was up to 50% slower). Standard charmap decoder 10% slower. Only UTF-7, unicode-escape and raw-unicode-escape have become much faster (unicode-escape and raw-unicode-escape as with issue16334 patch). A well optimized decoders do not benefit from the _PyUnicodeWriter, only a slight slowdown. The patch requires some optimization (as for UTF-32 decoder) to reduce the negative effect. Non-optimized decoders will receive the great benefit. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue16311 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue16311] Use _PyUnicodeWriter API in text decoders
STINNER Victor added the comment: I ran decodebench.py and bench-diff.py scripts from #14624, I just replaced repeat=10 with repeat=100 to get more reliable numbers. I only see some performance regressions between -5% and -1%, but there are some speedup on UTF-8 and UTF-32 (between +11% and +14%). On a microbenchmark, numbers in the -10..10% range just means no change. Using _PyUnicodeWriter should not change anything to performances on valid data, only performances of handling decoding errors between the overallocation factor is different, the code to widen the buffer and the code to write replacement characters. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue16311 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue16311] Use _PyUnicodeWriter API in text decoders
Serhiy Storchaka added the comment: I will do some experiments and review tomorrow. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue16311 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue16311] Use _PyUnicodeWriter API in text decoders
STINNER Victor added the comment: Soon I'll post a patch, which speeds up unicode-escape and raw-unicode-escape decoders to 1.5-3x. Also there are not yet reviewed patches for UTF-32 (issue14625) and charmap (issue14850) decoders. Will be merge conflicts. codecs_writer.patch doesn't change too much the core of decoders, but mostly the code before and after the loop, and error handling. You can still use PyUnicode_WRITE, PyUnicode_READ, memcpy(), etc. But I will review the patch. If you review the patch, please check that how the buffer is allocated. It should not be overallocated by default, only on the first error. Overallocation can kill performances when it is not necessary (especially on Windows). -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue16311 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue16311] Use _PyUnicodeWriter API in text decoders
Changes by STINNER Victor victor.stin...@gmail.com: -- nosy: +loewis, serhiy.storchaka ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue16311 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue16311] Use _PyUnicodeWriter API in text decoders
New submission from STINNER Victor: Attached patch modifies text decoders to use the _PyUnicodeWriter API to factorize the code. It removes unicode_widen() and unicode_putchar() functions. * Don't overallocate by default (except for raw-unicode-escape codec), enable overallocation on the first decode error (as done currently) * _PyUnicodeWriter_Prepare() only overallocates 25%, instead of 100% for unicode_decode_call_errorhandler() * Use _PyUnicodeWriter_Prepare() + PyUnicode_WRITE() (two macros) instead of unicode_putchar() (function) * _PyUnicodeWriter structures stores many useful fields, so we don't have to pass multiple parameters to functions, only the writer I wrote the patch to factorize the code, but it might be faster. -- files: codecs_writer.patch keywords: patch messages: 173695 nosy: haypo priority: normal severity: normal status: open title: Use _PyUnicodeWriter API in text decoders type: performance versions: Python 3.4 Added file: http://bugs.python.org/file27697/codecs_writer.patch ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue16311 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue16311] Use _PyUnicodeWriter API in text decoders
Serhiy Storchaka added the comment: Soon I'll post a patch, which speeds up unicode-escape and raw-unicode-escape decoders to 1.5-3x. Also there are not yet reviewed patches for UTF-32 (issue14625) and charmap (issue14850) decoders. Will be merge conflicts. But I will review the patch. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue16311 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com